Skip to content
Brian Ward, Ph.D.
  • Home
  • Blog
  • CV
  • Contact

Blog

R coding

Find species occurrences within a region

  • December 30, 2017January 6, 2018
  • by brian_ward

Over the past few weeks I have been negligent in my duties to this website; I attribute my lapse in activity (of all kinds) to the holidays. However, I haven’t been totally sedentary. I was just home in New Hampshire for a few days, where we received 6 inches of fresh snow on Christmas day, on top of the foot or so that was already on the ground. I went out for a snowshoe with my parents and we were truly in a winter wonderland:

winter hunting cabin
The 2017-2018 winter in New Hampshire, off to a good start

However, the downside is that temperatures dropped precipitously after that, eventually staying in the vicinity of 0° F (-18° C) for the remainder of my trip. The moral of the story is that while it’s nice to visit New Hampshire in the winter, it’s also nice to be able to leave.

Anyways, I guess that’s enough about me. On to the primary topic of this post:

Problem

A coworker recently posed the following problem that she needed to solve in an extensible way:

Given a country X, find all species of weeds that have been reported there, and compare this list against a preexisting list containing the names of weed species of concern

In order to solve this problem, the first resource I would turn to is the Global Biodiversity Information Facility (GBIF). GBIF hosts a massive database that collects observations of species (referred to as “occurrences”) from all over the world; as of this writing the database contains roughly 1 billion species occurrences total.

Outline

We can break this problem down into a few simple steps:

  1. Clean our preexisting list of species names to match GBIF formatting. In the example I’ll use, we will have to get rid of some extra information included with the species names. One example is naming authorities – so for our purposes the gray wolf, first named by Linnaeus in 1758, should be written as Canis lupus instead of Canis lupus L. In addition, we will remove information regarding subspecies, varieties, and cultivars.
  2. Identify an appropriate dataset to download. We already know that we need to download species occurrences for a particular country, but downloading all occurrences from any given country could be far more information than we need. Therefore, we need to restrict the size of the download as much as is feasible.
  3. Initiate the download from GBIF servers
  4. Clean species names within the downloaded GBIF data. Generally the species names in GBIF datasets are formatted well, though it pays to also clean this data to ensure reliable matches with our own list.
  5. Compare our list of species names with the GBIF data
  6. [Optional] write out the downloaded occurrence data to a .csv file

R example

The script that this code is copied from is available here.

For the example, I will be using the excellent rOpenSci package rgbif, which, you may have guessed, provides several R functions for accessing GBIF data. If you don’t have rgbif installed, just open an R terminal and type:

install.packages("rgbif")

Input variables and data

Now that that is out of the way, let’s start by loading required packages, defining some variables, and simulating some data. We’ll use the country of Chad as an example. For our list of species, I am just going to create a data frame containing some various plants along with two animals with scientific names that I find humorous. Normally, our list of species would probably come from a preexisting .csv file that we read in. Finally, we need supply a model or type species, and a desired taxonomic rank of interest (more on that below).

library(rgbif)

## Country of interest
country <- "Chad"

## Species of interest - would normally obtain this from a pre-existing .csv file
species <- data.frame("sci_name" = c("Pica pica L.", 
                                     "Tuta absoluta (Meyrick)", 
                                     "Arachis hypogaea var. Georgia-12Y", 
                                     "Vigna vexillata (L.) A. Rich", 
                                     "ZEA MAYS"),
                      "common_name" = c("Eurasian magpie",
                                        "tomato leafminer",
                                        "peanut variety Georgia-12Y",
                                        "wild cowpea",
                                        "maize"))

## Model or type species, and taxonomic rank of interest
mod_sp <- "Arabidopsis thaliana"
rank <- "phylum"

## Select whether to output occurrence data (TRUE/FALSE)
## And specify output file name (.csv extension required)
## File name irrelevant if output_occs not set to TRUE
output_occs <- FALSE
out_file_name <- "Chad_vascular_plant_occurences.csv"

Define Functions

Now we need to create a couple of functions. The first will read in a vector of species names, and will output a cleaned-up version of this vector. Specifically, it will only keep the genus and species names, and will ensure that the genus name is capitalized (and that the species name is not). It will also get rid of any additional information such as naming authority, variety, subspecies, etc.

clean_sp <- function(s) {
  
  s <- gsub("_", " ", s, fixed = TRUE)
  s <- tolower(s)
  l <- strsplit(s, " ", fixed = TRUE)
  l <- lapply(l, function(x) {paste(x[1], x[2], sep = " ")})
  cleaned <- unlist(l)
  
  substr(cleaned, 1, 1) <- toupper(substr(cleaned, 1, 1))
  
  return(cleaned)
}

The second function will take a species that we choose as input, as well a taxonomic rank of our choosing (either kingdom, phylum, class, order, family, or genus). The GBIF functions that we use below require a “key” (which is an integer) for identifying a specific taxonomic rank. We can get these keys using the rgbif function name_backbone(). For example, if we want to find all occurrences of vascular plants (phylum Tracheophyta) within a country, the easiest way to get the GBIF key for Tracheophyta is to look up an example plant species that we know is in this phylum using the name_backbone() function, and then extract the key from it. The phylum key for Tracheophyta happens to be 7707728 – it doesn’t exactly roll off the tongue. The function tax_key() below is a wrapper for name_backbone() which allows us to easily look up a rank key for any given species.

tax_key <- function(species, rank) {
  
  if(!rank %in% c("kingdom", "phylum", "class", "order", "family", "genus")) {
    stop("rank argument requires value of 'kingdom', 'phylum', 'class', 'order', 'family', or 'genus'")
  }
  
  tax <- name_backbone(species)
  key <- unlist(tax[grep(paste(rank, "Key", sep = ""), names(tax))])
  print(paste("Using GBIF key", key, "for", rank, tax[[rank]]))
  
  return(key)
}

Executable example

Now we’re finally able to run some code. First we’ll clean up our species names, obtain the key value for phylum Tracheophyta (using the species Arabidopsis thaliana as an example), and obtain the two-letter ISO 3166-1 country code for Chad:

## Clean up input vector of species names
species$clean_name <- clean_sp(species$sci_name)

## Obtain integer key for given species/rank combination
key <- tax_key(species = mod_sp, rank = rank)

## Obtain ISO 3166-1 two-letter country code
country_code <- rgb_country_codes(country)$code

The rgbif package includes three functions that can download data from GBIF servers:

  • occ_download() – performs exactly the same operation as a manual data download. Requires GBIF username and password (which can be obtained for free through the GBIF website)
  • occ_search() – allows for finer control of download queries. Can download data, metadata, or hierarchical trees
  • occ_data() – faster version of occ_search() optimized for downloading data only

For the rest of this example, we’ll focus on the later two functions, though using occ_download() is a good idea if you will be grabbing a large amount of data.

Before we download any occurrence data, it is always a good idea to check on how many results there are for our desired search query. We can do this using the occ_search() function, by telling it to return metadata, and setting the limit value to 0. You’ll notice below that I’m calling occ_search() in an unusual fashion, by using the function do.call(). do.call() allows us to specify a function, and then supply its arguments as a list. The reason I’m doing this has to do with our key value for phylum Tracheophyta. Normally if I called occ_search() and wanted to only select this phylum, I would need to supply the argument manually:

occ_search(phylumKey = 7707728, …).

However, if on another occasion I wanted to restrict my search to the family Brassicaceae (key 3112), I would need to use the argument “familyKey”:

occ_search(familyKey = 3112, …)

This means that I would have to manually change the function call every time I search by a different taxonomic rank. The tax_rank() function that we defined earlier returns a named integer for the key, so we can supply its name (“phylumKey” in this example) in the list supplied to do.call() to avoid having to ever manually change the occ_search() call.

So, in the block below I define a list, gbif_args, which contains some arguments that we will use repeatedly. Notice “key” is listed without any assignment (=), because it contains both a name and a value. When I actually call the occ_search() function below, I append on some additional arguments to return metadata, and to limit the number of occurrences to 0. Note that I also include a prompt so the user can select whether or not to proceed with the download in the next step. The script must be sourced (i.e. not run interactively) for it to kick in:

## Construct list of arguments for GBIF functions
gbif_args <- list(country = country_code,
                  key,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  curlopts = list(progress = TRUE))

## Find number of occurences using specified arguments
n_occs <- do.call(occ_search, c(gbif_args, list(return = "meta", limit = 0)))$count
print(paste("Number of occurences returned by specified parameters:", n_occs))
y_n <- readline(prompt = "Proceed with occurrences download? [y/n]")
if (y_n != "y") {
  stop("User declined to download occurrences, or did not supply a y/n response")
}

We can now finally download our species occurrence data for real. We have our choice of either the occ_search() or occ_data() functions. Of the two, occ_data() tends to be faster. Running the code below on my computer/internet connection, occ_search() downloaded 3,225 occurrences in 0.64 minutes, while occ_data() completed the same task in 0.24 minutes.

## Download occurences into data frame using occ_search() function
start_time <- Sys.time()
occs <- do.call(occ_search, c(gbif_args, list(return = "data", limit = n_occs)))
end_time <- Sys.time()

print(paste("Elapsed time to download", nrow(occs), "occurences:", 
            round(difftime(end_time, start_time, units = "mins"), 2), "minutes"))

## Download occurences into data frame using occ_data() function
## Note "return" argument is not used by occ_data()
start_time <- Sys.time()
occs <- do.call(occ_data, c(gbif_args, list(limit = n_occs)))$data
end_time <- Sys.time()

print(paste("Elapsed time to download", nrow(occs), "occurences:", 
            round(difftime(end_time, start_time, units = "mins"), 2), "minutes"))

Finally, we can compare the GBIF species occurrences with our own list of species. Some GBIF occurrences are only identified to the genus level, so I’ll remove these by deleting any occurrences with scientific names that don’t contain a space. While the species names in the GBIF data tend to be pretty clean, we might as well run them through our cleaning function to be thorough. While we’re at it, we might as well save the occurrence data to a file, though this step is optional.

occs <- occs[grep(" ", occs$name, fixed = TRUE), ]
occs$clean_name <- clean_sp(occs$name)

intersect(species$clean_name, occs$clean_name)

if (output_occs) {
 write.csv(occs, file = out_file_name, row.names = FALSE)
}

When we run intersect(), we find that the three plant species in our original dataframe (which were not at all cherry-picked) are indeed among the list of vascular plants that have been observed in Chad. Critically, the Eurasian magpie and tomato leafminer are not shown here, since they are not vascular plants (or plants of any kind for that matter).

Going further

The rgbif package allows for many customized operations. For instance, we can isolate occurrences from a specific state or province. In this example, I count the number of occurrences of vascular plants in the U.S. state of Delaware:

## Count all occurences in US state of Delaware
gbif_args <- list(country = "US",
                  stateProvince = "Delaware",
                  key,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  limit = 0,
                  return = "meta",
                  curlopts = list(progress = TRUE))
n_occs <- do.call(occ_search, gbif_args)$count

If country borders seem a little too arbitrary, you can also define a custom polygon in well known text (WKT) format. I recommend this site for constructing the polygon (the site only seems to work with MS Internet Explorer or MS Edge). Below we search for vascular plants within a rectangle covering portions of Chad, Cameroon, and Nigeria:

## Find all occurences within a polygon
library(wicket)
wkt <- "POLYGON((17.02 11.52, 10.47 11.03, 10.78 7.78, 17.20 8.57, 17.02 11.52))"
wkt <- wkt_correct(wkt)
gbif_args <- list(geometry = wkt,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  limit = 100,
                  curlopts = list(progress = TRUE))
occs <- do.call(occ_data, gbif_args)$data

Cautionary notes

This is a pretty powerful method for finding species occurrences within a specific region. However, there are a few rules you should keep in mind:

  1. Reduce, reduce, reduce. You want your queries to be as specific as possible while still being relevant. Otherwise you’ll end up wasting time downloading lots of irrelevant data. I recommend using the lowest taxonomic rank and smallest geographic region possible. In this example, we downloaded vascular plant occurrences from Chad. If I had searched for all vascular plant occurrences in China, or the United States, the download would take a long time, or would just fail outright. That brings us to the next rule:
  2. Be nice to GBIF. They are providing a very valuable free service. Please don’t use scripts like this to spam their servers with multiple huge download requests.

Anyways, that’s about it. Now you can search out species occurrence data to your heart’s content.

Uncategorized

Tissue Grinding for 96-well DNA Extraction

  • December 3, 2017June 17, 2018
  • by brian_ward

Background

In a previous post I presented a simple protocol for collecting plant tissue for DNA extraction in a 96-well plate format. Later on we’ll get to the details of performing the DNA extraction. Before getting to that, we need to talk about the proper way of grinding plant tissue to get a nice, homogeneous powder. Like the subject of the previous post, this is a seemingly-simple step that often goes underappreciated. I promise that I will get around to a protocol for performing an actual DNA extraction sometime soon.

Tissue grinding is one of those things that can soak up hours of time if it doesn’t go right. If samples don’t grind properly, you’ll have to spend a lot of time trying to shake them again and again; you may even need to remove tissue from some wells if they are overfilled, which takes time and opens you up to more risk of cross-contamination.

Clearly, it’s best to grind the tissue properly in one fell swoop, so you can be on your way to the extraction with no problems. The previous post was all about gathering samples in a highly consistent fashion to enable an easy grinding step. Depending on whether you had a lyophilizer at your disposal for the last step, your tissue should now either be dried, or else frozen.

Materials

  • 96 well deep-well plates (Axygen Scientific # P-9641-1S)

Reagents

  • Liquid nitrogen (if using frozen tissue)

Laboratory Equipment

  • Spex SamplePrep Genogrinder 2010
  • Styrofoam shipping cooler (if using frozen tissue)
  • Tongs (if using frozen tissue)

 

Protocol

Step 1: Label Racks

The deep-well plates that I listed above have wells that are wide enough to fit our microtiter tubes in, so we will essentially be using them as heavy-duty tube racks. For the rest of this post, I will refer to them as “racks” instead of plates to avoid confusion. The racks that the tubes come with are too flimsy for tissue grinding – after a round or two in the Genogrinder, they will likely self-destruct. For the DNA extraction, you’ll be using anywhere from one to four plates of samples (i.e. up to 384 samples). So, the steps are:

  1. Label your deep-well racks something like “Grind-1”, “Grind-2” etc., for however many you need.

NOTE: The Genogrinder requires two racks at once, due to its clamp design. So, if you only have a single plate of samples, you should still label two deep-well racks to be used for grinding (more on that below).

  1. Transfer the tubes into the racks that will be used for grinding. If you have a single plate of samples, split the tubes into two racks, spacing them evenly across each rack for uniform clamping inside the Genogrinder. This is one step that should convince you that the unambiguous labeling of tube strips that we performed in the previous post was a good idea.

 

Step 2: Deep Freeze

For lyophilized tissue

If you have lyophilized tissue, you may be tempted to put it into the grinder at room temperature. I’m here to tell you that it won’t work – although the tissue is dry, it is still too flexible to grind properly at room temperature. You’ll just end up with large pieces of leaf that are jammed into the bottom or tops of the tubes. Despite this, I can’t emphasize enough how much easier it is to work with lyophilized tissue, because you can grind easily it after letting it sit in a -80° C freezer for a while. So, if you have lyophilized tissue, this part is easy:

  1. Place tubes in grinding racks in a -80° C freezer for a few hours or overnight.

For frozen tissue

If you’re using frozen tissue, -80° C probably won’t be cold enough for adequate grinding. Unfortunately this means you’ll need to use liquid nitrogen. So, the steps are:

  1. Pour an inch or so of liquid nitrogen from a vacuum flask into an adequately-sized styrofoam cooler (the square ones used as inserts for shipping refrigerated/frozen lab supplies work well)
  2. Use tongs to submerge the grinding racks holding the sample tubes in the liquid nitrogen for a minute or so

 

Step 3: Grinding

Whether you have used lyophilized or frozen tissue, the next steps are the same:

  1. Quickly transfer the racks into the Genogrinder (I probably shouldn’t have to mention this, but please use tongs and not your bare or gloved hands to do this if the racks are sitting in liquid nitrogen)
  2. Secure racks in place with the clamping mechanism and close the Genogrinder’s lid
  3. Grind the samples! I typically run the machine at 1,000 rpm for 1 minute.

After grinding, the samples should have turned into nice, pale-green flaky powder, like the first five tubes in the picture below:

Results of properly grinding plant tissue prior to DNA extraction
Properly ground wheat leaves

Epilogue, and Reflection

That’s all there is to it. Like I said at the beginning, grinding plant tissue is not rocket science, but doing it properly can save you hours of frustration. You can now:

  1. Put the sample tubes back in their original (flimsy) racks
  2. Place the samples back in a freezer for storage, OR
  3. Proceed directly to DNA extraction

One thing to note is that we can reuse the grinding racks – just check prior to each grinding to make sure that cracks aren’t forming in them. If they are getting cracked, just discard them and use a new set. In general, things tend to break when they are exposed to liquid nitrogen and then immediately shaken at high speeds – don’t be surprised if you need to replace racks often when using liquid nitrogen, or if you need to replace some cracked tube caps.

One final warning: any rack that goes into the tissue grinder should never go into a centrifuge later. Take it from me: you might end up really, really sorry if you ever do this. There’s nothing that can ruin your day quite like having a rack holding tubes full of chloroform shatter inside a spinning centrifuge.

Everyday Annoyances

Everyday Annoyances: String Formatting

  • November 12, 2017January 9, 2018
  • by brian_ward

This post is intended to be the first in a series of everyday annoyances that one typically runs into when working with data. Today’s example should be familiar to anyone who has ever worked with data recorded by multiple people or groups of people. Let me give a little vignette to illustrate the problem. I’ll assume that you are working with data that contains a column with some sort of identifier for research subjects, genotypes, etc.

The Tyranny of Identifier Formatting

You have just received a set of data from one of your collaborators. Excited to see what it contains, you get to work compiling it with your own data. However, within a couple of minutes you realize that something is wrong. For instance, you try to perform a merge using the identifier columns – and find that nothing matches between the two sets of data. Or, you try binding the rows of the two data tables together, and you suddenly end up with about twice as many different IDs as you had anticipated. Your spirits begin to sink as you experience an acute wave of panic.

After taking a minute to cool off, you look at both sets of data side by side. You look up one particular ID – “OR_2016_47_3.” When you find the same ID in your collaborator’s data, you see the problem – its listed there as “OR-2016-47-3.” The mystery is solved – at some time in the past, some person or script decided to take matters into their own hands, deviating from the identifier formatting that everyone involved in the project (hopefully) agreed on at the outset.

The situation above is just one of those things that happens – more often than not – when you deal with data that has been worked over by somebody else. Luckily there’s an easy fix. We simply need a sort of disciplinarian function that we can sic on any vector or list of strings to enforce some order. Below are some examples written in both R and Python. I’m sure there might be some versions of these functions written into this or that R package or Python module that I’m not aware of, but it’s easy enough to make our own versions.

R example

The full R code snippet is available here.

Note that this function requires R version > 3.2.0, as it uses the trimws() function internally. It simply takes in a vector of strings, and for each element converts everything to uppercase, and then removes any leading and/or trailing whitespace. The alien-looking syntax on line 16 is using regular expressions to convert several common word delimiters (space, dash, underscore, period) into one delimiter of the user’s choosing.

stand_str <- function(str_in, word_delim = "-") {
  
  ## Check for appropriate word delimiter selection
  if(!word_delim %in% c(" ", "-", "_", ".")) {
    stop('Please select one of the following for the output word delimiter:
         " " [space], ".", "-", "_"')
  }
  
  ## Convert everything to uppercase
  str_out <- toupper(str_in)
  
  ## Remove leading/trailing whitespace
  str_out <- trimws(str_out, which = "both")
  
  ## Swap out whitespace, dash, period, underscore for selected delimiter
  str_out <- gsub("\\s|-|\\.|_", word_delim, str_out)
  
  return(str_out)
}

So, running it on a vector consisting of some variations on the theme of “New York City”:

ny <- c("  New York City  ", "New-York_City", "new.york.city", "New_York.city")
stand_str(ny, "_")

[1] "NEW_YORK_CITY" "NEW_YORK_CITY" "NEW_YORK_CITY" "NEW_YORK_CITY"

Nice – our messy list of New York City strings are now all converted to a common format.

Python Example

Below is the same function written in Python, and the full code snippet that it comes from is available here.

def stand_strings( str_list, delim ):
    
    ## Check for appropriate word delimiter selection
    if delim not in [' ', '-', '_', '.']:
        sys.exit('''Please select one of the following for the output word delimiter:
            " " (space), "-", "_", "." ''')
    
    ## Convert to uppercase, strip leading/trailing whitespace        
    str_list = [x.upper().strip() for x in str_list]
    
    ## Swap out whitespace, dash, period, underscore for selected delimiter
    str_list = [re.sub('\s|\.|-|_', delim, x) for x in str_list]
    return str_list

And here is the same example (though note that we have to import a couple modules for the function to work properly):

import re
import sys

ny = ['  New York City  ', 'New-York_City', 'new.york.city', 'New_York.city']
print(stand_strings(ny, "_"))

['NEW_YORK_CITY', 'NEW_YORK_CITY', 'NEW_YORK_CITY', 'NEW_YORK_CITY']

Conclusion – and Caution for R

That’s all there is to it. These functions aren’t anything fancy, but like I said I do end up using one or the other almost every time I receive data from somebody else. Getting in the habit of using these on all my data files has certainly saved me from many headaches. It would also be easy to create a version of these functions that takes user-inputted search and replace patterns, but 98% of the time, you’ll probably be dealing with one of the four different common word delimiters used in these functions.

Do keep in mind that if your data is set up with identifiers as column headers, R will try to do its own formatting upon import if these names don’t live up to its standards. Specifically, it will add an “X” to the front of any column name that starts with a number, and it will convert spaces or dashes to periods. For instance, if we read in a .csv file with a column named “2017-New-York-City”, R will convert this to “X2017.New.York.City” unless we set check.names = FALSE in the call to the read.csv() function.

In a future post, I may get a little more sophisticated by going over approximate or “fuzzy” string matching.

R coding

Multi-Panel Figures in R

  • November 9, 2017November 12, 2017
  • by brian_ward

Background

This post will detail how to make some nice, (nearly) publication-ready, multi-panel figures in R. I want to stress that this isn’t a post intended for R beginners. I will be mixing some standard R code with tidyverse syntax, using the latter’s almighty pipe operator (%>%) and some left-to-right assignment (->).

As the title suggests, this post will be about that most dreaded part of preparing a manuscript for publication – making multi-panel figures. These are often an intimidating part of writing any article; if you’re too much of a perfectionist like me, it can be hard enough to make a single graph that looks just right, let alone making multiple graphs and fitting them all together.

When I first started using R, I was at a loss when it came time to put multiple graphs together. At first I would just create them individually, pasting them together afterwards in image editing software. Unfortunately this technique often leads to things that don’t quite look right – panels that don’t exactly line up, lines/points/labels that are different sizes, etc.

Eventually I learned about packages such as lattice and ggplot, which allow you to take the data stored within a dataframe, split it up into pieces based upon a factor variable, and then plot each piece. So, here’s an example using lattice and some car engine data from the famous Motor Trend dataset:

library(lattice)
data(mtcars)
xyplot(hp ~ disp | factor(cyl), pch = 16, data = mtcars)

The xyplot() function is plotting engine displacement on the x-axis against horsepower on the y-axis, and splitting the data up by number of cylinders. The result:

Plotting engine displacement, horsepower, and cylinders with the lattice package

Here’s how to make approximately the same thing using ggplot, and the output:

library(ggplot2)
data(mtcars)
ggplot(mtcars, aes(x = disp, y = hp)) +
  geom_point(color = "blue", fill = "blue") +
  facet_grid(~ cyl) +
  theme_bw()
Plotting engine displacement, horsepower, and cylinders with the ggplot package

Not too bad – the ggplot code is a little more involved, but offers nearly endless options for customization. I want to emphasize nearly, because there are certain things that ggplot and lattice can do with multiple panels, and certain things they can’t. If you’re ever using ggplot and you think to yourself “this graph would be great if I could just change the scale on only the left pane” you’ll quickly learn that it’s relatively difficult to do. This is by design – “customized” multi-pane figures are one way in which people commonly try to cook up some significant looking findings.

In the example below, I’ll show you how to mix and match different plots together. I want to emphasize that this is often a bad idea, but just like everything in life, there are exceptions. Ultimately it is up to you to decide when it is appropriate to use this technique. However, even if you are working with graphs that aren’t drastically different, this can be a good method to let you format and customize each one individually to get everything looking good. I’ll first go through the example, and at the end give my justification for why I think the practice is okay in this situation.

For the example, I will create a dataset consisting of 5 sunflower genotypes tested across 4 environments (props if you can figure out which state I pulled these town names from). I will simulate two traits for all genotype/environment combinations – plant height, and seed oil content. Then I’ll make genotype-by-environment interaction plots for both traits. The code will use ggplot in conjunction with the cowplots package to arrange plots into a nice, professional-looking grid.

Data Simulation

The first step is to create our simulated data. I’ll define a function, dat_sim(), and then use it to create a dataset “gxe”, consisting of the environment name, genotype name, and then the oil and height data. The dat_sim() function will create random data defined by both the standard deviation between environments, and the standard deviation between genotypes within environments. So, I’ll give the oil content trait some messy genotype-by-environment interaction – the trait variation will be much greater across environments than across genotypes. Conversely, I’ll give the height trait highly stable interaction, with trait variation across environments much less than the variation across genotypes:

library(tidyverse)
library(cowplot)
set.seed(99)

dat_sim <- function(n_genos, n_envs, grand_mean, env_sd, geno_sd) {
  vec <- abs(rnorm(n_envs, mean = grand_mean, sd = env_sd))
  out_list <- list()
  for (i in 1:length(vec)) {
    out_list[[i]] <- rnorm(n_genos, vec[i], sd = geno_sd)
  }
  return(do.call(c, out_list))
}

gxe <- data.frame("env" = sort(rep(c("Alpena", "Mellette", "Kranzburg", "Centerville"), 5)),
                  "geno" = rep(c("Zebulon", "Chocolate", "Sunbright", "Firecracker", "Greenburst"), 4),
                  "oil" = dat_sim(5, 4, 0.25, 0.08, 0.03),
                  "height" = dat_sim(5, 4, 180, 30, 10))

One thing you’ll notice is that the trait units here are completely different. The oil content is expressed as a fraction, and the height data is expressed in centimeters.

The Plotting Function

Now I’m going to create a quick-and-dirty plotting function that uses ggplot() internally. This function, plot_gen(), will have input arguments consisting of the gxe dataframe (df), the trait name (trait), and a label for the trait (ylabel). It will then perform a few steps:

  1. Find the mean trait value of all genotypes within each environment
  2. Sort the environments from the lowest mean to the highest mean
  3. Sort the dataframe by environment to match this ordering
  4. Plot a regression for each genotype across the environments

Keep in mind that we will be running this function separately for each trait.

plot_gen <- function(df, trait, ylabel) {
  
  ## Find the order of environment means
  group_by(df, env) %>%
    summarise_at(.vars = trait, .funs = mean) %>%
    arrange_(trait) %>%
    pull(env) %>%
    as.character() ->
    env_order

  ## This assigns the env variable levels in the order of
  ## their mean values before sorting
  gxe$env <- factor(gxe$env, levels = env_order)
  gxe <- arrange(gxe, desc(env), geno)

  ## Plot regressions
  ggplot(gxe, aes_string(x = "env", y = trait)) +
    geom_smooth(aes(group = geno, color = geno), size = 1, method = "lm", se = FALSE) +
    xlab("Environment") +
    ylab(ylabel) +
    theme_bw() ->
    plot_out
  
  return(plot_out)
}

IMPORTANT: Notice that when I sort the dataframe in line 14 above, I am first sorting by environment, and then by genotype. In this case, this is a critical step. The genotype column in our dataframe must be sorted the same way for each plot that is created, because ggplot will read the genotype names in the order that it encounters them. If the ordering of this column doesn’t stay consistent, the coloring of the genotypes will change between plots. It doesn’t matter what order the environments are in, as long as the genotypes are always listed in the same order within each environment.

The method that I’m using – sorting environments by their mean values, and then performing a regression for each genotype – is similar to a method for more formally assessing genotype-by-environment interaction, first described by Finlay and Wilkinson (1963).

The advantage to doing our plotting inside a function is that it makes it easy to reproduce the same plot for different traits. So, to make the plot for oil content (our low-heritability trait), we can just write:

plot_gen(gxe, "oil", "Oil Content")

Which gives us:

Sunflower oil content genotype-by-environment interaction plot

Nice. Most of the genotypes are behaving alike across environments. There’s just that one troublemaker -the genotype ‘Greenburst’ (which conveniently got colored green), which is exhibiting strong crossover interaction with all the other genotypes.

Assembling the Plot

Making our oil content plot is all well and good, but now we can get fancier by plotting the two traits together. This is where the cowplots package comes into play. It contains a function, plot_grid(), which allows us to easily smoosh together plots by supplying them as arguments. You can also specify the layout, and give each plot labels. So, to plot our two traits together, it’s as simple as:

plot_grid(plot_gen(gxe, "oil", "Oil Content"),
          plot_gen(gxe, "height", "Plant Height (cm)"),
          ncol = 2,
          labels = c("A", "B"))

The output:

An early attempt…

So… it’s not great. The big problem is that the legend appears twice. This is where it comes in handy to have all our plotting wrapped up in our plot_gen() function. First, to show an example of how we could tweak the plots themselves, I’ll rotate the x-axis labels 45 degrees, just for the heck of it. This just requires adding an additional line to our call to ggplot() – line 22 in the box below, so that plot_gen() becomes:

plot_gen <- function(df, trait, ylabel) {
  
  ## Find the order of environment means
  group_by(df, env) %>%
    summarise_at(.vars = trait, .funs = mean) %>%
    arrange_(trait) %>%
    pull(env) %>%
    as.character() ->
    env_order

  ## This assigns the env variables levels in the order of
  ## their mean values before sorting
  gxe$env <- factor(gxe$env, levels = env_order)
  gxe <- arrange(gxe, desc(env), geno)

  ## Plot regressions
  ggplot(gxe, aes_string(x = "env", y = trait)) +
    geom_smooth(aes(group = geno, color = geno), size = 1, method = "lm", se = FALSE) +
    xlab("Environment") +
    ylab(ylabel) +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) ->
    plot_out
  
  return(plot_out)
}

NOTE that our use of theme() in line 22 above comes after our use of theme_bw() on the line above it. If theme_bw() comes after, it will override any custom theme options that we have set – you can tell I know this by hard-won experience.

Now to take care of the real problem – the multiple legends. The cowplot package has another really handy function – get_legend(). As the name implies, this function pulls the legend out of a figure generated by ggplot. So, in the code block below, we will:

  1. Generate a dummy figure (just one of the plots that will be in the final figure)
  2. Pull the legend out of it using get_legend()
  3. Change the title of the legend from “geno” to “Genotype”
  4. Assemble the final multi-panel figure using our two traits and our extracted legend
  5. Label the two panels “A” and “B”

Note that we can also assign the portion of the overall plotting space that is taken up by each panel. So below, I am dividing the space up into fifths, and assigning each plot to 2/5 of the overall space, and the legend to 1/5.

dummy_plot <- plot_gen(gxe, "oil", "Oil Content") +
  scale_color_discrete(name = "Genotype")
legend <- get_legend(dummy_plot)

plot_grid(plot_gen(gxe, "oil", "Oil Content") + theme(legend.position = "none"),
          plot_gen(gxe, "height", "Plant Height (cm)") + theme(legend.position = "none"),
          legend,
          ncol = 3,
          rel_widths = c(2, 2, 1),
          labels = c("A", "B", ""))

Our final multi-panel figure:

Genotype-by-environment interaction plots for sunflower oil content (A) and plant height (B) measured across four environments

Discussion

That gets us pretty close to what we want. You’ll notice that not only are the units different for each plot, but the ordering of environments on the x axes are different as well. If there are additional tweaks that we want, we can always easily modify the plot_gen() function.

I won’t go into the details here, but as a final thought, I always suggest saving images in svg format. This allows us to both scale the image to any size without loss in fidelity, and to further tweak things by hand in a program like Adobe Illustrator or Inkscape. A great example of an easy svg tweak appears in our final image above – you’ll notice that the x-axis is labeled with “Environment” on both plots. It would be nice to have that appear just once, centered below both plots. This is an easy fix with an svg image – we can just manually remove one of the “Environment” labels, and drag the other copy to where we want it.

The graphs look pretty close to what I imagined – our messy trait (oil content) has some crossover genotype-by-environment interaction, while our more stable trait (plant height) exhibits nice, parallel lines in its interaction plot. I’m sure any plant breeder would prefer to work with the latter trait.

Now, as promised, my explanation as to why we did this in the first place. My justification is that in this case, we are more interested in the patterns of interaction for each of these two traits, rather than specific data points. To be more specific, it’s a matter of the scope of our inference. I could imagine generating this sort of plot if we were using a model that treated both genotype and environment as random effects. In that case, we would be more interested in characterizing some pool of possible genotypes and environments that we sampled in our experiment, rather than characterizing the specific genotypes or environments that we happened to use. (I use the term pool instead of population here, because I find the latter gets confusing when we start talking about populations in the genetic sense vs. the statistical sense).

With that, you now you have everything you need to go forth and compile all manner of multi-panel figures in R – please just commit to using this power wisely.

References

F7S9W6ZH items 1 asa default asc https://brianpward.net/wp-content/plugins/zotpress/

 

Lab Protocols

Sampling Plant Tissue for 96-well DNA Extraction

  • November 4, 2017June 17, 2018
  • by brian_ward

Background

Today I’m going to present a method of collecting plant tissue for DNA extraction. This method is intended for performing extractions in a 96-well format, using a tissue grinder and ball bearings for homogenization prior to extraction. Prior to graduate school, I had a job in which I performed low-throughput extractions, mostly using individual microcentrifuge tubes. However, in graduate school I worked on plant breeding projects that required genotyping hundreds of individuals. Clearly the single-tube method wouldn’t cut it anymore. Ultimately, the extracted DNA will be used in either 96-well or 384-well plates, so it’s easiest to collect tissue in this format to begin with.

In additional posts I’ll outline procedures for grinding the tissue samples and performing the extraction itself using a relatively low-cost method. However, before getting to that, I need to touch on an important topic: harvesting tissue for the extractions. If you’ve ever done plant DNA extractions, you’re probably wondering why I’m bothering with this topic. It’s pretty easy to sample tissue from plants – you just snip off some tissue, put it in a tube, freeze it for future use, and that’s that. There are the usual extra complexities if you will be performing RNA extractions, but here we will specifically be focusing on DNA.

Harvesting tissue might sound like a trivial step, but it’s important to talk about because it can cause a lot of problems if you start performing extractions on hundreds or thousands of samples. Grinding won’t work for tubes that contain too much tissue, meaning that you’ll either have to live without DNA for those samples, or else try to remove excess tissue, risking cross-contamination. Without further ado, on to the protocol (after a quick disclaimer):

Disclaimer

I present some catalog numbers for products below. I am not paid by these companies; these are products that I have successfully used in the past (sometimes after considerable trial and error). Other companies may market similar products that are just as effective.

Materials

  • 96-cell seed germination trays
  • 96-well format 1.2ml microtiter 8-tube strips with racks – sterile (Corning #4413) or non-sterile (Corning #4412)
  • Caps for tubes above (Corning #4418)
  • Clippers or scissors
  • Paper towels or Kimwipes
  • Ball bearing dispenser for 96-well plates (Spex SamplePrep #2100)
  • 4mm Stainless Steel Ball Bearings (Spex SamplePrep #2150)

Note: The ball bearings listed above are autoclavable. If you don’t need sterile ball bearings for tissue grinding, a much cheaper alternative is to use BB’s for pellet guns.

Reagents

  • Isopropyl or ethyl alcohol (preferably not denatured) for wiping down scissors/clippers between each sample

Lab Equipment

  • (Optional but recommended) laboratory lyophilizer

Protocol

  1. Yes, step zero. First, of course, you need to grow the plants in 96-well germination trays that mirror the plates that the tissue will eventually go into. I prefer to use potting soil, because it allows you to keep the seedlings alive for some time (more on that in step 4 below). Some people simply use cotton balls or other cheap germination media. With eudicot species I imagine you could wait until after the cotyledons unfurl to sample tissue. With wheat, I typically wait until seedlings hit the two-leaf stage:
Wheat seedlings in 2-leaf stage, ready for tissue sampling
Wheat seedlings ready for tissue sampling
  1. Before touching scissors to plant, you need to set up and label the plates. The tubes come in strips of 8, so each strip corresponds to a column of a plate (tubes will usually come set up in this orientation in rack. Here’s a picture of the general kind of product we’re talking about (this is a picture of Simport T100 Biotubes – I haven’t personally tried this product but it looks promising).

 

8-strip cluster tubes with rack

So, to prepare the tubes for grinding, you need to:

  1. Use the bearing dispenser to place bearing(s) into each tube – I usually use two bearings for some extra grinding heft.
  2. Label each plate (the plastic rack holding the tubes) with whatever plate identification scheme you’re using.
  3. Also label the strip tubes themselves – otherwise if they were to fall out of their racks, you would have no way of telling them apart. To unambiguously label each strip, you need to give it a plate identifier, column identifier, and some way to tell which end is the top row (row A), and which is the bottom row (row H). If I am working on a project with multiple plates, I will often give each column consecutive numbers across plates. So, for instance, the first plate will have columns 1-12, the second 13-24, and so on. I label both ends of each strip, so that, for instance, the top (row A) of column 2 looks like:
Top of strip-tube for column 2 labeled "2A"
Labeling top of strip-tube for column 2

and the bottom (row H) looks like:

Bottom of strip-tube for column 2 labeled "H"
Bottom row of strip tube
  1. IMPORTANT! If you have an odd number of tube strips overall, you’ll need to add an extra dummy strip. This should be set up exactly like the other strips (i.e. with ball bearings), and you can label it “D” or “N”. If we have an odd number of tube strips, we will need this extra one in order to keep our plates balanced when the tubes eventually get centrifuged during the DNA extraction.

 

  1. Now that the tubes are labeled and are holding the bearings, it’s time to take the tissue samples. The point here is to take a moderate amount of tissue, and break it into a few pieces. Most importantly, you should strive to take roughly the same amount of tissue for each sample. For monocots, this is relatively easy. Folks working with dicots might need to do a bit more work to get tissue pieces to easily fit into the tubes. For wheat, I just cut a piece of leaf the length of my pinky:
Pinky-length section of wheat seedling leaf
An easy measure of length

Then cut it in half, and cut each of those halves into halves:

Wheat seedling leaf cut into quarters
Quartered leaf

Then just place the four leaf pieces into the tube, and move on to the next sample:

Wheat tissue ready for DNA extraction
Some wheat tissue samples, capped and ready for freezing

I wipe down the scissors/clippers with alchohol between each sample to try and minimize cross-contamination.

  1. If you will be freezing the tissue, then you can place the caps onto each strip as you go. If you are able to lyophilize the tissue, then leave the caps off. You can then place the tubes in the lyophilizer for about 48 hours, giving you nice, paper-dry tissue. I then cap the tubes and store the lyophilized tissue in the fridge, though it should be able to safely stay at room temperature for a long time, if fridge space is tight.
  2. It’s also a good idea to save extra tissue in case something goes wrong during an extraction. There are three ways to do this:
    • Keep the plants alive until extractions are completed (easiest, but requires potting mix and may be impractical if extractions won’t be performed for a while)
    • Collect two sets of tissue samples (you might have guessed… this takes twice the effort and materials compared to taking single samples)
    • Save extra seed of all lines (sometimes extra seed isn’t available, and you’ll waste a week or two waiting for seeds to germinate for any repeat extractions)

Epilogue

That’s really all there is to it. You now have nice, consistent tissue samples ready for DNA extraction.

Observations

Inaugural Post for Halloween

  • October 31, 2017
  • by brian_ward

I’ll start off this blog with a post that is not really related to anything scientific – just some observations that I’ve had upon moving to the south. I was born and raised in New Hampshire, but now reside in the Research Triangle region of North Carolina. A couple of weeks ago I was back up north for a few days, during the peak of foliage season. It rained throughout most of the trip, but I managed to get this picture while walking through the woods with some friends during a brief break in the weather.

As I write this, I am back in North Carolina, and it’s the day before Halloween. All of the festivities and decorations are in full swing, but I can’t help but feel like I am trapped outside of time, in a season that feels something like early September. The trees here still have quite a bit of foliage, most of it green. In New England, leaves are already dropping at this time of year, ushering the region into the glorious, somber, and altogether eerie month of November. It’s a time of year that evokes hauntings and deceased ancestors, with encroaching darkness and plummeting temperatures – no wonder people’s thoughts turned to spirits, demons, and monsters. Perhaps it was integral to the survival of the Puritan settlers of the region, sobering them up for the harsh realities of the five-month long winter ahead. I remember trick-or-treating as a child with snow flurries falling and thin skins of ice forming on puddles on the road.

In retrospect, it’s probably better that children don’t spend hours walking outside in sub-freezing temperatures wearing poorly-insulated costumes. Still, I can’t help but feel a little lost at this time of year without that profound elegiac feeling of an ending hanging in the air. Maybe its a bit strange to look forward to an annual period of desolation and austerity, but to me, it’s just as natural as breathing.

Posts navigation

1 2

Recent Posts

  • Julia vs. Python
  • What Software Should I Learn? Pt. II
  • What Software Should I Learn? Pt. I
  • Misconceptions About Genetics and Evolution in Pop Culture, Part II
  • Investigating COVID-19 From Your Couch

Recent Comments

    Archives

    • October 2021
    • March 2021
    • January 2021
    • June 2020
    • May 2020
    • January 2020
    • October 2019
    • February 2019
    • June 2018
    • February 2018
    • December 2017
    • November 2017
    • October 2017

    Categories

    • Everyday Annoyances
    • Lab Protocols
    • Observations
    • R coding
    • shell scripting
    • Uncategorized

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    LinkedIn   ResearchGate   ORCID

    © 2017 Brian Ward
    Theme by Colorlib Powered by WordPress
     

    Loading Comments...