Skip to content
Brian Ward, Ph.D.
  • Home
  • Blog
  • CV
  • Contact
R coding

Find species occurrences within a region

  • December 30, 2017January 6, 2018
  • by brian_ward

Over the past few weeks I have been negligent in my duties to this website; I attribute my lapse in activity (of all kinds) to the holidays. However, I haven’t been totally sedentary. I was just home in New Hampshire for a few days, where we received 6 inches of fresh snow on Christmas day, on top of the foot or so that was already on the ground. I went out for a snowshoe with my parents and we were truly in a winter wonderland:

winter hunting cabin
The 2017-2018 winter in New Hampshire, off to a good start

However, the downside is that temperatures dropped precipitously after that, eventually staying in the vicinity of 0° F (-18° C) for the remainder of my trip. The moral of the story is that while it’s nice to visit New Hampshire in the winter, it’s also nice to be able to leave.

Anyways, I guess that’s enough about me. On to the primary topic of this post:

Problem

A coworker recently posed the following problem that she needed to solve in an extensible way:

Given a country X, find all species of weeds that have been reported there, and compare this list against a preexisting list containing the names of weed species of concern

In order to solve this problem, the first resource I would turn to is the Global Biodiversity Information Facility (GBIF). GBIF hosts a massive database that collects observations of species (referred to as “occurrences”) from all over the world; as of this writing the database contains roughly 1 billion species occurrences total.

Outline

We can break this problem down into a few simple steps:

  1. Clean our preexisting list of species names to match GBIF formatting. In the example I’ll use, we will have to get rid of some extra information included with the species names. One example is naming authorities – so for our purposes the gray wolf, first named by Linnaeus in 1758, should be written as Canis lupus instead of Canis lupus L. In addition, we will remove information regarding subspecies, varieties, and cultivars.
  2. Identify an appropriate dataset to download. We already know that we need to download species occurrences for a particular country, but downloading all occurrences from any given country could be far more information than we need. Therefore, we need to restrict the size of the download as much as is feasible.
  3. Initiate the download from GBIF servers
  4. Clean species names within the downloaded GBIF data. Generally the species names in GBIF datasets are formatted well, though it pays to also clean this data to ensure reliable matches with our own list.
  5. Compare our list of species names with the GBIF data
  6. [Optional] write out the downloaded occurrence data to a .csv file

R example

The script that this code is copied from is available here.

For the example, I will be using the excellent rOpenSci package rgbif, which, you may have guessed, provides several R functions for accessing GBIF data. If you don’t have rgbif installed, just open an R terminal and type:

install.packages("rgbif")

Input variables and data

Now that that is out of the way, let’s start by loading required packages, defining some variables, and simulating some data. We’ll use the country of Chad as an example. For our list of species, I am just going to create a data frame containing some various plants along with two animals with scientific names that I find humorous. Normally, our list of species would probably come from a preexisting .csv file that we read in. Finally, we need supply a model or type species, and a desired taxonomic rank of interest (more on that below).

library(rgbif)

## Country of interest
country <- "Chad"

## Species of interest - would normally obtain this from a pre-existing .csv file
species <- data.frame("sci_name" = c("Pica pica L.", 
                                     "Tuta absoluta (Meyrick)", 
                                     "Arachis hypogaea var. Georgia-12Y", 
                                     "Vigna vexillata (L.) A. Rich", 
                                     "ZEA MAYS"),
                      "common_name" = c("Eurasian magpie",
                                        "tomato leafminer",
                                        "peanut variety Georgia-12Y",
                                        "wild cowpea",
                                        "maize"))

## Model or type species, and taxonomic rank of interest
mod_sp <- "Arabidopsis thaliana"
rank <- "phylum"

## Select whether to output occurrence data (TRUE/FALSE)
## And specify output file name (.csv extension required)
## File name irrelevant if output_occs not set to TRUE
output_occs <- FALSE
out_file_name <- "Chad_vascular_plant_occurences.csv"

Define Functions

Now we need to create a couple of functions. The first will read in a vector of species names, and will output a cleaned-up version of this vector. Specifically, it will only keep the genus and species names, and will ensure that the genus name is capitalized (and that the species name is not). It will also get rid of any additional information such as naming authority, variety, subspecies, etc.

clean_sp <- function(s) {
  
  s <- gsub("_", " ", s, fixed = TRUE)
  s <- tolower(s)
  l <- strsplit(s, " ", fixed = TRUE)
  l <- lapply(l, function(x) {paste(x[1], x[2], sep = " ")})
  cleaned <- unlist(l)
  
  substr(cleaned, 1, 1) <- toupper(substr(cleaned, 1, 1))
  
  return(cleaned)
}

The second function will take a species that we choose as input, as well a taxonomic rank of our choosing (either kingdom, phylum, class, order, family, or genus). The GBIF functions that we use below require a “key” (which is an integer) for identifying a specific taxonomic rank. We can get these keys using the rgbif function name_backbone(). For example, if we want to find all occurrences of vascular plants (phylum Tracheophyta) within a country, the easiest way to get the GBIF key for Tracheophyta is to look up an example plant species that we know is in this phylum using the name_backbone() function, and then extract the key from it. The phylum key for Tracheophyta happens to be 7707728 – it doesn’t exactly roll off the tongue. The function tax_key() below is a wrapper for name_backbone() which allows us to easily look up a rank key for any given species.

tax_key <- function(species, rank) {
  
  if(!rank %in% c("kingdom", "phylum", "class", "order", "family", "genus")) {
    stop("rank argument requires value of 'kingdom', 'phylum', 'class', 'order', 'family', or 'genus'")
  }
  
  tax <- name_backbone(species)
  key <- unlist(tax[grep(paste(rank, "Key", sep = ""), names(tax))])
  print(paste("Using GBIF key", key, "for", rank, tax[[rank]]))
  
  return(key)
}

Executable example

Now we’re finally able to run some code. First we’ll clean up our species names, obtain the key value for phylum Tracheophyta (using the species Arabidopsis thaliana as an example), and obtain the two-letter ISO 3166-1 country code for Chad:

## Clean up input vector of species names
species$clean_name <- clean_sp(species$sci_name)

## Obtain integer key for given species/rank combination
key <- tax_key(species = mod_sp, rank = rank)

## Obtain ISO 3166-1 two-letter country code
country_code <- rgb_country_codes(country)$code

The rgbif package includes three functions that can download data from GBIF servers:

  • occ_download() – performs exactly the same operation as a manual data download. Requires GBIF username and password (which can be obtained for free through the GBIF website)
  • occ_search() – allows for finer control of download queries. Can download data, metadata, or hierarchical trees
  • occ_data() – faster version of occ_search() optimized for downloading data only

For the rest of this example, we’ll focus on the later two functions, though using occ_download() is a good idea if you will be grabbing a large amount of data.

Before we download any occurrence data, it is always a good idea to check on how many results there are for our desired search query. We can do this using the occ_search() function, by telling it to return metadata, and setting the limit value to 0. You’ll notice below that I’m calling occ_search() in an unusual fashion, by using the function do.call(). do.call() allows us to specify a function, and then supply its arguments as a list. The reason I’m doing this has to do with our key value for phylum Tracheophyta. Normally if I called occ_search() and wanted to only select this phylum, I would need to supply the argument manually:

occ_search(phylumKey = 7707728, …).

However, if on another occasion I wanted to restrict my search to the family Brassicaceae (key 3112), I would need to use the argument “familyKey”:

occ_search(familyKey = 3112, …)

This means that I would have to manually change the function call every time I search by a different taxonomic rank. The tax_rank() function that we defined earlier returns a named integer for the key, so we can supply its name (“phylumKey” in this example) in the list supplied to do.call() to avoid having to ever manually change the occ_search() call.

So, in the block below I define a list, gbif_args, which contains some arguments that we will use repeatedly. Notice “key” is listed without any assignment (=), because it contains both a name and a value. When I actually call the occ_search() function below, I append on some additional arguments to return metadata, and to limit the number of occurrences to 0. Note that I also include a prompt so the user can select whether or not to proceed with the download in the next step. The script must be sourced (i.e. not run interactively) for it to kick in:

## Construct list of arguments for GBIF functions
gbif_args <- list(country = country_code,
                  key,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  curlopts = list(progress = TRUE))

## Find number of occurences using specified arguments
n_occs <- do.call(occ_search, c(gbif_args, list(return = "meta", limit = 0)))$count
print(paste("Number of occurences returned by specified parameters:", n_occs))
y_n <- readline(prompt = "Proceed with occurrences download? [y/n]")
if (y_n != "y") {
  stop("User declined to download occurrences, or did not supply a y/n response")
}

We can now finally download our species occurrence data for real. We have our choice of either the occ_search() or occ_data() functions. Of the two, occ_data() tends to be faster. Running the code below on my computer/internet connection, occ_search() downloaded 3,225 occurrences in 0.64 minutes, while occ_data() completed the same task in 0.24 minutes.

## Download occurences into data frame using occ_search() function
start_time <- Sys.time()
occs <- do.call(occ_search, c(gbif_args, list(return = "data", limit = n_occs)))
end_time <- Sys.time()

print(paste("Elapsed time to download", nrow(occs), "occurences:", 
            round(difftime(end_time, start_time, units = "mins"), 2), "minutes"))

## Download occurences into data frame using occ_data() function
## Note "return" argument is not used by occ_data()
start_time <- Sys.time()
occs <- do.call(occ_data, c(gbif_args, list(limit = n_occs)))$data
end_time <- Sys.time()

print(paste("Elapsed time to download", nrow(occs), "occurences:", 
            round(difftime(end_time, start_time, units = "mins"), 2), "minutes"))

Finally, we can compare the GBIF species occurrences with our own list of species. Some GBIF occurrences are only identified to the genus level, so I’ll remove these by deleting any occurrences with scientific names that don’t contain a space. While the species names in the GBIF data tend to be pretty clean, we might as well run them through our cleaning function to be thorough. While we’re at it, we might as well save the occurrence data to a file, though this step is optional.

occs <- occs[grep(" ", occs$name, fixed = TRUE), ]
occs$clean_name <- clean_sp(occs$name)

intersect(species$clean_name, occs$clean_name)

if (output_occs) {
 write.csv(occs, file = out_file_name, row.names = FALSE)
}

When we run intersect(), we find that the three plant species in our original dataframe (which were not at all cherry-picked) are indeed among the list of vascular plants that have been observed in Chad. Critically, the Eurasian magpie and tomato leafminer are not shown here, since they are not vascular plants (or plants of any kind for that matter).

Going further

The rgbif package allows for many customized operations. For instance, we can isolate occurrences from a specific state or province. In this example, I count the number of occurrences of vascular plants in the U.S. state of Delaware:

## Count all occurences in US state of Delaware
gbif_args <- list(country = "US",
                  stateProvince = "Delaware",
                  key,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  limit = 0,
                  return = "meta",
                  curlopts = list(progress = TRUE))
n_occs <- do.call(occ_search, gbif_args)$count

If country borders seem a little too arbitrary, you can also define a custom polygon in well known text (WKT) format. I recommend this site for constructing the polygon (the site only seems to work with MS Internet Explorer or MS Edge). Below we search for vascular plants within a rectangle covering portions of Chad, Cameroon, and Nigeria:

## Find all occurences within a polygon
library(wicket)
wkt <- "POLYGON((17.02 11.52, 10.47 11.03, 10.78 7.78, 17.20 8.57, 17.02 11.52))"
wkt <- wkt_correct(wkt)
gbif_args <- list(geometry = wkt,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  limit = 100,
                  curlopts = list(progress = TRUE))
occs <- do.call(occ_data, gbif_args)$data

Cautionary notes

This is a pretty powerful method for finding species occurrences within a specific region. However, there are a few rules you should keep in mind:

  1. Reduce, reduce, reduce. You want your queries to be as specific as possible while still being relevant. Otherwise you’ll end up wasting time downloading lots of irrelevant data. I recommend using the lowest taxonomic rank and smallest geographic region possible. In this example, we downloaded vascular plant occurrences from Chad. If I had searched for all vascular plant occurrences in China, or the United States, the download would take a long time, or would just fail outright. That brings us to the next rule:
  2. Be nice to GBIF. They are providing a very valuable free service. Please don’t use scripts like this to spam their servers with multiple huge download requests.

Anyways, that’s about it. Now you can search out species occurrence data to your heart’s content.

Uncategorized

Tissue Grinding for 96-well DNA Extraction

  • December 3, 2017June 17, 2018
  • by brian_ward

Background

In a previous post I presented a simple protocol for collecting plant tissue for DNA extraction in a 96-well plate format. Later on we’ll get to the details of performing the DNA extraction. Before getting to that, we need to talk about the proper way of grinding plant tissue to get a nice, homogeneous powder. Like the subject of the previous post, this is a seemingly-simple step that often goes underappreciated. I promise that I will get around to a protocol for performing an actual DNA extraction sometime soon.

Tissue grinding is one of those things that can soak up hours of time if it doesn’t go right. If samples don’t grind properly, you’ll have to spend a lot of time trying to shake them again and again; you may even need to remove tissue from some wells if they are overfilled, which takes time and opens you up to more risk of cross-contamination.

Clearly, it’s best to grind the tissue properly in one fell swoop, so you can be on your way to the extraction with no problems. The previous post was all about gathering samples in a highly consistent fashion to enable an easy grinding step. Depending on whether you had a lyophilizer at your disposal for the last step, your tissue should now either be dried, or else frozen.

Materials

  • 96 well deep-well plates (Axygen Scientific # P-9641-1S)

Reagents

  • Liquid nitrogen (if using frozen tissue)

Laboratory Equipment

  • Spex SamplePrep Genogrinder 2010
  • Styrofoam shipping cooler (if using frozen tissue)
  • Tongs (if using frozen tissue)

 

Protocol

Step 1: Label Racks

The deep-well plates that I listed above have wells that are wide enough to fit our microtiter tubes in, so we will essentially be using them as heavy-duty tube racks. For the rest of this post, I will refer to them as “racks” instead of plates to avoid confusion. The racks that the tubes come with are too flimsy for tissue grinding – after a round or two in the Genogrinder, they will likely self-destruct. For the DNA extraction, you’ll be using anywhere from one to four plates of samples (i.e. up to 384 samples). So, the steps are:

  1. Label your deep-well racks something like “Grind-1”, “Grind-2” etc., for however many you need.

NOTE: The Genogrinder requires two racks at once, due to its clamp design. So, if you only have a single plate of samples, you should still label two deep-well racks to be used for grinding (more on that below).

  1. Transfer the tubes into the racks that will be used for grinding. If you have a single plate of samples, split the tubes into two racks, spacing them evenly across each rack for uniform clamping inside the Genogrinder. This is one step that should convince you that the unambiguous labeling of tube strips that we performed in the previous post was a good idea.

 

Step 2: Deep Freeze

For lyophilized tissue

If you have lyophilized tissue, you may be tempted to put it into the grinder at room temperature. I’m here to tell you that it won’t work – although the tissue is dry, it is still too flexible to grind properly at room temperature. You’ll just end up with large pieces of leaf that are jammed into the bottom or tops of the tubes. Despite this, I can’t emphasize enough how much easier it is to work with lyophilized tissue, because you can grind easily it after letting it sit in a -80° C freezer for a while. So, if you have lyophilized tissue, this part is easy:

  1. Place tubes in grinding racks in a -80° C freezer for a few hours or overnight.

For frozen tissue

If you’re using frozen tissue, -80° C probably won’t be cold enough for adequate grinding. Unfortunately this means you’ll need to use liquid nitrogen. So, the steps are:

  1. Pour an inch or so of liquid nitrogen from a vacuum flask into an adequately-sized styrofoam cooler (the square ones used as inserts for shipping refrigerated/frozen lab supplies work well)
  2. Use tongs to submerge the grinding racks holding the sample tubes in the liquid nitrogen for a minute or so

 

Step 3: Grinding

Whether you have used lyophilized or frozen tissue, the next steps are the same:

  1. Quickly transfer the racks into the Genogrinder (I probably shouldn’t have to mention this, but please use tongs and not your bare or gloved hands to do this if the racks are sitting in liquid nitrogen)
  2. Secure racks in place with the clamping mechanism and close the Genogrinder’s lid
  3. Grind the samples! I typically run the machine at 1,000 rpm for 1 minute.

After grinding, the samples should have turned into nice, pale-green flaky powder, like the first five tubes in the picture below:

Results of properly grinding plant tissue prior to DNA extraction
Properly ground wheat leaves

Epilogue, and Reflection

That’s all there is to it. Like I said at the beginning, grinding plant tissue is not rocket science, but doing it properly can save you hours of frustration. You can now:

  1. Put the sample tubes back in their original (flimsy) racks
  2. Place the samples back in a freezer for storage, OR
  3. Proceed directly to DNA extraction

One thing to note is that we can reuse the grinding racks – just check prior to each grinding to make sure that cracks aren’t forming in them. If they are getting cracked, just discard them and use a new set. In general, things tend to break when they are exposed to liquid nitrogen and then immediately shaken at high speeds – don’t be surprised if you need to replace racks often when using liquid nitrogen, or if you need to replace some cracked tube caps.

One final warning: any rack that goes into the tissue grinder should never go into a centrifuge later. Take it from me: you might end up really, really sorry if you ever do this. There’s nothing that can ruin your day quite like having a rack holding tubes full of chloroform shatter inside a spinning centrifuge.

Recent Posts

  • Julia vs. Python
  • What Software Should I Learn? Pt. II
  • What Software Should I Learn? Pt. I
  • Misconceptions About Genetics and Evolution in Pop Culture, Part II
  • Investigating COVID-19 From Your Couch

Recent Comments

    Archives

    • October 2021
    • March 2021
    • January 2021
    • June 2020
    • May 2020
    • January 2020
    • October 2019
    • February 2019
    • June 2018
    • February 2018
    • December 2017
    • November 2017
    • October 2017

    Categories

    • Everyday Annoyances
    • Lab Protocols
    • Observations
    • R coding
    • shell scripting
    • Uncategorized

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    LinkedIn   ResearchGate   ORCID

    © 2017 Brian Ward
    Theme by Colorlib Powered by WordPress
     

    Loading Comments...