Skip to content
Brian Ward, Ph.D.
  • Home
  • Blog
  • CV
  • Contact
R coding

Find species occurrences within a region

  • December 30, 2017January 6, 2018
  • by brian_ward

Over the past few weeks I have been negligent in my duties to this website; I attribute my lapse in activity (of all kinds) to the holidays. However, I haven’t been totally sedentary. I was just home in New Hampshire for a few days, where we received 6 inches of fresh snow on Christmas day, on top of the foot or so that was already on the ground. I went out for a snowshoe with my parents and we were truly in a winter wonderland:

winter hunting cabin
The 2017-2018 winter in New Hampshire, off to a good start

However, the downside is that temperatures dropped precipitously after that, eventually staying in the vicinity of 0° F (-18° C) for the remainder of my trip. The moral of the story is that while it’s nice to visit New Hampshire in the winter, it’s also nice to be able to leave.

Anyways, I guess that’s enough about me. On to the primary topic of this post:

Problem

A coworker recently posed the following problem that she needed to solve in an extensible way:

Given a country X, find all species of weeds that have been reported there, and compare this list against a preexisting list containing the names of weed species of concern

In order to solve this problem, the first resource I would turn to is the Global Biodiversity Information Facility (GBIF). GBIF hosts a massive database that collects observations of species (referred to as “occurrences”) from all over the world; as of this writing the database contains roughly 1 billion species occurrences total.

Outline

We can break this problem down into a few simple steps:

  1. Clean our preexisting list of species names to match GBIF formatting. In the example I’ll use, we will have to get rid of some extra information included with the species names. One example is naming authorities – so for our purposes the gray wolf, first named by Linnaeus in 1758, should be written as Canis lupus instead of Canis lupus L. In addition, we will remove information regarding subspecies, varieties, and cultivars.
  2. Identify an appropriate dataset to download. We already know that we need to download species occurrences for a particular country, but downloading all occurrences from any given country could be far more information than we need. Therefore, we need to restrict the size of the download as much as is feasible.
  3. Initiate the download from GBIF servers
  4. Clean species names within the downloaded GBIF data. Generally the species names in GBIF datasets are formatted well, though it pays to also clean this data to ensure reliable matches with our own list.
  5. Compare our list of species names with the GBIF data
  6. [Optional] write out the downloaded occurrence data to a .csv file

R example

The script that this code is copied from is available here.

For the example, I will be using the excellent rOpenSci package rgbif, which, you may have guessed, provides several R functions for accessing GBIF data. If you don’t have rgbif installed, just open an R terminal and type:

install.packages("rgbif")

Input variables and data

Now that that is out of the way, let’s start by loading required packages, defining some variables, and simulating some data. We’ll use the country of Chad as an example. For our list of species, I am just going to create a data frame containing some various plants along with two animals with scientific names that I find humorous. Normally, our list of species would probably come from a preexisting .csv file that we read in. Finally, we need supply a model or type species, and a desired taxonomic rank of interest (more on that below).

library(rgbif)

## Country of interest
country <- "Chad"

## Species of interest - would normally obtain this from a pre-existing .csv file
species <- data.frame("sci_name" = c("Pica pica L.", 
                                     "Tuta absoluta (Meyrick)", 
                                     "Arachis hypogaea var. Georgia-12Y", 
                                     "Vigna vexillata (L.) A. Rich", 
                                     "ZEA MAYS"),
                      "common_name" = c("Eurasian magpie",
                                        "tomato leafminer",
                                        "peanut variety Georgia-12Y",
                                        "wild cowpea",
                                        "maize"))

## Model or type species, and taxonomic rank of interest
mod_sp <- "Arabidopsis thaliana"
rank <- "phylum"

## Select whether to output occurrence data (TRUE/FALSE)
## And specify output file name (.csv extension required)
## File name irrelevant if output_occs not set to TRUE
output_occs <- FALSE
out_file_name <- "Chad_vascular_plant_occurences.csv"

Define Functions

Now we need to create a couple of functions. The first will read in a vector of species names, and will output a cleaned-up version of this vector. Specifically, it will only keep the genus and species names, and will ensure that the genus name is capitalized (and that the species name is not). It will also get rid of any additional information such as naming authority, variety, subspecies, etc.

clean_sp <- function(s) {
  
  s <- gsub("_", " ", s, fixed = TRUE)
  s <- tolower(s)
  l <- strsplit(s, " ", fixed = TRUE)
  l <- lapply(l, function(x) {paste(x[1], x[2], sep = " ")})
  cleaned <- unlist(l)
  
  substr(cleaned, 1, 1) <- toupper(substr(cleaned, 1, 1))
  
  return(cleaned)
}

The second function will take a species that we choose as input, as well a taxonomic rank of our choosing (either kingdom, phylum, class, order, family, or genus). The GBIF functions that we use below require a “key” (which is an integer) for identifying a specific taxonomic rank. We can get these keys using the rgbif function name_backbone(). For example, if we want to find all occurrences of vascular plants (phylum Tracheophyta) within a country, the easiest way to get the GBIF key for Tracheophyta is to look up an example plant species that we know is in this phylum using the name_backbone() function, and then extract the key from it. The phylum key for Tracheophyta happens to be 7707728 – it doesn’t exactly roll off the tongue. The function tax_key() below is a wrapper for name_backbone() which allows us to easily look up a rank key for any given species.

tax_key <- function(species, rank) {
  
  if(!rank %in% c("kingdom", "phylum", "class", "order", "family", "genus")) {
    stop("rank argument requires value of 'kingdom', 'phylum', 'class', 'order', 'family', or 'genus'")
  }
  
  tax <- name_backbone(species)
  key <- unlist(tax[grep(paste(rank, "Key", sep = ""), names(tax))])
  print(paste("Using GBIF key", key, "for", rank, tax[[rank]]))
  
  return(key)
}

Executable example

Now we’re finally able to run some code. First we’ll clean up our species names, obtain the key value for phylum Tracheophyta (using the species Arabidopsis thaliana as an example), and obtain the two-letter ISO 3166-1 country code for Chad:

## Clean up input vector of species names
species$clean_name <- clean_sp(species$sci_name)

## Obtain integer key for given species/rank combination
key <- tax_key(species = mod_sp, rank = rank)

## Obtain ISO 3166-1 two-letter country code
country_code <- rgb_country_codes(country)$code

The rgbif package includes three functions that can download data from GBIF servers:

  • occ_download() – performs exactly the same operation as a manual data download. Requires GBIF username and password (which can be obtained for free through the GBIF website)
  • occ_search() – allows for finer control of download queries. Can download data, metadata, or hierarchical trees
  • occ_data() – faster version of occ_search() optimized for downloading data only

For the rest of this example, we’ll focus on the later two functions, though using occ_download() is a good idea if you will be grabbing a large amount of data.

Before we download any occurrence data, it is always a good idea to check on how many results there are for our desired search query. We can do this using the occ_search() function, by telling it to return metadata, and setting the limit value to 0. You’ll notice below that I’m calling occ_search() in an unusual fashion, by using the function do.call(). do.call() allows us to specify a function, and then supply its arguments as a list. The reason I’m doing this has to do with our key value for phylum Tracheophyta. Normally if I called occ_search() and wanted to only select this phylum, I would need to supply the argument manually:

occ_search(phylumKey = 7707728, …).

However, if on another occasion I wanted to restrict my search to the family Brassicaceae (key 3112), I would need to use the argument “familyKey”:

occ_search(familyKey = 3112, …)

This means that I would have to manually change the function call every time I search by a different taxonomic rank. The tax_rank() function that we defined earlier returns a named integer for the key, so we can supply its name (“phylumKey” in this example) in the list supplied to do.call() to avoid having to ever manually change the occ_search() call.

So, in the block below I define a list, gbif_args, which contains some arguments that we will use repeatedly. Notice “key” is listed without any assignment (=), because it contains both a name and a value. When I actually call the occ_search() function below, I append on some additional arguments to return metadata, and to limit the number of occurrences to 0. Note that I also include a prompt so the user can select whether or not to proceed with the download in the next step. The script must be sourced (i.e. not run interactively) for it to kick in:

## Construct list of arguments for GBIF functions
gbif_args <- list(country = country_code,
                  key,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  curlopts = list(progress = TRUE))

## Find number of occurences using specified arguments
n_occs <- do.call(occ_search, c(gbif_args, list(return = "meta", limit = 0)))$count
print(paste("Number of occurences returned by specified parameters:", n_occs))
y_n <- readline(prompt = "Proceed with occurrences download? [y/n]")
if (y_n != "y") {
  stop("User declined to download occurrences, or did not supply a y/n response")
}

We can now finally download our species occurrence data for real. We have our choice of either the occ_search() or occ_data() functions. Of the two, occ_data() tends to be faster. Running the code below on my computer/internet connection, occ_search() downloaded 3,225 occurrences in 0.64 minutes, while occ_data() completed the same task in 0.24 minutes.

## Download occurences into data frame using occ_search() function
start_time <- Sys.time()
occs <- do.call(occ_search, c(gbif_args, list(return = "data", limit = n_occs)))
end_time <- Sys.time()

print(paste("Elapsed time to download", nrow(occs), "occurences:", 
            round(difftime(end_time, start_time, units = "mins"), 2), "minutes"))

## Download occurences into data frame using occ_data() function
## Note "return" argument is not used by occ_data()
start_time <- Sys.time()
occs <- do.call(occ_data, c(gbif_args, list(limit = n_occs)))$data
end_time <- Sys.time()

print(paste("Elapsed time to download", nrow(occs), "occurences:", 
            round(difftime(end_time, start_time, units = "mins"), 2), "minutes"))

Finally, we can compare the GBIF species occurrences with our own list of species. Some GBIF occurrences are only identified to the genus level, so I’ll remove these by deleting any occurrences with scientific names that don’t contain a space. While the species names in the GBIF data tend to be pretty clean, we might as well run them through our cleaning function to be thorough. While we’re at it, we might as well save the occurrence data to a file, though this step is optional.

occs <- occs[grep(" ", occs$name, fixed = TRUE), ]
occs$clean_name <- clean_sp(occs$name)

intersect(species$clean_name, occs$clean_name)

if (output_occs) {
 write.csv(occs, file = out_file_name, row.names = FALSE)
}

When we run intersect(), we find that the three plant species in our original dataframe (which were not at all cherry-picked) are indeed among the list of vascular plants that have been observed in Chad. Critically, the Eurasian magpie and tomato leafminer are not shown here, since they are not vascular plants (or plants of any kind for that matter).

Going further

The rgbif package allows for many customized operations. For instance, we can isolate occurrences from a specific state or province. In this example, I count the number of occurrences of vascular plants in the U.S. state of Delaware:

## Count all occurences in US state of Delaware
gbif_args <- list(country = "US",
                  stateProvince = "Delaware",
                  key,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  limit = 0,
                  return = "meta",
                  curlopts = list(progress = TRUE))
n_occs <- do.call(occ_search, gbif_args)$count

If country borders seem a little too arbitrary, you can also define a custom polygon in well known text (WKT) format. I recommend this site for constructing the polygon (the site only seems to work with MS Internet Explorer or MS Edge). Below we search for vascular plants within a rectangle covering portions of Chad, Cameroon, and Nigeria:

## Find all occurences within a polygon
library(wicket)
wkt <- "POLYGON((17.02 11.52, 10.47 11.03, 10.78 7.78, 17.20 8.57, 17.02 11.52))"
wkt <- wkt_correct(wkt)
gbif_args <- list(geometry = wkt,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  limit = 100,
                  curlopts = list(progress = TRUE))
occs <- do.call(occ_data, gbif_args)$data

Cautionary notes

This is a pretty powerful method for finding species occurrences within a specific region. However, there are a few rules you should keep in mind:

  1. Reduce, reduce, reduce. You want your queries to be as specific as possible while still being relevant. Otherwise you’ll end up wasting time downloading lots of irrelevant data. I recommend using the lowest taxonomic rank and smallest geographic region possible. In this example, we downloaded vascular plant occurrences from Chad. If I had searched for all vascular plant occurrences in China, or the United States, the download would take a long time, or would just fail outright. That brings us to the next rule:
  2. Be nice to GBIF. They are providing a very valuable free service. Please don’t use scripts like this to spam their servers with multiple huge download requests.

Anyways, that’s about it. Now you can search out species occurrence data to your heart’s content.

Recent Posts

  • Julia vs. Python
  • What Software Should I Learn? Pt. II
  • What Software Should I Learn? Pt. I
  • Misconceptions About Genetics and Evolution in Pop Culture, Part II
  • Investigating COVID-19 From Your Couch

Recent Comments

    Archives

    • October 2021
    • March 2021
    • January 2021
    • June 2020
    • May 2020
    • January 2020
    • October 2019
    • February 2019
    • June 2018
    • February 2018
    • December 2017
    • November 2017
    • October 2017

    Categories

    • Everyday Annoyances
    • Lab Protocols
    • Observations
    • R coding
    • shell scripting
    • Uncategorized

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    LinkedIn   ResearchGate   ORCID

    © 2017 Brian Ward
    Theme by Colorlib Powered by WordPress
     

    Loading Comments...