Skip to content
Brian Ward, Ph.D.
  • Home
  • Blog
  • CV
  • Contact

Blog

Uncategorized

What Software Should I Learn? Pt. I

  • January 17, 2021January 23, 2021
  • by brian_ward

Linux and Shell Scripting

Photo by Jean van der Meulen from Pexels

If you are preparing to enter graduate school in the sciences, or have recently started a graduate program, you have probably become keenly aware that working with modern datasets requires the use of tools that are more powerful and more esoteric than Excel. One way or another, you will have to learn how to code, and it probably behooves you to learn this skill sooner rather than later. You might start off graduate school like me, having taken one intro to computer science class as an undergraduate, where I learned to write rudimentary programs in C. On the other end of the spectrum, you may come to graduate school with a fair amount of coding experience under your belt, looking to refine your skills and learn how to apply them to your chosen research subject.

Whatever the case, I think it is helpful to review the various applications for coding that you’re likely to run into in graduate school. My advice here is geared towards the life sciences, as my field of study falls within this domain. I won’t offer any specific advice on how exactly to learn a new language, because there isn’t any magical, easy way to go about it. You should learn how to code the same way you learn everything in school – by taking a class, either in-person or online. More importantly, during or after that class you should then learn the new coding language for real by working on a project that requires you to use it. At first you will stumble about, writing terrible code that gives you grotesque, misshapen output. Eventually you’ll progress to writing terrible code that gives you the correct output. Someday down the road you may write very good code that other people use in their research. Otherwise you’ll be like me, writing mediocre code that gets the job done. Either way, you will have learned a valuable skill that will make your future academic life easier.

What is Linux?

If your research project(s) will involve any degree of bioinformatic work, in particular the processing of sequencing data, learning how to use Linux will be absolutely essential. It’s very unlikely that you haven’t heard of Linux, but if that is the case, Linux is one of several “Unix-like” operating systems. Also included among the ranks of Unix-like operating systems is one you may not have heard of, BSD, and one you have heard of – MacOS. The Unix-like OSes are, of course, like Unix, which is an operating system that was first developed at Bell Labs starting in the 1970s. Linux itself is named after its developer, Linus Torvalds, and was first released in 1991. Since then, many thousands of different versions, or distributions of Linux have been developed. Whatever distribution you choose (more on that below), you can open a terminal (actually technically a terminal emulator) and begin typing in commands by hand. The interface between the user and the operating system is called a shell. There are several different shells out there – sh, zsh, fish, and the most popular, bash. The magic of Linux and other Unix-like operating systems is that the commands that you would normally type out by hand can be combined into scripts – shell scripts – to automate the process.

What should I use the Linux shell for?

Why bother with learning how to use Linux, when Windows and MacOS dominate the personal computer market? In your mind, Linux and shell scripting should be associated with one type of data – large text files. This speaks both to the strengths and weaknesses of the shell. You see, the shell’s greatest weakness is that it treats everything it sees as strings (i.e. text). That means that if you type in something that you think of as a number, like “12,” the shell will treat it as text, unless you specifically tell it to interpret it as a number using some syntax that is frankly rather clunky. On the other hand, the shell, along with all its associated utilities, does its job of text processing very, very well.

It just so happens that many of the files containing biological data that you encounter will be large text files. This isn’t by coincidence. These include:

  • Files of unaligned sequences, either with associated quality information (FASTQ) or without (FASTA). These are typically referred to as “formats” or “specifications,” but that is technically incorrect. There is no formal specification for how these files should be formatted. Rather, they represent sets of informal conventions that have been adopted by the community over time.
  • Sequence alignment files (SAM). Typically you would generate a SAM file by aligning the reads in a FASTQ or FASTA file against a reference sequence.
  • Genetic variant data stored in variant call format (VCF) files. These are typically generated by comparing multiple sequence alignment files and seeing where they differ.

You may have already encountered these types of files. They usually are compressed (often with the extension “.gz”), and if you try to open one with a text editor on your typical laptop it may promptly crash. Why does this happen? Because the file when uncompressed could easily be larger than the amount of memory on your computer. For instance, you may encounter a 4GB file with the extension “.fastq.gz.” However, when uncompressed this file could actually be 15GB. If you try to open it on a computer with 8GB of RAM, it will completely fill all available memory, bringing everything to a standstill. In Linux, many tools avoid this problem by streaming – reading an input file one line at a time, doing some operation on each line, and then writing it to an output file. Since these utilities only hold one line in memory at a time, they can process files of arbitrary size. It’s important to note that they are also quite fast – likely far faster than anything you could write yourself, since they have been developed over the course of several decades. The primary limiting factor for file operations becomes the read/write speed of the storage medium.

The ability of most Linux command-line utilities to handle very large files is one reason that the large computers you will encounter in your research – servers and high performance computing clusters (i.e. supercomputers) – will almost invariably run some kind of Linux distribution. They will also almost invariably have no graphical user interface. So when I talk about “learning how to use Linux,” I specifically mean learning how to work with the command line. If your experience is like that of many thousands of graduate students around the world, you may get some general guidance from your university’s IT department on how to log in to some particular server using a secure shell (ssh) interface. Upon logging in, you will be confronted by an empty black screen with a blinking cursor. What now?

Some Simple Examples

Linux, like Unix, uses the input redirect (<), pipe (|), and output redirect (>) operators. Most command line utilities will by default read from standard input (the keyboard) and write to standard output (the screen), but these operators allow you to read input from a file, use the output of one command as the input of another, and send output to a file instead of the screen. The input redirect operator is used less frequently than the other two, because many command line utilities can be pointed directly to an input file to read from. For instance, say we had a gzip-compressed fasta file in the typical format of sequence IDs preceded by “>”, sequence descriptions appearing after the first space in ID lines, and sequences as everything not proceeded by “>”. For instance:

>sequence1_id sequence1_description
AGTCCGCATAG
>sequence2_id sequence2_description
GGTCACTTTGA
>sequence3_id sequence3_description
TTACTGGCAAA
...

If we wanted to, say, isolate the first 10,000 sequence IDs (without the “>” preceding them) and place them into a text file, we could write:

gunzip -c my_fasta_file.fa.gz |
    grep ">" -m 10000 |
    sed 's/>//' |
    sed 's/ .*$//' > my_fasta_ids.txt

Here we perform the following steps, using pipes to supply the output of one step as the input for the next:

  1. decompress the gzipped fasta file using gunzip -c. The “-c” option stands for “console,” and would normally write the output to the screen, except here we redirect using the pipe operator
  2. Use grep, which pulls out any line containing “>”. Using the -m (for max) option limits the output to the first 10,000 matches.
  3. Use the Unix stream editor program (sed) to reformat our sequence ID lines. Here we are using find/replace commands, written in the strange-looking nomenclature ‘s/find/replace/’. Here “s” stands for “string,” so in the first line we’re replacing the string “>” with nothing (that is, just removing it).
  4. Perform a second sed replacement using a regular expression to replace everything after the first space (including the space) with nothing.
  5. Redirect our final output to my_fasta_ids.txt.

Notably, you can use this short command to isolate the first 10,000 sequence IDs from a fasta file containing 100,000 sequences, or 100,000,000 (as long as it can fit on your hard drive). Some, but not all Linux systems will have the program zgrep installed, which is a version of grep that can directly read gzip-compressed files. In addition, we can supply sed with multiple find/replace expressions by using the -e option multiple times, so we could condense the above command to:

zgrep ">" -m 10000 my_fasta.fa.gz |
    sed -e 's/>//' -e 's/ .*$//' > my_fasta_ids.txt

Many bioinformatics softwares are designed to process specific types of the files listed above, using syntax that is more or identical to general shell syntax. So we have the program samtools for working with SAM files, and the program bcftools (or the older vcftools) for working with VCF files. We can do various piping/redirection operations using these, so for instance to output the chromosome, position, reference allele and alternate allele for all SNPs on chromosome 2 in a given VCF file, we could write:

bcftools view my_input.vcf.gz --regions chr2 |
    bcftools query - '%CHROM\t%POS\t%REF\t%ALT\n' > chr2_snp_info.txt

Here we are using bcftools view to isolate just chromosome 2, and then passing that off to bcftools query (the lonesome “-” here means “read from standard input,” i.e. the pipe’s output). Then we have a very literal statement showing bcftools query how to write the output, separating our desired output fields by tabs (“\t“) and signifying that we want each SNP to be listed on a separate line (“\n“).

Hardware for Learning Linux

You always have the option of learning Linux without any graphical interface, as you would on a university server or cluster, or else through a virtual machine. However, it can be a hassle to have to deal with these systems before you have the basics down. If you want to go a more local route, you have a few options:

  1. A PC running a Linux distribution. This is by far the best option. There are a huge amount of distributions to choose from – some popular ones are Ubuntu (though it faces some criticism due to corporate practices of its parent company Canonical), Linux Mint (a very popular community-driven fork of Ubuntu), Debian, Zorin OS and Elementary OS (both good for people used to Windows or Mac), Manjaro, Gentoo, and the list goes on. I personally prefer Linux Mint. Working on a Linux computer will allow you to try out various commands and scripts on small test datasets before running larger jobs on servers or clusters. It’s also much easier to learn how to use the command line locally, where you don’t have to worry about constantly logging in to a remote computer, making other users mad by hogging resources, or else scheduling jobs like you would on a cluster. It’s also probably easier than you think to obtain a computer running Linux. If you have an old laptop or desktop lying around that struggles to run Windows or MacOS, you can often install a lightweight Linux distribution and magically resurrect it. Some good distributions to consider in this case would be those using the Xfce desktop environment, such as Linux Mint Xfce edition, or those using the LXDE desktop environment, such as Lubuntu.
  2. A dual-boot computer. This is when you have two operating systems installed side-by-side on the same machine. This way you can have both Windows or MacOS installed alongside a Linux OS. This might seem like the best of both worlds, but there are some caveats. Switching between the two OSes requires restarting the machine, which can become tedious. Also, setting up dual boot will immediately void your warranty if it is still active.
  3. Use Windows 10 or MacOS. Traditionally, Macs were considered superior to Windows PCs if you wanted a popular, commercial operating system that somewhat resembled Linux. Recall that both Linux and MacOS are Unix-like. However, MacOS more closely resembles BSD than Linux, and it requires some tweaking to truly get it working more like Linux. Even then, it’s never quite the same. By comparison, Windows was the odd one out, excepting a few solutions like Cygwin or virtual machines (VMs). However, that changed somewhat with the release of Windows 10 and the Windows Subsystem for Linux (WSL). This is a VM that allows you to install one of several Linux distributions inside of Windows. The first version of WSL was quite a bit faster than a traditional VM, though still somewhat clunky. The new WSL 2 has vastly improved performance over the original WSL. However, there is a catch – WSL runs with its own filesystem, and it can be quite slow when working with anything in the Windows filesystem on the computer. With the release of WSL 2, Windows has become more competitive with Mac when it comes to Linux functionality. However, if you’re looking for a seamless experience for learning Linux, Both Windows and Mac will still give you some headaches from time to time.

Whatever route you choose, the first bit of learning might be a little bit tough and frustrating. However, now that you can look up or ask any question on the internet, the task of learning Linux has become a lot easier. Soon enough you’ll be breezing through the command line and processing enormous files, surprising your advisor by getting through huge computational projects with relative ease.

Uncategorized

Misconceptions About Genetics and Evolution in Pop Culture, Part…

  • June 8, 2020
  • by brian_ward
Incredible Hulk #1
Strength and durability: +5000%
Reading and comprehension: -8000%

Mutation

For part II of this series, we’ll tackle the big one – perhaps one of the most misunderstood natural phenomena in all of fiction. In writing, movies, television shows, video games and countless other media, genetic mutation is often viewed as something frightening or uncontrollable, frequently brought about through the abuse or misuse of science, or else through unfortunate accidents. For instance, genetic mutation turns the physicist Bruce Banner into the Hulk – a violent, simple-minded brute whose favorite hobby is smashing things.

In reality, mutation is neither a good thing nor a bad thing. It is simply a natural process, like the weather or plate tectonics. Mutation is the raw fuel for natural selection and evolution. It is the spontaneous change in DNA sequence over generations due to either environmental factors (e.g. exposure to ionizing radiation or mutagenic chemicals) or DNA replication errors. This last part is important – mutations would occur in a population over time even if it were magically shielded from any environmental factors that cause mutation. Most small-scale mutations are single nucleotide polymorphisms (SNPs), where a single nucleotide is changed, for instance changing the sequence ATTGCTCCA to ATTGATCCA. This can also include deletions: ATTGCTCCA to ATTGTCCA, and insertions: ATTGCTCCA to ATTGCATCCA. However, larger-scale mutations such as the deletion of entire genes are also possible.

How Does Fiction Get it Wrong?

In general, fictional accounts of genetic mutation fail to grasp one major concept: the difference between somatic mutations and germline mutations. Germline cells are the small number of cells in your body that are involved in passing your genes on to your progeny. They include eggs, sperm, the cells of the ovaries involved in egg formation (oogenesis), and the cells of the testicles involved in sperm formation (spermatogenesis). Because these cells pass genetic information on to offspring, mutations arising in them will potentially be passed on to every cell of an offspring’s body as it forms through a complex series of cell divisions following the fusion of an egg and sperm. In general, female mammals are born with all of their eggs already made (there is some debate on this though). Therefore mutations to the cells involved in oogenesis after birth will not be passed on to a woman’s offspring. This is not the case in males, where the cells involved in spermatogenesis are busy creating sperm throughout most of their life.

In contrast to germline cells, somatic cells make up nearly the entire mass of your body. While some are involved in the act of reproduction (e.g. the uterus in women, or the urethra in males), the defining characteristic of somatic cells is that they do not pass any genetic information on to an organism’s progeny. This brings us to the primary error regarding mutation in fiction – mutations to somatic cells cannot effect an organism’s body as a whole. For instance, a mutation occurring in a skin cell on your forearm due to ultraviolet light exposure is unlikely to give you any superpowers. In contrast, an organism can acquire a mutation in a germline cell, which is then passed on to all cells in their offspring. These two distinct outcomes of mutation continue to get confused in science fiction. Below we’ll shed a little more light on the difference between the two.

The Effects of Somatic Mutation

Usually, mutations in somatic cells have no effect. Your genome contains a large amount of genetic material that doesn’t directly affect your phenotype. This includes repetitive elements – somewhat autonomous DNA sequences that either “copy and paste” or “cut and paste” themselves across the genome. It also includes genes that have been epigenetically silenced, and pseudogenes, which are genes that have lost their function (due to prior mutations). Mutations within any of these regions, and many others, will typically not have any measurable effect on a cell’s phenotype.

But what if a mutation occurs inside a gene?

Then things can get a little more interesting. There’s still a very good chance that a mutation inside of a gene will have little or no effect. Genes are first transcribed, when an RNA copy of the DNA sequence is created. Then the RNA transcript is translated into a protein (which is a chain of amino acids linked together). However, after transcription and before translation, portions of the RNA are removed. Since these regions, called introns, don’t make it to the final protein product of the gene, mutations within them are often “silent.”

But what if a mutation occurs in a coding region in a gene?

The regions of genes that are transcribed in RNA and then translated into proteins are called exons. Even if a mutation occurs in one of these regions, it may not have any effect. For one thing, the genetic code is degenerate. This doesn’t mean that that it’s out making trouble (usually). Rather, the genetic code consists of three-letter “words,” called codons. Each codon is translated into an amino acid – the building blocks of proteins mentioned above. So, for instance, the three codons CTTGTTGGC will be translated into the amino acid chain Leucine-Valine-Glycine. However, nearly all the amino acids are encoded by multiple codons. For instance, glycine is encoded by the codons GGC, GGT, GGA, and GGG, so any substitution of the last base of a codon starting with “GG” will not have any effect on the protein that is translated from the gene. This is what is referred to as a “synonymous” mutation.

But what if a coding region mutation isn’t synonymous?

Now things start to get more interesting. A mutation that results in a change to an amino acid is referred to as a “missense” mutation. These mutations can have little or no effect, or major effects depending on how the amino acid swap changes the overall shape or function of the protein. In addition, there are two types of mutations that can have more dramatic effects. One is a premature stop formation. There are three codons that signal for transciption to stop – TAG, TAA, and TGA. If a mutation converts, for instance, a cysteine codon (TGC) into the stop codon TGA, then it will result in a shortened protein, which can have highly significant effects. This is referred to as a “nonsense” mutation. Finally, the other more serious type of mutation is a frame shift. This occurs when an insertion or deletion of one or more bases completely changes the “words” (codons) of the genetic code in a coding sequence. To illustrate, here’s an English sentence that happens to consist of three-letter words, like the genetic code: “The fat cat and big dog ate.” If we insert an adenine (A) at the beginning of the sentence, but keep the word separations at the same place, we end up with: “Ath efa tca tan dbi gdo gat e.” Basically, a frameshift mutation scrambles codons, resulting in the production of a completely different protein.

Mutations with Major Effects

As we have seen, mutations often have no beneficial or detrimental effect. In fact, the neutral theory of molecular evolution, proposed by Motoo Kimura in 1968, posits that the majority of evolution on the molecular scale is due to the genetic drift (the random change in allele frequencies over time) of mutations with neutral effects. However, there are cases where mutations can have major phenotypic effects. Germline mutations with major effects may be enriched in or purged from a population through natural selection. Highly deleterious mutations may be lethal to organisms, while highly beneficial mutations may grant an organism greater fitness, increasing the likelihood that the mutation will be passed on to future generations.

In somatic cells, we only really see the outcome of deleterious mutations. The cells making up your body must all “play by the rules.” Specifically, they have to behave as part of a multicellular organism. Mutations that allow a cell to function beyond its very narrowly-regulated niche within the body can have very dangerous consequences. Cells have many contingencies to deal with such situations. Apoptosis is programmed cell death, whereby a cell is triggered, either internally or externally, to self-destruct. Many cells within your body undergo this process every day. There are many ways in which apoptosis is initiated; one way is in response to mutations that alter how a cell grows and multiplies. In this way, cells that pick up mutations that could allow them to stop playing nice with the other cells around them often automatically self-destruct. Unfortunately these mechanisms don’t always work. Sometimes the very pathways meant to induce apoptosis become corrupted. Unfortunately these changes don’t give us new superpowers – instead they may lead to the formation of cancerous cells.

Who Gets it Right

Given all the intricacies of how mutation works in reality, different works of fiction fall close to or wide of the mark. In the Marvel Universe, mutants are those born with mutations (presumably acquired as germline mutations by their parents or other ancestors), whereas mutates are characters who acquire their powers from some external force, energy, catastrophe, etc. So, comics that focus mostly on mutants, such as the X-Men, largely get the concept of genetic mutation correct. It’s still not clear how mutations seem to give everyone big muscles or big breasts. Nor is it clear how they occasionally give people god-like powers. In contrast, comics about mutates, such as The Hulk, Spiderman, or the Fantastic Four, have a less realistic depiction of the effects of mutations (relatively speaking), as we are supposed to believe that the exposure of a fully-developed human to some mutagen could alter their entire body. As we have seen, the most dramatic effect on somatic cells due to exposure to mutagens is cancer. This isn’t quite as exciting a subject as heroes fighting monsters, though there is a genre of comics and graphic novels dealing with illness and medicine. Interestingly, there are some cases in the Marvel Universe where mutates later give birth to mutants. One example are the mutates Sue Storm and Reed Richards of the Fantastic Four, who give birth to the mutant Franklin Richards. It turns out Franklin is essentially a god – he can alter reality itself and create universes. It’s not clear how the Marvel Multiverse still exists at all when children wield this much power.

Franklin Richards creates a universe
There goes the neighborhood universe…

Who Gets it Wrong

There are too many to list here – take your pick really. We already mentioned the mutates in the Marvel Universe. We could add to that an almost unlimited number of fictional characters, including many of the characters and monsters from the Resident Evil video game series, the supermutants from the Fallout video games, Joker from the Batman comic books, the Teenage Mutant Ninja Turtles – the list goes on and on.

Room for Artistic License

There is one potential mutagen that could more feasibly create mutations in a wide range of somatic cells – viruses, specifically retroviruses. These are viruses with RNA genomes which are reverse-transcribed into DNA copies, which are then inserted into the host cell genome. In this way, retroviruses naturally genetically modify their host cells. For this reason, retroviruses are the vectors of choice for gene therapy. Therefore it’s not too hard to imagine a scenario in which gene therapy is used not just to cure disease, but to grant some enhancement or even powers to humans. This ignores the many hurdles that have made successful gene therapy difficult to achieve, but it is at least within the realm of possibility. Interestingly, the human genome contains many inactive endogenous retroviruses. These were retroviruses which at one time infected germline cells in our ancestors, and have since been passed down through the generations. While retroviruses are a (somewhat) plausible method of creating body-wide mutations in an adult, we first need much more research into using them to cure disease, before we start using them to give people rockin’ bods and superpowers.

Uncategorized

Investigating COVID-19 From Your Couch

  • May 12, 2020May 12, 2020
  • by brian_ward
I need you to help fight COVID-19!

I don’t need to remind you that the world has changed dramatically since the worldwide spread of the COVID-19 novel coronavirus; millions are suddenly unemployed, or else working at home. If you suddenly find yourself in one of these situations, this can be a good time to pick up some new coding and data science skills while helping to investigate the cause of our current crisis.

You’ll see many statistics and analyses regarding COVID-19 getting thrown around online, but you don’t necessarily have to take their word for it. Many large sets of COVID-19 data are publicly available. If you’re one of the many who suddenly finds themselves with a lot more free time on their hands, now could be a great opportunity to learn some new things while playing around with real-world data sets. It’s not likely that you’ll be generating insights beyond those of professional epidemiologists, but it never hurts to have more eyes on the data. Below is a list of a few compiled sources of COVID-19 information.

COVID-19 Open Research Dataset

The White House and a coalition of research institutions have made the COVID-19 Open Research Dataset (CORD-19) available for download. This dataset consists of 59,000 journal articles (47,000 with full text) on COVID-19, SARS-CoV-2 and other coronaviruses. This represents a remarkably large collection of freely accessible scientific literature, so if you’re looking for a large dataset to try learning some text mining techniques, this is as good a place to start as any.

Johns Hopkins COVID-19 Data Repository

The folks at the Johns Hopkins University Center for Systems Science and Engineering are the ones behind the popular ArcGIS global COVID-19 tracking map. The curated set of data that they use to produce this map is available here. This page has a handy list of all of the project’s data sources, including the WHO, CDC, and various international government statistics, in case you wanted to do more sleuthing on your own. This collection previously only contained case and death counts by country, with a few select U.S. regions delineated (e.g. New York City), but has recently been upgraded to also report county-level data like the New York Times data set mentioned below. This git repository is automatically updated on a daily basis, so don’t forget to git pull often.

New York Times U.S. COVID Data

The New York Times has its own set of interactive graphics on the COVID-19 outbreak, and like the JHU site mentioned above, the company has made the backend data for these graphics freely available. This data set tracks reported COVID-19 cases and deaths at the nation, state, and county levels, and is updated on a daily basis.

Finding Ways to Spend Your Time

As you can see, there are several different sources of COVID-19 data to play around with. Many other services make use of one or more of these sources – for instance Tableau’s COVID-19 Data Hub includes a free trial for a starter workbook, utilizing the Johns Hopkins-compiled data mentioned above. This allows users to rapidly prototype different virus-related visualizations, in case you want to try playing around with data, without getting into too much hardcore coding.

If you are interesting in learning more about coding, this could be a great time to jump into some free online training (maybe even if you wish to join the endangered, but very important ranks of Cobol programmers). Probably of more interest to most people are free courses in more modern languages, such as Google’s python machine learning crash course. I do want to be clear here – it is okay to feel isolated, afraid or bored in these times, and it’s okay to respond by this pressure by bingeing Tiger King on Netflix in a single day. However, if you decide you would like to try something new, there’s always a need for more people who can find patterns underlying the phenomena that shape our world and society. If it makes us better prepared for the next pandemic, then all the better.

Uncategorized

Frequent Misconceptions About Genetics and Evolution in Pop Culture

  • January 6, 2020January 6, 2020
  • by brian_ward

Lately, the consolidation of the entertainment industry has gotten me thinking about all the ways in which genetics and evolution have historically been misrepresented in popular culture. With the popularization of nerd culture, we now see many old misconceptions long present in speculative fiction suddenly reaching a much wider audience. Whether or not you have any personal interest in the Marvel Cinematic Universe, the Star Wars Universe, or myriad other fictional universes is irrelevant – these intellectual properties have become embedded in the zeitgeist, and for the average consumer there is no escape. A child growing up today will encounter myriad fictional interpretations of biological principles, some quite closely related to reality, and some wildly divergent.

Many of the common misconceptions regarding genetics in fiction arise from ones that scientists themselves often held for years or decades. However, even after a general consensus is expunged from the scientific community, it may linger in the collective popular culture consciousness for many more years. Here I wanted to start a series where I explore some of the most common ways in which pop culture gets genetics and evolution wrong. Along the way we can look at how different pieces of fiction are either ignorant of the current scientific consensus, or how they address or circumvent it. We’ll touch on topics such as:

  • Genes that influence traits
  • Mutation
  • The purpose of evolution
  • The direction of evolution
The Thing reacts after learning that his origin story was based on a misunderstanding of genetic principles

Before proceeding, I want to make one thing clear: this is all just for illustrative and educational purposes. Speculative fiction is interesting to us because of all the ways in which its settings don’t match with our world. To illustrate just how boring reality can be, here’s a more plausible retelling of the origins (or lack thereof) of the Incredible Hulk:

In the desert, mild mannered physicist Bruce Banner saves a teenager who unwittingly wanders onto the test site of a nuclear bomb. In the process, his body is exposed to high levels of gamma radiation. Nothing happens. Thirty year later, he develops leukemia which may nor may not have been caused by his previous radiation exposure.

This is not the type of story that will cause comic books to leap off of the shelves into the hands of children – or adults for that matter. This is why science fiction should not necessarily attempt to be scientifically sound. Everyone needs some escapism now and then, after all. Now, without further ado, we can proceed on to our first major misconception:

Part 1: The Gene for X

Here’s something that you’ve probably seen dozens of times across many forms of media: the headline “Scientists discover gene for X.” In fact, you may have seen it so many times that you’ve subconsciously begun to expect that every trait or disease is controlled by a gene. These sorts of headlines typically arise from genome-wide association studies (GWAS), which, as we previously examined, seek to associate the phenotypic variation we see for a trait in a population with underlying genetic variation.

The problem with GWAS studies is that they are easily misinterpreted. While it is true that a well-designed GWAS study can often uncover genes influencing trait X, what is often not reported on is that variation in each of these genes may explain only a very tiny portion (think a few percent or less) of the phenotypic variation that is observed in the population. What’s more, even if we add up all the variation explained by all the significant genes identified in a GWAS study conducted on a very large sample, we still typically only explain a fraction of the variation that we know is due to genetics (i.e. the heritable portion of the phenotypic variation). The cause of this phenomenon is still being debated, and it is fittingly referred to as the missing heritability problem. A recent study examining human disease risk predicted by multiple GWAS studies for multiple diseases found that, on average, the heritability explained by SNPs is quite low.

To be clear here, there are many examples of genetic diseases that are caused by mutations in one or a few genes. Currently, the U.S. National Library of Medicine lists over 1,300 diseases with a  genetic basis. For example, sickle cell anemia is caused by mutations within the HBB gene encoding the two β-globin subunits of hemoglobin. Individuals who are homozygous for mutations of HBB (that is, carry it from both of their parents), will have a malformed version of hemoglobin, which in turn leads to malformed (sickle-shaped) red blood cells which get stuck in small capillaries, limiting blood and oxygen availability to certain tissues.

However, there are just as many if not more examples of traits which have far more complex genetic underpinnings. In a previous post, we talked about quantitative traits – those which are controlled by many different genes. An example would be human body weight. If this trait were controlled by a single gene, we would expect that all adult humans would have just a handful of different possible body weights. However, clearly this is not the case. An adult’s body weight can be any value over a range of several hundred pounds. What’s more, an individual’s body weight can obviously change over time. This brings us to another characteristic of quantitative traits – they are influenced by the environment to varying degrees. Many times there simply is no gene for trait X. Rather, it is more accurate to say that many traits are influenced by hundreds to thousands of genes, and their interactions with each other, and with the organism’s environment.

Although we now generally know better than to try and find single genes underlying complex traits, this doesn’t stop research from being readily misinterpreted. Take, for instance, the recent GWAS study of sexual orientation which definitively stated that there is no single “gay gene” underlying same-sex sexual behavior, and the controversy that erupted when this studies’ data was immediately misused to create a “How Gay Are You?” app. During the Human Genome Project, many scientists expressed optimism that human traits would be “solved” to one extent or another – that we would finally have the key to unlocking the genetic puzzle underlying many of our characteristics and ailments. Twenty years and thousands of GWAS studies later, it is okay to admit that we remain humbled in light of the baffling complexity of the genetics influencing who we are.

Returning to the “gene for X” trope in fiction, it all begins to seem a bit silly talking about a gene for genius-level intellect, or telekinesis, or superhuman strength (actually there is a rare mutation for that one), when there doesn’t appear to be any particular gene underlying many of the traits that make us who we are. To see how far the gene for X trope can diverge from reality, we can look to the subject of transgenic organisms. Genetically-modified crop plants have been grown in parts of the world for several decades. These are typically produced by introducing one gene that does one specific thing into a plant from another organism – usually making the plant immune to a particular herbicide. But what if we could take this transgenic mixing and matching much further? That is where we cross into the realm of science fiction.

An Egregious Example

The 2015 blockbluster Jurassic World shares much of the implausible science of Jurassic Park, but without the work that Michael Crichton put in to give the latter its well-crafted illusion of authenticity. In it, theme park scientists, no longer content to merely resurrect extinct dinosaur species, create a new species of their own. The resulting Indominus rex is made through a process that would be the envy of any real life synthetic biologist. Take some T. rex big n’ scary genes, mix in some Velociraptor claw genes, add a dash of crocodile toothy grin genes. Get rid of all those pesky feather genes. Whisk until batter is smooth. (As an interesting aside, the “raptors” in Jurassic Park were originally modeled after the related, but distinct, Deinonychus antirrhopus. Both are in the family Dromaeosauridae, but Velociraptor and Deinonychus are separate genera within this family).

“I can speak five languages and solve complex systems of equations thanks to my Homo sapiens language and math genes, but all you see is a monster”

The Indominus rex displays many remarkable characteristics and abilities, which seem to be inherited from a general blending of the various species that make up its genome. The height of this wildly successful cookie baking experiment comes when the I. rex encounters a group of Velociraptors, and begins to hold a casual conversation with them (in roaring/screeching dinosaur speak). We should keep in mind that the I. rex was raised in isolation in its own pen, so this would be akin to a human being raised in solitary confinement encountering another human and immediately striking up a conversation about the weather. Evidently the raptor language genes (?) that the I. rex carries are quite potent – the biological equivalent of a universal translator.

It shouldn’t come as a surprise to anyone that this is all far from reality. It is true that a sort-of synthetic organism was recently created by the J. Craig Venter Institute. However, it is a monumental leap to go from engineering a bacterium to engineering a multi-ton dinosaur. There is also the question of why we should be in such a rush to create new species when we are rapidly eliminating the ones that already exist. Our dreams of creating a “clever girl” that could one day outsmart Robert Muldoon at his own game will have to wait a little longer. In the meantime, if you want to see dinosaurs in real life, you can always get a pair of binoculars, walk out into the woods, and go bird watching.

Uncategorized

Multi-Environment Genomic Prediction Paper

  • October 21, 2019October 21, 2019
  • by brian_ward
sheep looking down
“You’d better get those predictions right…”

In my last post (a very long time ago), I talked about a genome-wide association study (GWAS) that I recently worked on and published. I promised to talk about a second paper that came out around the same time, although I struggled to think about the best way to describe it. In the second paper, my coauthors and I performed genomic prediction (aka genomic selection) on the same set of phenotypic data. This method is sort of the anti-GWAS. Recall from the last post that a GWAS of highly polygenic traits may not be very useful, because each individual gene has a small contribution to phenotype. Instead of searching out genes that are affecting our trait of interest, with genomic prediction we instead throw up our hands and say “let’s just use all the genes.” Specifically, we use a set of markers spread across the genome, so that we can assume that every area of the genome that is influencing our trait of interest is in linkage with at least one marker. We then feed a model data from a training set of individuals, which has both genotypic and phenotypic data collected, and then generate predictions for a testing set of individuals, which is only genotyped with our marker set. Although the specifics vary based on the exact model used, in this method we typically ignore the functions of individual genes. Instead, we use the genetic markers to help us estimate the relationships between individuals in the training set and the testing set.

Why would you want to go through the trouble of doing this? Increasing computational power has enabled the adoption of routine predictive modeling across many fields of study and sectors of industry. In the case of plant breeding, genotypic data is now much cheaper to obtain than phenotypic data. Currently, we can get decent genome-wide genotypic data for around $10 per line. Compare that to the cost of phenotyping a line in the field, which includes all the costs and labor required to plant it in one or more locations, manage it over a growing season, collect data, harvest it, and then potentially collect more data following harvest. Therefore, breeders are very interested in collecting phenotypic data on a subset of lines, and then predicting performance of additional lines using only cheaper genotypic data. Genomic prediction also has the potential to offer higher genetic gains per unit of time, because we can rapidly grow each generation, run our marker panel on it, and make selections without taking the time to grow plants in the field.

Genomic Prediction – A Brief History

Using genetic markers to predict a line’s performance sounds very high-tech, but it’s actually just a new variation on an old technique. The theoretical underpinnings of this method date back to a paper from 1918, when the geneticist and statistician R.A. Fisher established the infinitesimal model. This paper settled a long-standing dispute between two groups of geneticists – the biometricians and the Mendelians. These two groups were like the Jets and the Sharks, or the Bloods and the Crips of early 20th century genetics research. Perhaps that’s an overstatement, but things were certainly heated between the two camps. The biometricians worked on developing mathematical models to describe what they could observe – namely, that many traits exhibit continuous variation (think of a trait like human height or body weight). This school of thought descended from Darwin’s theory of blending inheritance. This theory at least had the advantage of an unambiguous name – it posited that for any given trait, progeny represented a blending of their parents’ characteristics. However, the theory did not describe the mechanism underlying this blending. In addition, the biometricians had difficulty explaining things such as transgressive segregants (progeny that exhibit trait values that are outside the range of their parents, for example people who are taller or shorter than both of their parents).

Meanwhile, the Mendelians had become converts to the theory of inheritance via genes, after rediscovering Gregor Mendel’s work around the beginning of the 20th century. (Note that here “gene” is used to mean “unit of inheritance.” This was still decades before the structure of DNA was discovered through the efforts of Rosalind Franklin, Maurice Wilkins, James Watson, and Francis Crick). However, while the Mendelians had a mechanism to describe inheritance, they couldn’t square this mechanism with the observation that many traits vary continuously, in contrast to the more simple qualitative traits like pea flower color that Mendel had originally described.

Fisher’s paper essentially unified these two fields and demonstrated that they were both right, in part. He showed that progeny do inherit their parent’s characteristics via genes (that is, in a Mendelian fashion). However, his major insight was determining that inheritance via genes could lead to continuous variation if each gene contributed a tiny amount to the variation observed for a trait. Hence, in his theory each gene’s effect is considered equal and infinitesimal. This paper also has the distinction of introducing many statistical concepts and methods that are used almost universally today, including variance, and the analysis of variance (ANOVA) model.

R.A. Fisher sitting and smoking a pipe
R.A. Fisher demonstrating the smug satisfaction that comes with publishing one of the most important papers in the fields of genetics and statistics

The mathematical framework underlying what we now call genomic prediction was established by the animal breeder C.R. Henderson in the 1970’s. Animal breeders originally performed prediction using pedigrees – that is, records of each individual’s ancestors and relatives. For some time, animal breeders suspected that they could replace pedigree data with genome-wide marker data. Using marker data gives us a better idea of the relationship between individuals. The key concept here is that each individual inherits 50% of their alleles from their mother, and the 50% from their father on average. However, there is a wide variation in the proportions of genetic material, and the specific regions of the genome, inherited from each parent. We all know that siblings are often quite different from one another (especially the middle one). However, if we’re just going off pedigree data, we can only generate a single prediction for all the siblings of any two given parents.

Plants vs. Animals – Advantages and Challenges

It’s not much of a secret that animal breeders tend to be smarter than us plant breeders. This has arisen out of necessity. Compared to animals, most plants are prodigious reproducers, generating dozens to thousands of seeds each generation, depending on the species. Therefore, plant breeders can create large populations, and grow them in many different environments. If you’re dealing with a self-pollinating plant like wheat, things get even easier, because a plant line that reproduces primarily with itself generation after generation will become “immortalized.” That is, all heterozygous genes will eventually become fixed to one homozygous state or the other. At this point, the line will change very little over generations, so if you plant it, and then collect and plant its seed, its progeny will be nearly identical. Therefore the same line can be tested over and over again, across different environments, giving plant breeders who work with self-pollinating species a brute-force option for evaluating line performance. In contrast, livestock animals reproduce at a much slower rate, and individuals with good genetic merit can potentially be worth millions of dollars. How do you predict how an animal will perform for a given trait when there’s only one of them? The answer, of course, is to look at their relatives.

However, I did just mention one way in which plant breeders’ jobs are harder, when I spoke about testing lines in different environments. In general, plant breeders must deal with more genotype-by-environment interaction (GxE). For complex traits controlled by many genes, an individual’s surroundings (i.e. its environment) starts to play a larger part in its phenotype. Animals have the advantage of being mobile. If they get thirsty, they can go get a drink. If they get hot, they can lie underneath a tree. There are of course environmental factors that do come into play if an animal lives in, say, North Dakota vs. Georgia, but on the whole being able to move around gives animals (including humans) a lot of advantages that we often take for granted. On the other hand, plants are stuck in one place, and have to take whatever nature throws at them. Therefore, adequately assessing GxE becomes crucial. That’s where this paper comes in.

I won’t go into the nitty-gritty details, but there are a few ways in which we can deal with GxE interaction. One is to just ignore it, and hope that our predictions don’t suffer too much. This is what most genomic prediction in plants has done to date, on account of using models originally developed by animal breeders. We can get more fancy by explicitly including GxE effects in our models. As my coauthors and I found, this offers little or no benefit when trying to predict the performance of lines that have never been tested before. However, we see more benefit when we either a) implement sparse testing (that is, testing some lines in one environment, different lines in another environment, and then using the relationships between the lines to predict everyone’s performance across all environments), or b) when we use multiple correlated traits for prediction.

At any rate, I think this is probably enough on the topic of genomic prediction for the time being. Plant breeders are now largely past the theorizing and how-to stages of studying this technique. Now we have arrived at the stage where people simply have to try it out, and see how it compares against the age-old method of direct phenotypic selection. It should at least be interesting to see whether it ends up being worth all the hassle or not.

Uncategorized

New Wheat GWAS Publication

  • February 25, 2019February 25, 2019
  • by brian_ward
Picture of printing press
Photo by Wendelin Jacober from Pexels

As usual, I’ve been remiss in updating this site. However, this time I have a good reason for it. I am the first author on two journal articles that were just published within a day of each other at the end of last week. These articles both used the same data collected from the field, consisting of around 320 wheat lines grown in two locations across two years, but they encompass quite different analyses. A common theme that I’ve picked up when talking to friends and family is that nobody knows what I do for work, nor why I do it. In that spirit, I wanted to give some less-technical background on these papers here – if you want the nitty-gritty, both articles are available open-access. I’ll just go over one of them for now, and will save the other one for a future post.

Definitions

Before starting, we’ll have to define a few things, so that everything that follows will make sense. An individual’s phenotype is what you can observe about them. We typically divide phenotype into specific traits. In the case of plants these might be height, grain weight, branching, flowering time, etc. An individual’s genotype is their genetic makeup. Genotype contributes to phenotype, in conjunction with environment (an individual’s surroundings). Finally, an organism’s genome is composed of a set of very long molecules called chromosomes, each one of these being the familiar double helix consisting of a chain of bases or nucleotides, which form an organism’s “genetic code”. A marker is some specific genomic feature that we can use to differentiate between different lines. In this case, the markers we’re referring to are single nucleotide polymorphisms (SNPs), which is when a single nucleotide is substituted for another one.

Here we will be talking about bread wheat, which is the product of ancient hybridizations between multiple grass species in the Middle East. This means that it confusingly has three genomes inherited from three separate progenitor species, labeled A, B, and D. The A genome was inherited from red wild einkorn wheat (Triticum urartu). The B genome was likely inherited from a close relative of the grass species Aegilops speltoides, while the D genome was inherited from Tausch’s goat grass (Aegilops tauschii). Each genome has seven chromosomes, labeled 1A through 7A, 1B through 7B, and 1D through 7D, for a total of 21 chromosomes. The separate genomes are sometimes referred to as “subgenomes,” but here we will just refer to them collectively as “the genome.” A very common question is “where is the C genome?” The C genome is present in another Middle Eastern grass species, Aegilops caudata, which is not a progenitor species of modern wheat. I suspect that this is a historical quirk due to taxonomic reclassifications, though it’s not something I’ve looked into in great detail.

Finding marker-trait associations

In this first paper, published in PLOS One, my coauthors and I performed a genome-wide association study of various yield-related traits. What does that mean? A genome-wide association study, or GWAS, is a type of study in which we genotype all the included lines with a set of markers spread throughout the genome. Then we determine which markers (if any) explain a significant portion of the phenotypic variation that we observe in the field, that is, which markers are involved in significant marker-trait associations. In our case, we used tens of thousands of markers, though in many other species with more advanced genomic data available (especially in humans), it’s customary to use millions or tens of millions of markers.

When I say “yield-related traits,” I’m referring both to grain yield, the most important trait in cereals, as well as traits like grain production per unit area, grains per head, heads per unit area, etc. We also included several phenological traits (i.e. traits relating to the timing of plant development), such as heading date, plant maturity date, and the period of time between the two.

GWAS studies are a dime a dozen these days. What I think makes this one interesting is that it is one of the first wheat GWAS to use a reference genome. When I started graduate school, there was no reference genome, and nobody thought that there would be one anytime soon. Soon afterwards, the International Wheat Genome Sequencing Consortium sequenced the largest wheat chromosome, 3B. Weighing in at about 1 billion bases, chromosome 3B is more than twice as large as the entire rice genome. The full wheat genome is enormous, consisting of somewhere around 15 billion bases – over five times as large as the human genome. To make matters worse, much of this DNA (about 85%) consists of repetetive elements – quasi-autonomous sequences which “copy and paste” themselves throughout the genome. I personally find these sequences fascinating, but they do greatly complicate the task of assembling sequencer reads together. The monumental task of sequencing chromosome 3B convinced many researchers that finishing the entire genome could take a very long time. However, the rapid advancement of assembly technologies brought us a surprise – a fully assembled reference genome published last year.

Candidate genes

Having a reference genome available is a real game-changer. Previously, we had to rely on genetic positions for markers. These have the unfortunate property of being variable depending on the population that was used to calculate them. With a reference genome, we can locate markers both by their genetic position, and by their physical position on the chromosome. This makes it much easier to compare significant results between studies. In addition, once we have found a significant marker, we can now more easily look for candidate genes (i.e. genes that we think are causing the significant marker-trait association) surrounding it.

Sometimes we get lucky, and a particular marker is itself causing variation within a gene, which is in turn causing phenotypic variation that we can observe. However, more often that not, a marker is merely floating out between genes somewhere, presumably close to one that we are actually interested in. In this case, the alleles of the marker and different alleles of the gene will be held together in linkage disequilibrium (LD) or gametic phase disequilibrium. These are two complicated terms for the same, somewhat simple concept. If you learned about Gregor Mendel’s peas and Punnett squares in high school biology, you likely learned about Mendel’s Law of Independent Assortment, which states that alleles (i.e. different “versions”) of separate genes are inherited independently. This is often true, but not always. Specifically, if two genes, or a gene and a marker, are located close to each other on a chromosome, their alleles will be in LD, meaning they will be inherited together in a non-random fashion. There are other factors that can cause genes to be in LD with each other, but physical proximity is the most prevalent. Using the reference genome, we could finally look at patterns of LD in terms of physical distances. In the figure below, we have LD plotted against physical distance, both in the A, B, and D genomes as a whole (top panel), and in each of the seven chromosomes within each genome (bottom panel). The dashed line is a somewhat arbitrary threshold of “significant” LD, but according to this, we would say that LD is significant out to a distance of 1 million bases (1Mb). Therefore, if we have a significant marker, we should take a look at all the genes within a 1Mb radius of it. This sounds like a large distance, but keep in mind that it is 1/1000 the size of the 3B chromosome.

For some marker-trait associations, certain genes within our 1Mb window really stood out as likely candidates. For instance, one marker was highly associated with the traits grains per square meter, grains per head, and heads per square meter. This marker happened to be close to a gene that is similar to the rice gene abberant panicle organization 1 (APO1). As the name suggests, this gene is known to cause variation in rice panicle size and organization (the word “panicle” is used to refer to the rice equivalent of the “head” or “spike” in the wheat plant). Therefore, we had found a marker influencing traits related to the number of seeds per unit area, next to a gene similar to one known to affect the shape of rice plant heads. Among other candidate genes, we also found a heat-stress response gene close to a marker associated with the trait seeds per head. A previous study suggested that the expression of this gene is up-regulated during flowering and seed formation, so this will be an interesting finding to follow up on.

What’s next

While GWAS papers are great, we can’t draw many conclusions beyond the specific hypothesis that is being tested: does marker X have some association with trait Y, or no association? While locating nearby candidate genes is good practice, we still often don’t know the function of these genes. For that, we need follow-up studies. One brute-force method for deducing a gene’s function is to disable it completely (a “knock-out” mutation). You may have heard of CRISPR-Cas9 technology – this is one recently-developed way to introduce targeted mutations into a plant, including mutations that can effectively silence genes. Keep in mind that genes are variably expressed (i.e. translated into proteins) at different stages of development, so another brute force method of determining gene function is to put it under the control of a sequence that just turns up expression to 11 all the time. The hope is that some of these discoveries could in turn lead to future discoveries that could.. you know, save the planet and mankind. No pressure.

While taking this approach of identifying promising genes is all well and good, one problem is that many of these traits are highly polygenic (controlled by many genes). This means that any single gene likely only determines a small portion of the overall phenotypic variation that we observe. For highly polygenic traits, progress in breeding is mostly made by considering the genetic contributions of a vast set of genes as a whole. That is a topic to explore when we talk about the next paper.

Lab Protocols

96-well DNA Extraction

  • June 17, 2018June 17, 2018
  • by brian_ward

My apologies for a (very) long absence from this site – I’ve been at my new job for about six months, so I guess I can’t use the new job excuse anymore, but I am at the tail end of about a month of travel. I was in Ireland for a week, which was beautiful (and relatively affordable):

Connemara mountain and road
Connemara, County Galway, Ireland

Then I was at Cornell for a week-long training hosted by Ed Buckler’s lab:

Fall Creek, Ithaca, NY
Fall Creek, Ithaca, NY

(sadly this gorge is not where the actual training took place).  Unfortunately I got a cold while in Ithaca, and I’m still not quite over it – if you were also at the training and ended up getting sick, I’m sorry! Anyways, that’s enough about me – if you’re here you’re probably looking for info on DNA extractions.

In a previous post I presented a simple protocol for collecting plant tissue for DNA extraction in a 96-well plate format. Then we went over how to grind the tissue. The next step is of course performing the extraction itself. There are many different considerations to make when choosing a protocol for DNA extraction. Many labs have moved towards kit-based extractions, and for good reason: kit-based protocols offer good reproducibility, require minimal reagent preparation, and minimize cross-contamination. For labs that are working with large numbers of samples, automated platforms can perform thousands of extractions per day while limiting consumable use. Finally, send-away extraction services, where a researcher mails out tissue samples and receives DNA a few weeks later, are becoming increasingly feasible.

Of course, not every lab can pay for a robotic liquid-handling system or mail-order extraction services. With that in mind, I’ll be outlining the method that my lab in graduate school used for performing DNA extractions in a 96-well format. This method is relatively straightforward, and doesn’t require any specialized equipment. It is a modification of a method using cetyl trimethylammonium bromide (CTAB) for cell lysis and chloroform for phase separation .

Warnings and Disclaimers

Sorry – before going any further, I have mention these:

Chloroform Safety

This protocol has a few different options for separating aqueous and organic phases after cell lysis, but they all involve chloroform. Chloroform needs to be respected, but is not especially dangerous on its own. Yes, it acts as an anesthetic when inhaled. Whenever you work with chloroform, you need to be working in a fume hood, and using all necessary personal protective equipment (PPE), including gloves, lab-coat, and goggles. You also will need a waste stream dedicated to used reagents and materials containing chloroform – which is essentially all of the reagents/materials used in the extraction. Keep in mind that chloroform will dissolve many polymers – including nitrile gloves. Therefore, it’s a good idea to change gloves if you splash chloroform on them.

Optionally, you can use a phenol/chloroform mixture for phase separation. Phenol can cause deep chemical burns upon contact with skin or mucous membranes. This is made more dangerous when combined with chloroform’s ability to dissolve nitrile gloves. If you are working with phenol/chloroform mixtures, you should take all the precautions of working with chloroform, but you should also wear a liquid-proof lab coat or apron, and chemical-resistant gloves such as those made of Silver Shield or Viton.

Cross-contamination

The method that is described here can be prone to cross-contamination, due to the very close proximity of sample tubes to one another. Good technique can minimize this risk, but not eliminate it entirely. The exposure to this risk that you can tolerate will greatly depend upon the downstream application of the isolated DNA. If you will be performing an assay to determine the presence/absence of some genomic feature, then you may have no tolerance for any cross-contamination. On the other hand, an assay for discrimination between different alleles may be less exacting. It may be best to test the influence of cross-contamination empirically, for instance by performing an extraction using a plate of two alternating known genotypes, and then using the DNA in whatever assay will ultimately be performed.

Listed Products

In addition, just like in the protocol on sampling tissue for extraction, I present some catalog numbers for products below. I am not paid by these companies; these are just products that I have successfully used in the past (sometimes after trial and error). Other companies/suppliers may market similar products that are just as effective.

DNA Extraction Protocol

With all of that out of the way, we can go on to the protocol itself:

Materials

  • 1.2ml microtiter strip-tubes (Corning #4413)
  • Strip-caps for 1.2ml tubes above (Corning #4418)
  • 96 well deep-well plates (ISC Bioexpress Cat. # P-9641-1)
  • Reagent reservoirs for 8-channel multichannel pipets
  • 8-channel 1000ul or 1200ul pipet with corresponding tips

Reagents

  • CTAB Extraction Buffer (recipe for 1 liter is given at the end of the protocol):
  • EITHER:
    • 24:1 chloroform:octanol mixture, OR
    • 25:24:1 phenol:chloroform:isoamyl alcohol mixture (this should be pH 8.0 for DNA extractions, i.e. Acros Organics #327115000)
  • Pure isopropanol
  • 75% ethanol (NOT denatured)
  • RNase A powder suspended to 10 ng/μl in molecular-grade water (Sigma #R6513)

Laboratory Equipment

  • Spex SamplePrep Genogrinder 2010
  • Tabletop centrifuge with rotor for 96/384 well plates
  • Rocking Platform (e.g. VWR # 12620-906)
  • Small Laboratory Oven or Water Bath set at 60°C – 65°C
  • (optional) laboratory lyophilizer or vacuum chamber

Step 0: Labeling Tubes and Plates

Sorry – I’m very fond of including step zero in my protocols. Anyways, the first thing we need to do is some labeling and organizing. First, we need to label some tube strips:

  1. If you have an odd number of tube strips in total, you’ll need to add an additional dummy strip (with ball bearings) – this is to properly balance everything when centrifuging. Label this tube strip something like “D” or “X”
  2. Create a new set of tube strips that are labeled exactly the same as the set of tubes holding the tissue (including the dummy strip if present). Remember that in the previous post we labeled all tube strips with unique identifiers. If you have a dummy tube strip, then this set of tubes also gets a dummy strip.
  3. Use a multichannel pipet to fill the new set of tubes with 450µL of isopropanol
  4. [optional] Place the new set of tubes in a -20° C freezer for the time being (with the caps off)

Now we need to label a new set of deep 96-well plates, which will really just be used as heavy-duty racks to hold the tubes. These racks will be used for centrifuging, so label them C-1, C-2, C-3, etc. Then place the centrifuging racks next to the centrifuge, where they will stay. Any tubes that will be spun will get transferred into these racks prior to going into the centrifuge. Remember from the last post that the most important thing is that RACKS THAT WERE USED FOR TISSUE GRINDING NEVER GO INTO THE CENTRIFUGE.

Step 1: Cell Lysis

After following the steps in the previous post on tissue grinding, we should have tubes containing a nice pale green powder. Now we are ready to begin the extraction procedure:

  1. Add suspended RNase A to CTAB buffer in the ratio of 450µL buffer:0.75µL RNase for each sample. Be sure to make plenty of extra (e.g. multiply the number of total samples by 1.15) due to inefficiencies of using a reagent reservoir for the multichannel pipet.  Mix CTAB/RNase solution by inverting several or using a stir bar, depending on the container that is chosen
  2. Add 450µL of CTAB/RNase mixture to each tube using a multichannel pipet
  3. Place the tubes into a 60° C water bath or oven with CAPS OFF (let the tubes get up to temperature before capping them – otherwise the caps will all pop off). I like to place the tubes in a water bath for about 10 minutes, take them out, cap them, and then transfer them to a 60° C oven so they dry off.
  4. Incubate tubes at 60° C for one hour, inverting them several times every 15 minutes
  5. Transfer tubes to centrifuge racks and give them a quick spin. After the spin, transfer the tubes back to the cheap racks they came in for the next step.

Step 2: Separation

In a fume hood:

  1. Uncap tubes and add 450µl of chloroform/octanol or phenol/chloroform/isoamyl alcohol mixture to each well
  2. Recap tubes and mix by inverting several times.
  3. Place tubes in racks, place a folded paper towel under rack lid (the chloroform-CTAB buffer mixture has much less surface tension than water, and a small amount can leak out around the tube caps), and secure rack lid with rubber band
  4. Place racks on rocker for 30mins. to 1 hour
  5. Check paper towel for chloroform leaks. Dry any leaked chloroform off of the outside of tubes with paper towel or kimwipe.
  6. Place tubes into centrifuge racks, DOUBLE-CHECK RACKS ARE BALANCED, centrifuge plate for 30 min. at 4,000 rpm at room. temp. (20°C – 25°C)

Step 3: Remove supernatant

This is the trickiest part of the extraction. After you remove the tubes from the centrifuge, they should exhibit a nice phase separation, with the greenish/brown/black chloroform and organic “junk” on the bottom, and the clear (or slightly pink/reddish if using phenol) aqueous phase containing the DNA on top. The two phases are usually separated by a somewhat hardened skin.

Now you need to carefully remove the top phase, and place it into the new set of tubes containing isopropanol. Note that you need a clean pipet tip for each sample, so this step can potentially go through a lot of tips. If you have a full-plate pipetting station, you can do this step for an entire plate at once. Otherwise, you’ll need to use the multichannel pipet and go one tube strip at a time.

When you are actually pipetting, move quickly – the aqueous phase can slowly leak out of pipet tips after it is sucked up, so you don’t want any drips falling into the wrong tubes when you are transferring. Otherwise, take your time – it is easy to get mixed up here. It’s easiest if the layout of the new set of tubes containing the isopropanol exactly mirrors the layout of the old set of tubes. Also, you can physically move the old set of tubes to some other rack as you remove the supernatant from each one.

Step 4: Precipitation

Once the aqueous phase is transferred into the new tubes containing isopropanol, cap them and then invert them several times. Hopefully in a few tubes you’ll see some white clumps of DNA floating around. Don’t worry if you can’t see the DNA in all the tubes – just trust that it’s in there. Alternatively, if you see big globs of DNA in every tube, it probably means that you harvested too much leaf tissue.

After a few inversions, place the tubes into the centrifuge racks and then centrifuge them for 30 minutes at 4,000 rpm at 4° C.

Step 5: Rinsing

After the tubes are done centrifuging, uncap them and gently pour the isopropanol out into a beaker or other waste container. The DNA should be firmly stuck to the bottom of the tube in a pellet, so you shouldn’t have to worry about throwing the baby out with the bath water.

Once all the isopropanol is drained out, add 450 μL of 70% ethanol to each tube and recap. Invert the tubes several times (the DNA pellets should now be a brighter white color, and may detach from the bottom of the tube). Then place the tubes into their centrifuge racks, and centrifuge them for 30 minutes at 4,000 rpm and 4° C.

If you suspect that your DNA is particularly dirty for any reason, you can repeat the whole rinse step a second time.

Step 6: Drying & Resuspension

After centrifuging the DNA in ethanol, uncap the tubes and gently pour the ethanol out into your waste container. You may want to also dab the ends of the tubes on a paper towel while they are upside-down to remove some extra ethanol.

One problem with using such narrow tubes is that a fair amount of residual ethanol will remain after decanting – even with the best pouring and dabbing technique. My recommendation is to place all the tubes in a vacuum chamber or lyophilizer (without the refrigeration enabled), which should evaporate any remaining ethanol in a few minutes. If you don’t have this equipment available, the tubes can be air-dried for several hours. DNA is relatively resistant to short-term thermal degradation, so you can even dry tubes in an oven set to 40° C for a short period, if you get really desperate.

After drying, just use a multichannel pipet or pipetting station to add 100 μL of water, Tris buffer, or Tris-EDTA buffer to each tube. Your DNA extraction is finished; cap the tubes, place them in the fridge overnight, and the next day you should have resuspended DNA ready for quantification.

References

{2129580:A6G8KF8J} asa default asc no 276
Saghai-Maroof, M. A., K. M. Soliman, R. A. Jorgensen, and R. W. Allard. 1984. “Ribosomal DNA Spacer-Length Polymorphisms in Barley: Mendelian Inheritance, Chromosomal Location, and Population Dynamics.” Proceedings of the National Academy of Sciences of the United States of America 81(24):8014–18.

 

Recipe for 1 liter of CTAB DNA extraction buffer

Ingredient Amount
1M Tris buffer (pH 8.0) 100mL
0.25M EDTA buffer (ph 8.0) 80mL
NaCl powder 82 grams
CTAB powder 20 grams
ddH2O remainder of 1L (~820 mL)

Combine all ingredients and mix over moderate heat using a stir bar until CTAB powder goes into solution (stir at a slow pace to avoid foaming).

I won’t go into the details of making the Tris or EDTA buffers here – these are also sold ready-to-use in the required molarity/pH by several vendors.

shell scripting

Parsing command line options in bash

  • February 26, 2018
  • by brian_ward

I just started a new job a few weeks ago, so I’ve been pretty negligent regarding this website. However, I think I’ve finally gotten my footing a bit better, so hopefully I can spend a little more time here. I wanted to mark my return by writing about terrible shell scripts, some techniques for writing less terrible shell scripts, and – most importantly – deciding when its appropriate to use these techniques. Just for clarification, when I’m talking about shells in this post, I’m talking about bash, though some of the methods mentioned may work for other shells as well.

Terrible shell scripts – a brief history

Throughout most of graduate school, I wrote a lot of really bad shell scripts. This is likely because I started coding in R. As an interpreted language, R allows for relatively interactive sessions, with the ability to run code line-by-line. The excellent RStudio IDE can also make writing R code enjoyable, if writing code is your thing. There’s nothing interactive about shell scripting, so at first I really struggled when I transitioned to it. In my early attemps, if I needed to set the value of a variable, I simply performed the assignment inside of the script, up near the top. For instance, to set the path to an input file, I would write something like:

infile=/path/to/my/input/file.txt

This is how everyone learns to write shell scripts. However, I quickly grew disenchanted with this approach. For one thing, I always had to look inside of a script to remember how it worked. In addition, storing a script like this inside of a git repository was annoying, because git would always be tracking any changes made to assignments of constants.

I eventually progressed to using positional arguments. For instance, typing something like the following:

./script.sh in_file.txt foo bar

would assign three strings (“in_file.txt,” “foo,” and “bar”) to the variables $1, $2, and $3 respectively. However, one downside of using this approach was that I still had to look inside the script to remember how to use it, and it got pretty cumbersome if there were more than a few arguments.

I also experimented with reading arguments in from files. For example, the following code will read in lines from the second column of a tab-delimited file, place them in an array, and then extract the elements of the array into variables:

mapfile -t args_array < <(cut -d$'\t' -f 2 args_in.txt) 

echo "Arguments read in from args_in.txt:"
printf '%s\n' "${args_array[@]}" 

infile=${args_array[0]} 
foo=${args_array[1]} 
bar=${args_array[2]}

This method works well if there are a large number of arguments to be input. Your argument input file can consist of a table, with one column containing input labels for reference, and another column containing their values (column 2 in the example above). However, one downside is that the argument input file must always tag along with the script. In addition, you’ll still have to look inside either the argument file or the script itself to figure out what is going on.

Parsing command line options

Finally, I decided that I wanted to write scripts that more closely resemble all the bash built-ins and C programs that I typically work with. For instance, below is a grep command to count the number of lines in a file that do not contain the phrase “foo,” written with short options:

grep -v -c "foo" file.txt

or the same command using the equivalent long options:

grep --invert-match --count "foo" file.txt

Notably, I never have to dive into the source code for grep to change its operation. In addition, if I don’t know how to use grep for a particular purpose, I can just type “grep -h” to get some advice.

After some research and some trial and error, I finally realized my goal… mostly. I will say that I don’t think there is currently any perfect solution for parsing command line options in shell scripts, but there are some that get us most of the way there. Confusingly, the two programs that we can use to parse options in bash have almost identical names:

  1. getopts
  2. getopt

To make matters more confusing, there is an older version of getopt, and an enhanced GNU version. Whether getopts or getopt is preferable is a matter of eternal debate. I won’t go into much detail on it, but here are some main points to consider:

  1. getopts is a bash built-in, and is compatible with multiple POSIX-compliant shells
  2. The GNU version of getopt is available in the util-linux package, which comes installed on nearly all Linux distributions. However, it does not come installed on Mac OS or FreeBSD. On Mac, it can be installed using MacPorts.
  3. The GNU getopt is the only one that can handle long option names – this is technically possible using getopts, but the solution is a bit hackish
  4. getopts can make life a bit easier, because it does the actual parsing of command options – getopt only formats supplied options to make them easier to parse.
  5. The most important thing is to avoid the old version of getopt – at this point there’s no reason to ever use it.

Example

I’m going to just show an example bash script, called getopt_example.sh, that parses command line arguments using getopt here, and then give some explanation as to what it’s doing. Note that the code is also posted here. This script just takes two floating point numbers as input, and returns their sum. It can optionally output the text to a file instead of to standard out, and can insert some custom text along with the summation output.

#!/bin/bash

# Set script Name variable
script=$(basename ${BASH_SOURCE[0]})

## Template getopt string below
## Short options specified with -o. option name followed by nothing indicates
## no argument (i.e. flag) one colon indicates required argument, two colons
## indicate optional argument
##
## Long arguments specified with --long. Options are comma separated. Same
## options syntax as for short options
opts=$(getopt -o a:b:lt:: --long numa:,numb:,log,text:: -n 'option-parser' -- "$@")
eval set -- "$opts"

## Set fonts used for help function.
norm=$(tput sgr0)
bold=$(tput bold)

## help function
function help {
echo -e \\n"help documentation for ${bold}${script}.${norm}"
echo "
REQUIRED ARGUMENTS:
-a or --numa            Floating point number - first number to be summed
-b or --numb            Floating point number - second number to be summed

OPTIONAL ARGUMENTS
-t or --text            Optional descriptive text to include in output

FLAGS
-l or --log             If included, prints output to a file named 'sum.log',
                        otherwise prints to stdout
-h or --help            Displays this message."
    
echo -e \\n"USAGE EXAMPLE: 
${bold}./$script --numa=2 --numb=3 --text='Here is the output' --log ${norm}"\\n

exit 1;
}

## If no arguments supplied, print help message and exit
if [[ "$1" == "--" ]]; then help; fi

## Set initial values for arguments
num_a="init"
num_b="init"
text="Floating Point Sum:"
log="false"

## Parse out command-line arguments
while true; do
    case "$1" in
        -a | --numa ) 
            case "$2" in
                "" ) shift 2;;
                * ) num_a="$2"; shift 2;;
            esac ;;
        -b | --numb )
            case "$2" in
                "" ) shift 2;;
                * ) num_a="$2"; shift 2;;
            esac ;;
        -t | --text )
            case "$2" in
                "" ) shift 2;;
                * ) text="$2"; shift 2;;
            esac ;;
        -l | --log ) log="true"; shift ;;
        -h | --help ) help ;;
        -- ) shift; break;;
        * ) echo "Internal Error!"; break;;
    esac
done

## Check if both number a and b were supplied
if [[ "$num_a" == "init" ]] || [[ "$num_b" == "init" ]]; then
    echo "--numa and --numb arguments are both required. Type './getopt_example.sh -h' for help"
    exit 1;
fi

## Create the sum of the numbers - a little more involved for floating point numbers
if [[ $log == "true" ]]; then
    echo "$text" > sum.log
    echo "The sum of $num_a and $num_b is $(echo $num_a + $num_b | bc)" >> sum.log
else
    echo "$text"
    echo "The sum of $num_a and $num_b is $(echo $num_a + $num_b | bc)"
fi

exit 0;

Explanation

That’s a lot to take in – we’ll break it down piece by piece below.

The call to getopt

getopt just performs one function – it takes a string supplied after the script name, and breaks it apart into an array of “words” – i.e. the options and (possibly) their arguments. I say “possibly” because not all options have arguments. In our case, the output array is stored in variable $opts.

In our call to getopt, we supply the short versions of arguments after the -o option. Pay attention to the colons – if there are no colons after an option letter (l in this case) it signifies that the option doesn’t take an argument (i.e. it operates as a flag). One colon means that the option requires an argument, and two colons means that it can take an optional argument. We then list out the corresponding long versions of the arguments after –long. Everything is the same as for the short options, except we need to place commas between each option name.

Here’s a good place to point out one of the idiosyncrasies of getopt – for some reason that I don’t understand, it cannot handle a space between a long option and an optional argument. For instance, in our call to the above script, if we wanted to customize the output header text, we would have to write –text=”Output below” instead of simply –text “Output below.” For that reason, I recommend always using the equal sign assignment for long options (whether they have required or optional arguments), and writing the usage example in the help function (more on this below) to reflect this.

The -n option is used to give a name to the parser, to be output along with any error messages should option parsing fail. The “–” signifies the end of the options, and the $@ is used to store any additional positional arguments, which can be supplied after the named options.

Finally, that line beginning with eval set is used to preserve whitespace within any positional parameters. Interestingly, you could delete everything through the line starting with eval set, and the script would still technically work, though it would easily get confused by more intricate command line inputs.

The help function

This is really the most important part of this whole exercise. You don’t have to write a help message, but in my mind having one is the whole point of going to the trouble of parsing arguments in the first place. The help function is pretty simple – its job is to print out information on each of the possible options/flags, and then exit the script. You’ll notice below the function definition I have a single line that tests whether $opts starts with “–“; if it does, the script prints the help message and exits. $opts will start with “–” if no arguments are supplied, so in this case, if the user just types:

./getopt_example.sh

the script will just print the help message. This is why I mentioned above that this script can’t take any positional arguments.

Initial values

Next we set the starting values for all arguments. My one piece of advice here is to set the initial value for any options requiring an argument to some standard string (such as “init”) – more on this later. You should set the initial value for any option with an optional argument to some sensible default value. Finally, the initial values for any flags should be set to some version of “false” – it could be “F” or “FALSE” – whatever works best for you.

The Parsing Loop

Now – what is going on inside of the while loop? Remember that the output of the call to getopt is an array of strings, called $opts in our example. So, the while loop will simply loop through this array. For each iteration of the loop, an option is assigned to the variable $1, and its argument (if present) is assigned to the variable $2. The case…esac statement is a way of performing conditional tests that is equivalent to the statement switch in many languages. Basically, it is a way of testing for a match to multiple possible patterns, with syntax that is easier than using a long string of if… elif… else statements. You’ll notice all the shift commands in there – for any option that has an argument, we will assign the argument value to a variable, and then shift two positions over in the getopt-supplied array (the option name, and its argument). For any option without an argument (i.e. a flag), we only need to shift one position through the array.

There are a few different ways to go about handling required vs. optional arguments, but I prefer to handle them the same way, to make it possible to copy and paste code within the option parsing block. For each option with an argument, I have a nested case…esac statement. If the supplied argument is anything (“*”), it is assigned to the relevant variable. If it is an empty string (“”), nothing happens and the while loop moves on.

Finally, the –log/-l and –help/-h options are flags with no arguments. In the case of –log, the value of the variable $log is set to “true,” while –help/-h calls up the help function.

When “–” is encountered, it signals the end of the entered options, and the loop is broken. Any other value present aside from those listed above signals that something has gone wrong, raising an error message.

Checking user input

getopt can make life easier, but it still isn’t very smart. For instance, right below the option parsing block in the code is a test to make sure that neither –numa/-a nor –numb/-b are still set to their initial values of “init.”getopt will only catch an omission of a required argument if the user types a short or long option with a space, and then the argument, for instance:

./getopt_example.sh --numa 2.5 --numb

However, it won’t catch the problem is the user uses an equal sign for a long option:

./getopt_example.sh --numa=4 --numb=

or if the user never enters the option for number b in the first place:

./getopt_example.sh --numa=4

Therefore, it’s up to us to make sure that all required arguments are supplied. By setting a default value of “init” for all required arguments, and by setting up our parsing block to do nothing if it encounters an empty string for a required argument, we can reliably exit the script if any required argument was not supplied with the following test:

## Check if both number a and b were supplied
if [[ "$num_a" == "init" ]] || [[ "$num_b" == "init" ]]; then
    echo "--numa and --numb arguments are both required. Type './getopt_example.sh -h' for help"
    exit 1;
fi

Conclusion

That’s all there is to it – you can play around with this simple script to see how different inputs affect its output; I’m sure there are some ways to break it that I haven’t thought of. Now you know everything you need to write nice, professional-looking scripts in bash. That being said, you should think about when to use this technique. There is something to be said for crappy scripts – sometimes you need to just write something, run it once or twice, and move on. The boilerplate code required to use getopt is somewhat long and tedious to modify, especially if the script has a large number of options. However, if you are writing something that you will need to run many times in the future, it probably pays to take the extra time to make it a little more professional.

R coding

Find species occurrences within a region

  • December 30, 2017January 6, 2018
  • by brian_ward

Over the past few weeks I have been negligent in my duties to this website; I attribute my lapse in activity (of all kinds) to the holidays. However, I haven’t been totally sedentary. I was just home in New Hampshire for a few days, where we received 6 inches of fresh snow on Christmas day, on top of the foot or so that was already on the ground. I went out for a snowshoe with my parents and we were truly in a winter wonderland:

winter hunting cabin
The 2017-2018 winter in New Hampshire, off to a good start

However, the downside is that temperatures dropped precipitously after that, eventually staying in the vicinity of 0° F (-18° C) for the remainder of my trip. The moral of the story is that while it’s nice to visit New Hampshire in the winter, it’s also nice to be able to leave.

Anyways, I guess that’s enough about me. On to the primary topic of this post:

Problem

A coworker recently posed the following problem that she needed to solve in an extensible way:

Given a country X, find all species of weeds that have been reported there, and compare this list against a preexisting list containing the names of weed species of concern

In order to solve this problem, the first resource I would turn to is the Global Biodiversity Information Facility (GBIF). GBIF hosts a massive database that collects observations of species (referred to as “occurrences”) from all over the world; as of this writing the database contains roughly 1 billion species occurrences total.

Outline

We can break this problem down into a few simple steps:

  1. Clean our preexisting list of species names to match GBIF formatting. In the example I’ll use, we will have to get rid of some extra information included with the species names. One example is naming authorities – so for our purposes the gray wolf, first named by Linnaeus in 1758, should be written as Canis lupus instead of Canis lupus L. In addition, we will remove information regarding subspecies, varieties, and cultivars.
  2. Identify an appropriate dataset to download. We already know that we need to download species occurrences for a particular country, but downloading all occurrences from any given country could be far more information than we need. Therefore, we need to restrict the size of the download as much as is feasible.
  3. Initiate the download from GBIF servers
  4. Clean species names within the downloaded GBIF data. Generally the species names in GBIF datasets are formatted well, though it pays to also clean this data to ensure reliable matches with our own list.
  5. Compare our list of species names with the GBIF data
  6. [Optional] write out the downloaded occurrence data to a .csv file

R example

The script that this code is copied from is available here.

For the example, I will be using the excellent rOpenSci package rgbif, which, you may have guessed, provides several R functions for accessing GBIF data. If you don’t have rgbif installed, just open an R terminal and type:

install.packages("rgbif")

Input variables and data

Now that that is out of the way, let’s start by loading required packages, defining some variables, and simulating some data. We’ll use the country of Chad as an example. For our list of species, I am just going to create a data frame containing some various plants along with two animals with scientific names that I find humorous. Normally, our list of species would probably come from a preexisting .csv file that we read in. Finally, we need supply a model or type species, and a desired taxonomic rank of interest (more on that below).

library(rgbif)

## Country of interest
country <- "Chad"

## Species of interest - would normally obtain this from a pre-existing .csv file
species <- data.frame("sci_name" = c("Pica pica L.", 
                                     "Tuta absoluta (Meyrick)", 
                                     "Arachis hypogaea var. Georgia-12Y", 
                                     "Vigna vexillata (L.) A. Rich", 
                                     "ZEA MAYS"),
                      "common_name" = c("Eurasian magpie",
                                        "tomato leafminer",
                                        "peanut variety Georgia-12Y",
                                        "wild cowpea",
                                        "maize"))

## Model or type species, and taxonomic rank of interest
mod_sp <- "Arabidopsis thaliana"
rank <- "phylum"

## Select whether to output occurrence data (TRUE/FALSE)
## And specify output file name (.csv extension required)
## File name irrelevant if output_occs not set to TRUE
output_occs <- FALSE
out_file_name <- "Chad_vascular_plant_occurences.csv"

Define Functions

Now we need to create a couple of functions. The first will read in a vector of species names, and will output a cleaned-up version of this vector. Specifically, it will only keep the genus and species names, and will ensure that the genus name is capitalized (and that the species name is not). It will also get rid of any additional information such as naming authority, variety, subspecies, etc.

clean_sp <- function(s) {
  
  s <- gsub("_", " ", s, fixed = TRUE)
  s <- tolower(s)
  l <- strsplit(s, " ", fixed = TRUE)
  l <- lapply(l, function(x) {paste(x[1], x[2], sep = " ")})
  cleaned <- unlist(l)
  
  substr(cleaned, 1, 1) <- toupper(substr(cleaned, 1, 1))
  
  return(cleaned)
}

The second function will take a species that we choose as input, as well a taxonomic rank of our choosing (either kingdom, phylum, class, order, family, or genus). The GBIF functions that we use below require a “key” (which is an integer) for identifying a specific taxonomic rank. We can get these keys using the rgbif function name_backbone(). For example, if we want to find all occurrences of vascular plants (phylum Tracheophyta) within a country, the easiest way to get the GBIF key for Tracheophyta is to look up an example plant species that we know is in this phylum using the name_backbone() function, and then extract the key from it. The phylum key for Tracheophyta happens to be 7707728 – it doesn’t exactly roll off the tongue. The function tax_key() below is a wrapper for name_backbone() which allows us to easily look up a rank key for any given species.

tax_key <- function(species, rank) {
  
  if(!rank %in% c("kingdom", "phylum", "class", "order", "family", "genus")) {
    stop("rank argument requires value of 'kingdom', 'phylum', 'class', 'order', 'family', or 'genus'")
  }
  
  tax <- name_backbone(species)
  key <- unlist(tax[grep(paste(rank, "Key", sep = ""), names(tax))])
  print(paste("Using GBIF key", key, "for", rank, tax[[rank]]))
  
  return(key)
}

Executable example

Now we’re finally able to run some code. First we’ll clean up our species names, obtain the key value for phylum Tracheophyta (using the species Arabidopsis thaliana as an example), and obtain the two-letter ISO 3166-1 country code for Chad:

## Clean up input vector of species names
species$clean_name <- clean_sp(species$sci_name)

## Obtain integer key for given species/rank combination
key <- tax_key(species = mod_sp, rank = rank)

## Obtain ISO 3166-1 two-letter country code
country_code <- rgb_country_codes(country)$code

The rgbif package includes three functions that can download data from GBIF servers:

  • occ_download() – performs exactly the same operation as a manual data download. Requires GBIF username and password (which can be obtained for free through the GBIF website)
  • occ_search() – allows for finer control of download queries. Can download data, metadata, or hierarchical trees
  • occ_data() – faster version of occ_search() optimized for downloading data only

For the rest of this example, we’ll focus on the later two functions, though using occ_download() is a good idea if you will be grabbing a large amount of data.

Before we download any occurrence data, it is always a good idea to check on how many results there are for our desired search query. We can do this using the occ_search() function, by telling it to return metadata, and setting the limit value to 0. You’ll notice below that I’m calling occ_search() in an unusual fashion, by using the function do.call(). do.call() allows us to specify a function, and then supply its arguments as a list. The reason I’m doing this has to do with our key value for phylum Tracheophyta. Normally if I called occ_search() and wanted to only select this phylum, I would need to supply the argument manually:

occ_search(phylumKey = 7707728, …).

However, if on another occasion I wanted to restrict my search to the family Brassicaceae (key 3112), I would need to use the argument “familyKey”:

occ_search(familyKey = 3112, …)

This means that I would have to manually change the function call every time I search by a different taxonomic rank. The tax_rank() function that we defined earlier returns a named integer for the key, so we can supply its name (“phylumKey” in this example) in the list supplied to do.call() to avoid having to ever manually change the occ_search() call.

So, in the block below I define a list, gbif_args, which contains some arguments that we will use repeatedly. Notice “key” is listed without any assignment (=), because it contains both a name and a value. When I actually call the occ_search() function below, I append on some additional arguments to return metadata, and to limit the number of occurrences to 0. Note that I also include a prompt so the user can select whether or not to proceed with the download in the next step. The script must be sourced (i.e. not run interactively) for it to kick in:

## Construct list of arguments for GBIF functions
gbif_args <- list(country = country_code,
                  key,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  curlopts = list(progress = TRUE))

## Find number of occurences using specified arguments
n_occs <- do.call(occ_search, c(gbif_args, list(return = "meta", limit = 0)))$count
print(paste("Number of occurences returned by specified parameters:", n_occs))
y_n <- readline(prompt = "Proceed with occurrences download? [y/n]")
if (y_n != "y") {
  stop("User declined to download occurrences, or did not supply a y/n response")
}

We can now finally download our species occurrence data for real. We have our choice of either the occ_search() or occ_data() functions. Of the two, occ_data() tends to be faster. Running the code below on my computer/internet connection, occ_search() downloaded 3,225 occurrences in 0.64 minutes, while occ_data() completed the same task in 0.24 minutes.

## Download occurences into data frame using occ_search() function
start_time <- Sys.time()
occs <- do.call(occ_search, c(gbif_args, list(return = "data", limit = n_occs)))
end_time <- Sys.time()

print(paste("Elapsed time to download", nrow(occs), "occurences:", 
            round(difftime(end_time, start_time, units = "mins"), 2), "minutes"))

## Download occurences into data frame using occ_data() function
## Note "return" argument is not used by occ_data()
start_time <- Sys.time()
occs <- do.call(occ_data, c(gbif_args, list(limit = n_occs)))$data
end_time <- Sys.time()

print(paste("Elapsed time to download", nrow(occs), "occurences:", 
            round(difftime(end_time, start_time, units = "mins"), 2), "minutes"))

Finally, we can compare the GBIF species occurrences with our own list of species. Some GBIF occurrences are only identified to the genus level, so I’ll remove these by deleting any occurrences with scientific names that don’t contain a space. While the species names in the GBIF data tend to be pretty clean, we might as well run them through our cleaning function to be thorough. While we’re at it, we might as well save the occurrence data to a file, though this step is optional.

occs <- occs[grep(" ", occs$name, fixed = TRUE), ]
occs$clean_name <- clean_sp(occs$name)

intersect(species$clean_name, occs$clean_name)

if (output_occs) {
 write.csv(occs, file = out_file_name, row.names = FALSE)
}

When we run intersect(), we find that the three plant species in our original dataframe (which were not at all cherry-picked) are indeed among the list of vascular plants that have been observed in Chad. Critically, the Eurasian magpie and tomato leafminer are not shown here, since they are not vascular plants (or plants of any kind for that matter).

Going further

The rgbif package allows for many customized operations. For instance, we can isolate occurrences from a specific state or province. In this example, I count the number of occurrences of vascular plants in the U.S. state of Delaware:

## Count all occurences in US state of Delaware
gbif_args <- list(country = "US",
                  stateProvince = "Delaware",
                  key,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  limit = 0,
                  return = "meta",
                  curlopts = list(progress = TRUE))
n_occs <- do.call(occ_search, gbif_args)$count

If country borders seem a little too arbitrary, you can also define a custom polygon in well known text (WKT) format. I recommend this site for constructing the polygon (the site only seems to work with MS Internet Explorer or MS Edge). Below we search for vascular plants within a rectangle covering portions of Chad, Cameroon, and Nigeria:

## Find all occurences within a polygon
library(wicket)
wkt <- "POLYGON((17.02 11.52, 10.47 11.03, 10.78 7.78, 17.20 8.57, 17.02 11.52))"
wkt <- wkt_correct(wkt)
gbif_args <- list(geometry = wkt,
                  hasCoordinate = TRUE,
                  hasGeospatialIssue = FALSE,
                  limit = 100,
                  curlopts = list(progress = TRUE))
occs <- do.call(occ_data, gbif_args)$data

Cautionary notes

This is a pretty powerful method for finding species occurrences within a specific region. However, there are a few rules you should keep in mind:

  1. Reduce, reduce, reduce. You want your queries to be as specific as possible while still being relevant. Otherwise you’ll end up wasting time downloading lots of irrelevant data. I recommend using the lowest taxonomic rank and smallest geographic region possible. In this example, we downloaded vascular plant occurrences from Chad. If I had searched for all vascular plant occurrences in China, or the United States, the download would take a long time, or would just fail outright. That brings us to the next rule:
  2. Be nice to GBIF. They are providing a very valuable free service. Please don’t use scripts like this to spam their servers with multiple huge download requests.

Anyways, that’s about it. Now you can search out species occurrence data to your heart’s content.

Uncategorized

Tissue Grinding for 96-well DNA Extraction

  • December 3, 2017June 17, 2018
  • by brian_ward

Background

In a previous post I presented a simple protocol for collecting plant tissue for DNA extraction in a 96-well plate format. Later on we’ll get to the details of performing the DNA extraction. Before getting to that, we need to talk about the proper way of grinding plant tissue to get a nice, homogeneous powder. Like the subject of the previous post, this is a seemingly-simple step that often goes underappreciated. I promise that I will get around to a protocol for performing an actual DNA extraction sometime soon.

Tissue grinding is one of those things that can soak up hours of time if it doesn’t go right. If samples don’t grind properly, you’ll have to spend a lot of time trying to shake them again and again; you may even need to remove tissue from some wells if they are overfilled, which takes time and opens you up to more risk of cross-contamination.

Clearly, it’s best to grind the tissue properly in one fell swoop, so you can be on your way to the extraction with no problems. The previous post was all about gathering samples in a highly consistent fashion to enable an easy grinding step. Depending on whether you had a lyophilizer at your disposal for the last step, your tissue should now either be dried, or else frozen.

Materials

  • 96 well deep-well plates (Axygen Scientific # P-9641-1S)

Reagents

  • Liquid nitrogen (if using frozen tissue)

Laboratory Equipment

  • Spex SamplePrep Genogrinder 2010
  • Styrofoam shipping cooler (if using frozen tissue)
  • Tongs (if using frozen tissue)

 

Protocol

Step 1: Label Racks

The deep-well plates that I listed above have wells that are wide enough to fit our microtiter tubes in, so we will essentially be using them as heavy-duty tube racks. For the rest of this post, I will refer to them as “racks” instead of plates to avoid confusion. The racks that the tubes come with are too flimsy for tissue grinding – after a round or two in the Genogrinder, they will likely self-destruct. For the DNA extraction, you’ll be using anywhere from one to four plates of samples (i.e. up to 384 samples). So, the steps are:

  1. Label your deep-well racks something like “Grind-1”, “Grind-2” etc., for however many you need.

NOTE: The Genogrinder requires two racks at once, due to its clamp design. So, if you only have a single plate of samples, you should still label two deep-well racks to be used for grinding (more on that below).

  1. Transfer the tubes into the racks that will be used for grinding. If you have a single plate of samples, split the tubes into two racks, spacing them evenly across each rack for uniform clamping inside the Genogrinder. This is one step that should convince you that the unambiguous labeling of tube strips that we performed in the previous post was a good idea.

 

Step 2: Deep Freeze

For lyophilized tissue

If you have lyophilized tissue, you may be tempted to put it into the grinder at room temperature. I’m here to tell you that it won’t work – although the tissue is dry, it is still too flexible to grind properly at room temperature. You’ll just end up with large pieces of leaf that are jammed into the bottom or tops of the tubes. Despite this, I can’t emphasize enough how much easier it is to work with lyophilized tissue, because you can grind easily it after letting it sit in a -80° C freezer for a while. So, if you have lyophilized tissue, this part is easy:

  1. Place tubes in grinding racks in a -80° C freezer for a few hours or overnight.

For frozen tissue

If you’re using frozen tissue, -80° C probably won’t be cold enough for adequate grinding. Unfortunately this means you’ll need to use liquid nitrogen. So, the steps are:

  1. Pour an inch or so of liquid nitrogen from a vacuum flask into an adequately-sized styrofoam cooler (the square ones used as inserts for shipping refrigerated/frozen lab supplies work well)
  2. Use tongs to submerge the grinding racks holding the sample tubes in the liquid nitrogen for a minute or so

 

Step 3: Grinding

Whether you have used lyophilized or frozen tissue, the next steps are the same:

  1. Quickly transfer the racks into the Genogrinder (I probably shouldn’t have to mention this, but please use tongs and not your bare or gloved hands to do this if the racks are sitting in liquid nitrogen)
  2. Secure racks in place with the clamping mechanism and close the Genogrinder’s lid
  3. Grind the samples! I typically run the machine at 1,000 rpm for 1 minute.

After grinding, the samples should have turned into nice, pale-green flaky powder, like the first five tubes in the picture below:

Results of properly grinding plant tissue prior to DNA extraction
Properly ground wheat leaves

Epilogue, and Reflection

That’s all there is to it. Like I said at the beginning, grinding plant tissue is not rocket science, but doing it properly can save you hours of frustration. You can now:

  1. Put the sample tubes back in their original (flimsy) racks
  2. Place the samples back in a freezer for storage, OR
  3. Proceed directly to DNA extraction

One thing to note is that we can reuse the grinding racks – just check prior to each grinding to make sure that cracks aren’t forming in them. If they are getting cracked, just discard them and use a new set. In general, things tend to break when they are exposed to liquid nitrogen and then immediately shaken at high speeds – don’t be surprised if you need to replace racks often when using liquid nitrogen, or if you need to replace some cracked tube caps.

One final warning: any rack that goes into the tissue grinder should never go into a centrifuge later. Take it from me: you might end up really, really sorry if you ever do this. There’s nothing that can ruin your day quite like having a rack holding tubes full of chloroform shatter inside a spinning centrifuge.

Posts navigation

1 2

Recent Posts

  • What Software Should I Learn? Pt. I
  • Misconceptions About Genetics and Evolution in Pop Culture, Part II
  • Investigating COVID-19 From Your Couch
  • Frequent Misconceptions About Genetics and Evolution in Pop Culture
  • Multi-Environment Genomic Prediction Paper

Recent Comments

    Archives

    • January 2021
    • June 2020
    • May 2020
    • January 2020
    • October 2019
    • February 2019
    • June 2018
    • February 2018
    • December 2017
    • November 2017
    • October 2017

    Categories

    • Everyday Annoyances
    • Lab Protocols
    • Observations
    • R coding
    • shell scripting
    • Uncategorized

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    LinkedIn   ResearchGate   ORCID

    © 2017 Brian Ward
    Theme by Colorlib Powered by WordPress