Skip to content
Brian Ward, Ph.D.
  • Home
  • Blog
  • CV
  • Contact
Uncategorized

What Software Should I Learn? Pt. I

  • January 17, 2021January 23, 2021
  • by brian_ward

Linux and Shell Scripting

Photo by Jean van der Meulen from Pexels

If you are preparing to enter graduate school in the sciences, or have recently started a graduate program, you have probably become keenly aware that working with modern datasets requires the use of tools that are more powerful and more esoteric than Excel. One way or another, you will have to learn how to code, and it probably behooves you to learn this skill sooner rather than later. You might start off graduate school like me, having taken one intro to computer science class as an undergraduate, where I learned to write rudimentary programs in C. On the other end of the spectrum, you may come to graduate school with a fair amount of coding experience under your belt, looking to refine your skills and learn how to apply them to your chosen research subject.

Whatever the case, I think it is helpful to review the various applications for coding that you’re likely to run into in graduate school. My advice here is geared towards the life sciences, as my field of study falls within this domain. I won’t offer any specific advice on how exactly to learn a new language, because there isn’t any magical, easy way to go about it. You should learn how to code the same way you learn everything in school – by taking a class, either in-person or online. More importantly, during or after that class you should then learn the new coding language for real by working on a project that requires you to use it. At first you will stumble about, writing terrible code that gives you grotesque, misshapen output. Eventually you’ll progress to writing terrible code that gives you the correct output. Someday down the road you may write very good code that other people use in their research. Otherwise you’ll be like me, writing mediocre code that gets the job done. Either way, you will have learned a valuable skill that will make your future academic life easier.

What is Linux?

If your research project(s) will involve any degree of bioinformatic work, in particular the processing of sequencing data, learning how to use Linux will be absolutely essential. It’s very unlikely that you haven’t heard of Linux, but if that is the case, Linux is one of several “Unix-like” operating systems. Also included among the ranks of Unix-like operating systems is one you may not have heard of, BSD, and one you have heard of – MacOS. The Unix-like OSes are, of course, like Unix, which is an operating system that was first developed at Bell Labs starting in the 1970s. Linux itself is named after its developer, Linus Torvalds, and was first released in 1991. Since then, many thousands of different versions, or distributions of Linux have been developed. Whatever distribution you choose (more on that below), you can open a terminal (actually technically a terminal emulator) and begin typing in commands by hand. The interface between the user and the operating system is called a shell. There are several different shells out there – sh, zsh, fish, and the most popular, bash. The magic of Linux and other Unix-like operating systems is that the commands that you would normally type out by hand can be combined into scripts – shell scripts – to automate the process.

What should I use the Linux shell for?

Why bother with learning how to use Linux, when Windows and MacOS dominate the personal computer market? In your mind, Linux and shell scripting should be associated with one type of data – large text files. This speaks both to the strengths and weaknesses of the shell. You see, the shell’s greatest weakness is that it treats everything it sees as strings (i.e. text). That means that if you type in something that you think of as a number, like “12,” the shell will treat it as text, unless you specifically tell it to interpret it as a number using some syntax that is frankly rather clunky. On the other hand, the shell, along with all its associated utilities, does its job of text processing very, very well.

It just so happens that many of the files containing biological data that you encounter will be large text files. This isn’t by coincidence. These include:

  • Files of unaligned sequences, either with associated quality information (FASTQ) or without (FASTA). These are typically referred to as “formats” or “specifications,” but that is technically incorrect. There is no formal specification for how these files should be formatted. Rather, they represent sets of informal conventions that have been adopted by the community over time.
  • Sequence alignment files (SAM). Typically you would generate a SAM file by aligning the reads in a FASTQ or FASTA file against a reference sequence.
  • Genetic variant data stored in variant call format (VCF) files. These are typically generated by comparing multiple sequence alignment files and seeing where they differ.

You may have already encountered these types of files. They usually are compressed (often with the extension “.gz”), and if you try to open one with a text editor on your typical laptop it may promptly crash. Why does this happen? Because the file when uncompressed could easily be larger than the amount of memory on your computer. For instance, you may encounter a 4GB file with the extension “.fastq.gz.” However, when uncompressed this file could actually be 15GB. If you try to open it on a computer with 8GB of RAM, it will completely fill all available memory, bringing everything to a standstill. In Linux, many tools avoid this problem by streaming – reading an input file one line at a time, doing some operation on each line, and then writing it to an output file. Since these utilities only hold one line in memory at a time, they can process files of arbitrary size. It’s important to note that they are also quite fast – likely far faster than anything you could write yourself, since they have been developed over the course of several decades. The primary limiting factor for file operations becomes the read/write speed of the storage medium.

The ability of most Linux command-line utilities to handle very large files is one reason that the large computers you will encounter in your research – servers and high performance computing clusters (i.e. supercomputers) – will almost invariably run some kind of Linux distribution. They will also almost invariably have no graphical user interface. So when I talk about “learning how to use Linux,” I specifically mean learning how to work with the command line. If your experience is like that of many thousands of graduate students around the world, you may get some general guidance from your university’s IT department on how to log in to some particular server using a secure shell (ssh) interface. Upon logging in, you will be confronted by an empty black screen with a blinking cursor. What now?

Some Simple Examples

Linux, like Unix, uses the input redirect (<), pipe (|), and output redirect (>) operators. Most command line utilities will by default read from standard input (the keyboard) and write to standard output (the screen), but these operators allow you to read input from a file, use the output of one command as the input of another, and send output to a file instead of the screen. The input redirect operator is used less frequently than the other two, because many command line utilities can be pointed directly to an input file to read from. For instance, say we had a gzip-compressed fasta file in the typical format of sequence IDs preceded by “>”, sequence descriptions appearing after the first space in ID lines, and sequences as everything not proceeded by “>”. For instance:

>sequence1_id sequence1_description
AGTCCGCATAG
>sequence2_id sequence2_description
GGTCACTTTGA
>sequence3_id sequence3_description
TTACTGGCAAA
...

If we wanted to, say, isolate the first 10,000 sequence IDs (without the “>” preceding them) and place them into a text file, we could write:

gunzip -c my_fasta_file.fa.gz |
    grep ">" -m 10000 |
    sed 's/>//' |
    sed 's/ .*$//' > my_fasta_ids.txt

Here we perform the following steps, using pipes to supply the output of one step as the input for the next:

  1. decompress the gzipped fasta file using gunzip -c. The “-c” option stands for “console,” and would normally write the output to the screen, except here we redirect using the pipe operator
  2. Use grep, which pulls out any line containing “>”. Using the -m (for max) option limits the output to the first 10,000 matches.
  3. Use the Unix stream editor program (sed) to reformat our sequence ID lines. Here we are using find/replace commands, written in the strange-looking nomenclature ‘s/find/replace/’. Here “s” stands for “string,” so in the first line we’re replacing the string “>” with nothing (that is, just removing it).
  4. Perform a second sed replacement using a regular expression to replace everything after the first space (including the space) with nothing.
  5. Redirect our final output to my_fasta_ids.txt.

Notably, you can use this short command to isolate the first 10,000 sequence IDs from a fasta file containing 100,000 sequences, or 100,000,000 (as long as it can fit on your hard drive). Some, but not all Linux systems will have the program zgrep installed, which is a version of grep that can directly read gzip-compressed files. In addition, we can supply sed with multiple find/replace expressions by using the -e option multiple times, so we could condense the above command to:

zgrep ">" -m 10000 my_fasta.fa.gz |
    sed -e 's/>//' -e 's/ .*$//' > my_fasta_ids.txt

Many bioinformatics softwares are designed to process specific types of the files listed above, using syntax that is more or identical to general shell syntax. So we have the program samtools for working with SAM files, and the program bcftools (or the older vcftools) for working with VCF files. We can do various piping/redirection operations using these, so for instance to output the chromosome, position, reference allele and alternate allele for all SNPs on chromosome 2 in a given VCF file, we could write:

bcftools view my_input.vcf.gz --regions chr2 |
    bcftools query - '%CHROM\t%POS\t%REF\t%ALT\n' > chr2_snp_info.txt

Here we are using bcftools view to isolate just chromosome 2, and then passing that off to bcftools query (the lonesome “-” here means “read from standard input,” i.e. the pipe’s output). Then we have a very literal statement showing bcftools query how to write the output, separating our desired output fields by tabs (“\t“) and signifying that we want each SNP to be listed on a separate line (“\n“).

Hardware for Learning Linux

You always have the option of learning Linux without any graphical interface, as you would on a university server or cluster, or else through a virtual machine. However, it can be a hassle to have to deal with these systems before you have the basics down. If you want to go a more local route, you have a few options:

  1. A PC running a Linux distribution. This is by far the best option. There are a huge amount of distributions to choose from – some popular ones are Ubuntu (though it faces some criticism due to corporate practices of its parent company Canonical), Linux Mint (a very popular community-driven fork of Ubuntu), Debian, Zorin OS and Elementary OS (both good for people used to Windows or Mac), Manjaro, Gentoo, and the list goes on. I personally prefer Linux Mint. Working on a Linux computer will allow you to try out various commands and scripts on small test datasets before running larger jobs on servers or clusters. It’s also much easier to learn how to use the command line locally, where you don’t have to worry about constantly logging in to a remote computer, making other users mad by hogging resources, or else scheduling jobs like you would on a cluster. It’s also probably easier than you think to obtain a computer running Linux. If you have an old laptop or desktop lying around that struggles to run Windows or MacOS, you can often install a lightweight Linux distribution and magically resurrect it. Some good distributions to consider in this case would be those using the Xfce desktop environment, such as Linux Mint Xfce edition, or those using the LXDE desktop environment, such as Lubuntu.
  2. A dual-boot computer. This is when you have two operating systems installed side-by-side on the same machine. This way you can have both Windows or MacOS installed alongside a Linux OS. This might seem like the best of both worlds, but there are some caveats. Switching between the two OSes requires restarting the machine, which can become tedious. Also, setting up dual boot will immediately void your warranty if it is still active.
  3. Use Windows 10 or MacOS. Traditionally, Macs were considered superior to Windows PCs if you wanted a popular, commercial operating system that somewhat resembled Linux. Recall that both Linux and MacOS are Unix-like. However, MacOS more closely resembles BSD than Linux, and it requires some tweaking to truly get it working more like Linux. Even then, it’s never quite the same. By comparison, Windows was the odd one out, excepting a few solutions like Cygwin or virtual machines (VMs). However, that changed somewhat with the release of Windows 10 and the Windows Subsystem for Linux (WSL). This is a VM that allows you to install one of several Linux distributions inside of Windows. The first version of WSL was quite a bit faster than a traditional VM, though still somewhat clunky. The new WSL 2 has vastly improved performance over the original WSL. However, there is a catch – WSL runs with its own filesystem, and it can be quite slow when working with anything in the Windows filesystem on the computer. With the release of WSL 2, Windows has become more competitive with Mac when it comes to Linux functionality. However, if you’re looking for a seamless experience for learning Linux, Both Windows and Mac will still give you some headaches from time to time.

Whatever route you choose, the first bit of learning might be a little bit tough and frustrating. However, now that you can look up or ask any question on the internet, the task of learning Linux has become a lot easier. Soon enough you’ll be breezing through the command line and processing enormous files, surprising your advisor by getting through huge computational projects with relative ease.

Uncategorized

Misconceptions About Genetics and Evolution in Pop Culture, Part…

  • June 8, 2020
  • by brian_ward
Incredible Hulk #1
Strength and durability: +5000%
Reading and comprehension: -8000%

Mutation

For part II of this series, we’ll tackle the big one – perhaps one of the most misunderstood natural phenomena in all of fiction. In writing, movies, television shows, video games and countless other media, genetic mutation is often viewed as something frightening or uncontrollable, frequently brought about through the abuse or misuse of science, or else through unfortunate accidents. For instance, genetic mutation turns the physicist Bruce Banner into the Hulk – a violent, simple-minded brute whose favorite hobby is smashing things.

In reality, mutation is neither a good thing nor a bad thing. It is simply a natural process, like the weather or plate tectonics. Mutation is the raw fuel for natural selection and evolution. It is the spontaneous change in DNA sequence over generations due to either environmental factors (e.g. exposure to ionizing radiation or mutagenic chemicals) or DNA replication errors. This last part is important – mutations would occur in a population over time even if it were magically shielded from any environmental factors that cause mutation. Most small-scale mutations are single nucleotide polymorphisms (SNPs), where a single nucleotide is changed, for instance changing the sequence ATTGCTCCA to ATTGATCCA. This can also include deletions: ATTGCTCCA to ATTGTCCA, and insertions: ATTGCTCCA to ATTGCATCCA. However, larger-scale mutations such as the deletion of entire genes are also possible.

How Does Fiction Get it Wrong?

In general, fictional accounts of genetic mutation fail to grasp one major concept: the difference between somatic mutations and germline mutations. Germline cells are the small number of cells in your body that are involved in passing your genes on to your progeny. They include eggs, sperm, the cells of the ovaries involved in egg formation (oogenesis), and the cells of the testicles involved in sperm formation (spermatogenesis). Because these cells pass genetic information on to offspring, mutations arising in them will potentially be passed on to every cell of an offspring’s body as it forms through a complex series of cell divisions following the fusion of an egg and sperm. In general, female mammals are born with all of their eggs already made (there is some debate on this though). Therefore mutations to the cells involved in oogenesis after birth will not be passed on to a woman’s offspring. This is not the case in males, where the cells involved in spermatogenesis are busy creating sperm throughout most of their life.

In contrast to germline cells, somatic cells make up nearly the entire mass of your body. While some are involved in the act of reproduction (e.g. the uterus in women, or the urethra in males), the defining characteristic of somatic cells is that they do not pass any genetic information on to an organism’s progeny. This brings us to the primary error regarding mutation in fiction – mutations to somatic cells cannot effect an organism’s body as a whole. For instance, a mutation occurring in a skin cell on your forearm due to ultraviolet light exposure is unlikely to give you any superpowers. In contrast, an organism can acquire a mutation in a germline cell, which is then passed on to all cells in their offspring. These two distinct outcomes of mutation continue to get confused in science fiction. Below we’ll shed a little more light on the difference between the two.

The Effects of Somatic Mutation

Usually, mutations in somatic cells have no effect. Your genome contains a large amount of genetic material that doesn’t directly affect your phenotype. This includes repetitive elements – somewhat autonomous DNA sequences that either “copy and paste” or “cut and paste” themselves across the genome. It also includes genes that have been epigenetically silenced, and pseudogenes, which are genes that have lost their function (due to prior mutations). Mutations within any of these regions, and many others, will typically not have any measurable effect on a cell’s phenotype.

But what if a mutation occurs inside a gene?

Then things can get a little more interesting. There’s still a very good chance that a mutation inside of a gene will have little or no effect. Genes are first transcribed, when an RNA copy of the DNA sequence is created. Then the RNA transcript is translated into a protein (which is a chain of amino acids linked together). However, after transcription and before translation, portions of the RNA are removed. Since these regions, called introns, don’t make it to the final protein product of the gene, mutations within them are often “silent.”

But what if a mutation occurs in a coding region in a gene?

The regions of genes that are transcribed in RNA and then translated into proteins are called exons. Even if a mutation occurs in one of these regions, it may not have any effect. For one thing, the genetic code is degenerate. This doesn’t mean that that it’s out making trouble (usually). Rather, the genetic code consists of three-letter “words,” called codons. Each codon is translated into an amino acid – the building blocks of proteins mentioned above. So, for instance, the three codons CTTGTTGGC will be translated into the amino acid chain Leucine-Valine-Glycine. However, nearly all the amino acids are encoded by multiple codons. For instance, glycine is encoded by the codons GGC, GGT, GGA, and GGG, so any substitution of the last base of a codon starting with “GG” will not have any effect on the protein that is translated from the gene. This is what is referred to as a “synonymous” mutation.

But what if a coding region mutation isn’t synonymous?

Now things start to get more interesting. A mutation that results in a change to an amino acid is referred to as a “missense” mutation. These mutations can have little or no effect, or major effects depending on how the amino acid swap changes the overall shape or function of the protein. In addition, there are two types of mutations that can have more dramatic effects. One is a premature stop formation. There are three codons that signal for transciption to stop – TAG, TAA, and TGA. If a mutation converts, for instance, a cysteine codon (TGC) into the stop codon TGA, then it will result in a shortened protein, which can have highly significant effects. This is referred to as a “nonsense” mutation. Finally, the other more serious type of mutation is a frame shift. This occurs when an insertion or deletion of one or more bases completely changes the “words” (codons) of the genetic code in a coding sequence. To illustrate, here’s an English sentence that happens to consist of three-letter words, like the genetic code: “The fat cat and big dog ate.” If we insert an adenine (A) at the beginning of the sentence, but keep the word separations at the same place, we end up with: “Ath efa tca tan dbi gdo gat e.” Basically, a frameshift mutation scrambles codons, resulting in the production of a completely different protein.

Mutations with Major Effects

As we have seen, mutations often have no beneficial or detrimental effect. In fact, the neutral theory of molecular evolution, proposed by Motoo Kimura in 1968, posits that the majority of evolution on the molecular scale is due to the genetic drift (the random change in allele frequencies over time) of mutations with neutral effects. However, there are cases where mutations can have major phenotypic effects. Germline mutations with major effects may be enriched in or purged from a population through natural selection. Highly deleterious mutations may be lethal to organisms, while highly beneficial mutations may grant an organism greater fitness, increasing the likelihood that the mutation will be passed on to future generations.

In somatic cells, we only really see the outcome of deleterious mutations. The cells making up your body must all “play by the rules.” Specifically, they have to behave as part of a multicellular organism. Mutations that allow a cell to function beyond its very narrowly-regulated niche within the body can have very dangerous consequences. Cells have many contingencies to deal with such situations. Apoptosis is programmed cell death, whereby a cell is triggered, either internally or externally, to self-destruct. Many cells within your body undergo this process every day. There are many ways in which apoptosis is initiated; one way is in response to mutations that alter how a cell grows and multiplies. In this way, cells that pick up mutations that could allow them to stop playing nice with the other cells around them often automatically self-destruct. Unfortunately these mechanisms don’t always work. Sometimes the very pathways meant to induce apoptosis become corrupted. Unfortunately these changes don’t give us new superpowers – instead they may lead to the formation of cancerous cells.

Who Gets it Right

Given all the intricacies of how mutation works in reality, different works of fiction fall close to or wide of the mark. In the Marvel Universe, mutants are those born with mutations (presumably acquired as germline mutations by their parents or other ancestors), whereas mutates are characters who acquire their powers from some external force, energy, catastrophe, etc. So, comics that focus mostly on mutants, such as the X-Men, largely get the concept of genetic mutation correct. It’s still not clear how mutations seem to give everyone big muscles or big breasts. Nor is it clear how they occasionally give people god-like powers. In contrast, comics about mutates, such as The Hulk, Spiderman, or the Fantastic Four, have a less realistic depiction of the effects of mutations (relatively speaking), as we are supposed to believe that the exposure of a fully-developed human to some mutagen could alter their entire body. As we have seen, the most dramatic effect on somatic cells due to exposure to mutagens is cancer. This isn’t quite as exciting a subject as heroes fighting monsters, though there is a genre of comics and graphic novels dealing with illness and medicine. Interestingly, there are some cases in the Marvel Universe where mutates later give birth to mutants. One example are the mutates Sue Storm and Reed Richards of the Fantastic Four, who give birth to the mutant Franklin Richards. It turns out Franklin is essentially a god – he can alter reality itself and create universes. It’s not clear how the Marvel Multiverse still exists at all when children wield this much power.

Franklin Richards creates a universe
There goes the neighborhood universe…

Who Gets it Wrong

There are too many to list here – take your pick really. We already mentioned the mutates in the Marvel Universe. We could add to that an almost unlimited number of fictional characters, including many of the characters and monsters from the Resident Evil video game series, the supermutants from the Fallout video games, Joker from the Batman comic books, the Teenage Mutant Ninja Turtles – the list goes on and on.

Room for Artistic License

There is one potential mutagen that could more feasibly create mutations in a wide range of somatic cells – viruses, specifically retroviruses. These are viruses with RNA genomes which are reverse-transcribed into DNA copies, which are then inserted into the host cell genome. In this way, retroviruses naturally genetically modify their host cells. For this reason, retroviruses are the vectors of choice for gene therapy. Therefore it’s not too hard to imagine a scenario in which gene therapy is used not just to cure disease, but to grant some enhancement or even powers to humans. This ignores the many hurdles that have made successful gene therapy difficult to achieve, but it is at least within the realm of possibility. Interestingly, the human genome contains many inactive endogenous retroviruses. These were retroviruses which at one time infected germline cells in our ancestors, and have since been passed down through the generations. While retroviruses are a (somewhat) plausible method of creating body-wide mutations in an adult, we first need much more research into using them to cure disease, before we start using them to give people rockin’ bods and superpowers.

Uncategorized

Investigating COVID-19 From Your Couch

  • May 12, 2020May 12, 2020
  • by brian_ward
I need you to help fight COVID-19!

I don’t need to remind you that the world has changed dramatically since the worldwide spread of the COVID-19 novel coronavirus; millions are suddenly unemployed, or else working at home. If you suddenly find yourself in one of these situations, this can be a good time to pick up some new coding and data science skills while helping to investigate the cause of our current crisis.

You’ll see many statistics and analyses regarding COVID-19 getting thrown around online, but you don’t necessarily have to take their word for it. Many large sets of COVID-19 data are publicly available. If you’re one of the many who suddenly finds themselves with a lot more free time on their hands, now could be a great opportunity to learn some new things while playing around with real-world data sets. It’s not likely that you’ll be generating insights beyond those of professional epidemiologists, but it never hurts to have more eyes on the data. Below is a list of a few compiled sources of COVID-19 information.

COVID-19 Open Research Dataset

The White House and a coalition of research institutions have made the COVID-19 Open Research Dataset (CORD-19) available for download. This dataset consists of 59,000 journal articles (47,000 with full text) on COVID-19, SARS-CoV-2 and other coronaviruses. This represents a remarkably large collection of freely accessible scientific literature, so if you’re looking for a large dataset to try learning some text mining techniques, this is as good a place to start as any.

Johns Hopkins COVID-19 Data Repository

The folks at the Johns Hopkins University Center for Systems Science and Engineering are the ones behind the popular ArcGIS global COVID-19 tracking map. The curated set of data that they use to produce this map is available here. This page has a handy list of all of the project’s data sources, including the WHO, CDC, and various international government statistics, in case you wanted to do more sleuthing on your own. This collection previously only contained case and death counts by country, with a few select U.S. regions delineated (e.g. New York City), but has recently been upgraded to also report county-level data like the New York Times data set mentioned below. This git repository is automatically updated on a daily basis, so don’t forget to git pull often.

New York Times U.S. COVID Data

The New York Times has its own set of interactive graphics on the COVID-19 outbreak, and like the JHU site mentioned above, the company has made the backend data for these graphics freely available. This data set tracks reported COVID-19 cases and deaths at the nation, state, and county levels, and is updated on a daily basis.

Finding Ways to Spend Your Time

As you can see, there are several different sources of COVID-19 data to play around with. Many other services make use of one or more of these sources – for instance Tableau’s COVID-19 Data Hub includes a free trial for a starter workbook, utilizing the Johns Hopkins-compiled data mentioned above. This allows users to rapidly prototype different virus-related visualizations, in case you want to try playing around with data, without getting into too much hardcore coding.

If you are interesting in learning more about coding, this could be a great time to jump into some free online training (maybe even if you wish to join the endangered, but very important ranks of Cobol programmers). Probably of more interest to most people are free courses in more modern languages, such as Google’s python machine learning crash course. I do want to be clear here – it is okay to feel isolated, afraid or bored in these times, and it’s okay to respond by this pressure by bingeing Tiger King on Netflix in a single day. However, if you decide you would like to try something new, there’s always a need for more people who can find patterns underlying the phenomena that shape our world and society. If it makes us better prepared for the next pandemic, then all the better.

Uncategorized

Frequent Misconceptions About Genetics and Evolution in Pop Culture

  • January 6, 2020January 6, 2020
  • by brian_ward

Lately, the consolidation of the entertainment industry has gotten me thinking about all the ways in which genetics and evolution have historically been misrepresented in popular culture. With the popularization of nerd culture, we now see many old misconceptions long present in speculative fiction suddenly reaching a much wider audience. Whether or not you have any personal interest in the Marvel Cinematic Universe, the Star Wars Universe, or myriad other fictional universes is irrelevant – these intellectual properties have become embedded in the zeitgeist, and for the average consumer there is no escape. A child growing up today will encounter myriad fictional interpretations of biological principles, some quite closely related to reality, and some wildly divergent.

Many of the common misconceptions regarding genetics in fiction arise from ones that scientists themselves often held for years or decades. However, even after a general consensus is expunged from the scientific community, it may linger in the collective popular culture consciousness for many more years. Here I wanted to start a series where I explore some of the most common ways in which pop culture gets genetics and evolution wrong. Along the way we can look at how different pieces of fiction are either ignorant of the current scientific consensus, or how they address or circumvent it. We’ll touch on topics such as:

  • Genes that influence traits
  • Mutation
  • The purpose of evolution
  • The direction of evolution
The Thing reacts after learning that his origin story was based on a misunderstanding of genetic principles

Before proceeding, I want to make one thing clear: this is all just for illustrative and educational purposes. Speculative fiction is interesting to us because of all the ways in which its settings don’t match with our world. To illustrate just how boring reality can be, here’s a more plausible retelling of the origins (or lack thereof) of the Incredible Hulk:

In the desert, mild mannered physicist Bruce Banner saves a teenager who unwittingly wanders onto the test site of a nuclear bomb. In the process, his body is exposed to high levels of gamma radiation. Nothing happens. Thirty year later, he develops leukemia which may nor may not have been caused by his previous radiation exposure.

This is not the type of story that will cause comic books to leap off of the shelves into the hands of children – or adults for that matter. This is why science fiction should not necessarily attempt to be scientifically sound. Everyone needs some escapism now and then, after all. Now, without further ado, we can proceed on to our first major misconception:

Part 1: The Gene for X

Here’s something that you’ve probably seen dozens of times across many forms of media: the headline “Scientists discover gene for X.” In fact, you may have seen it so many times that you’ve subconsciously begun to expect that every trait or disease is controlled by a gene. These sorts of headlines typically arise from genome-wide association studies (GWAS), which, as we previously examined, seek to associate the phenotypic variation we see for a trait in a population with underlying genetic variation.

The problem with GWAS studies is that they are easily misinterpreted. While it is true that a well-designed GWAS study can often uncover genes influencing trait X, what is often not reported on is that variation in each of these genes may explain only a very tiny portion (think a few percent or less) of the phenotypic variation that is observed in the population. What’s more, even if we add up all the variation explained by all the significant genes identified in a GWAS study conducted on a very large sample, we still typically only explain a fraction of the variation that we know is due to genetics (i.e. the heritable portion of the phenotypic variation). The cause of this phenomenon is still being debated, and it is fittingly referred to as the missing heritability problem. A recent study examining human disease risk predicted by multiple GWAS studies for multiple diseases found that, on average, the heritability explained by SNPs is quite low.

To be clear here, there are many examples of genetic diseases that are caused by mutations in one or a few genes. Currently, the U.S. National Library of Medicine lists over 1,300 diseases with a  genetic basis. For example, sickle cell anemia is caused by mutations within the HBB gene encoding the two β-globin subunits of hemoglobin. Individuals who are homozygous for mutations of HBB (that is, carry it from both of their parents), will have a malformed version of hemoglobin, which in turn leads to malformed (sickle-shaped) red blood cells which get stuck in small capillaries, limiting blood and oxygen availability to certain tissues.

However, there are just as many if not more examples of traits which have far more complex genetic underpinnings. In a previous post, we talked about quantitative traits – those which are controlled by many different genes. An example would be human body weight. If this trait were controlled by a single gene, we would expect that all adult humans would have just a handful of different possible body weights. However, clearly this is not the case. An adult’s body weight can be any value over a range of several hundred pounds. What’s more, an individual’s body weight can obviously change over time. This brings us to another characteristic of quantitative traits – they are influenced by the environment to varying degrees. Many times there simply is no gene for trait X. Rather, it is more accurate to say that many traits are influenced by hundreds to thousands of genes, and their interactions with each other, and with the organism’s environment.

Although we now generally know better than to try and find single genes underlying complex traits, this doesn’t stop research from being readily misinterpreted. Take, for instance, the recent GWAS study of sexual orientation which definitively stated that there is no single “gay gene” underlying same-sex sexual behavior, and the controversy that erupted when this studies’ data was immediately misused to create a “How Gay Are You?” app. During the Human Genome Project, many scientists expressed optimism that human traits would be “solved” to one extent or another – that we would finally have the key to unlocking the genetic puzzle underlying many of our characteristics and ailments. Twenty years and thousands of GWAS studies later, it is okay to admit that we remain humbled in light of the baffling complexity of the genetics influencing who we are.

Returning to the “gene for X” trope in fiction, it all begins to seem a bit silly talking about a gene for genius-level intellect, or telekinesis, or superhuman strength (actually there is a rare mutation for that one), when there doesn’t appear to be any particular gene underlying many of the traits that make us who we are. To see how far the gene for X trope can diverge from reality, we can look to the subject of transgenic organisms. Genetically-modified crop plants have been grown in parts of the world for several decades. These are typically produced by introducing one gene that does one specific thing into a plant from another organism – usually making the plant immune to a particular herbicide. But what if we could take this transgenic mixing and matching much further? That is where we cross into the realm of science fiction.

An Egregious Example

The 2015 blockbluster Jurassic World shares much of the implausible science of Jurassic Park, but without the work that Michael Crichton put in to give the latter its well-crafted illusion of authenticity. In it, theme park scientists, no longer content to merely resurrect extinct dinosaur species, create a new species of their own. The resulting Indominus rex is made through a process that would be the envy of any real life synthetic biologist. Take some T. rex big n’ scary genes, mix in some Velociraptor claw genes, add a dash of crocodile toothy grin genes. Get rid of all those pesky feather genes. Whisk until batter is smooth. (As an interesting aside, the “raptors” in Jurassic Park were originally modeled after the related, but distinct, Deinonychus antirrhopus. Both are in the family Dromaeosauridae, but Velociraptor and Deinonychus are separate genera within this family).

“I can speak five languages and solve complex systems of equations thanks to my Homo sapiens language and math genes, but all you see is a monster”

The Indominus rex displays many remarkable characteristics and abilities, which seem to be inherited from a general blending of the various species that make up its genome. The height of this wildly successful cookie baking experiment comes when the I. rex encounters a group of Velociraptors, and begins to hold a casual conversation with them (in roaring/screeching dinosaur speak). We should keep in mind that the I. rex was raised in isolation in its own pen, so this would be akin to a human being raised in solitary confinement encountering another human and immediately striking up a conversation about the weather. Evidently the raptor language genes (?) that the I. rex carries are quite potent – the biological equivalent of a universal translator.

It shouldn’t come as a surprise to anyone that this is all far from reality. It is true that a sort-of synthetic organism was recently created by the J. Craig Venter Institute. However, it is a monumental leap to go from engineering a bacterium to engineering a multi-ton dinosaur. There is also the question of why we should be in such a rush to create new species when we are rapidly eliminating the ones that already exist. Our dreams of creating a “clever girl” that could one day outsmart Robert Muldoon at his own game will have to wait a little longer. In the meantime, if you want to see dinosaurs in real life, you can always get a pair of binoculars, walk out into the woods, and go bird watching.

Uncategorized

Multi-Environment Genomic Prediction Paper

  • October 21, 2019October 21, 2019
  • by brian_ward
sheep looking down
“You’d better get those predictions right…”

In my last post (a very long time ago), I talked about a genome-wide association study (GWAS) that I recently worked on and published. I promised to talk about a second paper that came out around the same time, although I struggled to think about the best way to describe it. In the second paper, my coauthors and I performed genomic prediction (aka genomic selection) on the same set of phenotypic data. This method is sort of the anti-GWAS. Recall from the last post that a GWAS of highly polygenic traits may not be very useful, because each individual gene has a small contribution to phenotype. Instead of searching out genes that are affecting our trait of interest, with genomic prediction we instead throw up our hands and say “let’s just use all the genes.” Specifically, we use a set of markers spread across the genome, so that we can assume that every area of the genome that is influencing our trait of interest is in linkage with at least one marker. We then feed a model data from a training set of individuals, which has both genotypic and phenotypic data collected, and then generate predictions for a testing set of individuals, which is only genotyped with our marker set. Although the specifics vary based on the exact model used, in this method we typically ignore the functions of individual genes. Instead, we use the genetic markers to help us estimate the relationships between individuals in the training set and the testing set.

Why would you want to go through the trouble of doing this? Increasing computational power has enabled the adoption of routine predictive modeling across many fields of study and sectors of industry. In the case of plant breeding, genotypic data is now much cheaper to obtain than phenotypic data. Currently, we can get decent genome-wide genotypic data for around $10 per line. Compare that to the cost of phenotyping a line in the field, which includes all the costs and labor required to plant it in one or more locations, manage it over a growing season, collect data, harvest it, and then potentially collect more data following harvest. Therefore, breeders are very interested in collecting phenotypic data on a subset of lines, and then predicting performance of additional lines using only cheaper genotypic data. Genomic prediction also has the potential to offer higher genetic gains per unit of time, because we can rapidly grow each generation, run our marker panel on it, and make selections without taking the time to grow plants in the field.

Genomic Prediction – A Brief History

Using genetic markers to predict a line’s performance sounds very high-tech, but it’s actually just a new variation on an old technique. The theoretical underpinnings of this method date back to a paper from 1918, when the geneticist and statistician R.A. Fisher established the infinitesimal model. This paper settled a long-standing dispute between two groups of geneticists – the biometricians and the Mendelians. These two groups were like the Jets and the Sharks, or the Bloods and the Crips of early 20th century genetics research. Perhaps that’s an overstatement, but things were certainly heated between the two camps. The biometricians worked on developing mathematical models to describe what they could observe – namely, that many traits exhibit continuous variation (think of a trait like human height or body weight). This school of thought descended from Darwin’s theory of blending inheritance. This theory at least had the advantage of an unambiguous name – it posited that for any given trait, progeny represented a blending of their parents’ characteristics. However, the theory did not describe the mechanism underlying this blending. In addition, the biometricians had difficulty explaining things such as transgressive segregants (progeny that exhibit trait values that are outside the range of their parents, for example people who are taller or shorter than both of their parents).

Meanwhile, the Mendelians had become converts to the theory of inheritance via genes, after rediscovering Gregor Mendel’s work around the beginning of the 20th century. (Note that here “gene” is used to mean “unit of inheritance.” This was still decades before the structure of DNA was discovered through the efforts of Rosalind Franklin, Maurice Wilkins, James Watson, and Francis Crick). However, while the Mendelians had a mechanism to describe inheritance, they couldn’t square this mechanism with the observation that many traits vary continuously, in contrast to the more simple qualitative traits like pea flower color that Mendel had originally described.

Fisher’s paper essentially unified these two fields and demonstrated that they were both right, in part. He showed that progeny do inherit their parent’s characteristics via genes (that is, in a Mendelian fashion). However, his major insight was determining that inheritance via genes could lead to continuous variation if each gene contributed a tiny amount to the variation observed for a trait. Hence, in his theory each gene’s effect is considered equal and infinitesimal. This paper also has the distinction of introducing many statistical concepts and methods that are used almost universally today, including variance, and the analysis of variance (ANOVA) model.

R.A. Fisher sitting and smoking a pipe
R.A. Fisher demonstrating the smug satisfaction that comes with publishing one of the most important papers in the fields of genetics and statistics

The mathematical framework underlying what we now call genomic prediction was established by the animal breeder C.R. Henderson in the 1970’s. Animal breeders originally performed prediction using pedigrees – that is, records of each individual’s ancestors and relatives. For some time, animal breeders suspected that they could replace pedigree data with genome-wide marker data. Using marker data gives us a better idea of the relationship between individuals. The key concept here is that each individual inherits 50% of their alleles from their mother, and the 50% from their father on average. However, there is a wide variation in the proportions of genetic material, and the specific regions of the genome, inherited from each parent. We all know that siblings are often quite different from one another (especially the middle one). However, if we’re just going off pedigree data, we can only generate a single prediction for all the siblings of any two given parents.

Plants vs. Animals – Advantages and Challenges

It’s not much of a secret that animal breeders tend to be smarter than us plant breeders. This has arisen out of necessity. Compared to animals, most plants are prodigious reproducers, generating dozens to thousands of seeds each generation, depending on the species. Therefore, plant breeders can create large populations, and grow them in many different environments. If you’re dealing with a self-pollinating plant like wheat, things get even easier, because a plant line that reproduces primarily with itself generation after generation will become “immortalized.” That is, all heterozygous genes will eventually become fixed to one homozygous state or the other. At this point, the line will change very little over generations, so if you plant it, and then collect and plant its seed, its progeny will be nearly identical. Therefore the same line can be tested over and over again, across different environments, giving plant breeders who work with self-pollinating species a brute-force option for evaluating line performance. In contrast, livestock animals reproduce at a much slower rate, and individuals with good genetic merit can potentially be worth millions of dollars. How do you predict how an animal will perform for a given trait when there’s only one of them? The answer, of course, is to look at their relatives.

However, I did just mention one way in which plant breeders’ jobs are harder, when I spoke about testing lines in different environments. In general, plant breeders must deal with more genotype-by-environment interaction (GxE). For complex traits controlled by many genes, an individual’s surroundings (i.e. its environment) starts to play a larger part in its phenotype. Animals have the advantage of being mobile. If they get thirsty, they can go get a drink. If they get hot, they can lie underneath a tree. There are of course environmental factors that do come into play if an animal lives in, say, North Dakota vs. Georgia, but on the whole being able to move around gives animals (including humans) a lot of advantages that we often take for granted. On the other hand, plants are stuck in one place, and have to take whatever nature throws at them. Therefore, adequately assessing GxE becomes crucial. That’s where this paper comes in.

I won’t go into the nitty-gritty details, but there are a few ways in which we can deal with GxE interaction. One is to just ignore it, and hope that our predictions don’t suffer too much. This is what most genomic prediction in plants has done to date, on account of using models originally developed by animal breeders. We can get more fancy by explicitly including GxE effects in our models. As my coauthors and I found, this offers little or no benefit when trying to predict the performance of lines that have never been tested before. However, we see more benefit when we either a) implement sparse testing (that is, testing some lines in one environment, different lines in another environment, and then using the relationships between the lines to predict everyone’s performance across all environments), or b) when we use multiple correlated traits for prediction.

At any rate, I think this is probably enough on the topic of genomic prediction for the time being. Plant breeders are now largely past the theorizing and how-to stages of studying this technique. Now we have arrived at the stage where people simply have to try it out, and see how it compares against the age-old method of direct phenotypic selection. It should at least be interesting to see whether it ends up being worth all the hassle or not.

Uncategorized

New Wheat GWAS Publication

  • February 25, 2019February 25, 2019
  • by brian_ward
Picture of printing press
Photo by Wendelin Jacober from Pexels

As usual, I’ve been remiss in updating this site. However, this time I have a good reason for it. I am the first author on two journal articles that were just published within a day of each other at the end of last week. These articles both used the same data collected from the field, consisting of around 320 wheat lines grown in two locations across two years, but they encompass quite different analyses. A common theme that I’ve picked up when talking to friends and family is that nobody knows what I do for work, nor why I do it. In that spirit, I wanted to give some less-technical background on these papers here – if you want the nitty-gritty, both articles are available open-access. I’ll just go over one of them for now, and will save the other one for a future post.

Definitions

Before starting, we’ll have to define a few things, so that everything that follows will make sense. An individual’s phenotype is what you can observe about them. We typically divide phenotype into specific traits. In the case of plants these might be height, grain weight, branching, flowering time, etc. An individual’s genotype is their genetic makeup. Genotype contributes to phenotype, in conjunction with environment (an individual’s surroundings). Finally, an organism’s genome is composed of a set of very long molecules called chromosomes, each one of these being the familiar double helix consisting of a chain of bases or nucleotides, which form an organism’s “genetic code”. A marker is some specific genomic feature that we can use to differentiate between different lines. In this case, the markers we’re referring to are single nucleotide polymorphisms (SNPs), which is when a single nucleotide is substituted for another one.

Here we will be talking about bread wheat, which is the product of ancient hybridizations between multiple grass species in the Middle East. This means that it confusingly has three genomes inherited from three separate progenitor species, labeled A, B, and D. The A genome was inherited from red wild einkorn wheat (Triticum urartu). The B genome was likely inherited from a close relative of the grass species Aegilops speltoides, while the D genome was inherited from Tausch’s goat grass (Aegilops tauschii). Each genome has seven chromosomes, labeled 1A through 7A, 1B through 7B, and 1D through 7D, for a total of 21 chromosomes. The separate genomes are sometimes referred to as “subgenomes,” but here we will just refer to them collectively as “the genome.” A very common question is “where is the C genome?” The C genome is present in another Middle Eastern grass species, Aegilops caudata, which is not a progenitor species of modern wheat. I suspect that this is a historical quirk due to taxonomic reclassifications, though it’s not something I’ve looked into in great detail.

Finding marker-trait associations

In this first paper, published in PLOS One, my coauthors and I performed a genome-wide association study of various yield-related traits. What does that mean? A genome-wide association study, or GWAS, is a type of study in which we genotype all the included lines with a set of markers spread throughout the genome. Then we determine which markers (if any) explain a significant portion of the phenotypic variation that we observe in the field, that is, which markers are involved in significant marker-trait associations. In our case, we used tens of thousands of markers, though in many other species with more advanced genomic data available (especially in humans), it’s customary to use millions or tens of millions of markers.

When I say “yield-related traits,” I’m referring both to grain yield, the most important trait in cereals, as well as traits like grain production per unit area, grains per head, heads per unit area, etc. We also included several phenological traits (i.e. traits relating to the timing of plant development), such as heading date, plant maturity date, and the period of time between the two.

GWAS studies are a dime a dozen these days. What I think makes this one interesting is that it is one of the first wheat GWAS to use a reference genome. When I started graduate school, there was no reference genome, and nobody thought that there would be one anytime soon. Soon afterwards, the International Wheat Genome Sequencing Consortium sequenced the largest wheat chromosome, 3B. Weighing in at about 1 billion bases, chromosome 3B is more than twice as large as the entire rice genome. The full wheat genome is enormous, consisting of somewhere around 15 billion bases – over five times as large as the human genome. To make matters worse, much of this DNA (about 85%) consists of repetetive elements – quasi-autonomous sequences which “copy and paste” themselves throughout the genome. I personally find these sequences fascinating, but they do greatly complicate the task of assembling sequencer reads together. The monumental task of sequencing chromosome 3B convinced many researchers that finishing the entire genome could take a very long time. However, the rapid advancement of assembly technologies brought us a surprise – a fully assembled reference genome published last year.

Candidate genes

Having a reference genome available is a real game-changer. Previously, we had to rely on genetic positions for markers. These have the unfortunate property of being variable depending on the population that was used to calculate them. With a reference genome, we can locate markers both by their genetic position, and by their physical position on the chromosome. This makes it much easier to compare significant results between studies. In addition, once we have found a significant marker, we can now more easily look for candidate genes (i.e. genes that we think are causing the significant marker-trait association) surrounding it.

Sometimes we get lucky, and a particular marker is itself causing variation within a gene, which is in turn causing phenotypic variation that we can observe. However, more often that not, a marker is merely floating out between genes somewhere, presumably close to one that we are actually interested in. In this case, the alleles of the marker and different alleles of the gene will be held together in linkage disequilibrium (LD) or gametic phase disequilibrium. These are two complicated terms for the same, somewhat simple concept. If you learned about Gregor Mendel’s peas and Punnett squares in high school biology, you likely learned about Mendel’s Law of Independent Assortment, which states that alleles (i.e. different “versions”) of separate genes are inherited independently. This is often true, but not always. Specifically, if two genes, or a gene and a marker, are located close to each other on a chromosome, their alleles will be in LD, meaning they will be inherited together in a non-random fashion. There are other factors that can cause genes to be in LD with each other, but physical proximity is the most prevalent. Using the reference genome, we could finally look at patterns of LD in terms of physical distances. In the figure below, we have LD plotted against physical distance, both in the A, B, and D genomes as a whole (top panel), and in each of the seven chromosomes within each genome (bottom panel). The dashed line is a somewhat arbitrary threshold of “significant” LD, but according to this, we would say that LD is significant out to a distance of 1 million bases (1Mb). Therefore, if we have a significant marker, we should take a look at all the genes within a 1Mb radius of it. This sounds like a large distance, but keep in mind that it is 1/1000 the size of the 3B chromosome.

For some marker-trait associations, certain genes within our 1Mb window really stood out as likely candidates. For instance, one marker was highly associated with the traits grains per square meter, grains per head, and heads per square meter. This marker happened to be close to a gene that is similar to the rice gene abberant panicle organization 1 (APO1). As the name suggests, this gene is known to cause variation in rice panicle size and organization (the word “panicle” is used to refer to the rice equivalent of the “head” or “spike” in the wheat plant). Therefore, we had found a marker influencing traits related to the number of seeds per unit area, next to a gene similar to one known to affect the shape of rice plant heads. Among other candidate genes, we also found a heat-stress response gene close to a marker associated with the trait seeds per head. A previous study suggested that the expression of this gene is up-regulated during flowering and seed formation, so this will be an interesting finding to follow up on.

What’s next

While GWAS papers are great, we can’t draw many conclusions beyond the specific hypothesis that is being tested: does marker X have some association with trait Y, or no association? While locating nearby candidate genes is good practice, we still often don’t know the function of these genes. For that, we need follow-up studies. One brute-force method for deducing a gene’s function is to disable it completely (a “knock-out” mutation). You may have heard of CRISPR-Cas9 technology – this is one recently-developed way to introduce targeted mutations into a plant, including mutations that can effectively silence genes. Keep in mind that genes are variably expressed (i.e. translated into proteins) at different stages of development, so another brute force method of determining gene function is to put it under the control of a sequence that just turns up expression to 11 all the time. The hope is that some of these discoveries could in turn lead to future discoveries that could.. you know, save the planet and mankind. No pressure.

While taking this approach of identifying promising genes is all well and good, one problem is that many of these traits are highly polygenic (controlled by many genes). This means that any single gene likely only determines a small portion of the overall phenotypic variation that we observe. For highly polygenic traits, progress in breeding is mostly made by considering the genetic contributions of a vast set of genes as a whole. That is a topic to explore when we talk about the next paper.

Uncategorized

Tissue Grinding for 96-well DNA Extraction

  • December 3, 2017June 17, 2018
  • by brian_ward

Background

In a previous post I presented a simple protocol for collecting plant tissue for DNA extraction in a 96-well plate format. Later on we’ll get to the details of performing the DNA extraction. Before getting to that, we need to talk about the proper way of grinding plant tissue to get a nice, homogeneous powder. Like the subject of the previous post, this is a seemingly-simple step that often goes underappreciated. I promise that I will get around to a protocol for performing an actual DNA extraction sometime soon.

Tissue grinding is one of those things that can soak up hours of time if it doesn’t go right. If samples don’t grind properly, you’ll have to spend a lot of time trying to shake them again and again; you may even need to remove tissue from some wells if they are overfilled, which takes time and opens you up to more risk of cross-contamination.

Clearly, it’s best to grind the tissue properly in one fell swoop, so you can be on your way to the extraction with no problems. The previous post was all about gathering samples in a highly consistent fashion to enable an easy grinding step. Depending on whether you had a lyophilizer at your disposal for the last step, your tissue should now either be dried, or else frozen.

Materials

  • 96 well deep-well plates (Axygen Scientific # P-9641-1S)

Reagents

  • Liquid nitrogen (if using frozen tissue)

Laboratory Equipment

  • Spex SamplePrep Genogrinder 2010
  • Styrofoam shipping cooler (if using frozen tissue)
  • Tongs (if using frozen tissue)

 

Protocol

Step 1: Label Racks

The deep-well plates that I listed above have wells that are wide enough to fit our microtiter tubes in, so we will essentially be using them as heavy-duty tube racks. For the rest of this post, I will refer to them as “racks” instead of plates to avoid confusion. The racks that the tubes come with are too flimsy for tissue grinding – after a round or two in the Genogrinder, they will likely self-destruct. For the DNA extraction, you’ll be using anywhere from one to four plates of samples (i.e. up to 384 samples). So, the steps are:

  1. Label your deep-well racks something like “Grind-1”, “Grind-2” etc., for however many you need.

NOTE: The Genogrinder requires two racks at once, due to its clamp design. So, if you only have a single plate of samples, you should still label two deep-well racks to be used for grinding (more on that below).

  1. Transfer the tubes into the racks that will be used for grinding. If you have a single plate of samples, split the tubes into two racks, spacing them evenly across each rack for uniform clamping inside the Genogrinder. This is one step that should convince you that the unambiguous labeling of tube strips that we performed in the previous post was a good idea.

 

Step 2: Deep Freeze

For lyophilized tissue

If you have lyophilized tissue, you may be tempted to put it into the grinder at room temperature. I’m here to tell you that it won’t work – although the tissue is dry, it is still too flexible to grind properly at room temperature. You’ll just end up with large pieces of leaf that are jammed into the bottom or tops of the tubes. Despite this, I can’t emphasize enough how much easier it is to work with lyophilized tissue, because you can grind easily it after letting it sit in a -80° C freezer for a while. So, if you have lyophilized tissue, this part is easy:

  1. Place tubes in grinding racks in a -80° C freezer for a few hours or overnight.

For frozen tissue

If you’re using frozen tissue, -80° C probably won’t be cold enough for adequate grinding. Unfortunately this means you’ll need to use liquid nitrogen. So, the steps are:

  1. Pour an inch or so of liquid nitrogen from a vacuum flask into an adequately-sized styrofoam cooler (the square ones used as inserts for shipping refrigerated/frozen lab supplies work well)
  2. Use tongs to submerge the grinding racks holding the sample tubes in the liquid nitrogen for a minute or so

 

Step 3: Grinding

Whether you have used lyophilized or frozen tissue, the next steps are the same:

  1. Quickly transfer the racks into the Genogrinder (I probably shouldn’t have to mention this, but please use tongs and not your bare or gloved hands to do this if the racks are sitting in liquid nitrogen)
  2. Secure racks in place with the clamping mechanism and close the Genogrinder’s lid
  3. Grind the samples! I typically run the machine at 1,000 rpm for 1 minute.

After grinding, the samples should have turned into nice, pale-green flaky powder, like the first five tubes in the picture below:

Results of properly grinding plant tissue prior to DNA extraction
Properly ground wheat leaves

Epilogue, and Reflection

That’s all there is to it. Like I said at the beginning, grinding plant tissue is not rocket science, but doing it properly can save you hours of frustration. You can now:

  1. Put the sample tubes back in their original (flimsy) racks
  2. Place the samples back in a freezer for storage, OR
  3. Proceed directly to DNA extraction

One thing to note is that we can reuse the grinding racks – just check prior to each grinding to make sure that cracks aren’t forming in them. If they are getting cracked, just discard them and use a new set. In general, things tend to break when they are exposed to liquid nitrogen and then immediately shaken at high speeds – don’t be surprised if you need to replace racks often when using liquid nitrogen, or if you need to replace some cracked tube caps.

One final warning: any rack that goes into the tissue grinder should never go into a centrifuge later. Take it from me: you might end up really, really sorry if you ever do this. There’s nothing that can ruin your day quite like having a rack holding tubes full of chloroform shatter inside a spinning centrifuge.

Recent Posts

  • What Software Should I Learn? Pt. I
  • Misconceptions About Genetics and Evolution in Pop Culture, Part II
  • Investigating COVID-19 From Your Couch
  • Frequent Misconceptions About Genetics and Evolution in Pop Culture
  • Multi-Environment Genomic Prediction Paper

Recent Comments

    Archives

    • January 2021
    • June 2020
    • May 2020
    • January 2020
    • October 2019
    • February 2019
    • June 2018
    • February 2018
    • December 2017
    • November 2017
    • October 2017

    Categories

    • Everyday Annoyances
    • Lab Protocols
    • Observations
    • R coding
    • shell scripting
    • Uncategorized

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    LinkedIn   ResearchGate   ORCID

    © 2017 Brian Ward
    Theme by Colorlib Powered by WordPress