As usual, I’ve been remiss in updating this site. However, this time I have a good reason for it. I am the first author on two journal articles that were just published within a day of each other at the end of last week. These articles both used the same data collected from the field, consisting of around 320 wheat lines grown in two locations across two years, but they encompass quite different analyses. A common theme that I’ve picked up when talking to friends and family is that nobody knows what I do for work, nor why I do it. In that spirit, I wanted to give some less-technical background on these papers here – if you want the nitty-gritty, both articles are available open-access. I’ll just go over one of them for now, and will save the other one for a future post.
Before starting, we’ll have to define a few things, so that everything that follows will make sense. An individual’s phenotype is what you can observe about them. We typically divide phenotype into specific traits. In the case of plants these might be height, grain weight, branching, flowering time, etc. An individual’s genotype is their genetic makeup. Genotype contributes to phenotype, in conjunction with environment (an individual’s surroundings). Finally, an organism’s genome is composed of a set of very long molecules called chromosomes, each one of these being the familiar double helix consisting of a chain of bases or nucleotides, which form an organism’s “genetic code”. A marker is some specific genomic feature that we can use to differentiate between different lines. In this case, the markers we’re referring to are single nucleotide polymorphisms (SNPs), which is when a single nucleotide is substituted for another one.
Here we will be talking about bread wheat, which is the product of ancient hybridizations between multiple grass species in the Middle East. This means that it confusingly has three genomes inherited from three separate progenitor species, labeled A, B, and D. The A genome was inherited from red wild einkorn wheat (Triticum urartu). The B genome was likely inherited from a close relative of the grass species Aegilops speltoides, while the D genome was inherited from Tausch’s goat grass (Aegilops tauschii). Each genome has seven chromosomes, labeled 1A through 7A, 1B through 7B, and 1D through 7D, for a total of 21 chromosomes. The separate genomes are sometimes referred to as “subgenomes,” but here we will just refer to them collectively as “the genome.” A very common question is “where is the C genome?” The C genome is present in another Middle Eastern grass species, Aegilops caudata, which is not a progenitor species of modern wheat. I suspect that this is a historical quirk due to taxonomic reclassifications, though it’s not something I’ve looked into in great detail.
Finding marker-trait associations
In this first paper, published in PLOS One, my coauthors and I performed a genome-wide association study of various yield-related traits. What does that mean? A genome-wide association study, or GWAS, is a type of study in which we genotype all the included lines with a set of markers spread throughout the genome. Then we determine which markers (if any) explain a significant portion of the phenotypic variation that we observe in the field, that is, which markers are involved in significant marker-trait associations. In our case, we used tens of thousands of markers, though in many other species with more advanced genomic data available (especially in humans), it’s customary to use millions or tens of millions of markers.
When I say “yield-related traits,” I’m referring both to grain yield, the most important trait in cereals, as well as traits like grain production per unit area, grains per head, heads per unit area, etc. We also included several phenological traits (i.e. traits relating to the timing of plant development), such as heading date, plant maturity date, and the period of time between the two.
GWAS studies are a dime a dozen these days. What I think makes this one interesting is that it is one of the first wheat GWAS to use a reference genome. When I started graduate school, there was no reference genome, and nobody thought that there would be one anytime soon. Soon afterwards, the International Wheat Genome Sequencing Consortium sequenced the largest wheat chromosome, 3B. Weighing in at about 1 billion bases, chromosome 3B is more than twice as large as the entire rice genome. The full wheat genome is enormous, consisting of somewhere around 15 billion bases – over five times as large as the human genome. To make matters worse, much of this DNA (about 85%) consists of repetetive elements – quasi-autonomous sequences which “copy and paste” themselves throughout the genome. I personally find these sequences fascinating, but they do greatly complicate the task of assembling sequencer reads together. The monumental task of sequencing chromosome 3B convinced many researchers that finishing the entire genome could take a very long time. However, the rapid advancement of assembly technologies brought us a surprise – a fully assembled reference genome published last year.
Having a reference genome available is a real game-changer. Previously, we had to rely on genetic positions for markers. These have the unfortunate property of being variable depending on the population that was used to calculate them. With a reference genome, we can locate markers both by their genetic position, and by their physical position on the chromosome. This makes it much easier to compare significant results between studies. In addition, once we have found a significant marker, we can now more easily look for candidate genes (i.e. genes that we think are causing the significant marker-trait association) surrounding it.
Sometimes we get lucky, and a particular marker is itself causing variation within a gene, which is in turn causing phenotypic variation that we can observe. However, more often that not, a marker is merely floating out between genes somewhere, presumably close to one that we are actually interested in. In this case, the alleles of the marker and different alleles of the gene will be held together in linkage disequilibrium (LD) or gametic phase disequilibrium. These are two complicated terms for the same, somewhat simple concept. If you learned about Gregor Mendel’s peas and Punnett squares in high school biology, you likely learned about Mendel’s Law of Independent Assortment, which states that alleles (i.e. different “versions”) of separate genes are inherited independently. This is often true, but not always. Specifically, if two genes, or a gene and a marker, are located close to each other on a chromosome, their alleles will be in LD, meaning they will be inherited together in a non-random fashion. There are other factors that can cause genes to be in LD with each other, but physical proximity is the most prevalent. Using the reference genome, we could finally look at patterns of LD in terms of physical distances. In the figure below, we have LD plotted against physical distance, both in the A, B, and D genomes as a whole (top panel), and in each of the seven chromosomes within each genome (bottom panel). The dashed line is a somewhat arbitrary threshold of “significant” LD, but according to this, we would say that LD is significant out to a distance of 1 million bases (1Mb). Therefore, if we have a significant marker, we should take a look at all the genes within a 1Mb radius of it. This sounds like a large distance, but keep in mind that it is 1/1000 the size of the 3B chromosome.
For some marker-trait associations, certain genes within our 1Mb window really stood out as likely candidates. For instance, one marker was highly associated with the traits grains per square meter, grains per head, and heads per square meter. This marker happened to be close to a gene that is similar to the rice gene abberant panicle organization 1 (APO1). As the name suggests, this gene is known to cause variation in rice panicle size and organization (the word “panicle” is used to refer to the rice equivalent of the “head” or “spike” in the wheat plant). Therefore, we had found a marker influencing traits related to the number of seeds per unit area, next to a gene similar to one known to affect the shape of rice plant heads. Among other candidate genes, we also found a heat-stress response gene close to a marker associated with the trait seeds per head. A previous study suggested that the expression of this gene is up-regulated during flowering and seed formation, so this will be an interesting finding to follow up on.
While GWAS papers are great, we can’t draw many conclusions beyond the specific hypothesis that is being tested: does marker X have some association with trait Y, or no association? While locating nearby candidate genes is good practice, we still often don’t know the function of these genes. For that, we need follow-up studies. One brute-force method for deducing a gene’s function is to disable it completely (a “knock-out” mutation). You may have heard of CRISPR-Cas9 technology – this is one recently-developed way to introduce targeted mutations into a plant, including mutations that can effectively silence genes. Keep in mind that genes are variably expressed (i.e. translated into proteins) at different stages of development, so another brute force method of determining gene function is to put it under the control of a sequence that just turns up expression to 11 all the time. The hope is that some of these discoveries could in turn lead to future discoveries that could.. you know, save the planet and mankind. No pressure.
While taking this approach of identifying promising genes is all well and good, one problem is that many of these traits are highly polygenic (controlled by many genes). This means that any single gene likely only determines a small portion of the overall phenotypic variation that we observe. For highly polygenic traits, progress in breeding is mostly made by considering the genetic contributions of a vast set of genes as a whole. That is a topic to explore when we talk about the next paper.