
Multi-Environment Genomic Prediction Paper

In my last post (a very long time ago), I talked about a genome-wide association study (GWAS) that I recently worked on and published. I promised to talk about a second paper that came out around the same time, although I struggled to think about the best way to describe it. In the second paper, my coauthors and I performed genomic prediction (aka genomic selection) on the same set of phenotypic data. This method is sort of the anti-GWAS. Recall from the last post that a GWAS of highly polygenic traits may not be very useful, because each individual gene has a small contribution to phenotype. Instead of searching out genes that are affecting our trait of interest, with genomic prediction we instead throw up our hands and say “let’s just use all the genes.” Specifically, we use a set of markers spread across the genome, so that we can assume that every area of the genome that is influencing our trait of interest is in linkage with at least one marker. We then feed a model data from a training set of individuals, which has both genotypic and phenotypic data collected, and then generate predictions for a testing set of individuals, which is only genotyped with our marker set. Although the specifics vary based on the exact model used, in this method we typically ignore the functions of individual genes. Instead, we use the genetic markers to help us estimate the relationships between individuals in the training set and the testing set.
Why would you want to go through the trouble of doing this? Increasing computational power has enabled the adoption of routine predictive modeling across many fields of study and sectors of industry. In the case of plant breeding, genotypic data is now much cheaper to obtain than phenotypic data. Currently, we can get decent genome-wide genotypic data for around $10 per line. Compare that to the cost of phenotyping a line in the field, which includes all the costs and labor required to plant it in one or more locations, manage it over a growing season, collect data, harvest it, and then potentially collect more data following harvest. Therefore, breeders are very interested in collecting phenotypic data on a subset of lines, and then predicting performance of additional lines using only cheaper genotypic data. Genomic prediction also has the potential to offer higher genetic gains per unit of time, because we can rapidly grow each generation, run our marker panel on it, and make selections without taking the time to grow plants in the field.
Genomic Prediction – A Brief History
Using genetic markers to predict a line’s performance sounds very high-tech, but it’s actually just a new variation on an old technique. The theoretical underpinnings of this method date back to a paper from 1918, when the geneticist and statistician R.A. Fisher established the infinitesimal model. This paper settled a long-standing dispute between two groups of geneticists – the biometricians and the Mendelians. These two groups were like the Jets and the Sharks, or the Bloods and the Crips of early 20th century genetics research. Perhaps that’s an overstatement, but things were certainly heated between the two camps. The biometricians worked on developing mathematical models to describe what they could observe – namely, that many traits exhibit continuous variation (think of a trait like human height or body weight). This school of thought descended from Darwin’s theory of blending inheritance. This theory at least had the advantage of an unambiguous name – it posited that for any given trait, progeny represented a blending of their parents’ characteristics. However, the theory did not describe the mechanism underlying this blending. In addition, the biometricians had difficulty explaining things such as transgressive segregants (progeny that exhibit trait values that are outside the range of their parents, for example people who are taller or shorter than both of their parents).
Meanwhile, the Mendelians had become converts to the theory of inheritance via genes, after rediscovering Gregor Mendel’s work around the beginning of the 20th century. (Note that here “gene” is used to mean “unit of inheritance.” This was still decades before the structure of DNA was discovered through the efforts of Rosalind Franklin, Maurice Wilkins, James Watson, and Francis Crick). However, while the Mendelians had a mechanism to describe inheritance, they couldn’t square this mechanism with the observation that many traits vary continuously, in contrast to the more simple qualitative traits like pea flower color that Mendel had originally described.
Fisher’s paper essentially unified these two fields and demonstrated that they were both right, in part. He showed that progeny do inherit their parent’s characteristics via genes (that is, in a Mendelian fashion). However, his major insight was determining that inheritance via genes could lead to continuous variation if each gene contributed a tiny amount to the variation observed for a trait. Hence, in his theory each gene’s effect is considered equal and infinitesimal. This paper also has the distinction of introducing many statistical concepts and methods that are used almost universally today, including variance, and the analysis of variance (ANOVA) model.

The mathematical framework underlying what we now call genomic prediction was established by the animal breeder C.R. Henderson in the 1970’s. Animal breeders originally performed prediction using pedigrees – that is, records of each individual’s ancestors and relatives. For some time, animal breeders suspected that they could replace pedigree data with genome-wide marker data. Using marker data gives us a better idea of the relationship between individuals. The key concept here is that each individual inherits 50% of their alleles from their mother, and the 50% from their father on average. However, there is a wide variation in the proportions of genetic material, and the specific regions of the genome, inherited from each parent. We all know that siblings are often quite different from one another (especially the middle one). However, if we’re just going off pedigree data, we can only generate a single prediction for all the siblings of any two given parents.
Plants vs. Animals – Advantages and Challenges
It’s not much of a secret that animal breeders tend to be smarter than us plant breeders. This has arisen out of necessity. Compared to animals, most plants are prodigious reproducers, generating dozens to thousands of seeds each generation, depending on the species. Therefore, plant breeders can create large populations, and grow them in many different environments. If you’re dealing with a self-pollinating plant like wheat, things get even easier, because a plant line that reproduces primarily with itself generation after generation will become “immortalized.” That is, all heterozygous genes will eventually become fixed to one homozygous state or the other. At this point, the line will change very little over generations, so if you plant it, and then collect and plant its seed, its progeny will be nearly identical. Therefore the same line can be tested over and over again, across different environments, giving plant breeders who work with self-pollinating species a brute-force option for evaluating line performance. In contrast, livestock animals reproduce at a much slower rate, and individuals with good genetic merit can potentially be worth millions of dollars. How do you predict how an animal will perform for a given trait when there’s only one of them? The answer, of course, is to look at their relatives.
However, I did just mention one way in which plant breeders’ jobs are harder, when I spoke about testing lines in different environments. In general, plant breeders must deal with more genotype-by-environment interaction (GxE). For complex traits controlled by many genes, an individual’s surroundings (i.e. its environment) starts to play a larger part in its phenotype. Animals have the advantage of being mobile. If they get thirsty, they can go get a drink. If they get hot, they can lie underneath a tree. There are of course environmental factors that do come into play if an animal lives in, say, North Dakota vs. Georgia, but on the whole being able to move around gives animals (including humans) a lot of advantages that we often take for granted. On the other hand, plants are stuck in one place, and have to take whatever nature throws at them. Therefore, adequately assessing GxE becomes crucial. That’s where this paper comes in.
I won’t go into the nitty-gritty details, but there are a few ways in which we can deal with GxE interaction. One is to just ignore it, and hope that our predictions don’t suffer too much. This is what most genomic prediction in plants has done to date, on account of using models originally developed by animal breeders. We can get more fancy by explicitly including GxE effects in our models. As my coauthors and I found, this offers little or no benefit when trying to predict the performance of lines that have never been tested before. However, we see more benefit when we either a) implement sparse testing (that is, testing some lines in one environment, different lines in another environment, and then using the relationships between the lines to predict everyone’s performance across all environments), or b) when we use multiple correlated traits for prediction.
At any rate, I think this is probably enough on the topic of genomic prediction for the time being. Plant breeders are now largely past the theorizing and how-to stages of studying this technique. Now we have arrived at the stage where people simply have to try it out, and see how it compares against the age-old method of direct phenotypic selection. It should at least be interesting to see whether it ends up being worth all the hassle or not.