Hundreds of thousands of peer-reviewed articles in the field of genetics use a method called principal component analysis. But new research shows that this method is very biased. This means that multitudes of major studies concerning ancient populations may be radically wrong!

The irresistible draw of the PCA

It’s hard to make new friends, especially after your thirties. As Seinfeld so eloquently said: Whatever group you have now, that’s what you’re going with. You are not interviewing, you are not looking for new people, you are not interested in seeing applications . For DNA scientists, and scientists as a whole, the situation is even worse. The long hours and isolation required to carry out our research have detrimental effects on our social lives. Of course, there are always exceptions. Sometimes you get to know someone who is always there to support you, someone who asks for little and gives a lot. Someone you can always show up to a party with and be proud of. Someone your friends and co-workers will look up to because they make you look smart and cool with a deep understanding of the science involved. Who doesn’t want a friend like that? Replace “someone” with “something” and you will understand what principal component analysis (PCA) is for scientists, especially population geneticists.

What is Principal Component Analysis?

PCA is a mathematical transformation that takes a complex dataset, like 10,000 genomes from 2,000 people around the world, and transforms it so it can be represented by a colored XY scatterplot with the click of a button. It’s the best friend of the procrastinating student who has a lecture tomorrow and needs to get results fast, of the lecturer who is looking to churn out papers in a hurry, and of the professor who is looking for a promotion by making statements at fashion without evidence. The number of friends PCA has is a throwback to the good old days of MySpace – with citations of around 200,000 in genetics alone, multiplied by an average number of 10 authors per article, we get 2,000,000 scholars who have authored an article using PCA.

PCA is used to examine the population structure of a group of individuals to determine their ancestry, analyze demographic history and admixture, decide the genetic similarity of individuals and exclude outliers, decide how to model populations, describe ancient and modern genetic relationships between individuals, infer family ties, identify ancestral trends in data, detect genomic signatures of natural selection, identify evolutionary trends, support genetic studies of diseases, geolocate individuals , drawing historical and ethnobiological conclusions, etc. It’s “The little cloud of points that could”.

The problem with PCA was also its biggest advantage. He always told everyone what they wanted to hear, so no one dared to challenge him. So, naturally, I did.

PCA: A dubious method?

In an article published in Scientific reports , I have shown that PCA results are much more sensitive to input than anyone has realized. By analogy, think of PCA as an oven with flour, sugar, and eggs as input. The oven can always do the same thing, but the result, a cake, basically depends on the ratio of the ingredients and how they are combined. Similarly, minor changes in the way data is entered cause PCA to generate drastically different outputs, resulting in incorrect results, misconceptions, and lack of replication.

One of the areas considered APC’s best friends forever is paleogenomics, where we want to learn more about ancient peoples and individuals such as Copper Age Europeans. They are expected to be similar to Europeans, and scientists have used PCA to show that Copper Age Europeans clustered with Europeans. Why? Because the rationale for using PCA is that it can be used to create a genetic map that positions the unknown population alongside the populations it is most related to. Since the PCA only sees the data (without the labels), we assume that it is a neutral and unbiased tool, and that the answer it gives is correct.

My study has shown that small changes in the number of individuals and the choice of populations can produce a very large difference in PCA results, allowing the experimenter complete control of the results.

In this way, the experimenter (in this case, me) can produce very different answers to the simple question “To which population are Copper Age Europeans genetically closest?”, placing them at proximity to any population. I did this by changing the number of individuals in each population (Oceanians, South Asians, etc.) and choosing different subpopulations. What happened? Our supposedly unbiased tool, the Geneticists’ Compass, has produced four different historical scenarios (from virtually endless historical versions) all mathematically “correct”, but only one can be biologically correct (if any).

Such “conclusions” are derived from PCA in nearly every article on population genetics relating to humans, plants, animals, medical genetics, and drug testing (where cases and controls are matched) . PCA results are not limited to scientific articles. They are also integrated into large datasets, used by genetic test undertaken and used to support policy decisions. There isn’t a single reader who isn’t touched by PCA, whether they know what it is or are learning about it now. As many as 216,000 peer-reviewed papers in the field of genetics alone have used PCA to explore and visualize similarities and differences between individuals and populations and have based their conclusions on these findings.

This figure shows four of the countless PCA results describing the origins of Copper Age Europeans. The PCA plots were generated using the same reference populations but with different population sizes allowing everyone to choose their preferred historical scenario. (Provided by the author)

Scientific conclusions can be radically wrong

To put these examples into context, consider the recent publication of ” 12th Century Ashkenazi Jewish Graves in England by Mark G. Thomas (who was critical for misappropriation of evidence) and Ian Barnes. This study “explores” the ancestry of six newly discovered ancient individuals and, as always, it begins with a PCA plot where ancient individuals are projected onto known modern individuals to identify their ancestry (remember, overlap = ancestry).

A few elements immediately emerge from this plot. First, Ashkenazi Jews cluster with Southern Europeans (i.e., they are genetically indistinguishable from them); thus, the whole premise of this article is wrong. These people could well have been Italians. Second, although three of the former individuals are siblings, they do not group together, which should already raise concerns about the validity of this approach. Third, there are very few non-Jewish populations at the bottom of the plot, which was done to a) avoid showing modern Jews overlapping with modern non-Jews and b) ancient individuals overlapping with Africans. Finally, there are no other ancient populations banding together with their respective modern populations to convince us that this tool actually works.

We can see that although this plot is presented as an exploration of hypotheses, the experimenters built it to give the desired results, which, unfortunately, hardly ever happened! Nevertheless, the authors concluded that “these findings are consistent with Chapelfield individuals having Jewish ancestry”, citing an irrelevant article to add credence to their findings. Despite these issues, and although at no time did these samples overlap with Ashkenazi Jews, it was concluded that they were of Ashkenazi descent, and the article was featured in Nature (a for-profit journal) with my short review somewhere inside. In this area, truth is as important as the socks you took off yesterday after a long, hot day.

PCA chart of unknown ancient individuals (black) and known modern populations (color) (provided by author)

PCA chart of unknown ancient individuals (black) and known modern populations (color) (provided by author)

ACP is an illustration of dataism in population genetics. Dataism describes an ideology formed by the emergence of Big Data, where the measurement of data is the ultimate achievement. Proponents of dataism believe that with enough data and computing power, the mysteries of the world will be revealed. Dataism enthusiasts rarely ask whether PCA results are correct, but rather how to correctly interpret the results. As such, clustering is interpreted as identity due to common ancestry and its absence as genetic drift. In PCA-based science, almost all answers are equally acceptable, and the truth is in the eyes of the beholder. Although the PCA does not explain anything, it illustrates Seinfeld‘s point. It’s really hard to make friends when you’re old, especially if you’re a scientist.

Independent commentary on the paper

“Techniques that offer such flexibility encourage bad science and are particularly dangerous in a world where the pressure to publish is intense. If a researcher runs PCA multiple times, the temptation will always be to select the output that makes the best story,” added Professor William Amos, professor of evolutionary genetics at the University of Cambridge, who was not involved in the study. study.

Top image: Geneticist contemplating his DNA dataset. Source: Grispb /Adobe Stock

DNA of ancient origins

By Eran Elhaik


Tropical depression likely to form in the next few days


Okla deputies. Co. negotiate with subject barricaded for hours, suspect in custody

Check Also