Review: Geographic population structure analysis of worldwide human populations infers their biogeographical origins

This post is part of an experiment where I will be posting summaries and critiques of the main points of papers I review for journals. Apologies in advance for any misunderstandings and errors on my end; please correct these in the comments.

TL;DR: I have a conceptual disagreement with a paper on learning about the geographic origin of individuals from genetic information.

I recently reviewed a manuscript titled “Geographic population structure analysis of worldwide human populations infers their biogeographical origins”, which has now been published. Overall I found the paper difficult to review because the authors and I have fundamentally different views about what genetic information can tell us about geography. I hope to explain this a bit in this post.

(Side note: some of the authors have started a company called Prosapia Genetics to sell a product based on this paper, but in the paper write “The authors declare no competing financial interests”. This seems to run counter the spirit of these types of disclosures).

(Side note 2: Pseudonymous blogger Dienekes Pontikos notes that the method in this paper is extremely similar to one he developed a few years ago. Regardless of the intentions of the authors, I personally apologize to Dienekes for not noticing his previous work).

ncomms_edit

The goal of the paper and the associated method

Imagine you had my genome sequence. The goal of this paper is to develop an algorithm to place me on a map–that is, to find the latitude and longitude of my “biogeographical origins”, a concept that I think can be vaguely defined as the geographic location of my ancestors sometime in the recent past (for a European-American, maybe sometime in the last few hundred years prior to the major European migrations to the US)

One way to do this is to imagine the world as a grid (either in 2D or 3D space), and build some model for how the frequencies of genetic variants vary across space. If you had my genotypes at a number of variants, you could then find the best spot for me on this grid. This is the basic idea underlying previous work on this topic, for example in Spatial Ancestry analysis.

The authors of this paper take a different approach. Instead of explicitly modeling geographic variation in the frequencies of alleles, they first perform a clustering analysis on a reference set of individuals with known geographic locations. They then (more or less) find the clusters I fall closest to, and copy over the geographic information from those clusters. That is, if genetically I seem most similar to a reference group of French and German individuals, then they say that my “biogeographic origin” is between France and Germany.

Conceptual issue: This is a paper about genetic clustering, not about geography

Basically, geographic information plays no role in this algorithm except in a post hoc manner. Instead, this is a standard genetic clustering algorithm. This means it has the same limitations as any such algorithm. For example, in the Figure above, imagine a set of reference individuals colored according to their inferred “cluster”. Now imagine matching test individual 1 to those clusters. In this case, it’s simple: individual 1 matches cluster 1, and so copying over the geographic information from cluster 1 to individual 1 seems reasonable. But what about individual 2? This individual doesn’t match any of the reference clusters, so the algorithm can’t do anything with it. If the algorithm were truly learning about geography, this wouldn’t be the case.

As the authors note, a whole host of other limitations come along with this. For example, the authors assume the reference populations can’t have changed geographic locations in the time frame of interest (implying the method is limited to populations with historical records attesting to their residency in a geographic location). That is, imagine that 200 years ago, everyone from one reference village moved 200 km to the west for some reason–this algorithm would place the descendants of that population in the present-day location, rather than the historical location. This is all fine if the goal is genetic clustering, but the authors interpret their algorithm strongly in terms of geography. This leads to something of a tautology: this algorithm can use genetic information to infer geography only if you assume genetic clusters are geographically meaningful*. The utility of the method thus depends on whether this is the case in any particular application.

*The sentence has been edited for clarify

2 thoughts on “Review: Geographic population structure analysis of worldwide human populations infers their biogeographical origins

  1. Global similarity methods sometimes provide very misleading, or even amusing, information with admixed individuals. For example, my Latin American friends get placed very precisely among distinct Pakistani ethnic groups by 23andMe’s Global Similarity tool. I’m putting a screenshot of that on Twitter. But I’m not criticizing 23andMe, which tells users that the tool won’t work if they are admixed. It strikes me that most people who would want to use Prosapia, or any method like this, are probably immigrants, and they are almost all likely to get misleading results. In particular, anyone with ancestors in the new world before 1900 is almost certainly admixed (even if it’s just different countries on the same continent) and their “origin” will be averaged (in genome space, which might not map very well to real space). An analysis based on nuclear haplotypes identical by descent is usually more informative (such methods are available through 23andMe and Family Tree DNA). I wonder if you’re familar with hybrid methods (using something like variation in the apparent geographic origin across the genome).

    • Thanks, I agree with all of that. I think 23andMe has almost completely moved away from allele frequency-based estimation of ancestry to haplotype-based things, no?

      In this line of work, besides the 23andMe-like haplotype-based analyses, there’s SPA-mix for admixed individuals.

      Of course all of this assumes a very specific situation–you’re looking for recent ancestry (only a few generations ago, as in the case of European-Americans looking for the location of their European “roots”) and you assume the reference populations haven’t moved in that time.

Leave a comment