SPSmart

General considerations

The population groups are based on a previous study of global variability that found a close match between geographic distribution of populations and genetic clustering using structure to arrange populations into groups based on patterns of variability (Rosenberg et al. 2002). This allows the pre-processing of groups to achieve faster results, however any population can be grouped as desired at the advanced search section.

Only unrelated individuals are considered in order to build all the statistical indexes provided, so the number of samples and genotypes stored in the data mart are slightly less than the total number of samples and genotypes present in each database.

Population acronyms

ASW	African ancestry in Southwest USA
CEU	Utah residents with N & W European ancestry from the CEPH collection
CHB	Han Chinese in Beijing, China
CHS	Han Chinese South
CLM	Colombian in Medellín, Colombia
FIN	Finnish in Finland
GBR	British in England and Scotland
IBS	Iberian populations in Spain
JPT	Japanese in Tokyo, Japan
LWK	Luhya in Webuye, Kenya
MXL	Mexican ancestry in Los Angeles, California
PUR	Puerto Rican in Puerto Rico
TSI	Toscans in Italy
YRI	Yoruba in Ibadan, Nigeria

The world map

From the clickable map at the frontpage you can select any population (individual dots on the map) or group of populations (coloured according to the listed geographic groupings) to activate the quick search function. The same information is obtained from an advanced search.

The advanced search

Multiple population selection is permitted, up to a maximum 5 populations or groupings. Selection of any combination of populations builds the custom query and the pre-calculated statistical summaries are generated from the merged genotype data accordingly.

A SNP query can be performed in 3 different ways: defining a chromosome region (all SNPs encompassed by the chromosome positions will be included), entering a gene name or a list of genes (all SNPs inside that genes will be included), or entering a list of rs-numbers.

The results page

The frequencies tab will contain each frequency set, the visual bar-chart translations and the complete dataset group pie-charts, and summary pie-charts arranged by major population-group. The corresponding global information of the other datasets where data for the is held is provided if selected.

The statistics tab will contain each SNP's information in a population-per-row arrangement, showing sample size (N), all the alleles found on the population set queried (alleles), minor allele (MA), minor allele frequency (MAF), observed and expected heterozygosities (H_OBS and H_EXP), and certain relevant statistical indexes: local inbreeding (F_S) for single populations, genetic differentiation (F_ST) for groups of populations, and the informativeness for group assignment (I_n). For visual aid, the F_ST values are written in yellow when it starts to be significative (>0.05), in orange when it is significative (>0.15) and in red when it is highly differentiative (>0.25). In addition, descriptive SNP information is extracted from dbSNP build 132: chromosome, chromosome position, validation status, gene, reference allele on same strand as SNP (obtained from current genome reference hg19) and ancestral allele (obtained from the Chimpanzee genome).

When available, the downloads tab allows the user to retrieve all the statistical information in a single csv formatted file, separated by ";". Genotypes from alternative datasets are also available for download using the population filters defined by the user's queries.

Symmetrical bases

In the data mart creation process, some checks are applied in order to unify the listing criteria of the available alleles for each SNP. The reference allele is used as the fixed point considering forward and reverse strands. These checks are not perfect, as there are situations where corrections should not be applied when the base is symmetrical (a SNP may be genotyped in forward as AT/GC in one database, and in reverse as TA/GC in another).

The above process is automatic and does not highlight the occurrence of symmetrical bases, therefore the users must ensure symmetrical base sites are properly compared across databases and that queries can return inverse results. Inspecting the MAF values of the populations queried may be the best way to make such comparisons.

The statistical references

The algorithms used for calculating the expected heterozygosities and the F-statistics have been taken from "Principles of Population Genetics" (Hartl 1997, Sinauer Associates) and from "Population Genetics, a concise guide" (Gillespie 1998, Johns Hopkins University Press), following Dr. David McDonald's advice from his worked example of calculating F-statistics from genotypic data.

The In values are computed using equation (4) from "Informativeness of Genetic Markers for Inference of Ancestry" (Rosenberg 2003, Am. J. Hum. Genet. 73:1402-1422)

ENGINES