Have questions? Visit https://www.reddit.com/r/SNPedia


From SNPedia

Public Genomes
Summary a 41k year old member of the genus Homo
Promethease report
Platform Full Sequencing

Explained at http://snpedia.blogspot.com/2012/02/denisova.html

rs# VCF at http://s3.amazonaws.com/snpedia/Denisova/Denisova-dbsnp132.vcf.gz

I've read your blogpost on the Denisova genome and I used your Rs-ID-annotated VCF file you provide to transform it into a 23andMe-style file which I uploaded to openSNP (you can find the page here: http://opensnp.org/users/278) as I thought users might be interested in comparing their SNPs with Denisova. Thanks for your work on this!

But I've found some strange things after converting the file and either I'm not clever enough to parse VCF or this is an upstream-error. The thing: You mentioned in your blogpost that there are Y-chromosomal SNP-calls. But I also found that those Y-chromosomal SNPs seem to be diploid and heterozygotous for some sites. And the same is true for the mitochondrial SNPs

This is a line of the annotated VCF as an example:

Y 59033159 rs28447236 A G 1585.24 . AC=1;AF=0.50;AN=2;BaseQRankSum=2.001;DB;DP=151;Dels=0.01;FS=14.641;HRun=0;HaplotypeScore=15.4950;MQ=33.20;MQ0=2;MQRankSum=0.678;QD=10.50;ReadPosRankSum=-0.114 GT:AD:DP:GQ:PL 0/1:89,60:149:99:1615,0,2746

As far as I've understood VCF the genotype is encoded by the cell 0/1:89,60:149:99:1615,0,2746 with 0/1 being the genotype (0 = major allele, 1 = minor allele), is this right? If I'm right: Do you have a clue why those SNPs got called as diploid?

When in doubt, the X&Y weirdness is due to the pseudoautosomal region

While experimenting with my own confusion in the regions these commands came in handy, so I repaste them here

grab the Y snps

gzip -dc denisovan1.vcf.gz | grep '^Y' | cut -f2 > thepos.txt

count how many are in each 1mb region

perl -p -e'$_ = int($_/1000000)."\n"' thepos.txt  | uniq -c
  43 snps in megabase 2
 540 snps in megabase  3
 368 snps in megabase  4
 273 ... 5
 205 ...  6
 116 7
  40 8
3370 9
3084 10
11389 13
 168 14
 111 15
  83 16
  87 17
  61 18
  49 19
   6 20
  91 21
  46 22
  59 23
   9 24
   3 25
   6 26
   1 27
1039 28
1513 58
 748 59

Note that megabase 13 has a ton of snps. (betcha didn't know 11389 snps weighs 2000lbs).

visit http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper#Calling_sex_chromosomes and the dropbox presentation

My weird SNPs weren't well clustered to the PARs, but there were some spikes around 13Mb and 58Mb.

In the wiki text they talk about "with proper, full data processing as per the best practices in the GATK should lead to very good calls". Well I didn't do best practices (never found an obvious link to any) I ran with defaults, and I figured that contributed to the weirdness.

I've not been able to explain it, except to conclude that Unified Genotyper doesn't do well with haploid, and hoped others might dig deeper. So far it seems you're the first.

I think this conversation is worthy of wider awareness, so I'm reposting to http://www.snpedia.com/index.php?title=User:Denisova