Talk:Genes for Good
Genes for Good is using some old version of dbSNP for annotation, many if not most of the rsID site-id's are missing and replaced with genetic location and nucleotide change. Can the web-based analysis handle this automatically, or will annotations for those sites be missed? Also the files contain empty contingent headers, which the desktop version doesn't like, but I assume the web-version has been modified to handle them now?
Pooling both directly tested and imputed variants is definitely best, especially since the GFG chip is a combined GWAS + Exome chip, but less than half of the SNP's on it are actually in common with other chips, so a run with directly tested variants only has relatively few annotations. And all tested variants aren't included in the imputed file, so running with just imputed file isn't good idea either (though presumably most if not all annotated variants are in the imputed file...).
However, if for nothing else then for this it would be useful for Promethease to be able to display which variants are imputed, and which directly tested. I guess the only information available in the original VCF's is whether the genotypes are unphased (directly tested) or statistically pseudophased (imputed ones). Of course, a lot of people will additionally have a GWAS kit like 23andMe or AncestryDNA and may want to include those as well.
As an aside, testing this on the desktop version, I used latest dbSNP snapshot to annotate the VCF file, which adds the RV (rev) tag, which causes the desktop version to flood errors about multisamples, and apparently produce annotations in both orientations. Is this a bug, or am I missing something? Removing the RV tags fixed the results, however. --Donwulff (talk) 20:22, 1 February 2016 (UTC)
One thought that arises is that most genetic analysis techniques (DNA microarray chips such as the DTC companies use and short read sequencing, to name a few) are prone to non-specific hybridization, or in the case of short read sequencing multiple mapping. What this means is that they're detecting the variants by locally matching the expected genetic nucleotide sequence at the location of the variant. But sometimes, perhaps quite frequently, other areas of the genome have identical or close to identical sequences, which will match too. It is possible, even likely, that labs don't insert the proper dbSNP id into VCF files when they believe the probe may map to multiple locations (And VCF file really only supports listing the variant for single location).
Of course, none of the variants are certain until they've been "clinically validated", so as the saying goes "for informational and educational purposes only", and I personally take the stance that more information is better, though at this stage automated DNA interpretation isn't indicated for people with the rs4680 worrier variant ;) However, it would be nice if in the future SNPedia was able to annotate when a variant was imputed with it's quality score (GFG sadly doesn't provide those, but DNA.Land imputation does), whether the site-name was a match or not, if a source (Such as the Illumina QC files, or perhaps OpenSNP variant frequencies) indicates they're imprecisely mapped, or a genomic scan of the binding region indicates it's duplicated elsewhere.--Donwulff (talk) 09:08, 4 February 2016 (UTC)