Have questions? Visit https://www.reddit.com/r/SNPedia

User:Cariaso/SequenceExchangeFormat

From SNPedia

23andMe asks

@cariaso Why do you hope for XML? Seems inefficient to encode large quantities of data. Thanks for the VCF link, have not see it before.


"... premature optimization is the root of all evil" @Knuth


This is not theoretical. The differences between the 23andMe v1 and v2 file formats have caused a misdiagnosis. As with 23andMe's current format, I expect a .zip wrapper for compression. I would also welcome a SAM/BAM style best of both worlds style solution.

Will 23etAl be returning a series of 'raw reads' if so these are specific to the platform. This would therefore vary between companies and across time within one company.

Perhaps a distilled assembly, but its a best guess and alternative assemblies will exist. I'd anticipate multiple revisions and some notation to suggest possible alternatives for low confidence regions. Perhaps a measure of confidence, similar to fastq?

If we want optimized, we should be prefiltering the 99% identical to reference portions of the genome. How shall we communicate these non identical 'diffs'?

How will we resolve when there are multiple ways of describing the same dna sequence, akin to Ambiguous flips.

XML is best able to handle this sort of uncertainty and evolution.