2021 has been a year of significant product innovations at Streamline Genomics. Recently, we launched our Seeq search engine, which aggregates information about genes, variants, treatments and clinical trials into one intuitive and fast search engine. There are other tools available to explore the human genome, the aggregate body of knowledge of mutations and their relevance to pathology, and the latest personalized treatments available, but with Seeq, our aim is to make the experience of navigating this complex information as easy as possible.
Today, our mission to democratize the tools necessary to unlock genomic information extends even further. We are releasing SeeqVCF, a lightweight tool that allows anyone to analyze Variant Call Format (VCF) files, locally in the brower, for free. We have identified several barriers to wider adoption of easy VCF analysis, and addressed them in our approach to building SeeqVCF.
VCF specifies the format of a text file used in bioinformatics for storing gene sequence variations. This format has been and continues to be developed alongside the exponential growth in Next Generation Sequencing (NGS) and corresponding proliferation of digitized genomic data. VCFs are a common and widely understood file format for storing genomic data. For the layperson reader, a VCF acts like a compression algorithm for raw genomic sequencing data. A generalized, raw file format for storing the entirety of one person’s genome would include every single gene at a level of measured statistical reliability. Practically, the vast majority of this information will be exactly the same in all human genomes and is effectively redundant. The VCF format specifically stores only the variations to the “reference genome” and therefore can convey information about the entire genome at a fraction of the space. Just like in common file formats for images, audio and video — virtually the same amount of fidelity on a dataset can be captured with a fractional subset of the data with the right “compression algorithm”.
While the VCF format does compress and surface raw genomic data into essentially an index of variants, the VCF itself does not make reasoned decisions about which variants are clinically significant or significant to research. If a human genome is sequenced and contains, for example, four to five million individual variants and structural variations affecting over twenty million sites — only a handful of these variants may be clinically relevant in the context of, say, breast cancer treatment.
Analyzing VCFs becomes a bioinformatics problem space that is growing exponentially alongside a proliferation of sequencing data, as more and more people around the world are sequencing their genomes as patients and in other contexts.
Several factors make private, fast VCF analysis difficult:
- VCF analysis takes the input of sensitive datasets that may inherently include personal health information.
- Accessible third-party services that leverage proprietary techniques often require the use of server-side or cloud-based processes. Essentially, one must send data to the cloud.
- VCF analysis tools can present a hurdle, where fluency with basic querying language, open-source code libraries, command line interfaces, and intermediate software and database skills excludes certain users — and requires a time commitment even from technically knowledgeable data users.
SeeqVCF specifically addresses these challenges with innovations in how we distribute and run our tool:
- SeeqVCF is a drag-and-drop interface — all you need to do is choose the VCF file from its source and run the analysis. There is no need to download open-source libraries, install command-line tools or interoperating sets of infrastructure directly on your device.
- SeeqVCF is fast. Input your target genes along with your file and SeeqVCF will output analysis in under a minute. Analyzing VCFs should be fast, safe and reliable.
We hope that together we can accelerate our genomic future and democratize meaningful genomic analysis for all.