STRipy: Enhancing Genotyping for Short Tandem Repeats

In the popular imagination, most characterizations of genetic mutations will emphasize the idea of single nucleotide variants — for instance, a “C” gets swapped for a “T” in the wrong place and, voila, you have an unexpected variation with unexpected consequences. Users of genomic data, however, often encounter changes to the genetic code that can take predictable but more complex forms. Among the categories of genetic abnormalities are Short Tandem Repeats (STR). The “short” in STR refers to the short length of repeating sequence of bases (2 to 16 base pairs), while “tandem” indicates that the repeats occur next to each other, sequentially. STRs are also called microsatellites.

STRs can be phenotypically neutral, but have an applied function as an identifier, such as those used in forensics to identify individuals. Other STRs have been found to have an impact on disease. Naturally having longer repeats can be a good thing; for example, the CAG repeat in the androgen receptor (AR) gene has been shown to affect the expression level of the gene. Specifically, more repeats lowers the transcription rate. In prostate cancer, this has been linked to better outcomes.

Genomic sizing or sequencing should be done on individual samples to measure the length of the STR. Sequencing has been found to be unsurprisingly more accurate than the cruder sizing technique, and for the purposes of disease diagnostics this can have a significant impact on an individual’s care.

There are new methodologies available to measure and report STRs. STRipy is built around an existing tool called ExpansionHunter. The new method adds functionality to accommodate longer repeats and crucially adds more data to the database of known repeats.

The group published a much larger dataset that includes 55 loci (previously 28) that include all known disease-causing loci, each of which has been expertly curated. These are necessary in a clinical context to relate the STR loci (where the repeat is found in the genome) to linked diseases.

The tool takes as input the alignment that is already generated for any sequencing sample (BAM or CRAM formats are accepted), which allows for the cost of running this tool to remain low, especially when used in conjunction with a different variant caller. The visualizations and GUI interface are also readily available, which makes this very easy to download and use!

Updated: