Skip to content
Adam English edited this page Dec 13, 2023 · 2 revisions

tdb compatible tandem repeat representations in VCFs

VCFs with tdb compatible representations describe the full region of interest in the CHROM/POS/REF columns and a sample's full allele that would replace the region's reference sequence in the ALT column. We call this a 'replacement' style representation since the entry describes 'replace this span of reference sequence with the sample's alternate sequence'. For example:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO     FORMAT  SAMPLE
chr1    1234    .       CGCG    CGCGCGCG .      PASS    MOTIF=CG GT      1/1

The reference sequence from chr1:1234-1238 is CGCG, we see an expansion of two additional copy in our sample, and the total length of the sample's allele is 8bp compared to the reference's 4bp. By representing the change from the sample over the full region, we're able to more easily look at tandem repeat relevant properties (e.g. allele length) and compare alleles across samples. If, however, the variant was represented as an insertion of CGCG, we would need additional INFO fields or analysis to describe the total length of the sample's tandem repeat whereas, with this representation, we simply need to look at the length of the alternate sequence. Furthermore, the CGCG insertion could placed at multiple positions over the region which would again require more information or analysis to compare samples.

tdb structure

VCFs are the standard format for genomic variant reporting. However, the format is not normalized. For example, the VCF format's INFO field violates first normal form's rule of atomicity since multiple INFO fields are in the same column.

There are three distinct groups of information on a single VCF line: Locus; Allele; Sample;. A Locus is a location in the genome, namely the chromosome, position, and reference sequence. Allele information describes non-reference sequence observed at the Locus. And Sample information is sample specific properties about an allele observed at a locus.

For example, consider a variant entry:

chr1 1234 . CCTCCT   CCT,CCTCCTCCT . . INFO=1,3   GT:SD 1/2:10,20.

If we were to normalize this information, we'd have:

## Locus
# chrom  start  end
chr1  1234  1240

## Allele
# allele_number  allele_length  sequence
0  6  CCTCCT
1  3  CCT
2  9  CCTCCTCCT

## Sample
# allele_number  spanning_reads
1  10
2  20

If we simply apply primary keys to these rows such that one can combine information across these three tables, we'd have an easier data structure to query.

tdb tutorial

tdb compatible VCFs can be collected into a database for easier querying. For an interactive tutorial, see notebooks/Introduction.ipynb

The database is a directory with a .tdb extension containing a set of parquet files. For example:

project.tdb/
├── locus.pq
├── allele.pq
├── sample.HG002.pq
├── sample.HG003.pq
└── sample.HG004.pq

Table Descriptions

locus.pq: The locations of tandem repeats

column definition
LocusID identifier for the locus
chrom chromosome of the locus
start 0-based start position of the locus
end 0-based end position of the locus

allele.pq: Tandem repeat alleles found on a locus

column definition
LocusID locus to which the allele belongs
allele_number identifier of the allele on the locus. '0' is the reference allele
allele_length length of the allele
sequence sequence of the allele

sample.*.pq: sample properties of alleles

column definition
LocusID identifier for the locus
allele_number identifier of the allele on the locus. '0' is the reference allele
spanning_reads number of spanning reads supporting per allele
length_range_lower allele minimum predicted length
length_range_upper allele maximum predicted length
average_methylation allele's mean methylation scaled 0-1

Sample Table Information

The columns in the Sample table are parsed from specific FORMAT fields in the VCF. The VCF header tags which are parsed are defined as:

##FORMAT=<ID=ALLR,Number=A,Type=String,Description="Length range per allele">
##FORMAT=<ID=SD,Number=A,Type=Integer,Description="Number of spanning reads supporting per allele">
##FORMAT=<ID=AM,Number=A,Type=Float,Description="Mean methylation level per allele">

So each of the fields have two values, one for each haplotype. These fields may have Nones (.) and the VCF can still be converted to a tdb.

Creating a database

To create a database, we can use the following command:

tdb create -o proband.tdb proband.vcf.gz

The output must end with .tdb. We can create a database from multiple vcfs or tdb files.

tdb create -o family.tdb proband.tdb father.vcf.gz mother.vcf.gz

We can also append to an existing database.

tdb append --to family.tdb --fr cousin.vcf.gz

Querying a database

Once we have a database created, we can then perform standard queries.

By default, results are written as a tsv to stdout. The output can be redirected to a file with --output (-o) and output with multiple formats using --output-type (-O).

tdb query gtmerge family.tdb -O c -o genotypes.csv

Python library

tdb's can be manipulated with python using the tdb library

import tdb

Loading a tdb is simply:

data = tdb.load_tdb("family.tdb")

This returns a dictionary with structure

{
  "locus": pandas.DataFrame,
  "allele": pandas.DataFrame,
  "sample": {
    "proband": pandas.DataFrame, 
    "father": pandas.DataFrame, 
    "mother": pandas.DataFrame, 
    "cousin": pandas.DataFrame,
  }
}

Accessing the DataFrame:

data['locus'].head()
   LocusID chrom  start    end
0        3  chr1  16682  16774
1        9  chr1  19275  19473
2        4  chr1  20798  20893
3        2  chr1  29714  29822
4        0  chr1  30824  30989

Some of the tdb queries are available programmatically:

ac = tdb.allele_count(data)

Pandas is very helpful for manipulating data and can make building your own queries easy. Here's an example query for collecting all loci with at least 3 alleles

loci_count = data['allele'].groupby(["LocusID"]).size()
min_3_alleles = loci_count[loci_count >= 3]

data['locus'].set_index("LocusID", inplace=True)
data['locus'].loc[min_3_alleles.index].head()

Here's an example for calculating what percent of alleles per-locus have over 20x spanning reads:

# Build an allele coverage table
act = data['locus']["LocusID"].copy().to_frame().set_index("LocusID")
act["Total"] = 0
act["Covered"] = 0

# Add each sample to the coverage table
for samp, table in data['sample'].items():
    act["Total"] = act["Total"].add(table.groupby(["LocusID"]).size(), fill_value=0)
    act["Covered"] = act["Covered"].add(table[table['spanning_reads'] >= 20].groupby("LocusID").size(), fill_value=0)

# Create a column with the per-locus percent and get the mean.
act["Percent"] = act["Covered"] / act["Total"]
act["Percent"].mean()

Note that this measurement of 'average percent of alleles per-locus with >= 20x' is different from the 'total percent of alleles with >= 20x', which would be:

m_sum = act.sum()
m_sum['Covered'] / m_sum['Total']

When loading a tdb, we can pass filters to pyarrow and only load subsets of data.

chr1 = tdb.load_tdb("family.tdb",
		samples=['proband', 'mother'],
		lfilters=[("chrom", "=", "chr1"), 
		          ("start", ">", 71685300),
			  ("end", "<", 72497289)],
		afilters=[("allele_length", ">=", 1000)])

See help(tdb.load_tdb) for details on the filters.