-
Notifications
You must be signed in to change notification settings - Fork 1
Home
VCFs with tdb compatible representations describe the full region of interest in the CHROM/POS/REF columns and a sample's full allele that would replace the region's reference sequence in the ALT column. We call this a 'replacement' style representation since the entry describes 'replace this span of reference sequence with the sample's alternate sequence'. For example:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr1 1234 . CGCG CGCGCGCG . PASS MOTIF=CG GT 1/1
The reference sequence from chr1:1234-1238 is CGCG, we see an expansion of two additional copy in our sample, and the total length of the sample's allele is 8bp compared to the reference's 4bp. By representing the change from the sample over the full region, we're able to more easily look at tandem repeat relevant properties (e.g. allele length) and compare alleles across samples. If, however, the variant was represented as an insertion of CGCG, we would need additional INFO fields or analysis to describe the total length of the sample's tandem repeat whereas, with this representation, we simply need to look at the length of the alternate sequence. Furthermore, the CGCG insertion could placed at multiple positions over the region which would again require more information or analysis to compare samples.
VCFs are the standard format for genomic variant reporting. However, the format is not normalized. For example, the VCF format's INFO field violates first normal form's rule of atomicity since multiple INFO fields are in the same column.
There are three distinct groups of information on a single VCF line: Locus; Allele; Sample;. A Locus is a location in the genome, namely the chromosome, position, and reference sequence. Allele information describes non-reference sequence observed at the Locus. And Sample information is sample specific properties about an allele observed at a locus.
For example, consider a variant entry:
chr1 1234 . CCTCCT CCT,CCTCCTCCT . . INFO=1,3 GT:SD 1/2:10,20.
If we were to normalize this information, we'd have:
## Locus
# chrom start end
chr1 1234 1240
## Allele
# allele_number allele_length sequence
0 6 CCTCCT
1 3 CCT
2 9 CCTCCTCCT
## Sample
# allele_number spanning_reads
1 10
2 20
If we simply apply primary keys to these rows such that one can combine information across these three tables, we'd have an easier data structure to query.
tdb compatible VCFs can be collected into a database for easier querying. For an interactive tutorial, see
notebooks/Introduction.ipynb
The database is a directory with a .tdb
extension containing a set of parquet files. For example:
project.tdb/
├── locus.pq
├── allele.pq
├── sample.HG002.pq
├── sample.HG003.pq
└── sample.HG004.pq
locus.pq: The locations of tandem repeats
column | definition |
---|---|
LocusID | identifier for the locus |
chrom | chromosome of the locus |
start | 0-based start position of the locus |
end | 0-based end position of the locus |
allele.pq: Tandem repeat alleles found on a locus
column | definition |
---|---|
LocusID | locus to which the allele belongs |
allele_number | identifier of the allele on the locus. '0' is the reference allele |
allele_length | length of the allele |
sequence | sequence of the allele |
sample.*.pq: sample properties of alleles
column | definition |
---|---|
LocusID | identifier for the locus |
allele_number | identifier of the allele on the locus. '0' is the reference allele |
spanning_reads | number of spanning reads supporting per allele |
length_range_lower | allele minimum predicted length |
length_range_upper | allele maximum predicted length |
average_methylation | allele's mean methylation scaled 0-1 |
The columns in the Sample table are parsed from specific FORMAT fields in the VCF. The VCF header tags which are parsed are defined as:
##FORMAT=<ID=ALLR,Number=A,Type=String,Description="Length range per allele">
##FORMAT=<ID=SD,Number=A,Type=Integer,Description="Number of spanning reads supporting per allele">
##FORMAT=<ID=AM,Number=A,Type=Float,Description="Mean methylation level per allele">
So each of the fields have two values, one for each haplotype. These fields may have Nones (.
) and the VCF can still be converted to a tdb.
To create a database, we can use the following command:
tdb create -o proband.tdb proband.vcf.gz
The output must end with .tdb
. We can create a database from multiple vcfs or tdb files.
tdb create -o family.tdb proband.tdb father.vcf.gz mother.vcf.gz
We can also append to an existing database.
tdb append --to family.tdb --fr cousin.vcf.gz
Once we have a database created, we can then perform standard queries.
By default, results are written as a tsv to stdout. The output can be redirected to a file with --output
(-o
) and output with multiple formats using --output-type
(-O
).
tdb query gtmerge family.tdb -O c -o genotypes.csv
tdb's can be manipulated with python using the tdb library
import tdb
Loading a tdb is simply:
data = tdb.load_tdb("family.tdb")
This returns a dictionary with structure
{
"locus": pandas.DataFrame,
"allele": pandas.DataFrame,
"sample": {
"proband": pandas.DataFrame,
"father": pandas.DataFrame,
"mother": pandas.DataFrame,
"cousin": pandas.DataFrame,
}
}
Accessing the DataFrame:
data['locus'].head()
LocusID chrom start end
0 3 chr1 16682 16774
1 9 chr1 19275 19473
2 4 chr1 20798 20893
3 2 chr1 29714 29822
4 0 chr1 30824 30989
Some of the tdb queries are available programmatically:
ac = tdb.allele_count(data)
Pandas is very helpful for manipulating data and can make building your own queries easy. Here's an example query for collecting all loci with at least 3 alleles
loci_count = data['allele'].groupby(["LocusID"]).size()
min_3_alleles = loci_count[loci_count >= 3]
data['locus'].set_index("LocusID", inplace=True)
data['locus'].loc[min_3_alleles.index].head()
Here's an example for calculating what percent of alleles per-locus have over 20x spanning reads:
# Build an allele coverage table
act = data['locus']["LocusID"].copy().to_frame().set_index("LocusID")
act["Total"] = 0
act["Covered"] = 0
# Add each sample to the coverage table
for samp, table in data['sample'].items():
act["Total"] = act["Total"].add(table.groupby(["LocusID"]).size(), fill_value=0)
act["Covered"] = act["Covered"].add(table[table['spanning_reads'] >= 20].groupby("LocusID").size(), fill_value=0)
# Create a column with the per-locus percent and get the mean.
act["Percent"] = act["Covered"] / act["Total"]
act["Percent"].mean()
Note that this measurement of 'average percent of alleles per-locus with >= 20x' is different from the 'total percent of alleles with >= 20x', which would be:
m_sum = act.sum()
m_sum['Covered'] / m_sum['Total']
When loading a tdb, we can pass filters to pyarrow and only load subsets of data.
chr1 = tdb.load_tdb("family.tdb",
samples=['proband', 'mother'],
lfilters=[("chrom", "=", "chr1"),
("start", ">", 71685300),
("end", "<", 72497289)],
afilters=[("allele_length", ">=", 1000)])
See help(tdb.load_tdb)
for details on the filters.