Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene names converted to dates in Xena's PANCAN_mutation dataset #4

Open
dhimmel opened this issue Jul 16, 2016 · 5 comments
Open

Gene names converted to dates in Xena's PANCAN_mutation dataset #4

dhimmel opened this issue Jul 16, 2016 · 5 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Jul 16, 2016

I've noticed that some gene names have been converted to dates in PANCAN_mutation (version info, Xena Browser). Here are some of the effected rows:

sample chr start end reference alt gene effect DNA_VAF RNA_VAF Amino_Acid_Change
TCGA-KK-A8IH-01 chr4 164534558 164534558 G C 1-Mar Missense_Mutation 0.320754716981 p.N33K
TCGA-EJ-7125-01 chr16 4829717 4829717 C A 12-Sep Missense_Mutation 0.0357142857143 p.R266L
TCGA-CH-5762-01 chr7 55874871 55874871 T C 14-Sep Missense_Mutation 0.0251256281407 p.T300A
TCGA-G9-6351-01 chrX 118767429 118767429 C A 6-Sep Missense_Mutation 0.0280373831776 p.R328M
TCGA-G9-6342-01 chr5 132098260 132098260 C A 8-Sep Missense_Mutation 0.0485436893204 p.M204I

The gene-to-date conversion is a well documented feature of Microsoft Excel. While the number of corrupted rows in PANCAN_mutation looked minimal, it's disturbing that the data has passed through Excel, since workflows that use Excel tend be manual rather than scripted and thus error prone and irreproducible.

@dhimmel
Copy link
Member Author

dhimmel commented Jul 25, 2016

Mary Goldman from the UCSC Xena Browser team investigated this issue and wrote:

We just checked our files and the gene names that are converted to dates are part of the input MAF data file we got from TCGA DCC, which is from the sequencing center, such as Broad.

As a workaround, we could always remap mutations to genes using genomic location as @clairemcleod began experimenting with. See #6 (comment).

@clairemcleod
Copy link
Member

@dhimmel Do you think it would be better to remap all genes based on genomic location, or simply those that have this identified date issue? I've implemented the liftover procedure @gwaygenomics described above, and will submit that via a pull request soon. Remapping everything seems best from a consistency perspective, but would obviously require more computation/time.

@dhimmel
Copy link
Member Author

dhimmel commented Jul 26, 2016

Do you think it would be better to remap all genes based on genomic location, or simply those that have this identified date issue?

I like the comprehensive (not patchwork) mapping approach. I do think it will be important to check for consistency with the Xena mapping. In instances where different genes are called, what happened, why, and who's right?

If we find that our mapping seems to have issues, then I would advocate using the Xena gene calls for all resolvable symbols and then remapping by location only the unresolvable symbols.

Computation time is less of a concern -- are we talking minutes (acceptable) or hours (acceptable but not ideal).

@clairemcleod
Copy link
Member

The liftover for the whole dataset takes maybe 10-15 min. If we re-map the whole dataset, the part I am more concerned about (and have not yet come up with an efficient way to do) is map a genomic location to an ID. Theoretically we'll have two tables: one with the observed mutation's location and another with IDs and a corresponding location range. Somehow we'll need to merge these tables when the observed location falls within the range.

The other problem I've been running into is finding a source for location/entrez ID mapping. I thought I'd found something useful with UCSC's knownGene and keggEntrez tables, but this actually only contains ~5300 unique Entrez IDs. Are there other resources anyone would recommend for trying to find this mapping?

If we find that our mapping seems to have issues, then I would advocate using the Xena gene calls for all resolvable symbols and then remapping by location only the unresolvable symbols.

Unless solutions to the above issues clearly present themselves, this seems like a good path forward. Even an inelegant solution to the location -> ID mapping should work fine on that scale.

@clairemcleod
Copy link
Member

@Inquisitive-Geek New plan (courtesy of @dhimmel): Use a combination of chromosome and gene symbol to map observed mutations to Entrez IDs. Hopefully the combination will be sufficient to resolve most ambiguity. To address the date conversion in Issue #4, we can either use a location based mapping, or backout the gene names that excel could have "translated".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants