Gene names converted to dates in Xena's PANCAN_mutation dataset #4

dhimmel · 2016-07-16T20:33:33Z

I've noticed that some gene names have been converted to dates in PANCAN_mutation (version info, Xena Browser). Here are some of the effected rows:

sample	chr	start	end	reference	alt	gene	effect	DNA_VAF	Amino_Acid_Change
TCGA-KK-A8IH-01	chr4	164534558	164534558	G	C	1-Mar	Missense_Mutation	0.320754716981	p.N33K
TCGA-EJ-7125-01	chr16	4829717	4829717	C	A	12-Sep	Missense_Mutation	0.0357142857143	p.R266L
TCGA-CH-5762-01	chr7	55874871	55874871	T	C	14-Sep	Missense_Mutation	0.0251256281407	p.T300A
TCGA-G9-6351-01	chrX	118767429	118767429	C	A	6-Sep	Missense_Mutation	0.0280373831776	p.R328M
TCGA-G9-6342-01	chr5	132098260	132098260	C	A	8-Sep	Missense_Mutation	0.0485436893204	p.M204I

The gene-to-date conversion is a well documented feature of Microsoft Excel. While the number of corrupted rows in PANCAN_mutation looked minimal, it's disturbing that the data has passed through Excel, since workflows that use Excel tend be manual rather than scripted and thus error prone and irreproducible.

The text was updated successfully, but these errors were encountered:

dhimmel · 2016-07-25T20:51:28Z

Mary Goldman from the UCSC Xena Browser team investigated this issue and wrote:

We just checked our files and the gene names that are converted to dates are part of the input MAF data file we got from TCGA DCC, which is from the sequencing center, such as Broad.

As a workaround, we could always remap mutations to genes using genomic location as @clairemcleod began experimenting with. See #6 (comment).

clairemcleod · 2016-07-26T15:04:03Z

@dhimmel Do you think it would be better to remap all genes based on genomic location, or simply those that have this identified date issue? I've implemented the liftover procedure @gwaygenomics described above, and will submit that via a pull request soon. Remapping everything seems best from a consistency perspective, but would obviously require more computation/time.

dhimmel · 2016-07-26T15:30:46Z

Do you think it would be better to remap all genes based on genomic location, or simply those that have this identified date issue?

I like the comprehensive (not patchwork) mapping approach. I do think it will be important to check for consistency with the Xena mapping. In instances where different genes are called, what happened, why, and who's right?

If we find that our mapping seems to have issues, then I would advocate using the Xena gene calls for all resolvable symbols and then remapping by location only the unresolvable symbols.

Computation time is less of a concern -- are we talking minutes (acceptable) or hours (acceptable but not ideal).

clairemcleod · 2016-07-26T20:55:02Z

The liftover for the whole dataset takes maybe 10-15 min. If we re-map the whole dataset, the part I am more concerned about (and have not yet come up with an efficient way to do) is map a genomic location to an ID. Theoretically we'll have two tables: one with the observed mutation's location and another with IDs and a corresponding location range. Somehow we'll need to merge these tables when the observed location falls within the range.

The other problem I've been running into is finding a source for location/entrez ID mapping. I thought I'd found something useful with UCSC's knownGene and keggEntrez tables, but this actually only contains ~5300 unique Entrez IDs. Are there other resources anyone would recommend for trying to find this mapping?

If we find that our mapping seems to have issues, then I would advocate using the Xena gene calls for all resolvable symbols and then remapping by location only the unresolvable symbols.

Unless solutions to the above issues clearly present themselves, this seems like a good path forward. Even an inelegant solution to the location -> ID mapping should work fine on that scale.

clairemcleod · 2016-07-27T02:43:52Z

@Inquisitive-Geek New plan (courtesy of @dhimmel): Use a combination of chromosome and gene symbol to map observed mutations to Entrez IDs. Hopefully the combination will be sufficient to resolve most ambiguity. To address the date conversion in Issue #4, we can either use a location based mapping, or backout the gene names that excel could have "translated".

dhimmel mentioned this issue Jul 17, 2016

Converting Xena datasets to standard identifiers rather than gene symbols #6

Closed

dhimmel mentioned this issue Jul 27, 2016

July 19–26 Project Cognoma Acknowledgements cognoma/cognoma#23

Closed

clairemcleod mentioned this issue Jul 27, 2016

Map mutation gene symbols to Entrez IDs #12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gene names converted to dates in Xena's PANCAN_mutation dataset #4

Gene names converted to dates in Xena's PANCAN_mutation dataset #4

dhimmel commented Jul 16, 2016

dhimmel commented Jul 25, 2016 •

edited

Loading

clairemcleod commented Jul 26, 2016

dhimmel commented Jul 26, 2016

clairemcleod commented Jul 26, 2016

clairemcleod commented Jul 27, 2016

Gene names converted to dates in Xena's PANCAN_mutation dataset #4

Gene names converted to dates in Xena's PANCAN_mutation dataset #4

Comments

dhimmel commented Jul 16, 2016

dhimmel commented Jul 25, 2016 • edited Loading

clairemcleod commented Jul 26, 2016

dhimmel commented Jul 26, 2016

clairemcleod commented Jul 26, 2016

clairemcleod commented Jul 27, 2016

dhimmel commented Jul 25, 2016 •

edited

Loading