-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gene names converted to dates in Xena's PANCAN_mutation dataset #4
Comments
Mary Goldman from the UCSC Xena Browser team investigated this issue and wrote:
As a workaround, we could always remap mutations to genes using genomic location as @clairemcleod began experimenting with. See #6 (comment). |
@dhimmel Do you think it would be better to remap all genes based on genomic location, or simply those that have this identified date issue? I've implemented the liftover procedure @gwaygenomics described above, and will submit that via a pull request soon. Remapping everything seems best from a consistency perspective, but would obviously require more computation/time. |
I like the comprehensive (not patchwork) mapping approach. I do think it will be important to check for consistency with the Xena mapping. In instances where different genes are called, what happened, why, and who's right? If we find that our mapping seems to have issues, then I would advocate using the Xena gene calls for all resolvable symbols and then remapping by location only the unresolvable symbols. Computation time is less of a concern -- are we talking minutes (acceptable) or hours (acceptable but not ideal). |
The liftover for the whole dataset takes maybe 10-15 min. If we re-map the whole dataset, the part I am more concerned about (and have not yet come up with an efficient way to do) is map a genomic location to an ID. Theoretically we'll have two tables: one with the observed mutation's location and another with IDs and a corresponding location range. Somehow we'll need to merge these tables when the observed location falls within the range. The other problem I've been running into is finding a source for location/entrez ID mapping. I thought I'd found something useful with UCSC's knownGene and keggEntrez tables, but this actually only contains ~5300 unique Entrez IDs. Are there other resources anyone would recommend for trying to find this mapping?
Unless solutions to the above issues clearly present themselves, this seems like a good path forward. Even an inelegant solution to the location -> ID mapping should work fine on that scale. |
@Inquisitive-Geek New plan (courtesy of @dhimmel): Use a combination of chromosome and gene symbol to map observed mutations to Entrez IDs. Hopefully the combination will be sufficient to resolve most ambiguity. To address the date conversion in Issue #4, we can either use a location based mapping, or backout the gene names that excel could have "translated". |
I've noticed that some gene names have been converted to dates in
PANCAN_mutation
(version info, Xena Browser). Here are some of the effected rows:The gene-to-date conversion is a well documented feature of Microsoft Excel. While the number of corrupted rows in
PANCAN_mutation
looked minimal, it's disturbing that the data has passed through Excel, since workflows that use Excel tend be manual rather than scripted and thus error prone and irreproducible.The text was updated successfully, but these errors were encountered: