Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phase 2 species identification #5

Open
alimanfoo opened this issue Nov 7, 2017 · 0 comments
Open

Phase 2 species identification #5

alimanfoo opened this issue Nov 7, 2017 · 0 comments

Comments

@alimanfoo
Copy link
Member

Reported by Dan Lawson:

I ran some quick and dirty QC (perl 1 liners) on the phase 2 AR1 release and stumbled at almost the first hurdle - species identification.

I took the M/S assignment (column 9 of the samples.meta.txt file) and the 5 columns that constitute the samples.species.txt file to look for consistency of inference from the SNPs to M or S forms (now An. coluzzi, An. gambiae, or a hybrid). Here's what I found..

// Assignments to M, S or hybrid

perl -ne 'next if (/ox_code/);chomp;@f=split/\t/;print "$f[8]\n";' samples.meta.txt | sort | uniq -c | sort -rn
720 S
287 M
113
22 M/S

=> 113 individuals do not have an assignment

// Check consistency where an assignment present

Count samples.species m_s

650 [S S S S S] == S
56 [S . S S S] == S
7 [S M/S S S S] == S
4 [S M/S M/S M/S M/S] == S
1 [S S S S M/S] == S
1 [S S M/S M/S M/S] == S
1 [S M/S M M M] == S

260 [M M M M M] == M
15 [M M M M M/S] == M
6 [M . M M M] == M
3 [M . M M M/S] == M
2 [M S S S S] == M
1 [M M/S M/S M/S M/S] == M

8 [M/S M/S M/S M/S M/S] == M/S
6 [M/S S M/S M/S M/S] == M/S
4 [M/S S S S S] == M/S
2 [M/S M M/S M/S M/S] == M/S
1 [M/S M/S S S S] == M/S
1 [M/S M/S M/S M/S M] == M/S

101 [ S S S S] == AWOL S ?
7 [ . S S S] == AWOL S ?
2 [ M/S S S S] == AWOL S ?
1 [ S S S M/S] == AWOL S ?
1 [ S S M/S S] == AWOL S ?
1 [ M/S M/S M/S M/S] == AWOL ?

My take on this is 2 fold; firstly my ignorance as to what the various columns relate to, *snp are diagnostic SNPs taken from the reads, but I don't know what meta and sine_hmm pertain to. Secondly, there is quite a lot of heterogeneity here.

I'd love to understand this a bit more but a key thing is to review the missing data for the 113 individuals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant