Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subPos & match score feature request #5

Open
zztin opened this issue May 13, 2020 · 9 comments
Open

subPos & match score feature request #5

zztin opened this issue May 13, 2020 · 9 comments

Comments

@zztin
Copy link

zztin commented May 13, 2020

Hi Gao,
I tried to retrieve the repeated subunits from the long read and feed it into other consensus calling methods (such as Medaka by ONT or majority voting).

According to the README:
subPos: start coordinates of all the tandem repeat unit sequence, followed by the end coordinate of the last tandem repeat unit sequence, separated by ",", all coordinates are 1-based.

  • Problems I faced:
  1. When 5' and 3' primers are given, the subPos is the start of the tandem repeat sequence, not the start of the targeted sequence. However, the length is the targeted sequence length. The tandem repeat length is not reported.
  • Is it possible to report the start location at the position where the target sequence starts instead of the whole tandem repeat?
  • Is it possible to include the (start, end) position of each sub-unit? Or to have an option to export all the repeat subunits in a fastq file (with identifiable read name such as >readname_consX_repY).
  1. In some reads, multiple consensus sequences of different lengths are reported with (completely) overlaying regions. Is it possible to include a column to report the overall alignment score of the subunits?
  • I see there is a criterion to filter by maximum divergence rate between two consecutive repeats, but this does not necessarily report the quality of the overall consensus. Is this a correct intepretation? Is there a possibility to add a score to report the divergence rate of all repeats to the consensus sequence?

Thank you very much!!

@yangao07
Copy link
Collaborator

yangao07 commented May 14, 2020

Thanks for your comments and questions.

  • For the issues related to the subPos column, I updated the README file and added examples to illustrate how the coordinates are defined.
    Right now, it is not easy to output the coordinates of "target sequence" instead of the tandem repeats, since it needs accurate alignment to determine how the "target sequence" is contained in each tandem repeat unit. We may implement this in the future.
  • For the score or divergence rate. We could add a column of average accuracy, calculated based on the alignment between each repeat unit and the consensus sequence. Will this work for you?

Yan

@zztin
Copy link
Author

zztin commented May 14, 2020

Hi Yan,

  • I understand.
  • Yes, that would be nice!
  • If I have several consensuses derived from one long nanopore read, would you recommend a method to access if these consensus reads are actually the same sequence but got split up? ( What I do now is align them to genome sequence, but wondering if you have some reference free ideas?)
    8e5a
    In this figure, the blue are sense strand repeats and red anti-sense

@yangao07
Copy link
Collaborator

Not sure if I understand your question correctly.
You could align one consensus to another consensus sequence, see if they have enough matched bases.
Since each consensus sequence may start from any position of the target sequence, you can append one more copy to each consensus, and align the two copies to each other.

@yangao07
Copy link
Collaborator

Check out the -u/--unit-seq option in the latest release: v1.4.0.
It will give you all the unit sequences of each tandem repeat.

@zztin
Copy link
Author

zztin commented May 20, 2020

Hi Yan,
Thank you for the new feature --unit-seq. I tried it out it looks good!
I have a question about the avgMatch score. The test_50x4 example gives a score of 98.0 while the sequences are exactly the same to each other. Is this expected?
Is the aveMatch score on a scale of 0 - 100 (%)?

Another unrelated question is that in this example, the 4 repeats starts at 51, 101, 151, 201.
I would expect the subPos as 51, 101, 151, 201, 250 instead of 101,151,201,250. Is this always the case if the first repeat is not included in the tandem repeat subPos list even if they are complete? Or did I misinterpreted something?

Thank you very much!

output of the test_50x4.fa test case:

>test_50x4_rep0_300_51_250_50_4.0_98.0_0_101,151,201,250
CAGCTAGTCGATCGATCGGCATGCATGCATGCTAGTCGATGCATCGGGAT

@zztin zztin closed this as completed May 20, 2020
@zztin zztin reopened this May 20, 2020
@yangao07
Copy link
Collaborator

yangao07 commented May 20, 2020

You are right. The sequence was shfitted by 1 bp.
I will fix this bug soon.
Thanks for pointting it out.

@yangao07
Copy link
Collaborator

Just updated to v1.4.1.
Please try the new version.

@yangao07
Copy link
Collaborator

For your other questions:

  • The aveMatch score is the average percentage of # matched bases over the total length of each unit, so it is 0~100 (%).
  • The subPos information is based on the kmer matches, so it is not pointing to the very start position of the first tandem repeat unit, which is expected. Since there may not be enough matched kmers around that start position. The start and end information, which are 51 and 101 in this toy example, denote the start and end coordinate of the whole tandem repeat. To obtain these two positions, TideHunter aligns the generated consensus sequence back to the raw read.

@yangao07
Copy link
Collaborator

I think It is feasible to derive a set of subPos that includes as many units as possible.
The same to the subPos for full-length consensus sequences.
I will work on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants