-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the Barcode/UMI correction #136
Comments
Hi Anne, Thanks for your questions 1: Barcode Correction "The barcode sequence extracted from adapter1 based on its position and length is referred to as the "uncorrected barcode." This sequence is then filtered based on its sequencing quality and whether it appears in the known barcode file for 10X. Sequences meeting these criteria are termed "corrected barcodes" and are stored in the shortlist file. Is this correct?" A: Sequences that match the quality criteria and have 100% match in the 10x whitelist are classed as high quality barcodes; barcodes that we can be confident of existing in the dataset. These are used to build the shortlist of all barcodes we can expect to see. B: The threshold calculation is defined here:
The threshold calculated can be visualised in the kneeplot generated in the report as the vertical line, where cells to the right of this are filtered out C: Additionally, you consider barcodes not in the whitelist that may have been affected by minor indels during sequencing. These are compared to the whitelist barcodes, and the closest matching barcode is included. Is my understanding correct ?
2: UMI Correction
I hope this helps, Neil |
Ask away!
To whom it may concern,
I am a user of wf-single-cell and I have some questions regarding the internal workings of the pipeline, specifically concerning the barcode and UMI correction steps. I would greatly appreciate your assistance in clarifying these matters.
Barcode Correction
The barcode sequence extracted from adapter1 based on its position and length is referred to as the "uncorrected barcode." This sequence is then filtered based on its sequencing quality and whether it appears in the known barcode file for 10X. Sequences meeting these criteria are termed "corrected barcodes" and are stored in the shortlist file. Is this correct?
I noticed your documentation mentions: “This threshold is determined by ranking the cells by read count and taking the top n cells (n = expected_cells). The read count 95th percentile / 20 is the threshold used. This threshold can be visualized in the knee plots generated by the workflow.” I am not entirely clear on this explanation.
There are two columns in the barcode file: one for barcode and one for count. Here, count refers to the number of reads associated with that barcode, i.e., the number of reads per cell. Suppose I set the parameter
expected_cells
to 100, which means I sort counts in descending order and select the top 100 cells. Using these 100 cells as a whole, the threshold for filtering is based on the 95th percentile of their counts divided by 20. Are the barcodes that meet this threshold then used for further filtering?Additionally, you consider barcodes not in the whitelist that may have been affected by minor indels during sequencing. These are compared to the whitelist barcodes, and the closest matching barcode is included. Is my understanding correct?
UMI Correction
For UMI correction, since UMIs are random sequences without a reference sequence, you rely on minimap alignment results. Based on these results, you cluster reads mapped to the same gene and extract the shared UMI information. Are there specific criteria used for filtering to determine which UMIs are considered corrected? Furthermore, does the UMI assignment depend on the genes included in the alignment? For example, if I use only a small subset of genes' GTF files as the index, does the number of identified UMIs depend on the number of genes that can be aligned?
Metrics for Correction Proportions
Is there a way to calculate the proportion of corrected UMIs relative to all incorrect UMIs during the filtering process? The same question applies to barcodes.
Thank you very much for your assistance!
Best regards,
Anne
The text was updated successfully, but these errors were encountered: