DNA_Coder

A tool to encode Data in DNA. From the 2021 iGEM Team Aachen.

It encodes bytes into 6-trit trytes. Then it encodes this sequence of trytes in to a string of bases in the following way:

We encode only the transitions between bases, so Homonucleotides don't matter for the encoding.

As A and G are both purin bases and C and T are pyrimidinbases and A/T, C/G are both compatible. There are certain groups between the different bases. There is no "group change" between C/T, A/G so it is a 0. From A/T, G/C there is one group change so its a 1. From A/C, G/T there are two changes so its a 2.

A --0-- G
| \   / |
1   2   1
| /   \ |
T --0-- C

Installation

Clone this repository and install pipenv first. Then:

pipenv install

Usage

To Encode a File

dna_utils.py encode --input <input> --output <output>

To Correct a File

dna_utils.py correct --input <input> --output <output> --config <config>

This command takes a fastq file as input and generates a fasta file with a single DNA sequence as output. It also takes a yaml file with additional parameters.

The command does the following:

Scan all sequences for the given primer.
- If the primer is in the sequence, remove the primer and everything before it.
- If the primer is not found, remove the sequence from the file.
Scan all sequences for the given reverse primer.
- If the reverse primer is in the sequence, remove it and everything behind it.
- If it is not found, remove the sequence.
Compare all remaining sequences and "merge" them, meaning that the tool creates one sequence that represents the average of all sequences.

The config file should look something like this:

primer:
  sequence: ACAATTCAT...TCCATGTTGAT
  max_distance: 7
reverse_primer:
  sequence: TACCA
  max_distance: 2
support: 70

For primer and reverse_primer, max_distance is the error with which a subsequence is still recognized as the primer. Technically, the Levenshtein distance is used for this. It describes how many insertions, deletions and substitutions need to be made to match two given strings.

The parameter support is used for the merging step. It determines how long the merged sequence should be. To do so, the tool first compares the lengths of all sequences. It then sets teh length to the minimum that support% of the sequences have.

python dna_utils.py correct --input original_files/BC23_small.fastq --output generated_files/BC23_corrected.fasta --config example_config.yaml

To Decode a File

dna_utils.py decode --input <input> --output <output>

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
dna_coder.py		dna_coder.py
dna_utils.py		dna_utils.py
error_corrector.py		error_corrector.py
example_config.yaml		example_config.yaml
example_config_reverse.yaml		example_config_reverse.yaml
fastq.py		fastq.py
homonucleotide.py		homonucleotide.py
read.py		read.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNA_Coder

Installation

Usage

To Encode a File

To Correct a File

To Decode a File

About

Releases

Packages

Languages

Palaract/Aachen_dna-utils

Folders and files

Latest commit

History

Repository files navigation

DNA_Coder

Installation

Usage

To Encode a File

To Correct a File

To Decode a File

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages