A pipeline tool to convert Cytosense data files to EcoTaxa compatible format.
- Automated conversion from Cytosense (.cyz) to EcoTaxa format
- Built-in cyz2json binary for data extraction
- Cross-platform support (Linux, Windows, MacOS)
sudo apt install jq
just need install.sh file from the repository
sudo ./install.sh --github
need to install cyz2json in the repo before, there are some bugs
sudo ./install.sh
on Windows use install.ps1 instead of install.sh not tested
When you use manual build: you need to bypass security You have 2 possibilities : using xattr command.
xattr -d com.apple.quarantine /opt/cytosense_to_ecotaxa_pipeline/bin/*
or if you do not want use xattr you need to go in Systems Settings > Security & Privacy > General and allow the app to be opened for the 10+ libraries. You need to relauch sevaral time the cyz2json binary to acheive all neccessary permissions.
/usr/local/bin/cytosense_to_ecotaxa_pipeline Deployment\ 1\ 2024-07-18\ 21h12.cyz
--extra extra_data.json
the result will be in the the folder where you run the command you will find
- the json file (generated by cyz2json)
- the images folder (with ecotaxa tsv file in it)
- and the zip file, you could upload it to EcoTaxa
(venv) python Deployment\ 1\ 2024-07-18\ 21h12.json --extra extra_data.json
sudo ./uninstall.sh
on Windows use uninstall.ps1 instead of uninstall.sh
Commit your code and push it to GitHub. then add a tag to the commit and push it to GitHub. like the sample below
git tag v0.0.65 && git push origin v0.0.65
the action will build the binary and push it to the release
You will find result in : https://github.com/ecotaxa/cytosense_to_ecotaxa_pipeline/actions and release in : https://github.com/ecotaxa/cytosense_to_ecotaxa_pipeline/releases
-
during install, need to find the version of Python used in the venv need to make the links to the cyz2json binary and the python scripts
-
extra_data.json : not other name allowed
To update your mapping edit main.py, and search column_mapping variable. This is a dictionary where the key is the path to a JSON Cytosense feature and the value is an object to define how to store and transform the data for ecotaxa. The object have 3 features: + name: is the name of the column in ecotaxa + type: is the type of the column in ecotaxa [t] or [f] (for text or float) + transform: is a function to transform the value before storing it, you can use a lambda function or a function to transform the value
the cytosense key is a path in the json file, you can use the dot notation to access nested objects.
for example instrument.name is the path to read feature
{
"instrument":{
"name": "instrument name"
}
}
to read feature, that are define in the particle array, you need to use square bracket notation "particles[].pulseShapes", in this case dot path are limite to the first feature and need to use a function to find the data on it
To get particles[].pulseShapes.FWS, in fact you need to use a function to find the data on it
"particles[].pulseShapes*FWS": {"name": "object_pulseShape_FWS","type": "[t]","transform":search_pulse_shapes("FWS")}
search_pulse_shapes("FWS") is a function that search for the feature FWS in the pulseShapes array and return the value you could add some processing in the sub function for example to convert the values in this case into a polynomial function or to convert the values to a string (take care to the data size string are limited to 250 characters)
The sub function permit to pass extra parameters to the function
def search_pulse_shapes(description):
"""
Then in your mapping you can use it like:
{"name": "pulseShape_FWS", "type": "[t]", "transform": search_pulse_shapes("FWS")}
"""
def search(value):
result = next((item for item in value if item['description'] == description), None)
if result:
return result["values"]
return None
return search
If you need to use the same cytosense feature several time in the mapping you can use the same name but with a different suffix: the suffix must be and id prefixed by a star.
Ecotaxa have a column for date and a column for time, but in the Cytosense data there is only one feature containing both date and time. Then you need to split the feature in two store in two columns.
For example: the path to my feature is "instrument.measurementResults.start" then I define two mapping: I use suffix "*date" and "*hour" to define different entries in the mapping, but you are free to choose the name you want (just need to have differnet suffix for each column, I could work with *1 et #2 or the couple : no suffix and *mysuffix) (there are no use of the suffix in the code, that just for Python Dictionnary Key that must be unique)
"instrument.measurementResults.start*date": {"name": "sample_measurementResults_Start", "type": "[t]", "transform": extract_date_utc},
"instrument.measurementResults.start*hour": {"name": "sample_measurementResults_StartH", "type": "[t]", "transform": extract_time_utc},