The charset used to read data is either :
- The default value : UTF-8
- The value provided as a command line argument (
--encoding
) - The value provided in a metadata file (
identificationInfo[1]/*/characterSet
)
For now, value provided in companion file such as .cpg for shapefile is ignored because it may contains value such as "system".
Deep character validation is based on the attempt to apply the following transforms. A validation error is triggered if the string is modified.
If a dataset encoded UTF-8 and declared as LATIN1, the reading process can't detect an error. Meanwhile, strings will contains character sequence rarely presents in real data.
The validator optionally search sequences of double encoded UTF-8 characters and replace them by original characters.
Command line option : --string-fix-utf8
The validator optionnaly apply character replacement to produce normalized data :
- With a better supports by main fonts (common simplification)
- That supports encoding in a given charset (charset specification simplification)
The file validator-core/src/main/resources/simplify/common.csv defines this replacements.
Command line option : --string-simplify-common
The file validator-core/src/main/resources/simplify/ISO-8859-1.csv defines this replacements for LATIN1.
Command line option : --string-simplify-charset <CHARSET>
Non standard control characters are detected and escaped in hexadecimal form (ex : \u0092)
Command line option : --string-escape-controls
To ensure compatibility with a given charset, it is possible to escape unsupported characters too.
Command line option : --string-escape-charset <CHARSET>