Skip to content

Latest commit

 

History

History
60 lines (32 loc) · 2.18 KB

characters.md

File metadata and controls

60 lines (32 loc) · 2.18 KB

Characters validation

Data charset

The charset used to read data is either :

  • The default value : UTF-8
  • The value provided as a command line argument (--encoding)
  • The value provided in a metadata file (identificationInfo[1]/*/characterSet)

For now, value provided in companion file such as .cpg for shapefile is ignored because it may contains value such as "system".

Data charset validation

Characters validation

Deep character validation

Deep character validation is based on the attempt to apply the following transforms. A validation error is triggered if the string is modified.

Double UTF-8 encoding

If a dataset encoded UTF-8 and declared as LATIN1, the reading process can't detect an error. Meanwhile, strings will contains character sequence rarely presents in real data.

The validator optionally search sequences of double encoded UTF-8 characters and replace them by original characters.

Command line option : --string-fix-utf8

Character simplification

The validator optionnaly apply character replacement to produce normalized data :

  • With a better supports by main fonts (common simplification)
  • That supports encoding in a given charset (charset specification simplification)

Common simplification

The file validator-core/src/main/resources/simplify/common.csv defines this replacements.

Command line option : --string-simplify-common

Charset specific simplification

The file validator-core/src/main/resources/simplify/ISO-8859-1.csv defines this replacements for LATIN1.

Command line option : --string-simplify-charset <CHARSET>

Character escaping

Control characters

Non standard control characters are detected and escaped in hexadecimal form (ex : \u0092)

Command line option : --string-escape-controls

Characters not supported by specific charset

To ensure compatibility with a given charset, it is possible to escape unsupported characters too.

Command line option : --string-escape-charset <CHARSET>