- name of the datatype
- What fields will be queryable
- types for the fields
- what fields are unique identifiers
- Open your own values.yaml.
- Add the datatype name to the
dwv-ingest->config->liveDataTypes
value. This will contain a comma-delimited list of all your available datatypes - Add a new section under the
dwv-ingest->config->types
. - Fill in all the sections within the YAML. See New Datatype Configuration for additional details.
- Deploy using the DataWave Deployment section
To build the helm package: build_pkg.sh
To install via helm: helm install -n <namespace> datawave-system-X.X.X.tgz
To upgrade via helm: helm upgrade -n <namespace> datawave-system-X.X.X.tgz
When performing an upgrade that contains any datatype changes, you may need to roll the ingest pod to pick up the changes to the configuration. You also may not see the changes in the dictionary until you ingest more data.
- remote into the hadoop-nn pod (any of them if there is more than 1)
- run the following cmd:
hdfs dfs -ls hdfs://hdfs-nn:9000/data
- verify all the datatype folders exists
Example of what to add
The new sction will need to follow the following format:- name: <name of the datatype>
flagMakerConfig:
liveFolder: <name of datatype>
bulkFolder: <name of datatype>-bulk
config:
distrubutionArgs: none
extraIngestArgs: "-data.name.override=<name of datatype>"
inputFormat: datawave.ingest.json.mr.input.JsonInputFormat
lifo: false
properties:
"file.input.format": datawave.ingest.json.mr.input.JsonInputFormat
"data.name": <name of datatype>
"<name of datatype>.output.name": <name of datatype>
"<name of datatype>.ingest.helper.class": datawave.ingest.json.config.helper.JsonIngestHelper
"<name of datatype>.reader.class": datawave.ingest.json.mr.input.JsonRecordReader
"<name of datatype>.handler.classes": "datawave.ingest.json.mr.handler.ContentJsonColumnBasedHandler,datawave.ingest.mapreduce.handler.facet.FacetHandler"
"<name of datatype>.data.category.uuid.fields": <insert fields here>
"<name of datatype>.data.separator": ","
"<name of datatype>.data.header": <insert header here>
"<name of datatype>.data.process.extra.fields": true
"<name of datatype>.data.json.flattener.mode": GROUPED_AND_NORMAL
"<name of datatype>.SUMMARY.data.field.marking": PUBLIC
"<name of datatype>.data.category.marking.visibility.field": VISIBILITY
"<name of datatype>.data.category.date.formats": yyyy-MM-dd,yyyy-MM-dd'T'HH:mm:ss'Z',yyyy-MM-dd HH:mm:ss
"<name of datatype>.data.category.index": <insert queryable fields here>
"<name of datatype>.data.category.index.reverse": <insert queryable fields here>
"<name of datatype>.data.category.token.fieldname.designator": _TOKEN
"<name of datatype>.data.category.index.tokenize.allowlist": <>
"<name of datatype>.data.category.index.only": <>
"<name of datatype>.data.default.normalization.failure.policy": FAIL
"<name of datatype>.data.default.type.class": datawave.data.type.LcNoDiacriticsType
see for more detail: https://code.nsa.gov/datawave/docs/6.x/ingest/configuration (can only access outside AVD)
Field Name | Description |
---|---|
file.input.format | |
data.name | This is the name of the datatype, which distinguishes it from other types for the purposes of ingest processing and perhaps even for dataflow/transport concerns. As such, this can be used to denote a subtype of some common data format, like CSV files for example, which could originate from any number of sources |
(data.name).output.name | This is the name to use on the data in Accumulo |
(data.name).ingest.helper.class | |
(data.name).reader.class | |
(data.name).handler.classes | List of classes that should process each event |
(data.name).data.category.uuid.fields | List of known fields that contain UUIDs |
(data.name).data.separator | This is the separator to use for delimited text, and between configuration file parameters with multiple values. |
(data.name).data.header | Known metadata fields that may be expected to appear in every json document. Often, these may be "required" fields, and/or fields that you want to use for policy enforcement, quality assurance, etc |
(data.name).data.process.extra.fields | If true, "extra" fields within the json tree (ie, those outside the defined "header") should be processed. Otherwise, everything outside the header will be ignored unless explicitly whitelisted |
(data.name).data.json.flattener.mode | The classes datawave.ingest.json.mr.input.JsonRecordReader and datawave.ingest.json.config.helper.JsonIngestHelper support 4 different json-flattening modes: SIMPLE, NORMAL, GROUPED, and GROUPED_AND_NORMAL |
(data.name).data.category.marking.visibility.field | Known field in every record that will contain the event's ColumnVisibility for Accumulo. If the raw data doesn't convey security markings, then utilize the '.data.category.marking.default' property instead, to declare the default marking to be applied to every field |
(data.name).data.category.date | Known date field to be used, if found, for the shard row id. Otherwise, current date will be used |
(data.name).data.category.date.formats | Known/valid date formats for *.data.category.date field |
Field Name | Description |
---|---|
(data.name).data.category.index | List of known fields to index |
(data.name).data.category.index.reverse | List of known fields to reverse index |
(data.name).data.category.token.fieldname.designator | Field name suffix to be applied to field names that are tokenized. See *.data.category.index.tokenize.allowlist |
(data.name).data.category.index.tokenize.allowlist | These are the fields to tokenize and index. Tokenization allows fields to be parsed for searching the content of those fields (rather than the whole value) |
(data.name).data.category.index.only | Fields that will exist only in the global index. Will not be stored as part of the event/document |
Field Name | Description |
---|---|
(data.name).data.default.normalization.failure.policy | For field normalization failures: DROP, LEAVE, FAIL. FAIL: the entire event/document will be dropped, and possibly written to the error schema in Accumulo. LEAVE: the non-normalized value will be kept as-is. DROP: the failed field will be dropped, and everything else retained |
(data.name).data.default.type.class | Default type |
(data.name).(FieldName).data.field.type.class | Fully-qualified class name of the DataWave type to be used to interpret and normalize "FieldName" values Example types are datawave.data.type.DateType, datawave.data.type.NumberType, datawave.data.type.GeoType, etc |
Role | Description |
---|---|
Administrator | Provides all access to admin functions within Datawave |
AuthorizedQueryServer | Allowed to perform queries |
AuthorizedServer | Used to provide proxy entities from a server |
AuthorizedUser | A normal user |
InternalUser | Used for monitoring users within the system |
JBossAdministrator | Same as Administrator, provides all access to admin functions within Datawave |
MetricsAdministrator | TBD |
SecurityUser | Security admin functions? |
Using Datawave with self-signed certificates takes a bit of setup to accomplish. You have to create the Certificate Authority to sign the certificates with. For our demo we utilized self-signed certificates create by the NSA tool MADCert. The documentation for creating the new certificates are straight forward within the MADCert documentation , so it is not covered here.
After the server certificates are created for datawave, the following steps will ensure you get them within Datawave to be utilized.
-
Switch the
web.certificates.externalsecret.enabled
from false to true withinweb->values.yaml
. -
add the
web.certificates.externalsecret.name
value withinweb->values.yaml
. -
create a new file under
web->templates
and fill it with the following information:New File Content
{{ if .Values.web.certificates.externalSecret.enabled }} apiVersion: v1 kind: Secret metadata: name: {{ .Values.web.certificates.externalSecret.name }} type: Opaque data: keystore.p12: |- <fill me in> truststore.jks: |- <fill me in> {{ end }}
-
use the following command to create/update the truststore.jks
keytool -import -alias abacus-intermediate -file <certificate to add> -keystore <name of the truststore>
-
Use
base64
to encode both the server pkcs12 file and the truststore created. Fill in the output within the yaml file created above. -
Deploy the changes.