Complete rewrite of CBA_fnc_parseJSON #1582

initramfs · 2023-07-07T07:03:08Z

As compared to my last PR (#1580) this represents a complete rewrite of the JSON parsing code using the latest features available (as of the time of writing: version 2.12) in the scripting engine (though heavily inspired by the original shift-reduce design).

The parser is now (assuming bug-free) a fully validating parser that supports all the grammar defined at json.org, including previously unsupported features such as hex escape strings.

In addition to simply parsing a JSON document it can also track row/column numbers of the input tokens in order to generate appropriate error messages upon encountering an invalid JSON document. Row/column numbers are tracked at the unicode code point level rather than at the actual grapheme level for simplicity (so the use of things like U+200D ZERO WIDTH JOINER may cause an offset error in the reported column index). Row numbers are defined strictly by the presence of the code point U+000A LINE FEED so they should always be correct (technically there exists documents out there that exclusively use U+000D CARRIAGE RETURN or other exotic characters to denote new lines but catering to this would represent an unreasonable increase complexity that I was not willing to write).

This parser (as it's currently written) does represent an API change for CBA_fnc_parseJSON as instead of returning nil on error it throws an exception with an appropriate string message (though the old behaviour can be easily restored by wrapping the internal parse function call in a try-catch, this will effectively eliminate the purpose of the row/column tracking feature however).

When merged this pull request will:

Turn CBA_fnc_parseJSON into a fully validating parsing function to deserialize objects from JSON strings, throwing exceptions at any parsing errors (including row/column numbers and error reasoning).
Enable parsing of more JSON documents that could not previously be handled due to missing grammar implementations.
Causes some invalid JSON documents that were previously acceptable to be rejected (given the validating nature of the parser).

Side notes:

I wrote this parser predominately for myself (i.e. for my own mod) but since I drew inspiration from the original CBA function I decided to adapt it to fit the CBA API. If this code represents an unreasonable increase in complexity and/or seems unacceptable, feel free to disregard/close this PR.
There is a technical inefficiency in the number tokenizer where we repeatedly run a regex over an increasingly growing string (in order to find the longest token match without requiring the regex to be run over the entire document). This is done to avoid writing a FSM exclusively for number matching and relying on pure regex. There are ways to get around this limitation to make the tokenization single-pass but I considered those solutions to be overly complex or messy. In any case only pathological examples (documents consisting entirely of extremely long numbers [by character count]) would this really be an issue for.
I made up the error messages without really consulting any official documentation as to what, if any, canonical error messages should be reported, so they shouldn't be considered authoritative (i.e. I am willing to change the message text if my existing choices are not acceptable).

addons/hashes/fnc_parseJSON.sqf

commy2 · 2023-07-11T22:04:35Z

Grew by almost 300 lines, more than doubled. Not a fan of try/throw/catch in SQF. Does it still pass test_parseJSON.sqf?

initramfs · 2023-07-12T15:07:57Z

@commy2 As stated in my sidenotes I wrote this parser for my own mod, as I am not on the CBA team I cannot gauge if the increase in complexity is worth the functionality/feature improvement. I would like to point out however that the original implementation does not fulfill it's contractual obligations in the preamble:

Returns:
_object - The deserialized JSON object or nil if JSON is invalid.
<LOCATION, ARRAY, STRING, NUMBER, BOOL, HASHMAP, NIL>

There are quite a few cases where the script will error out due to invalid parameters being passed into scripting commands (e.g. into parseNumber) or unacceptable strings being accepted (e.g. no code point check) or valid JSON being rejected/misparsed (e.g. missing hex escape codes), which for my usecase makes it completely unacceptable.

My motivation for writing my own parser came from my frustration when debugging a script that I wrote where I spent way too long trying to pinpoint the issue down, never expecting it to be from a CBA function (see PR #1580).

With respect to the usage of SQF exceptions, I don't really see any other way to pass information back to the caller in a structured way without further API changes (e.g. changing the return value to a n-tuple [array] representing if the parsing was successful + the JSON object [if successful] or error message and line numbers [if unsuccessful]). This is especially true given the output of this function could be almost all data types SQF supports (especially when you consider the additional _objectType parameter).

The exception type being thrown is always a string in this case but I can always change it to something more structured (e.g. array or hashmap) if the caller wants to be able to structurally parse the exception object.

As an additional note: as stated above the current implementation of the parser can cause runtime errors that are uncatchable (given those are errors from invalid parameters to the scripting engine) which is, in my opinion, worse than exceptions.

Of course another solution to this entire issue would be to force in the preconditions of the function that the JSON string must always be valid, though I would argue that represents the largest API change/deviation among all the ones I have mentioned.

With respect to the unit tests:

21:28:47 File x\cba\addons\hashes\test.sqf..., line 18
21:28:47 [CBA] (hashes) Test OK: (CBA_fnc_parseJSON is defined) x\cba\addons\hashes\test_parseJSON.sqf:14
21:28:47 [CBA] (hashes) Test OK: (not (_result)) x\cba\addons\hashes\test_parseJSON.sqf:20
21:28:47 [CBA] (hashes) Test OK: (_result isEqualTo _expected) x\cba\addons\hashes\test_parseJSON.sqf:25
21:28:47 [CBA] (hashes) Test OK: (_result == _expected) x\cba\addons\hashes\test_parseJSON.sqf:29
21:28:47 [CBA] (hashes) Test OK: (_result == _expected) x\cba\addons\hashes\test_parseJSON.sqf:33
21:28:47 [CBA] (hashes) Test OK: (_result == _expected) x\cba\addons\hashes\test_parseJSON.sqf:37
21:28:47 [CBA] (hashes) Test OK: (_result == _expected) x\cba\addons\hashes\test_parseJSON.sqf:41
21:28:47 [CBA] (hashes) Test OK: (_result == _expected) x\cba\addons\hashes\test_parseJSON.sqf:45
21:28:47 [CBA] (hashes) Test OK: (not (_result)) x\cba\addons\hashes\test_parseJSON.sqf:48
21:28:47 [CBA] (hashes) Test OK: (isNull _result) x\cba\addons\hashes\test_parseJSON.sqf:51
21:28:47 [CBA] (hashes) Test OK: (_result) x\cba\addons\hashes\test_parseJSON.sqf:57
21:28:47 [CBA] (hashes) Test OK: (_result isEqualTo _expected) x\cba\addons\hashes\test_parseJSON.sqf:62
21:28:47 [CBA] (hashes) Test OK: (_result == _expected) x\cba\addons\hashes\test_parseJSON.sqf:66
21:28:47 [CBA] (hashes) Test OK: (_result == _expected) x\cba\addons\hashes\test_parseJSON.sqf:70
21:28:47 [CBA] (hashes) Test OK: (_result == _expected) x\cba\addons\hashes\test_parseJSON.sqf:74
21:28:47 [CBA] (hashes) Test OK: (_result == _expected) x\cba\addons\hashes\test_parseJSON.sqf:78
21:28:47 [CBA] (hashes) Test OK: (_result == _expected) x\cba\addons\hashes\test_parseJSON.sqf:82
21:28:47 [CBA] (hashes) Test OK: (not (_result)) x\cba\addons\hashes\test_parseJSON.sqf:85
21:28:47 [CBA] (hashes) Test OK: (isNull _result) x\cba\addons\hashes\test_parseJSON.sqf:88
21:28:47 [CBA] (hashes) Test OK: (CBA_fnc_encodeJSON is defined) x\cba\addons\hashes\test_parseJSON.sqf:95
21:28:47 "null"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "true"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "1.2"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 """Hello, World!"""
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "{}"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[null, true, 1.2, ""Hello, World!""]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": []}]}]}]}]}]}]}]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "null"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "true"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "1.2"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 """Hello, World!"""
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "{}"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[null, true, 1.2, ""Hello, World!""]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": []}]}]}]}]}]}]}]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "null"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "true"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "1.2"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 """Hello, World!"""
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "{}"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[null, true, 1.2, ""Hello, World!""]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": []}]}]}]}]}]}]}]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "null"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "true"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "1.2"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 """Hello, World!"""
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "{}"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[null, true, 1.2, ""Hello, World!""]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": []}]}]}]}]}]}]}]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "null"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "true"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "1.2"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 """Hello, World!"""
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "{}"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[null, true, 1.2, ""Hello, World!""]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 "[{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": [{""nested"": []}]}]}]}]}]}]}]"
21:28:47 [CBA] (hashes) Test OK: (_input == _output) x\cba\addons\hashes\test_parseJSON.sqf:118
21:28:47 [CBA] (hashes) Test OK: (count _properties == 6) x\cba\addons\hashes\test_parseJSON.sqf:126
21:28:47 [CBA] (hashes) Test OK: (typeName _value == _x) x\cba\addons\hashes\test_parseJSON.sqf:129
21:28:47 [CBA] (hashes) Test OK: (typeName _value == _x) x\cba\addons\hashes\test_parseJSON.sqf:129
21:28:47 [CBA] (hashes) Test OK: (typeName _value == _x) x\cba\addons\hashes\test_parseJSON.sqf:129
21:28:47 [CBA] (hashes) Test OK: (typeName _value == _x) x\cba\addons\hashes\test_parseJSON.sqf:129
21:28:47 [CBA] (hashes) Test OK: (typeName _value == _x) x\cba\addons\hashes\test_parseJSON.sqf:129
21:28:47 [CBA] (hashes) Test OK: (typeName _value == _x) x\cba\addons\hashes\test_parseJSON.sqf:129

So it appears it still passes all the unit tests.

commy2 · 2023-07-12T15:48:55Z

CBA never uses exceptions. I get the possibilities compared to just using the ERROR_N macro and returning nil, but meh.

Probably needs to think about what value to throw, because that would become part of the public API.

The helper function part should be simplified and optimized to:

    private ["_createObject", "_objectSet"];

    switch (_objectType) do {
        case false;
        case 0: {
            _createObject = {[] call CBA_fnc_createNamespace};
            _objectSet = {
                params ["_obj", "_key", "_val"];
                _obj setVariable [_key, _val];
            };
        };

        case true;
        case 1: {
            _createObject = {[] call CBA_fnc_hashCreate};
            _objectSet = {
                params ["_obj", "_key", "_val"];
                [_obj, _key, _val] call CBA_fnc_hashSet;
            };
        };

        case 2: {
            _createObject = {createHashMap};
            _objectSet = {
                params ["_obj", "_key", "_val"];
                _obj set [_key, _val];
            };
        };
    };

No need to look up _objectType repeatedly when it never changes.

commy2 · 2023-07-12T15:54:21Z

addons/hashes/fnc_parseJSON.sqf

+forceUnicode 0;
+
+/**
+ * Helper function returning the code used for object creation based on the "_objectType" parameter.


Comment explains what the code does. That's just duplication and not useful. If you want to keep it as separator, just use // helper function.

I've removed the small helper functions with the original code so there are no additional function declarations anymore. The original rationale for the comment is merely my codestyle of always having a preamble for any function declaration.

I believe the preambles for _tokenize, _reduce definitely provide additional information (wrt. output format, preconditions, etc). To a lesser extent _shift and _parse do as well, and it seems a little inconsistent to me for some functions to get a full descriptor and some to not.

initramfs · 2023-07-12T16:43:08Z

CBA never uses exceptions. I get the possibilities compared to just using the ERROR_N macro and returning nil, but meh.

Probably needs to think about what value to throw, because that would become part of the public API.

The helper function part should be simplified and optimized to:

    private ["_createObject", "_objectSet"];

    switch (_objectType) do {
        case false;
        case 0: {
            _createObject = {[] call CBA_fnc_createNamespace};
            _objectSet = {
                params ["_obj", "_key", "_val"];
                _obj setVariable [_key, _val];
            };
        };

        case true;
        case 1: {
            _createObject = {[] call CBA_fnc_hashCreate};
            _objectSet = {
                params ["_obj", "_key", "_val"];
                [_obj, _key, _val] call CBA_fnc_hashSet;
            };
        };

        case 2: {
            _createObject = {createHashMap};
            _objectSet = {
                params ["_obj", "_key", "_val"];
                _obj set [_key, _val];
            };
        };
    };

No need to look up _objectType repeatedly when it never changes.

I've essentially restored what the original code did which does a single check and sets _createObject and _objectSet as function-local variables. The reason for the change was merely legal, as I was doing a cleanroom implementation (but I had already studied what the original implementation did) I didn't want to copy any part of GPL-ed code from the original. Now that I'm submitting the code back I guess that point is moot.

With respect to exceptions, this is the first piece of CBA code that I have touched so I'm not familiar with what kind of public API is expected (for example, I didn't know that there were no CBA functions that threw exceptions). As for what type of object to throw on parse error, I really don't have any input as to what is appropriate, I simply wrote something that was appropriate for myself.

In essence, I really don't have any stake in the matter and I can change the exception (and the exception documentation) to anything you want (or even just try/catch and return nil). The reason I didn't add the exception throwing behaviour to the function documentation was because I didn't know how to format it and was expecting directions there.

I suggest perhaps discussion with other CBA maintainers for input, or if you're in a position to make decisions yourself, just tell me what you want me to do.

commy2 · 2023-07-12T20:20:40Z

As for what type of object to throw on parse error, I really don't have any input as to what is appropriate, I simply wrote something that was appropriate for myself.

Idk either, because there was never any need to do it. Just saying that we should think about what is useful, because changing it later could break stuff.

I suggest perhaps discussion with other CBA maintainers for input

They may speak below:

jonpas · 2023-09-03T17:12:37Z

While I see the goals and pros of this approach, I dislike the use of try/catch/throw in SQF and the significant increase in codebase and complexity with that.

@BaerMitUmlaut Would be interesting to see your take on that.

BaerMitUmlaut · 2023-09-03T17:35:09Z

If you want to know my opinion, this is way too complex, the function is now twice the size. I don't like full rewrites in general. There's also no tests for the new functionality.

dedmen · 2023-09-05T17:53:00Z

(as of the time of writing: version 2.12)

Well we have 2.14 now ;)

Complete rewrite of CBA_fnc_parseJSON (CBATeam#1582)

8148521

initramfs force-pushed the rewrite-parse-json branch from 5118207 to 8148521 Compare July 7, 2023 07:04