Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added transformers to serialize/unserialize entire row into/from entry #1358

Merged
merged 1 commit into from
Jan 12, 2025

Conversation

norberttech
Copy link
Member

Change Log

Added

  • Transformers to serialize/unserialize entire row under/from one entry

Fixed

Changed

Removed

Deprecated

Security


Description

Reference: #1317

One of the steps to offload join (and other) operations to an external computation engine is to serialize the entire row (more details here)

I made an attempt to integrate it as a ScalarFunctions serialize_row() and unserialize_row() but I quickly realized that this breaks the scalar function contract. The goal of the scalar function is to return scalar value/values that are later turned into Entries through EntryFactory. Serialization of the entire row operates on a Rows collection level that's why it had to be implemented as a Transformer.

Tip

Those transformers are not exposed to the main DataFrame API however they can be used through the DataFrame::transform() method that accepts Transformer or Transformation

This might not be the best solution and SQLLite-based implementations might need more precise/advanced configuration but regardless of the result, I believe those two transformers might be useful in other places.

Copy link
Contributor

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+-----------------+------------------+------------------+
| benchmark             | subject           | revs | its | mem_peak        | mode             | rstdev           |
+-----------------------+-------------------+------+-----+-----------------+------------------+------------------+
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.744mb +0.04%  | 545.200ms +0.99% | ±2.35% +1209.16% |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.811mb +0.04%  | 1.063s +0.77%    | ±0.36% -60.33%   |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 86.463mb +0.00% | 923.827ms +2.71% | ±2.35% +578.50%  |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.477mb +0.05%  | 33.595ms -0.41%  | ±0.69% +88.63%   |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.454mb +0.05%  | 600.856ms -0.09% | ±0.30% -17.91%   |
+-----------------------+-------------------+------+-----+-----------------+------------------+------------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+---------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev        |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+---------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 108.472mb +0.00% | 58.274ms -2.85% | ±0.47% -0.08% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+---------------+
Loaders
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode            | rstdev          |
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 54.020mb +0.00%  | 97.688ms -2.48% | ±1.98% +12.75%  |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 76.780mb +0.00%  | 93.539ms +0.00% | ±0.50% -36.02%  |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 166.960mb +0.00% | 20.605s -0.03%  | ±0.08% -87.10%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.063mb +0.01%  | 31.349ms +2.34% | ±1.18% +160.92% |
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+-------------------------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev                        |
+-------------------------+----------------------------+------+-----+------------------+------------------+-------------------------------+
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 80.247mb +0.00%  | 3.197ms -9.00%   | ±3.03% -13.19%                |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 97.523mb +0.00%  | 188.700ms -0.58% | ±0.08% -85.56%                |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 80.243mb +0.00%  | 19.119ms +0.91%  | ±0.47% -78.50%                |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 81.122mb +0.00%  | 1.690ms -1.75%   | ±1.72% -18.42%                |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 81.122mb +0.00%  | 1.445ms -20.04%  | ±3.66% -6.50%                 |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 79.283mb +0.00%  | 3.645ms -2.28%   | ±1.23% +82.27%                |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 79.812mb +0.00%  | 14.956ms -1.53%  | ±0.95% +433.91%               |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 79.812mb +0.00%  | 15.402ms +1.48%  | ±3.13% +855.51%               |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 78.503mb +0.00%  | 1.794μs -5.58%   | ±2.67% +22832449392073000.00% |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 78.503mb +0.00%  | 0.300μs -25.00%  | ±0.00% -100.00%               |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 86.840mb +0.00%  | 12.686ms +0.93%  | ±0.72% -76.77%                |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 114.188mb +0.00% | 58.424ms -3.56%  | ±0.69% -61.25%                |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 80.332mb +0.00%  | 1.344ms -19.31%  | ±1.89% -49.36%                |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 83.622mb +0.00%  | 62.495ms +1.36%  | ±1.26% +205.46%               |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 81.384mb +0.00%  | 3.677ms -9.25%   | ±3.40% +229.23%               |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 78.784mb +0.00%  | 43.205ms +2.23%  | ±1.21% +110.64%               |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 78.785mb +0.00%  | 42.898ms +2.54%  | ±0.41% -74.95%                |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 78.784mb +0.00%  | 42.905ms +1.33%  | ±0.99% +144.30%               |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 80.944mb +0.00%  | 8.110ms -1.61%   | ±1.61% +123.28%               |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 78.694mb +0.00%  | 29.244ms -1.57%  | ±1.96% +53.48%                |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 78.503mb +0.00%  | 14.054μs +0.24%  | ±3.18% +256.30%               |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 78.503mb +0.00%  | 16.218μs -0.40%  | ±0.87% -72.05%                |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 97.524mb +0.00%  | 194.297ms -0.29% | ±1.09% +152.00%               |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 98.639mb +0.00%  | 436.299ms -0.80% | ±0.48% -50.71%                |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 51.469mb +0.00%  | 226.566ms +1.04% | ±0.43% -47.81%                |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 13.641mb +0.01%  | 48.661ms -0.23%  | ±0.61% -78.58%                |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 43.773mb +0.00%  | 361.250ms +0.25% | ±0.46% -57.58%                |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 11.582mb +0.01%  | 72.775ms -0.85%  | ±0.48% -37.40%                |
+-------------------------+----------------------------+------+-----+------------------+------------------+-------------------------------+

Copy link

codecov bot commented Jan 12, 2025

Codecov Report

Attention: Patch coverage is 93.54839% with 2 lines in your changes missing coverage. Please review.

Project coverage is 82.58%. Comparing base (17837ca) to head (ad73e4f).
Report is 2 commits behind head on 1.x.

Additional details and impacted files
@@            Coverage Diff             @@
##              1.x    #1358      +/-   ##
==========================================
+ Coverage   82.55%   82.58%   +0.02%     
==========================================
  Files         643      645       +2     
  Lines       17307    17337      +30     
==========================================
+ Hits        14288    14317      +29     
- Misses       3019     3020       +1     
Components Coverage Δ
etl 86.01% <93.54%> (+0.03%) ⬆️
cli 85.17% <ø> (ø)
lib-array-dot 94.53% <ø> (ø)
lib-azure-sdk 62.56% <ø> (ø)
lib-doctrine-dbal-bulk 97.36% <ø> (ø)
lib-filesystem 76.23% <ø> (ø)
lib-parquet 84.57% <ø> (ø)
lib-parquet-viewer 82.02% <ø> (ø)
lib-rdsl 87.09% <ø> (ø)
lib-snappy 91.16% <ø> (+0.46%) ⬆️
bridge-filesystem-async-aws 90.38% <ø> (ø)
bridge-filesystem-azure 89.92% <ø> (ø)
bridge-monolog-http 96.38% <ø> (ø)
symfony-http-foundation 77.10% <ø> (ø)
adapter-chartjs 86.45% <ø> (ø)
adapter-csv 89.49% <ø> (ø)
adapter-doctrine 90.14% <ø> (ø)
adapter-elasticsearch 97.19% <ø> (ø)
adapter-google-sheet 78.04% <ø> (ø)
adapter-http 59.15% <ø> (ø)
adapter-json 92.85% <ø> (ø)
adapter-logger 53.84% <ø> (ø)
adapter-meilisearch 97.75% <ø> (ø)
adapter-parquet 59.88% <ø> (ø)
adapter-text 84.44% <ø> (ø)
adapter-xml 83.15% <ø> (ø)

@norberttech norberttech merged commit 6b3bbe1 into flow-php:1.x Jan 12, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant