Parallel loading from parquet and Pandas (#1732)

* start implementing parallel df loading (most of the infrastructure is there but still need to update all the loaders) * start implementing the parallel loaders * implement parallel loading from DfView * PropCols needs to return empty rows if not specified so the zipping doesn't terminate early * remove the len method from PropCol as it is not used anymore * fix merge issue * need to sort test output as order is no longer guaranteed * add missing feature tags * GID node state should implement Ord * make it possible to compare NodeState with dict * add sort_by_id for NodeState * clean up error handling and make missing values an error again * fix all the tests so they do not rely on insertion order which is no longer preserved * one more order-dependent test * resolve all nodes first * try to drop the pair lock earlier * try chunking by min of src/dst to reduce contention * pull the edge initialisation out of the node locks * num_shards exposed * try to improve the contention * expose number of shards to python * try to fix the decontention-sort * add jemallocator to fix some weirdness * snmalloc for slightly better performance and hopefully better compatibility * hopefully fix the python import error * fix python take 2 * last try * just on macos for now until we figure out what is going on * Revert the pre-sorting of the updates as it doesn't seem to help * clean up the handling of num_shards * remove unused method and bump the chunk size up in the pandas loader for a bit more speed * fix merge error and clean up allocator dependency management * fix dead code warnings * fix random breakage in async_graphql * no more debug symbols in the CI to hopefully save some disk space * fix the nextest invocation
Pometry · Sep 3, 2024 · 47e3329 · 47e3329 · github-actions · Sep 3, 2024
1 parent fcf885a
commit 47e3329
Show file tree

Hide file tree

Showing 31 changed files with 1,657 additions and 1,409 deletions.
diff --git a/.github/workflows/test_rust_disk_storage_workflow.yml b/.github/workflows/test_rust_disk_storage_workflow.yml
@@ -67,7 +67,7 @@ jobs:
           RUSTFLAGS: -Awarnings ${{ matrix.flags }}
           TEMPDIR: ${{ runner.temp }}
         run: |
-          cargo nextest run --all --no-default-features --features "storage"
+          cargo nextest run --all --no-default-features --features "storage" --cargo-profile test-ci
       - name: Check all features
         env:
           RUSTFLAGS: -Awarnings

diff --git a/.github/workflows/test_rust_workflow.yml b/.github/workflows/test_rust_workflow.yml
@@ -55,7 +55,7 @@ jobs:
           RUSTFLAGS: -Awarnings
           TEMPDIR: ${{ runner.temp }}
         run: |
-          cargo nextest run --all --no-default-features
+          cargo nextest run --all --no-default-features --cargo-profile test-ci
   doc-test:
     if: ${{ !inputs.skip_tests }}
     name: "Doc tests"
Benchmark suite	Current: `47e3329`	Previous: `fcf885a`	Ratio
`lotr_graph/iterate nodes`	`13015` ns/iter (`± 55`)	`4433` ns/iter (`± 111`)	`2.94`
`lotr_graph_window_100/iterate nodes`	`13211` ns/iter (`± 55`)	`4446` ns/iter (`± 133`)	`2.97`
`lotr_graph_window_10/iterate nodes`	`14933` ns/iter (`± 133`)	`5861` ns/iter (`± 130`)	`2.55`
`lotr_graph_subgraph_10pc/num_nodes`	`31550` ns/iter (`± 857`)	`15384` ns/iter (`± 1182`)	`2.05`
`lotr_graph_subgraph_10pc/iterate nodes`	`11722` ns/iter (`± 67`)	`3163` ns/iter (`± 85`)	`3.71`
`lotr_graph_subgraph_10pc/iterate edges`	`18143` ns/iter (`± 94`)	`8892` ns/iter (`± 224`)	`2.04`
`lotr_graph_subgraph_10pc_windowed/iterate nodes`	`11840` ns/iter (`± 120`)	`3375` ns/iter (`± 41`)	`3.51`
`lotr_graph_subgraph_10pc_windowed/iterate edges`	`17553` ns/iter (`± 57`)	`8390` ns/iter (`± 178`)	`2.09`
`lotr_graph_window_50_layered/iterate nodes`	`16348` ns/iter (`± 20`)	`7938` ns/iter (`± 82`)	`2.06`