-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the drop-duplicate functionality #4095
Optimize the drop-duplicate functionality #4095
Conversation
…2_enable-drop-duplicates
…2_enable-drop-duplicates
warnings.warn( | ||
"'multi' is deprecated as the removal of multi edges is " | ||
"only supported when creating a multi-graph", | ||
FutureWarning, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something: does this mean a user who wants a MultiGraph could also want to remove the multiple edges? Why would they want a MultiGraph without multiple edges? It seems like this option would be more useful for a Graph, but it's being deprecated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something: does this mean a user who wants a MultiGraph could also want to remove the multiple edges
No, It means that the only way to drop multi edges is by creating a multi-graph. In fact prior to that, we were using the flag multi
in the symmetrize
function to drop multi-edges (if creating a directed or undirected Graph) through the _memory_efficient_drop_duplicates
function which is dask based. But we can't just get rid of that flag because some users might still have it in their code base if they were directly calling the symmetrize
function directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks. I think my concerns will go away when Vibhu's feedback is addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good mostly apart from some changes around docstrings
.
Can we also quickly also run the graph creation benchmarks here too see how perf may change.
multi : bool, optional (default=False) | ||
Deprecated. | ||
Set to True if graph is a Multi(Di)Graph. This allows multiple | ||
edges instead of dropping them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also mention the warning statement along with Deprecated
, otherwise it is confusing for the user.
multi : bool, optional (default=False) | |
Deprecated. | |
Set to True if graph is a Multi(Di)Graph. This allows multiple | |
edges instead of dropping them. | |
multi : bool, optional (default=False) | |
[Deprecated, Multi will be removed in future version, and the removal of multi | |
edges will no longer be supported from 'symmetrize'. Multi edges will be removed | |
upon creation of graph instance directly based on if the graph is | |
`curgaph.MultiGraph` or `cugraph.Graph`. ] | |
Set to True if graph is a Multi(Di)Graph. This allows multiple | |
edges instead of dropping them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also will suggest filing an issue for tracking this deprecation.
@@ -208,6 +224,7 @@ def symmetrize( | |||
weights column name. | |||
|
|||
multi : bool, optional (default=False) | |||
Deprecated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here.
…2_enable-drop-duplicates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, just some clarification around if we are dropping duplicates for single GPU's only.
edgelist_df = edgelist_df.groupby( | ||
by=[*vertex_col_name], as_index=False | ||
).min() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming, edgelist_df
is a cuDF
data frame right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is a cuDF dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
if not self.properties.multi_edge: | ||
# Drop parallel edges for non MultiGraph | ||
# FIXME: Drop multi edges with the CAPI instead. | ||
df = df.groupby(by=[*vertex_col_name], as_index=False).min() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here , assuming this a cuDF dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is a cuDF dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
@@ -1176,10 +1176,12 @@ def test_extract_subgraph_vertex_prop_condition_only( | |||
) | |||
# Should result in two edges, one a "relationship", the other a "referral" | |||
expected_edgelist = cudf.DataFrame( | |||
{"src": [89216, 78634], "dst": [78634, 89216], "weights": [99, 8]} | |||
{"src": [78634, 89216], "dst": [89216, 78634], "weights": [8, 99]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder why we flipped it around ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially removed the drop_duplicate
call which was swapping the order of some columns. This expected result was set to match the actual_result. But this is no longer relevant since I am no longer updating this test file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving assuming all of @VibhuJawa 's feedback gets addressed.
@VibhuJawa that's correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…2_enable-drop-duplicates
/merge |
Our current python API leverages dask to implement the
drop-duplicate
functionality but it carries a lot of overhead as it draws a significant amount of host memory and results into a crash when processing large graphs (4+ billion edges).This PR
multi
which, when set to False, triggers the dask baseddrop-duplicate
functionalitydo_expensive_check
to check forNULL
values in the edgelist