-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make cugraph-ops optional for cugraph-gnn packages #99
base: branch-25.02
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
dependencies.yaml
Outdated
packages: | ||
- pytorch-cuda=12.4 | ||
- matrix: {cuda: "11.8"} | ||
- pytorch-gpu>=2.3=*cuda120* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already using conda-forge, I think? pytorch-gpu
is a conda-forge
package, not a pytorch
channel package. Also, the latest conda-forge builds are built with CUDA 12.6. CUDA 12.0 is no longer used to build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For compatibility reasons we may want to stick to older builds of pytorch-gpu (built with cuda120
) for now. We will hopefully be able to relax this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this PR switches to conda-forge::pytorch-gpu
since pytorch
channel will discontinue.
Also, the latest conda-forge builds are built with CUDA 12.6. CUDA 12.0 is no longer used to build.
For compatibility reasons we may want to stick to older builds of pytorch-gpu (built with cuda120) for now. We will hopefully be able to relax this in the future.
Oh, I had not noticed that the most recent build (_306) is only against 12.6. I agree with keeping 12.0 for better backward compatibility. However, the CUDA 11 build seems missing. Do we have details on their build matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA 11 builds were dropped recently. You may need an older version for CUDA 11 compatibility. I also saw this while working on rapidsai/cudf#17475. mamba search -c conda-forge "pytorch=*=cuda118*"
indicates the latest version with CUDA 11 support is 2.5.1 build 303. The latest is 2.5.1 build 306.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, the latest CUDA 12.0 build was also 2.5.1 build 303.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks. It shouldn't be a dealbreaker unless another test component ends up requiring a newer version of torch on CUDA 11 down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytorch-gpu
requires __cuda
, and is not installable on systems without a CUDA driver. This makes it impossible to resolve the conda environment needed for devcontainers jobs in CI, which are CPU-only.
Note: many CUDA packages, including RAPIDS, are explicitly designed not to have __cuda
as a run requirement, because it makes it impossible to install on a CPU node before using that environment on another system with a GPU.
It looks like if we just use pytorch
instead of pytorch-gpu
, we still get GPU builds:
CUDA 11 driver present:
CONDA_OVERRIDE_CUDA="11.8" conda create -n test --dry-run pytorch
shows
pytorch 2.5.1 cuda118_py313h40cdc2d_303 conda-forge
CUDA 12 driver present:
CONDA_OVERRIDE_CUDA="12.5" conda create -n test --dry-run pytorch
shows
pytorch 2.5.1 cuda126_py313hae2543e_306 conda-forge
No CUDA driver present:
CONDA_OVERRIDE_CUDA="" mamba create -n test --dry-run pytorch
shows
pytorch 2.5.1 cpu_mkl_py313_h90df46e_108 conda-forge
This should be sufficient. Let's try using just pytorch
instead of pytorch-gpu
with specific CUDA build selectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two benefits here, if my proposal above works.
- devcontainers CI job would get CPU-only builds, which should still be fine for builds
- We don't need to specify CUDA versions, so this dependency doesn't have to be "specific" to CUDA 11/12
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let's try with "pytorch"
instead of "pytorch-gpu"
.
That opens up a risk that there may be situations where the solver chooses a CPU-only version because of some conflict, but hopefully cugraph-pyg
can detect that with torch.cuda.is_available()
or similar and raise an informative error saying something like "if using conda, try 'conda install cugraph-pyg pytorch-gpu'".
We don't need to specify CUDA versions, so this dependency doesn't have to be "specific" to CUDA 11/12
I looked into this today... we shouldn't have needed to specify CUDA versions in build strings for pytorch-gpu
anyway, as long as we're pinning the cuda-version
package somewhere (for example, in the run:
dependencies of cugraph
).
Looks like pytorch-gpu
is ==
pinned to a specific pytorch
.
And the pytorch
CUDA builds all have run:
dependencies on cuda-version
.
So here in cugraph-pyg
, just having cuda-version
as a run:
dependency would be enough to ensure a compatible pytorch-gpu
/ pytorch
is pulled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb These simplifications to drop build string info are only possible now with conda-forge, iirc. I believe more complexity was required when we used the pytorch channel, and we probably just carried that over when switching to conda-forge.
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @tingyu66 , I'd like to help keep this moving forward. Left some questions for your consideration.
@@ -18,7 +18,6 @@ files: | |||
- depends_on_cugraph | |||
- depends_on_cudf | |||
- depends_on_dask_cudf | |||
- depends_on_pylibcugraphops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the dependencies in the conda recipes be removed as well?
- pylibcugraphops ={{ minor_version }} |
- pylibcugraphops ={{ minor_version }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tingyu66 could you please make this change (removing the dependencies on pylibcugraphops
in conda recipes) in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb Done, thank you!
except Exception as e: | ||
warnings.warn(f"Unexpected error while importing pylibcugraphops: {e}") | ||
|
||
if HAVE_CUGRAPH_OPS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this PR putting cugraph-ops stuff behind a conditional import instead of completely removing it?
Or, said another way.... how are you expecting that someone would have cugraph-pyg=25.2.*
and pylibcugraphops=25.2.*
installed together? Is this being left here so it's still possible to build and install pylibcugraphops
from source?
I'm hoping we can completely stop publishing pylibcugraphops
packages in the 25.02 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditional import is there because we planned to migrate cugraph-ops to a new location, and the new package will possibly retain the same name. We understand that rapids won't release cugraph-ops in 25.02. It's more of a placeholder for the migrated package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting, did not know that!
Ok thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this new home for the code doesn't exist yet, can we remove the warning on line 24? It seems wrong to warn about something you know does not exist.
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The packaging changes look mostly right to me. There are a few areas that may have been missed:
ci/release/update-version.sh
has references tolibcugraphops
andpylibcugraphops
readme_pages/CONTRIBUTING.md
says "The cugraph-ops package is the expection being a closed-source package." (Also probably a typo, expection -> exception)python/cugraph-pyg/pytest.ini
has references tocugraph_ops
, and several tests are marked withcugraph_ops
I would like to see those addressed (or responded to here) before approving.
The cugraph-gnn library changes can probably go further than shown in this PR. I don't think it makes sense to keep so many references to a library that no longer exists, even if you expect a new package. Maybe it's better to drop all cugraph-ops references and then add those code paths back when the new home of that code exists.
@@ -179,6 +180,7 @@ def test_get_source_destination_range(): | |||
assert output_d == expected_output | |||
|
|||
|
|||
@pytest.mark.skip(reason="Skipping due to missing cugraph-ops backend.") | |||
def test__create_homogeneous_cugraph_dgl_nn_sparse_graph(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Looks like a typo with the double underscore here.
def test__create_homogeneous_cugraph_dgl_nn_sparse_graph(): | |
def test_create_homogeneous_cugraph_dgl_nn_sparse_graph(): |
except Exception as e: | ||
warnings.warn(f"Unexpected error while importing pylibcugraphops: {e}") | ||
|
||
if HAVE_CUGRAPH_OPS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this new home for the code doesn't exist yet, can we remove the warning on line 24? It seems wrong to warn about something you know does not exist.
Fixes #55 RAPIDS will stop shipping `cugraph-ops` in 25.02 (ref: rapidsai/cugraph-gnn#99) This proposes removing all references to it in docs here. ## Notes for Reviewers In addition to auto-assigned CODEOWNERS, let's please not merge this until @alexbarghi-nv approves. ### How I found these references ```shell git grep -i cugraphops git grep -i -E 'cugraph.*ops' git grep -i ops ``` Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Don Acosta (https://github.com/acostadon) - Bradley Dice (https://github.com/bdice) - Alex Barghi (https://github.com/alexbarghi-nv) URL: #72
I'm + to this. I think it'd be better to fully remove the effectively-unreachable code here and then re-introduce it when the successor to But I'm also ok with deferring that to later if it means we could get this PR merged sooner, as that would unblock a lot of other work for removing |
Address #81