Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Update Jobeframework's IsActive() for RayJob #3949

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mszadkow
Copy link
Contributor

@mszadkow mszadkow commented Jan 9, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

The default value for the JobDeploymentStatus is “”, which equals JobDeploymentStatusNew.

Ray-operator is the one to change the JobDeploymentStatus from New to Initialising.
When the RayJob is Suspend it puts the JobDeploymentStatus to Suspending and then to Suspended.

However in MultiKueue setup we lack ray-operator in the manager cluster.
JobDeploymentStatus prevents workloads from being created in the manager cluster and copied to worker clusters as it remains "New" and it's not updated to Suspended.

Which issue(s) this PR fixes:

Relates #3822

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

To allow for workload creation in MultiKueue clusters.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. labels Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mszadkow
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 9, 2025
Copy link

netlify bot commented Jan 9, 2025

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 48f7006
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/677f89e73acb6000082f9dbe
😎 Deploy Preview https://deploy-preview-3949--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@mszadkow
Copy link
Contributor Author

mszadkow commented Jan 9, 2025

@andrewsykim @mimowo
I have created a separate PR to discuss it, based on the comment.
Is it the proper approach and explanation?

@mszadkow
Copy link
Contributor Author

mszadkow commented Jan 9, 2025

/cc @andrewsykim @mimowo

@mimowo
Copy link
Contributor

mimowo commented Jan 9, 2025

Could you clarify what are the user-facing consequences of the bug? Is it reproducible on a running system reliably or a rare race condition?


Kueue’s IsActive()old condition prevents workload creation as the IsActive is used for workload construction, to ensure that old workload is not running.

Can you show a link where this is used specifically so that we can better understand the e2e consequences of the issue?

@mimowo
Copy link
Contributor

mimowo commented Jan 9, 2025

Also, would this issue be reproducible and testable with integration tests for the RayJob, as being added in #3892?

@mszadkow
Copy link
Contributor Author

mszadkow commented Jan 9, 2025

Also, would this issue be reproducible and testable with integration tests for the RayJob, as being added in #3892?

Ray-operator is the one to change the JobDeploymentStatus from New to Initializing.
However bc the Job is Suspend it puts the JobDeploymentStatus to Suspending and then to Suspended.
It can only happen in MultiKueue scenario where we lack ray-operator...

We need some solution for that problem.

@mszadkow mszadkow changed the title Update Jobeframework's IsActive() for RayJob [WIP] Update Jobeframework's IsActive() for RayJob Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2025
@mszadkow mszadkow marked this pull request as draft January 9, 2025 12:23
@mimowo
Copy link
Contributor

mimowo commented Jan 9, 2025

It can only happen in MultiKueue scenario

I'm still not sure what makes it MK-specific. Can this happen if a regular RayJob is getting suspended, for example, due to preemption or setting the CQ to inactive? Also, in integration tests we could simulate what the Ray-operator would do.

@mszadkow
Copy link
Contributor Author

mszadkow commented Jan 9, 2025

It can only happen in MultiKueue scenario

I'm still not sure what makes it MK-specific. Can this happen if a regular RayJob is getting suspended, for example, due to preemption or setting the CQ to inactive? Also, in integration tests we could simulate what the Ray-operator would do.

  1. Regular(single cluster scenario) RayJob is not affected by this, bc we have running ray-operator.
    The ray-operator is responsible to put RayJob into non-active state, from Kueue perspective.
    Non-active meaning its .Status.JobDeploymentStatus is equal rayv1.JobDeploymentStatusSuspended and that's perfectly normal if the job is suspended.

  2. We do simulate ray-operator in the tests for single cluster by manually updating .Status.JobDeploymentStatus to rayv1.JobDeploymentStatusSuspended.
    As a matter of fact I can do the same for MultiKueue integration tests.
    However there is no agency to change .Status.JobDeploymentStatus from New to Suspended in MK scenario in e2e tests.

@mimowo
Copy link
Contributor

mimowo commented Jan 9, 2025

Thank you for the summary @mszadkow - this clarifies a lot. Since this is MK-specific I would prefer to again combine the change with the main PR - as initially done.

However there is no agency to change .Status.JobDeploymentStatus from New to Suspended in MK scenario in e2e tests.

This makes me wonder about the remaining question - why we didn't have an issue with IsActive for other JobCRDs before supporting managedBy in MultiKueue, for example, with kubeflow Jobs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants