Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix read permission denied on train script when run as non-root #2373

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

astefanutti
Copy link
Contributor

What this PR does / why we need it:

This PR writes the train function script into a tmp directory so it can be read when TrainJob containers run as non-root.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #2372.

Checklist:

  • Docs included if any changes are user facing

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Jan 7, 2025

Pull Request Test Coverage Report for Build 12695578194

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 12692161092: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

Comment on lines 169 to 170
printf "%s" \"$SCRIPT\" > \"$program_path\"/\"{func_file}\"
{entrypoint} \"$program_path\"/\"{func_file}\""""
Copy link
Member

@andreyvelich andreyvelich Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we create this file under HOME directory, will it still fail for non-root containers ?

I intentionally write the script into the same file, so the file system between users' workspace and Kubernetes Pod is the same. Which means users can use standard Python imports while developing ML training code. Similar to how we showcase at KubeCon demo: https://youtu.be/Lgy4ir1AhYw?t=446

Related: #2347.

cc @shravan-achar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the stricter security policy, containers are run with random UID with no fixed HOME directory.

I need to research a bit more what could be the other possible options and get back to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given this another cycle and it doesn't seem possible to change the working directory permissions easily.

On the other hand, mounting an emptyDir volume as the /workspace directory in the training runtime works as expected with the stricter security policy.

My understanding is this approach aligns well with the initializer and worker pods coordinating via PVCs. The default emptyDir fits with the initializers being currently run as init containers, but this can be adapted to mount a PVC when they'll run as separate Jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Permission denied when reading TrainJob function script when run as non-root user
3 participants