Fix read permission denied on train script when run as non-root #2373

astefanutti · 2025-01-07T14:05:11Z

What this PR does / why we need it:

This PR writes the train function script into a tmp directory so it can be read when TrainJob containers run as non-root.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #2372.

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2025-01-07T14:05:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2025-01-07T14:10:32Z

Pull Request Test Coverage Report for Build 12695578194

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 12692161092:	0.0%
Covered Lines:	85
Relevant Lines:	85

💛 - Coveralls

andreyvelich · 2025-01-07T18:46:51Z

sdk_v2/kubeflow/training/utils/utils.py

+                printf "%s" \"$SCRIPT\" > \"$program_path\"/\"{func_file}\"
+                {entrypoint} \"$program_path\"/\"{func_file}\""""


If we create this file under HOME directory, will it still fail for non-root containers ?

I intentionally write the script into the same file, so the file system between users' workspace and Kubernetes Pod is the same. Which means users can use standard Python imports while developing ML training code. Similar to how we showcase at KubeCon demo: https://youtu.be/Lgy4ir1AhYw?t=446

Related: #2347.

cc @shravan-achar

With the stricter security policy, containers are run with random UID with no fixed HOME directory.

I need to research a bit more what could be the other possible options and get back to you.

I've given this another cycle and it doesn't seem possible to change the working directory permissions easily.

On the other hand, mounting an emptyDir volume as the /workspace directory in the training runtime works as expected with the stricter security policy.

My understanding is this approach aligns well with the initializer and worker pods coordinating via PVCs. The default emptyDir fits with the initializers being currently run as init containers, but this can be adapted to mount a PVC when they'll run as separate Jobs.

Signed-off-by: Antonin Stefanutti <[email protected]>

google-oss-prow bot requested review from jinchihe and kuizhiqing January 7, 2025 14:05

google-oss-prow bot added the size/XS label Jan 7, 2025

astefanutti force-pushed the pr-07 branch from 6c5a1dc to 855a879 Compare January 7, 2025 16:20

andreyvelich reviewed Jan 7, 2025

View reviewed changes

Fix read permission denied on train script when run as non-root

5cabee9

Signed-off-by: Antonin Stefanutti <[email protected]>

astefanutti force-pushed the pr-07 branch from 855a879 to 5cabee9 Compare January 9, 2025 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix read permission denied on train script when run as non-root #2373

Fix read permission denied on train script when run as non-root #2373

astefanutti commented Jan 7, 2025

google-oss-prow bot commented Jan 7, 2025

coveralls commented Jan 7, 2025 •

edited

Loading

andreyvelich Jan 7, 2025 •

edited

Loading

astefanutti Jan 8, 2025

astefanutti Jan 9, 2025

		printf "%s" \"$SCRIPT\" > \"$program_path\"/\"{func_file}\"
		{entrypoint} \"$program_path\"/\"{func_file}\""""

Fix read permission denied on train script when run as non-root #2373

Are you sure you want to change the base?

Fix read permission denied on train script when run as non-root #2373

Conversation

astefanutti commented Jan 7, 2025

google-oss-prow bot commented Jan 7, 2025

coveralls commented Jan 7, 2025 • edited Loading

Pull Request Test Coverage Report for Build 12695578194

Details

💛 - Coveralls

andreyvelich Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

astefanutti Jan 8, 2025

Choose a reason for hiding this comment

astefanutti Jan 9, 2025

Choose a reason for hiding this comment

coveralls commented Jan 7, 2025 •

edited

Loading

andreyvelich Jan 7, 2025 •

edited

Loading