Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SageMaker inference download locations are misconfigured, and models are downloaded twice #949

Open
thvasilo opened this issue Aug 2, 2024 · 0 comments
Assignees
Labels
0.4 bug Something isn't working sagemaker
Milestone

Comments

@thvasilo
Copy link
Contributor

thvasilo commented Aug 2, 2024

GSF tries to download the models into /opt/ml/gsgnn_model, as seen here https://github.com/thvasilo/graphstorm/blob/8e7c4c2e10accb114f2beccaa36ec3094d01241c/python/graphstorm/sagemaker/sagemaker_infer.py#L173

One a job with large model (learnable embeddings included) we see this in the logs in terms of disk space:

Filesystem      Size  Used Avail Use% Mounted on
overlay         120G   31G   90G  26% /
tmpfs            64M     0   64M   0% /dev
tmpfs           374G     0  374G   0% /sys/fs/cgroup
/dev/nvme0n1p1   70G   47G   24G  67% /usr/sbin/docker-init
/dev/nvme2n1   1008G  178G  779G  19% /tmp
shm             372G     0  372G   0% /dev/shm
/dev/nvme1n1    120G   31G   90G  26% /etc/hosts
tmpfs           374G     0  374G   0% /proc/acpi
tmpfs           374G     0  374G   0% /sys/firmware

The partition mounted under /, and I think that includes /opt , will only have 90GB available.

To be able to download larger datasets/models we need to be used the partition mounted under /tmp .

Also, in our inference launch script we define

https://github.com/thvasilo/graphstorm/blob/8e7c4c2e10accb114f2beccaa36ec3094d01241c/sagemaker/launch/launch_infer.py#L120

That will download the model data from the provided S3 path into /opt/ml/input/data/<channel_name> which by default for models will be /opt/ml/input/data/model (see the Estimator docs)

But then here, we try to download the model again, this time into /opt/ml/gsgnn_model

@thvasilo thvasilo added bug Something isn't working 0.4 labels Aug 2, 2024
@classicsong classicsong added this to the 0.4 release milestone Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.4 bug Something isn't working sagemaker
Projects
None yet
Development

No branches or pull requests

2 participants