Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DENG-6965 Track the trained model as W&B Artifact #41

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

abhi-agg
Copy link
Collaborator

Goal(s) of this Pull Request:

Now that it is agreed (https://mozilla-hub.atlassian.net/browse/DENG-6977) that BYOB is our preferred approach for migration, we would like to test if this approach works on a test W&B Team (a Team that none of our clients is using).

Why?

We would like to discover potential issues with BYOB approach before we start migrating the existing clients/Teams to our GCS storage.

Example of how it works:

...

Acceptance criteria:

In order to approve, reviewer must be able to...

  • ...

@abhi-agg
Copy link
Collaborator Author

Testing

  1. Run the ImageClassifier Flow (replace your-api-key-here with your own API key)
    WANDB_API_KEY=your-api-key-here WANDB_PROJECT=byob-test-project python image_classifier_flow.py --environment=pypi run --offline False
    
  2. Test if the trained model gets stored in wb-byob-test-abhi bucket mlops-inference-nonprod GCP project

Here is the OB run.

An excerpt from the logs showing the W&B run:

wandb: ⭐️ View project at https://wandb.ai/mlops-byob-test/byob-test-project
2024-12-13 10:41:53.725 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: 🚀 View run at https://wandb.ai/mlops-byob-test/byob-test-project/runs/68f3fopy
wandb: 🚀 View run snowy-smoke-2 at: https://wandb.ai/mlops-byob-test/byob-test-project/runs/68f3fopy.269 MB of 0.269 MB uploaded
2024-12-13 10:44:00.725 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: ⭐️ View project at: https://wandb.ai/mlops-byob-test/byob-test-project
2024-12-13 10:44:00.725 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: Synced 5 W&B file(s), 0 media file(s), 1 artifact file(s) and 1 other file(s)

Corresponding run is stored in wb-byob-test-abhi/wb-data/mlops-byob-test/byob-test-project/68f3fopy

Corresponding artifact is stored in wb-byob-test-abhi/wb-data/wandb_artifacts/502456953/1369576845

Note

W&B doesn't store artifacts with the name that you set in wandb.Artifact() API call. Rather, the name of the artifact in the bucket is an internally generated hash which is also confirmed by W&B team on slack.
However, I checked the content of the wb-byob-test-abhi/wb-data/mlops-byob-test/byob-test-project/68f3fopy/artifact/1369576845/wandb_manifest.json file which proves that the artifact that I linked before is indeed trained_model.pt

{
    "version": 1,
    "storagePolicy": "wandb-storage-policy-v1",
    "storagePolicyConfig": {
        "storageLayout": "V2"
    },
    "contents": {
        "trained_model.pt": {
            "digest": "Z5ewEffEcraBHDG22jgK+Q==",
            "birthArtifactID": "QXJ0aWZhY3Q6MTM2OTU3Njg0NQ==",
            "size": 251788
        }
    }
}

).

This proves that BYOB approach is working on this example.

Click here for detailed console logs (ignore Internal Server error and GCS upload failures)
```
Metaflow 2.12.27.1+obcheckpoint(0.1.1);ob(v1) executing ImageClassifierFlow for user:[email protected]
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
2024-12-13 10:31:18.488 Bootstrapping virtual environment(s) ...
2024-12-13 10:32:11.105 Virtual environment(s) bootstrapped!
2024-12-13 10:32:14.016 Workflow starting (run-id 248), see it in the UI at https://ui.desertowl.obp.outerbounds.com/p/default/ImageClassifierFlow/248
2024-12-13 10:32:33.328 [248/start/859 (pid 57543)] Task is starting.
2024-12-13 10:32:40.347 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] Task is starting (Pod is pending)...
2024-12-13 10:33:30.854 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] Setting up task environment.
2024-12-13 10:33:35.150 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] Downloading code package...
2024-12-13 10:33:36.009 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] Code package downloaded.
2024-12-13 10:33:36.500 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] Task is starting.
2024-12-13 10:33:37.330 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] Bootstrapping virtual environment...
2024-12-13 10:35:31.039 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] Environment bootstrapped.
2024-12-13 10:35:41.154 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] start step: downloading and normalizing dataset
2024-12-13 10:35:41.470 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
2024-12-13 10:35:45.731 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] Extracting ./data/cifar-10-python.tar.gz to ./data
100.0%2-13 10:35:45.445 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8]
2024-12-13 10:35:48.095 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] Files already downloaded and verified
2024-12-13 10:36:12.182 [248/start/859 (pid 57543)] [pod t-8039a268-n45cw-ggjb8] Task finished with exit code 0.
2024-12-13 10:36:16.579 [248/start/859 (pid 57543)] Task finished successfully.
2024-12-13 10:36:23.226 [248/train/861 (pid 57563)] Task is starting.
2024-12-13 10:37:37.668 [248/train/861 (pid 57563)] Error occurred while polling for result: HTTP Error 500: Internal Server Error
2024-12-13 10:37:37.669 1 task is running: train (1 running; 0 done).
2024-12-13 10:37:37.669 No tasks are waiting in the queue.
2024-12-13 10:37:37.669 3 steps have not started: end, upload_model_to_gcs, evaluate.
2024-12-13 10:37:37.670 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] Task status: SUBMITTED...
2024-12-13 10:37:47.870 [248/train/861 (pid 57563)] Error occurred while polling for result: HTTP Error 500: Internal Server Error
2024-12-13 10:38:00.009 [248/train/861 (pid 57563)] Error occurred while polling for result: HTTP Error 500: Internal Server Error
2024-12-13 10:38:12.079 [248/train/861 (pid 57563)] Error occurred while polling for result: HTTP Error 500: Internal Server Error
2024-12-13 10:38:24.237 [248/train/861 (pid 57563)] Error occurred while polling for result: HTTP Error 500: Internal Server Error
2024-12-13 10:38:36.349 [248/train/861 (pid 57563)] Error occurred while polling for result: HTTP Error 500: Internal Server Error
2024-12-13 10:38:48.454 [248/train/861 (pid 57563)] Error occurred while polling for result: HTTP Error 500: Internal Server Error
2024-12-13 10:39:00.546 [248/train/861 (pid 57563)] Error occurred while polling for result: HTTP Error 500: Internal Server Error
2024-12-13 10:39:12.700 [248/train/861 (pid 57563)] Error occurred while polling for result: HTTP Error 500: Internal Server Error
2024-12-13 10:39:24.815 [248/train/861 (pid 57563)] Error occurred while polling for result: HTTP Error 500: Internal Server Error
2024-12-13 10:39:36.934 [248/train/861 (pid 57563)] Error occurred while polling for result: HTTP Error 500: Internal Server Error
2024-12-13 10:39:43.016 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] Setting up task environment.
2024-12-13 10:42:43.722 1 task is running: train (1 running; 0 done).
2024-12-13 10:42:43.722 No tasks are waiting in the queue.
2024-12-13 10:42:43.722 3 steps have not started: end, upload_model_to_gcs, evaluate.
2024-12-13 10:39:47.064 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] Downloading code package...
2024-12-13 10:39:51.664 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] Code package downloaded.
2024-12-13 10:39:52.007 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] Task is starting.
2024-12-13 10:39:52.294 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] Bootstrapping virtual environment...
2024-12-13 10:41:29.149 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] Environment bootstrapped.
2024-12-13 10:41:54.177 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] Fri Dec 13 09:41:54 2024
2024-12-13 10:41:54.177 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] +-----------------------------------------------------------------------------------------+
2024-12-13 10:41:54.177 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] | NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
2024-12-13 10:41:54.177 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] |-----------------------------------------+------------------------+----------------------+
2024-12-13 10:41:54.177 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2024-12-13 10:41:54.177 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2024-12-13 10:41:54.177 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] |                                         |                        |               MIG M. |
2024-12-13 10:41:54.177 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] |=========================================+========================+======================|
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] |   0  NVIDIA H100 80GB HBM3          On  |   00000000:0B:00.0 Off |                    0 |
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] | N/A   32C    P0             68W /  700W |       4MiB /  81559MiB |      0%      Default |
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] |                                         |                        |             Disabled |
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] +-----------------------------------------+------------------------+----------------------+
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b]
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] +-----------------------------------------------------------------------------------------+
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] | Processes:                                                                              |
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] |        ID   ID                                                               Usage      |
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] |=========================================================================================|
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] |  No running processes found                                                             |
2024-12-13 10:41:54.225 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] +-----------------------------------------------------------------------------------------+
2024-12-13 10:41:54.310 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] 0
2024-12-13 10:41:54.311 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] Training on: cuda
2024-12-13 10:42:12.442 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [1,  2000] loss: 2.240
2024-12-13 10:42:16.094 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [1,  4000] loss: 1.840
2024-12-13 10:42:19.788 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [1,  6000] loss: 1.659
2024-12-13 10:42:23.475 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [1,  8000] loss: 1.586
2024-12-13 10:42:27.170 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [1, 10000] loss: 1.522
2024-12-13 10:42:30.880 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [1, 12000] loss: 1.502
2024-12-13 10:42:35.520 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [2,  2000] loss: 1.415
2024-12-13 10:42:39.189 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [2,  4000] loss: 1.389
2024-12-13 10:41:53.119 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: Currently logged in as: aaggarwal (mlops-byob-test). Use `wandb login --relogin` to force relogin
2024-12-13 10:41:53.723 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: Tracking run with wandb version 0.17.9
2024-12-13 10:41:53.723 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: Run data is saved locally in /tmp/tmpb6sk8x8i/metaflow/wandb/run-20241213_094153-68f3fopy
2024-12-13 10:41:53.723 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: Run `wandb offline` to turn off syncing.
2024-12-13 10:41:53.725 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: Syncing run snowy-smoke-2
2024-12-13 10:41:53.725 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: ⭐️ View project at https://wandb.ai/mlops-byob-test/byob-test-project
2024-12-13 10:41:53.725 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: 🚀 View run at https://wandb.ai/mlops-byob-test/byob-test-project/runs/68f3fopy
2024-12-13 10:42:42.832 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [2,  6000] loss: 1.347
2024-12-13 10:42:46.487 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [2,  8000] loss: 1.335
2024-12-13 10:42:50.182 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [2, 10000] loss: 1.302
2024-12-13 10:42:53.820 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] [2, 12000] loss: 1.284
2024-12-13 10:42:54.775 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] Finished Training
wandb: 🚀 View run snowy-smoke-2 at: https://wandb.ai/mlops-byob-test/byob-test-project/runs/68f3fopy.269 MB of 0.269 MB uploaded
2024-12-13 10:44:00.725 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: ⭐️ View project at: https://wandb.ai/mlops-byob-test/byob-test-project
2024-12-13 10:44:00.725 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: Synced 5 W&B file(s), 0 media file(s), 1 artifact file(s) and 1 other file(s)
2024-12-13 10:44:00.725 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: Find logs at: ./wandb/run-20241213_094153-68f3fopy/logs
2024-12-13 10:44:00.725 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.
2024-12-13 10:44:14.941 [248/train/861 (pid 57563)] [7aac132c-ffb3-446d-b757-9bd2bbfc8c0b] Task finished with exit code 0.
2024-12-13 10:44:23.272 [248/train/861 (pid 57563)] Task finished successfully.
2024-12-13 10:44:29.699 [248/evaluate/862 (pid 57602)] Task is starting.
2024-12-13 10:44:36.830 [248/evaluate/862 (pid 57602)] [pod t-2dd27ae3-4nfnm-lf94z] Task is starting (Pod is running, Container is running)...

2024-12-13 10:44:34.077 [248/evaluate/862 (pid 57602)] [pod t-2dd27ae3-4nfnm-lf94z] Setting up task environment.
2024-12-13 10:44:38.110 [248/evaluate/862 (pid 57602)] [pod t-2dd27ae3-4nfnm-lf94z] Downloading code package...
2024-12-13 10:44:39.071 [248/evaluate/862 (pid 57602)] [pod t-2dd27ae3-4nfnm-lf94z] Code package downloaded.
2024-12-13 10:44:39.536 [248/evaluate/862 (pid 57602)] [pod t-2dd27ae3-4nfnm-lf94z] Task is starting.
2024-12-13 10:44:40.385 [248/evaluate/862 (pid 57602)] [pod t-2dd27ae3-4nfnm-lf94z] Bootstrapping virtual environment...
2024-12-13 10:46:29.703 [248/evaluate/862 (pid 57602)] [pod t-2dd27ae3-4nfnm-lf94z] Environment bootstrapped.
2024-12-13 10:46:35.873 [248/evaluate/862 (pid 57602)] [pod t-2dd27ae3-4nfnm-lf94z] Evaluating on: cpu
2024-12-13 10:46:47.254 [248/evaluate/862 (pid 57602)] [pod t-2dd27ae3-4nfnm-lf94z] Accuracy of the network on the 10000 test images: 55 %
2024-12-13 10:46:59.750 [248/evaluate/862 (pid 57602)] [pod t-2dd27ae3-4nfnm-lf94z] Task finished with exit code 0.
2024-12-13 10:47:05.485 [248/evaluate/862 (pid 57602)] Task finished successfully.
2024-12-13 10:47:11.296 [248/upload_model_to_gcs/863 (pid 57615)] Task is starting.
2024-12-13 10:47:18.280 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2] Task is starting (Pod is running, Container is running)...
2024-12-13 10:47:15.540 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2] Setting up task environment.
2024-12-13 10:47:47.777 1 task is running: upload_model_to_gcs (1 running; 0 done).
2024-12-13 10:47:47.777 No tasks are waiting in the queue.
2024-12-13 10:47:47.777 end step has not started
2024-12-13 10:47:19.655 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2] Downloading code package...
2024-12-13 10:47:20.516 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2] Code package downloaded.
2024-12-13 10:47:20.974 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2] Task is starting.
2024-12-13 10:47:21.772 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2] Bootstrapping virtual environment...
2024-12-13 10:47:42.161 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2] Environment bootstrapped.
2024-12-13 10:47:44.920 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2] Uploading model to gcs
2024-12-13 10:47:45.351 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2] <flow ImageClassifierFlow step upload_model_to_gcs> failed:
2024-12-13 10:47:46.538 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     Internal error
2024-12-13 10:47:46.540 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2] Traceback (most recent call last):
2024-12-13 10:47:46.540 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/metaflow/cli.py", line 1139, in main
2024-12-13 10:47:46.540 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     start(auto_envvar_prefix="METAFLOW", obj=state)
2024-12-13 10:47:46.540 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/metaflow/tracing/tracing_modules.py", line 111, in wrapper_func
2024-12-13 10:47:46.540 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     return func(args, kwargs)
2024-12-13 10:47:46.540 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]            ^^^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.540 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/metaflow/_vendor/click/core.py", line 829, in __call__
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     return self.main(args, kwargs)
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/metaflow/_vendor/click/core.py", line 782, in main
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     rv = self.invoke(ctx)
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]          ^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/metaflow/_vendor/click/core.py", line 1259, in invoke
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/metaflow/_vendor/click/core.py", line 1066, in invoke
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     return ctx.invoke(self.callback, ctx.params)
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/metaflow/_vendor/click/core.py", line 610, in invoke
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     return callback(args, kwargs)
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]            ^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/metaflow/_vendor/click/decorators.py", line 21, in new_func
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     return f(get_current_context(), args, kwargs)
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/metaflow/cli.py", line 469, in step
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     task.run_step(
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/metaflow/task.py", line 656, in run_step
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     self._exec_step_function(step_func)
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/metaflow/task.py", line 62, in _exec_step_function
2024-12-13 10:47:46.541 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     step_function()
2024-12-13 10:47:54.119 [248/upload_model_to_gcs/863 (pid 57615)] Kubernetes error:
2024-12-13 10:47:54.119 [248/upload_model_to_gcs/863 (pid 57615)] Error: gle/cloud/storage/client.py", line 907, in get_bucket
2024-12-13 10:47:54.211 [248/upload_model_to_gcs/863 (pid 57615)] bucket.reload(
2024-12-13 10:47:54.211 [248/upload_model_to_gcs/863 (pid 57615)] File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/contextlib.py", line 81, in inner
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] return func(args, kwds)
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] ^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/cloud/storage/bucket.py", line 1147, in reload
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] super(Bucket, self).reload(
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/cloud/storage/_helpers.py", line 303, in reload
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] api_response = client._get_resource(
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] ^^^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/cloud/storage/client.py", line 474, in _get_resource
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] return self._connection.api_request(
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/cloud/storage/_http.py", line 90, in api_request
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] return call()
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] ^^^^^^
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] return retry_target(
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] ^^^^^^^^^^^^^
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 153, in retry_target
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] _retry_error_helper(
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/api_core/retry/retry_base.py", line 212, in _retry_error_helper
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] raise final_exc from source_exc
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 144, in retry_target
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] result = target()
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] ^^^^^^^^
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] raise exceptions.from_http_response(response)
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/your-gcs-bucket-here?projection=noAcl&prettyPrint=false: The specified bucket does not exist.
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)]
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)] (exit code 1). This could be a transient error. Use @retry to retry.
2024-12-13 10:47:54.212 [248/upload_model_to_gcs/863 (pid 57615)]
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/image_classifier_flow.py", line 219, in upload_model_to_gcs
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     storage_client.store(
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/mozmlops/cloud_storage_api_client.py", line 45, in store
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     bucket = client.get_bucket(self.gcs_bucket_name)
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/contextlib.py", line 81, in inner
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     return func(args, kwds)
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]            ^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/cloud/storage/client.py", line 907, in get_bucket
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     bucket.reload(
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/contextlib.py", line 81, in inner
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     return func(args, kwds)
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]            ^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/cloud/storage/bucket.py", line 1147, in reload
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     super(Bucket, self).reload(
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/cloud/storage/_helpers.py", line 303, in reload
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     api_response = client._get_resource(
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]                    ^^^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/cloud/storage/client.py", line 474, in _get_resource
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     return self._connection.api_request(
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/cloud/storage/_http.py", line 90, in api_request
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     return call()
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]            ^^^^^^
2024-12-13 10:47:46.542 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     return retry_target(
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]            ^^^^^^^^^^^^^
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 153, in retry_target
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     _retry_error_helper(
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/api_core/retry/retry_base.py", line 212, in _retry_error_helper
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     raise final_exc from source_exc
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 144, in retry_target
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     result = target()
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]              ^^^^^^^^
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]   File "/metaflow/linux-64/7614dedb0e414d5/lib/python3.11/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]     raise exceptions.from_http_response(response)
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2] google.api_core.exceptions.NotFound: 404 GET https://storage.googleapis.com/storage/v1/b/your-gcs-bucket-here?projection=noAcl&prettyPrint=false: The specified bucket does not exist.
2024-12-13 10:47:46.543 [248/upload_model_to_gcs/863 (pid 57615)] [pod t-21118d6a-kspgx-n4qf2]
2024-12-13 10:47:54.573 [248/upload_model_to_gcs/863 (pid 57615)] Task failed.
2024-12-13 10:47:55.559 Workflow failed.
2024-12-13 10:47:55.559 Terminating 0 active tasks...
2024-12-13 10:47:55.559 Flushing logs...
    Step failure:
    Step upload_model_to_gcs (task-id 863) failed.

```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant