Skip to content

Commit

Permalink
Updates and bug fixes. Add sagemaker launch script for papers100m con…
Browse files Browse the repository at this point in the history
…version job
  • Loading branch information
thvasilo committed Jan 10, 2025
1 parent 899479f commit 63ca7e6
Show file tree
Hide file tree
Showing 7 changed files with 143 additions and 51 deletions.
52 changes: 25 additions & 27 deletions examples/sagemaker-pipelines-graphbolt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ aws ec2 run-instances \
--instance-type "m6in.4xlarge" \
--key-name my-key-name \
--block-device-mappings '[{
"DeviceName": "/dev/sdf",
"DeviceName": "/dev/sda1",
"Ebs": {
"VolumeSize": 300,
"VolumeType": "gp3",
Expand All @@ -108,15 +108,14 @@ Once logged in, you can set up your Python environment to run GraphStorm
```bash
conda init
eval $SHELL
conda create -y --name gsf python=3.10
conda activate gsf
# Available on the DLAMI, otherwise create a new conda env
conda activate pytorch

# Install dependencies
pip install sagemaker boto3 ogb pyarrow
pip install sagemaker[local] boto3 ogb pyarrow

# Clone the GraphStorm repository to access the example code
git clone https://github.com/awslabs/graphstorm.git ~/graphstorm
cd ~/graphstorm/examples/sagemaker-pipelines-graphbolt
```

### Download and prepare datasets
Expand All @@ -136,12 +135,11 @@ You'lll download the smaller-scale [ogbn-arxiv](https://ogb.stanford.edu/docs/no
BUCKET_NAME=<your-s3-bucket>
```


You will use this script to directly download, transform and upload the data to S3:


```bash
python convert_ogb_arxiv_to_gconstruct.py \
cd ~/graphstorm/examples/sagemaker-pipelines-graphbolt
python convert_arxiv_to_gconstruct.py \
--output-prefix s3://$BUCKET_NAME/ogb-arxiv-input
```

Expand Down Expand Up @@ -188,19 +186,23 @@ bash build_and_push_papers100M_image.sh
# $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/papers100m-processor

# Run a SageMaker job to do the processing and upload the output to S3
SAGEMAKER_EXECUTION_ROLE=<your-sagemaker-execution-role>
SAGEMAKER_EXECUTION_ROLE_ARN=<your-sagemaker-execution-role-arn>
ACCOUNT_ID=<your-aws-account-id>
REGION=us-east-1
python sagemaker_convert_papers100M.py \

aws configure set region $REGION
python sagemaker_convert_papers100m.py \
--output-bucket $BUCKET_NAME \
--execution-role-arn $SAGEMAKER_EXECUTION_ROLE \
--execution-role-arn $SAGEMAKER_EXECUTION_ROLE_ARN \
--region $REGION \
--instance-type ml.m5.4xlarge \
--image-uri $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/papers100m-processor
```

This will produce the processed data at `s3://$BUCKET_NAME/ogb-papers100M-input` which can then be used as input to GraphStorm.

> NOTE: Ensure your instance IAM profile is allow to perform `iam:GetRole` and `iam:GetPolicy` on your `SAGEMAKER_EXECUTION_ROLE_ARN`.

#### [Optional] Prepare the ogbn-papers100M dataset locally

Expand All @@ -220,7 +222,8 @@ mkdir ~/papers100M-raw-data
cd ~/papers100M-raw-data
axel -n 16 http://snap.stanford.edu/ogb/data/nodeproppred/papers100M-bin.zip
ripuznip unzip-file papers100M-bin.zip
ripunzip unzip-file papers100M-bin/raw/data.npz && rm papers100M-bin/raw/data.npz
cd papers100M-bin/raw
ripunzip unzip-file data.npz && rm data.npz

# Install process script dependencies
python -m pip install \
Expand All @@ -232,6 +235,7 @@ python -m pip install \


# Process and upload to S3, this will take around 20 minutes
cd ~/graphstorm/examples/sagemaker-pipelines-graphbolt
python convert_ogb_papers100m_to_gconstruct.py \
--input-dir ~/papers100M-raw-data
--output-dir s3://$BUCKET_NAME/ogb-papers100M-input
Expand All @@ -248,20 +252,13 @@ sudo apt update
sudo apt install -y Docker.io
docker -v

# Enter you account ID here
ACCOUNT_ID=<aws-account-id>
REGION=us-east-1

cd ~/graphstorm

bash ./docker/build_graphstorm_image.sh --environment sagemaker --device cpu

bash docker/push_graphstorm_image.sh -e sagemaker -r $REGION -a $ACCOUNT_ID -d cpu
# This will push an image to
# ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sagemaker-cpu

# Install sagemaker with support for local mode
pip install sagemaker[local]
```

Next, you will create a SageMaker Pipeline to run the jobs that are necessary to train GNN models with GraphStorm.
Expand All @@ -276,10 +273,10 @@ In this section, you will create a [Sagemaker Pipeline](https://docs.aws.amazon.

```bash
PIPELINE_NAME="ogbn-arxiv-gs-pipeline"
BUCKET_NAME="my-s3-bucket"

bash deploy_papers100M_pipeline.sh \
--account "<aws-account-id>" \
--bucket-name $BUCKET_NAME --role "<execution-role>" \
--account $ACCOUNT_ID \
--bucket-name $BUCKET_NAME --role $SAGEMAKER_EXECUTION_ROLE_ARN \
--pipeline-name $PIPELINE_NAME \
--use-graphbolt false
```
Expand All @@ -303,8 +300,8 @@ The ogbn-arxiv data are small enough that you can execute the pipeline locally.

```bash
PIPELINE_NAME="ogbn-arxiv-gs-pipeline"
cd ~/graphstorm/sagemaker/pipeline
python execute_sm_pipeline.py \

python ~/graphstorm/sagemaker/pipeline/execute_sm_pipeline.py \
--pipeline-name $PIPELINE_NAME \
--region us-east-1 \
--local-execution | tee arxiv-local-logs.txt
Expand Down Expand Up @@ -382,7 +379,7 @@ bash deploy_arxiv_pipeline.sh \
--pipeline-name $PIPELINE_NAME \
--use-graphbolt true
# Execute the pipeline locally
python execute_sm_pipeline.py \
python ~/graphstorm/sagemaker/pipeline/execute_sm_pipeline.py \
--pipeline-name $PIPELINE_NAME \
--region us-east-1 \
--local-execution | tee arxiv-local-gb-logs.txt
Expand Down Expand Up @@ -439,6 +436,7 @@ Now you are ready to deploy your initial pipeline for papers-100M

```bash
PIPELINE_NAME="ogb-papers100M-pipeline"
cd ~/graphstorm/examples/sagemaker-pipelines-graphbolt/
bash deploy_papers100M_pipeline.sh \
--account <aws-account-id> \
--bucket-name <s3-bucket> --role <execution-role> \
Expand All @@ -449,7 +447,7 @@ bash deploy_papers100M_pipeline.sh \
Execute the pipeline and let it run the background.

```bash
python execute_sm_pipeline.py \
python ~/graphstorm/sagemaker/pipeline/execute_sm_pipeline.py \
--pipeline-name $PIPELINE_NAME \
--region us-east-1
--async-execution
Expand All @@ -469,7 +467,7 @@ bash deploy_papers100M_pipeline.sh \
--use-graphbolt true

# Execute the GraphBolt-enabled pipeline on SageMaker
python execute_sm_pipeline.py \
python ~/graphstorm/sagemaker/pipeline/execute_sm_pipeline.py \
--pipeline-name $PIPELINE_NAME \
--region us-east-1 \
--async-execution
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ def convert_ogbn_arxiv(output_prefix: str):
"node_id_col": "nid",
"node_type": "node",
"format": {"name": "parquet"},
"files": [f"{output_prefix}/nodes/paper/nodes.parquet"],
"files": ["nodes/paper/nodes.parquet"],
"features": [
{
"feature_col": "feat",
Expand All @@ -118,9 +118,9 @@ def convert_ogbn_arxiv(output_prefix: str):
"task_type": "classification",
"custom_split_filenames": {
"column": "nid",
"train": f"{output_prefix}/splits/train_idx.parquet",
"valid": f"{output_prefix}/splits/valid_idx.parquet",
"test": f"{output_prefix}/splits/test_idx.parquet",
"train": "splits/train_idx.parquet",
"valid": "splits/valid_idx.parquet",
"test": "splits/test_idx.parquet",
},
"label_stats_type": "frequency_cnt",
}
Expand All @@ -133,14 +133,14 @@ def convert_ogbn_arxiv(output_prefix: str):
"dest_id_col": "dst",
"relation": ["node", "cites", "node"],
"format": {"name": "parquet"},
"files": [f"{output_prefix}/edges/paper-cites-paper/edges.parquet"],
"files": ["edges/paper-cites-paper/edges.parquet"],
},
{
"source_id_col": "dst",
"dest_id_col": "src",
"relation": ["node", "cites-rev", "node"],
"format": {"name": "parquet"},
"files": [f"{output_prefix}/edges/paper-cites-paper/edges.parquet"],
"files": ["/edges/paper-cites-paper/edges.parquet"],
},
],
}
Expand All @@ -160,4 +160,4 @@ def convert_ogbn_arxiv(output_prefix: str):
if __name__ == "__main__":
args = parse_args()

convert_ogbn_arxiv(args.output_prefix)
convert_ogbn_arxiv(args.output_s3_prefix)
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
Convert papers100M data and prepare for input to GConstruct
"""

import argparse
import gzip
import json
Expand Down Expand Up @@ -87,7 +88,7 @@ def process_data(input_dir, output_dir, filesystem):
num_nodes, num_features = node_feat.shape
num_edges = edge_index.shape[1]
logging.info(
f"Node features shape: {node_feat.shape:,}, Number of edges: {num_edges:,}"
f"Node features shape: {node_feat.shape}, Number of edges: {num_edges:,}"
)

# Define schemas for nodes and edges
Expand Down
12 changes: 6 additions & 6 deletions examples/sagemaker-pipelines-graphbolt/deploy_arxiv_pipeline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,15 @@ parse_params() {
case "${1-}" in
-h | --help) usage ;;
-x | --verbose) set -x ;;
-r | --role)
ROLE="${2-}"
-r | --execution-role)
ROLE_ARN="${2-}"
shift
;;
-a | --account)
ACCOUNT="${2-}"
shift
;;
-b | --bucket)
-b | --bucket-name)
BUCKET_NAME="${2-}"
shift
;;
Expand All @@ -54,8 +54,8 @@ parse_params() {

# check required params and arguments
[[ -z "${ACCOUNT-}" ]] && die "Missing required parameter: -a/--account <aws-account-id>"
[[ -z "${BUCKET-}" ]] && die "Missing required parameter: -b/--bucket <s3-bucket>"
[[ -z "${ROLE-}" ]] && die "Missing required parameter: -r/--role <execution-role-arn>"
[[ -z "${BUCKET_NAME-}" ]] && die "Missing required parameter: -b/--bucket <s3-bucket>"
[[ -z "${ROLE_ARN-}" ]] && die "Missing required parameter: -r/--execution-role <execution-role-arn>"
[[ -z "${USE_GRAPHBOLT-}" ]] && die "Missing required parameter: -g/--use-graphbolt <true|false>"

return 0
Expand Down Expand Up @@ -102,6 +102,7 @@ fi

python3 $SCRIPT_DIR/../../sagemaker/pipeline/create_sm_pipeline.py \
--cpu-instance-type ${TRAIN_CPU_INSTANCE} \
--execution-role "${ROLE_ARN}" \
--graph-construction-args "--num-processes 8" \
--graph-construction-instance-type ${GCONSTRUCT_INSTANCE} \
--graph-construction-config-filename ${GCONSTRUCT_CONFIG} \
Expand All @@ -119,7 +120,6 @@ python3 $SCRIPT_DIR/../../sagemaker/pipeline/create_sm_pipeline.py \
--partition-output-json ${PARTITION_OUTPUT_JSON} \
--partition-algorithm ${PARTITION_ALGORITHM} \
--region ${REGION} \
--role "${ROLE}" \
--train-on-cpu \
--train-inference-task ${TASK_TYPE} \
--train-yaml-s3 "${TRAIN_YAML_S3}" \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,15 @@ parse_params() {
case "${1-}" in
-h | --help) usage ;;
-x | --verbose) set -x ;;
-r | --role)
ROLE="${2-}"
-r | --execution-role)
ROLE_ARN="${2-}"
shift
;;
-a | --account)
ACCOUNT="${2-}"
shift
;;
-b | --bucket)
-b | --bucket-name)
BUCKET_NAME="${2-}"
shift
;;
Expand All @@ -56,7 +56,7 @@ parse_params() {
# check required params and arguments
[[ -z "${ACCOUNT-}" ]] && die "Missing required parameter: -a/--account <aws-account-id>"
[[ -z "${BUCKET-}" ]] && die "Missing required parameter: -b/--bucket <s3-bucket>"
[[ -z "${ROLE-}" ]] && die "Missing required parameter: -r/--role <execution-role-arn>"
[[ -z "${ROLE_ARN-}" ]] && die "Missing required parameter: -r/--execution-role <execution-role-arn>"
[[ -z "${USE_GRAPHBOLT-}" ]] && die "Missing required parameter: -g/--use-graphbolt <true|false>"

return 0
Expand All @@ -78,6 +78,7 @@ fi

JOBS_TO_RUN="gconstruct train inference"

DATASET_S3_PATH="s3://${BUCKET_NAME}/papers-100M-input"
OUTPUT_PATH="s3://${BUCKET_NAME}/pipelines-output"
GRAPH_NAME="papers-100M"
INSTANCE_COUNT="4"
Expand All @@ -91,7 +92,7 @@ GSF_CPU_IMAGE_URI=${ACCOUNT}.dkr.ecr.$REGION.amazonaws.com/graphstorm:sagemaker-
GSF_GPU_IMAGE_URI=${ACCOUNT}.dkr.ecr.$REGION.amazonaws.com/graphstorm:sagemaker-gpu

GCONSTRUCT_CONFIG="gconstruct_config_papers100m.json"
GRAPH_CONSTRUCTION_ARGS="--add-reverse-edges False --num-processes 16"
GRAPH_CONSTRUCTION_ARGS="--num-processes 16"

PARTITION_OUTPUT_JSON="metadata.json"
PARTITION_OUTPUT_JSON="$GRAPH_NAME.json"
Expand All @@ -111,7 +112,7 @@ if [[ -z "${PIPELINE_NAME-}" ]]; then
fi

python3 $SCRIPT_DIR/../../sagemaker/pipeline/create_sm_pipeline.py \
--execution-role "${ROLE}" \
--execution-role "${ROLE_ARN}" \
--cpu-instance-type ${CPU_INSTANCE_TYPE} \
--gpu-instance-type ${TRAIN_GPU_INSTANCE} \
--graph-construction-args "${GRAPH_CONSTRUCTION_ARGS}" \
Expand All @@ -124,7 +125,7 @@ python3 $SCRIPT_DIR/../../sagemaker/pipeline/create_sm_pipeline.py \
--inference-yaml-s3 "${INFERENCE_YAML_S3}" \
--input-data-s3 "${DATASET_S3_PATH}" \
--instance-count ${INSTANCE_COUNT} \
--jobs-to-run "${JOBS_TO_RUN}" \
--jobs-to-run ${JOBS_TO_RUN} \
--num-trainers ${NUM_TRAINERS} \
--output-prefix-s3 "${OUTPUT_PATH}" \
--pipeline-name "${PIPELINE_NAME}" \
Expand Down
4 changes: 1 addition & 3 deletions examples/sagemaker-pipelines-graphbolt/process_papers100M.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ trap cleanup SIGINT SIGTERM ERR EXIT
cleanup() {
trap - SIGINT SIGTERM ERR EXIT
# script cleanup here
kill $DISK_USAGE_PID > /dev/null 2>&1 || true
}

# Download and unzip data in parallel
Expand All @@ -16,7 +15,6 @@ cd $TEMP_DATA_PATH || exit 1

echo "Will execute script $1 with output prefix $2"


echo "$(date -u '+%Y-%m-%dT%H:%M:%SZ'): Downloading files using axel, this will take at least 10 minutes depending on network speed"
time axel -n 16 --quiet http://snap.stanford.edu/ogb/data/nodeproppred/papers100M-bin.zip

Expand All @@ -29,6 +27,6 @@ time ripunzip unzip-file data.npz && rm data.npz

# Run the processing script
echo "$(date -u '+%Y-%m-%dT%H:%M:%SZ'): Processing data and uploading to S3, this will take around 20 minutes"
python3 /opt/ml/code/"$1" \
time python3 /opt/ml/code/"$1" \
--input-dir "$TEMP_DATA_PATH/papers100M-bin/" \
--output-prefix "$2"
Loading

0 comments on commit 63ca7e6

Please sign in to comment.