Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate data-processing clusters to us-central1 #1092

Open
13 of 17 tasks
stephen-soltesz opened this issue Jun 9, 2022 · 12 comments
Open
13 of 17 tasks

Migrate data-processing clusters to us-central1 #1092

stephen-soltesz opened this issue Jun 9, 2022 · 12 comments
Assignees

Comments

@stephen-soltesz
Copy link
Contributor

stephen-soltesz commented Jun 9, 2022

The data-processing cluster in mlab-sandbox & mlab-staging is in us-east, while the archive-measurement-lab bucket is in us-central1. These clusters should be redeployed to us-central, and their output buckets recreated in us-central. Since we want the GKE cluster to be managed by Terraform, we will recreate the production cluster as well.

  • create new data-processing cluster in us-central1 for sandbox & staging
  • create new etl-$PROJECT replacement bucket in us-central
  • create new etl-$PROJECT-us-central1 buckets & update etl & gardener configuration to use them

Production deployment

  • Tag terraform-support repo to create data-pipeline cluster in production
  • Import service account into TF by hand.
    terraform import module.data-pipeline.google_service_account.stats_pipeline  \
        projects/mlab-oti/serviceAccounts/[email protected]
    
  • Add role binding to new GKE cluster:
    kubectl create clusterrolebinding additional-cluster-admins  --clusterrole=cluster-admin  \
        --user=<id>@cloudbuild.gserviceaccount.com
    
  • Update CB substitutions for the six data pipeline service repos.
  • Tag all six data pipeline service repos to deploy to data-pipeline cluster
  • Create DNS record for prometheus-data-pipeline.mlab-oti.measurementlab.net using Cluster LB address

Clean up tasks after deployments:

  • Remove services from sandbox & staging data-processing cluster
  • Remove services from production data-processing cluster
  • Remove prometheus-data-processing.$PROJECT.* DNS records
  • Remove old data sources from prometheus-support & Grafana
  • Remove etl-$PROJECT intermediate buckets
  • Remove data-processing clusters

Consider

  • recreating etl-$PROJECT bucket in us-central & update etl parser to use the short name again
  • recreating the archive-$PROJECT buckets to be single-region (not multi-region) in us-central
@autolabel autolabel bot added the review/triage Team should review and assign priority label Jun 9, 2022
@stephen-soltesz
Copy link
Contributor Author

Due to the v2 data pipeline cluster location in some projects, data must be transferred between regions in sandbox and staging project. This can be eliminated by placing these projects in us-central1 region.

mlab-oti     archive-measurement-lab us-central1 to data-processing	us-central1
mlab-staging archive-measurement-lab us-central1 to data-processing	us-east1
mlab-sandbox archive-measurement-lab us-central1 to data-processing	us-east1
etl-mlab-sandbox	Jun 13, 2017, 3:22:04 PM	Region	us-east1
etl-mlab-staging	Jul 31, 2020, 4:03:17 PM	Region	us-east1
etl-mlab-oti		Aug  6, 2020, 7:48:10 PM	Region	us-central1

Since this requires updates to sandbox and staging projects, the disruption will be minimal.

Changing the data-processing cluster locations will be easy. Changing the output target buckets may not be..

@stephen-soltesz
Copy link
Contributor Author

stephen-soltesz commented Aug 11, 2022

The data-processing cluster includes multiple node pools for service-specific workloads:

  • default-pool Ok 1.21.12-gke.2200 1 (0 - 1 per zone) n1-standard-4
  • downloader-pool Ok 1.21.12-gke.2200 3 (1 per zone) n1-standard-2
  • parser-pool Ok 1.21.12-gke.2200 8 (2 - 3 per zone) n1-standard-16
  • prometheus-pool Ok 1.21.12-gke.2200 3 (1 per zone) n1-standard-4
  • stats-pipeline-pool Ok 1.21.12-gke.2200 3 (1 per zone) n2-standard-8

The commands used to create these node pools are various (and likely dated or incomplete):

@stephen-soltesz
Copy link
Contributor Author

stephen-soltesz commented Aug 11, 2022

Repositories with services on the data-processing cluster (one per node pool):

  • etl
  • etl-gardener
  • prometheus-support
  • stats-pipeline
  • downloader
  • autoloader

@stephen-soltesz
Copy link
Contributor Author

This should be completed using Terraform not manual, adhoc recreations.

@stephen-soltesz
Copy link
Contributor Author

stephen-soltesz commented Aug 21, 2023

Evidently, while gcloud supports bulk-export for some resource types, GKE is not yet one of them.

Documentation on the Terraform gke module

@stephen-soltesz
Copy link
Contributor Author

stephen-soltesz commented Aug 21, 2023

GKE resource is called something else in this context, ContainerEngine, and ContainerNodePools

Running this command requires additional permissions than basic roles alone. https://cloud.google.com/asset-inventory/docs/access-control#required_permissions

gcloud beta resource-config bulk-export \
   --resource-types=ContainerCluster,ContainerNodePool \
   --project=mlab-sandbox --resource-format=terraform \
   --path=output

Additional types are ComputeNetwork and ComputeSubnetwork for declaring the VPC networks over which the cluster communicates.

gcloud beta resource-config list-resource-types
gcloud beta resource-config bulk-export  \
    --resource-types=ComputeNetwork,ComputeSubnetwork \
    --project=mlab-sandbox --resource-format=terraform --path=output

@stephen-soltesz stephen-soltesz self-assigned this Aug 21, 2023
@stephen-soltesz
Copy link
Contributor Author

Current data processing cluster workloads are using deprecated APIs.

Screen Shot 2023-08-22 at 12 36 30 PM

@stephen-soltesz
Copy link
Contributor Author

The deprecated APIs appear to be from kube-state-metrics (v2.2.4) from the prometheus-support configuration. Attempting to update to v2.9.2

@stephen-soltesz
Copy link
Contributor Author

The archive-* buckets are "Multi-region" buckets:

  • archive-mlab-sandbox
  • archive-mlab-staging

Unclear if this has a significant impact on costs if it is not explicitly in the cluster region.

@stephen-soltesz
Copy link
Contributor Author

Grafana must be restarted in each project to pickup the new datasources for the data-pipeline cluster.

@stephen-soltesz
Copy link
Contributor Author

The egress traffic from measurement-lab to sandbox/staging appears to have decreased significantly over the weekend after stopping the data-processing cluster in the us-east last week.

Screen Shot 2023-08-28 at 10 53 48 AM

@stephen-soltesz
Copy link
Contributor Author

stephen-soltesz commented Aug 28, 2023

And the gardener & autoloader appear to be WAI in staging over the weekend also.

Screen Shot 2023-08-28 at 10 56 50 AM Screen Shot 2023-08-28 at 10 58 41 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant