description |
---|
How to update the AMI on our ECS cluster instances |
The Digital webapps cluster uses the Elastic Container Service on AWS. We have a handful of EC2 instances that actually host the containers.
These instances use a stock Amazon Machine Image (AMI) from Amazon designed for Docker that comes with the ECS agent pre-installed. From time to time, Amazon releases a new version of this “ECS-optimized” image, either to upgrade the ECS agent or the underlying OS.
Thanks to our instance-drain Lambda function, updating the cluster EC2 images is a zero-downtime process. Nevertheless, it’s best to run this during the weekly digital maintenance window, and make sure that staging looks good before doing it on production.
This process is sometimes referred to as “rolling” the cluster though it’s more accurate that we set up a second cluster of machines and migrate to it.
-
Find the latest ID for the ECS-optimized AMI. You can do this on the Amazon ECS-optimized AMIs page.
-
In a browser navigate to CityOfBoston/digital-terraform repository and edit the
apps/clusters.tf
file. -
Update the
instance_image_id
value for thestaging_cluster
module to the new AMI ID from step 1 above. Save/commit the file as a new branch, not directly to theproduction
branch. -
Make a PR which merges the new branch into the
production
branch, and assign a person to review the changes. -
When you make the PR, GitHub will automatically execute an
atlantis plan
process (see what atlantis is).
When the plan is done, inspect the output and expect to see changes to:
- resource "aws_autoscaling_group" "instances"
- resource "aws_cloudwatch_metric_alarm" "low_instance_alarm"
- resource "aws_launch_configuration" "instances" (This last one will have the new AMI guid)
Any other changes the plan identifies should be carefully investigated.
Terraform may be proposing to make changes to the AWS environment you don't want, or at least are not expecting. -
After viewing the plan, if you need to update the terraform scripts, be sure to save the changes to the new branch.
If comitting your changes does not trigger the atlantis plan automatically, you can run it manually by creating a new comment with**atlantis plan
.**\ -
Once the atlantis plan is finished, and the PR has been approved, create a new comment
atlantis apply.
This will cause Atlantis to apply changes to AWS. (Atlantis runs aterraform apply
command in a background process). See what happens. -
Keep an eye on the “ECS Instances” tab in the cluster’s UI. You should see the “Running tasks” on the draining instance(s) go down, and go up on the new instances. \
ECS Instances tab in the AWS web console
-
Once all the tasks have moved, the old instance(s) will terminate and Terraform will complete. Check a few URLs on staging to make sure that everything’s up-and-running.
-
Now that Atlantis’s apply finished, you can merge the staging PR and repeat the process (steps 2-6) for the production cluster.
If you have terraform installed on your local computer, you can do the update directly from your computer.
-
Find the latest ID for the ECS-optimized AMI. You can do this on the Amazon ECS-optimized AMIs page.
-
Ensure your cloned copy of the
digital-terraform
repository is on theproduction
branch, and that the branch it up to date with the origin on GitHub. -
Create a new branch from the
production
branch. -
In your preferred IDE open the
/apps/clusters.tf
file and update theinstance_image_id
value for thestaging_cluster
module to the new AMI ID from step 1 above. Save/commit the file to the new new branch (not directly to theproduction
branch). -
in a terminal/shell from the
repo/apps/
folder, run the command:
terraform plan
-
When the plan is done, inspect the output and expect to see changes to:
- resource "aws_autoscaling_group" "instances"
- resource "aws_cloudwatch_metric_alarm" "low_instance_alarm"
- resource "aws_launch_configuration" "instances" (This last one will have the new AMI guid)
Any other changes the plan identifies should be carefully investigated.
Terraform may be proposing to make changes to the AWS environment you don't want, or at least are not expecting. -
Once you are happy with the changes that terraform will apply to the AWS environment, you can run the command:
terraform apply
See what terraform apply does. -
Keep an eye on the “ECS Instances” tab in the cluster’s UI. You should see the “Running tasks” on the draining instance(s) go down, and go up on the new instances. \
ECS Instances tab in the AWS web console
-
Once all the tasks have moved, the old instance(s) will terminate and Terraform will complete. Check a few URLs on staging to make sure that everything’s up-and-running.
-
Now that terraform's apply is finished, you can repeat the process (steps 2-9) for the production cluster.
-
Finally you should merge the changes in your new (local) branch into the local
production
branch, and then push the your localproduction
branch to the origin in Github.
{% hint style="success" %} After the production instances are fully up, check that they have roughly equal “Running tasks” numbers. ECS should schedule duplicate tasks on separate machines so that they are split across AZs. If you see a service has both of its tasks on the same instance you can run a force deployment to restart it. (See Restarting an ECS service) {% endhint %}
What are Atlantis and Terraform ?
Terraform is a CLI utility synchronizes AWS with scripts. In essence, it uses a series of scripts to detect and make changes to AWS. Terraform commands are run from a terminal session on a machine with Terraform libraries installed.
Installing Terraform | See website | See documentation
Atlantis provides a GitHub provisioned wrapper for Terraform: it runs terraform plan
and terraform apply
commands from GitHub and posts the results back to GitHub.
See website | See documentation
Atlantis is a small application which Digital team have installed on a very small serverless environment in AWS (fargate). It runs in fargate because it restarts the staging and production containers and therefore cannot run on any of the main EC2 instances.
What happens during a terraform/atlantis apply with an updated AMI?
When the AMI is updated the terraform plan
command will:
- create a new Launch Configuration for cluster instances (i.e. EC2 instances) that uses the new AMI,
- create a new Autoscaling Group that uses the Launch Configuration,
- trigger deletion of the old Autoscaling Group.
The instance-drain Lambda function will tell ECS to drain the tasks from the instances that are being shut down (terraform won’t delete the Autoscaling Group until its instances are fully terminated). ECS will automatically start those tasks up on the new instances that got created by the new Autoscaling Group.