Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidying up dockerfile RUN examples and text #14

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 6 additions & 7 deletions 05-modifying-containers.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -258,9 +258,9 @@ Now that you're familiar with the basics of Dockerfiles, let's dive into some mo

Now you are also familiar with `CMD` which runs something when the container is built

> **FROM** creates a layer from the another Docker image.
> **CMD** specifies what command to run within the container.
> **RUN** builds your application with make.
> **FROM** creates a layer from another Docker image.
> **CMD** specifies the default command to run when a container is started from an image.
> **RUN** executes commands during the build process of the Docker image.
> **COPY** adds files from your Docker client’s current directory.
Comment on lines -261 to 264
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this block is being formatted as intended. It currently appears as a code/quoted text block with each line one after the other. I wonder if some kind of bulleted list was intended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a quote so that's why I indented it but I should make that clear.


Next let's use `RUN` to add a package to our image.
Expand All @@ -279,24 +279,23 @@ RUN Rscript -e "install.packages( \
'newpackagename'))"
```

To add an R package from Bioconductor, you can follow this kind of format:
To add an R package from Bioconductor, you can use this kind of format:

```
RUN Rscript -e "options(warn = 2); BiocManager::install( \
c('limma', \
'newpackagename')

```

To add a **Python package using pip**, you will need to add pip3 to install Python packages using this format. But first you'll need to make sure you have pip installed using:
To add a **Python package using pip**, you will first need to make sure you have pip installed using:

Install pip:
```
RUN apt-get update && apt-get install -y --no-install-recommends \
python3-pip
```

Then you can use pip install to install packages
Then you can use pip install to install packages with the following format:
```
RUN pip3 install \
"somepackage==0.1.0"
Expand Down
29 changes: 14 additions & 15 deletions 06-writing-dockerfiles.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1T5Lfei2UVou9b0qa

Now that you're familiar with the basics of Dockerfiles and how to use them to build images, let's dive into some more of the things you can do with them.

`FROM` is one of the [main commands that a Dockerfile can take, as described by their documentation](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/).
`FROM` is one of the [main commands that a Dockerfile can take, as described by their documentation](https://docs.docker.com/reference/dockerfile/#from).

Now you are also familiar with `CMD` which runs something when the container is built.

> **FROM** creates a layer from another Docker image.
> **CMD** specifies what command to run within the container.
> **RUN** builds your application with make.
> **CMD** specifies the default command to run when a container is started from an image.
> **RUN** executes commands during the build process of the Docker image.
> **COPY** adds files from your Docker client’s current directory.

Next let's use `RUN` to add a package to our image.
Expand All @@ -32,46 +32,45 @@ RUN Rscript -e "install.packages( \
'newpackagename'))"
```

To add an R package from Bioconductor, you can follow this kind of format:
To add an R package from Bioconductor, you can use this kind of format:

```
RUN Rscript -e "options(warn = 2); BiocManager::install( \
c('limma', \
'newpackagename')

```

To add a **Python package using pip**, you will need to add pip3 to install Python packages using this format. But first you'll need to make sure you have pip installed using:
To add a **Python package using pip**, you will first need to make sure you have pip installed using:

Install pip:
```
RUN apt-get update && apt-get install -y --no-install-recommends \
python3-pip
```

Then you can use pip install to install packages
Then you can use pip install to install packages with the following format:
```
RUN pip3 install \
"somepackage==0.1.0"
```

There are so many things you can add to your Docker image. (Picture whatever software and packages you are using on your computer). We can only get you started for the feel of how to build a Dockerfile, and what you put on your Docker image will be up to you.
There are so many things you can add to your Docker image (Picture whatever software and packages you are using on your computer). We can only get you started on how to build a Dockerfile. What you put on your Docker image will be up to you.

To figure out how to add something, a good strategy is to look for other Dockerfiles that might have the package you want installed and borrow their `RUN` command. Then try to re-build your Docker image with that added `RUN` command and see if it builds successfully.
To figure out how to add something, a good strategy is to look for other Dockerfiles that might have the package you want installed and borrow their `RUN` command. Then try to re-build your Docker image with that added `RUN` command and see if it builds successfully. Another strategy is to enter an interactive terminal session on your base image, work out the required commands for installing the missing tool/package, then add those `RUN` commands to your Dockerfile.

Make sure that whatever changes you make to your Dockerfile, that you add version control it and add it to your GitHub repository!
And lastly, make sure that whatever changes you make to your Dockerfile, that you add it to your GitHub repository!

## Troubleshooting tips for building images

1. Look for a good base image to start with on your `FROM` Something that has a lot of what you need but not more software packages than you need.
1. Look for a good base image to start with on your `FROM` command. This should be an image that has a lot of what you need and not a lot of software packages that you don't need.
- If you know you want use `R` on your container then take a look at [the `rocker` images](https://hub.docker.com/u/rocker).
- If you know you want to use Jupyter notebooks on your container, go to the [Jupyter Project images](https://hub.docker.com/u/jupyter).
- If you are doing anything with bioinformatics software, [take a look at Biocontainers](https://biocontainers.pro/).
2. When adding packages, look for other Dockerfiles folks have written that have the same operating system aka usually Ubuntu, and copy their installation steps.
3. Use version numbers so if you rebuild the same versions will be installed and that won't be a moving target for you.
4. Should the installation steps fail, try to pinpoint what is the first part it is failing on. Look for if there's a message like "missing dependency" or something similar. It may mean you need to add another package in there before installing this package.
2. When adding packages, look for other Dockerfiles that folks have written with the same base operating system (e.g., Ubuntu), and copy their installation steps.
3. Specify version numbers for packages whenever possible so that when you rebuild the same versions will be installed and that won't be a moving target for you.
4. Should the installation steps fail, try to pinpoint what is the first part it is failing on. Look for a message like "missing dependency" or something similar. It may mean you need to add another package before installing this package.
5. Google your error messages. Look on StackOverflow. Post on StackOverflow.
6. If all else fails, can you just install a different software or a different version number of that software that can do the same functionality?
6. If all else fails, try installing a different software or a different version number of that software that can provide the same functionality.
7. If you change something in a base image or in a file that is copied over you may need to use `--no-cache` so that everything really gets rebuilt from scratch.

### More learning
Expand Down
33 changes: 16 additions & 17 deletions 07-sharing-images.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,23 +26,23 @@ Some of these best practices are ethically and legally consequential, while othe

Sharing is crucial to community driven science.

Not sharing at all is not an option, this impedes the progress of research. BUT we do need to discuss *when*, *what*, and *who* of appropriate sharing. If you work with protected data types, like Protected Health Information (PHI) or Personally Identifiable Information (PII) and want to use your protected data with containers, that's great!
Not sharing at all is not an option, this impedes the progress of research. BUT we do need to discuss the *when*, *what*, and *who* of appropriate sharing. If you work with protected data types, like Protected Health Information (PHI) or Personally Identifiable Information (PII) and want to use your protected data with containers, that's great!

However, there are some very strong dos and don'ts associated with protected data.

If you are working with protected data (or are not sure if you are), we also **HIGHLY encourage you to talk with data privacy experts** and ensure that the practices you are using are appropriate and making sure they protect the individuals' whose data you have.
If you are working with protected data (or are not sure if you are), we **HIGHLY encourage you to talk with data privacy experts** and ensure that the practices you are using are appropriate and make sure they protect the individuals' whose data you have.

Bottom line: **DO NOT put protected data like PII or PHI on publicly shared Docker images!**

The more layers of safety checks and cushions for human mistakes (which will happen), the better!

#### Alternatives:

You can use one or more of these alternatives just make sure you clear it with the proper channels like IRBs and security experts!
You can use one or more of these alternatives. Just make sure you clear it with the proper channels like IRBs and security experts!

- Do NOT put the data on the image. Store the data separately from the image. Don't even build the docker image near where those data are stored. You may be able to, from a secure location, run a Docker container and access the data through a volume assuming the data is not moved anywhere. Not only does storing data on an image make the image really big, but obviously in the case of protected data its just not a good idea.
- If for some reason you must put something protected on an image, your other option is you can push the image to a secure and protected location site where only IRB approved individuals have access to it. Amazon Web Services Container registry has options as does Microsoft Azure for example.
- If for some reason you must put something protected on an image, and you can't use a private registry, your other option is: **Don’t push the image anywhere** This obviously makes it harder to share methods, but it also could be possible you could share a version of the Dockerfile of this image that doesn't have protected data information on there and this Dockerfile could just be shared for methods purposes.
- Do NOT put the data on the image. Store the data separately from the image. Don't even build the docker image near where those data are stored. You may be able to, from a secure location, run a Docker container and access the data through an attached volume, assuming the data is not moved anywhere. Not only does storing data on an image make the image really big, but in the case of protected data its just not a good idea.
- If for some reason you must put something protected on an image, your other option is to push the image only to a secure and protected location where only IRB approved individuals have access to it. Amazon Web Services Container registry has options as does Microsoft Azure for example.
- If for some reason you must put something protected on an image, and you can't use a private registry, your other option is: **Don’t push the image anywhere**. This makes it harder to share methods. You must take extra steps to, for example, share a version of the Dockerfile of the image that doesn't have protected data information on it.

```{r, out.width = "100%", echo = FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1T5Lfei2UVou9b0qaUCrWXmkcIwAao-UcN4pHMPEE4CY/edit#slide=id.g30a4ed49e59_0_1276")
Expand All @@ -61,13 +61,13 @@ There's no limit on the number of images you can make! There can be a fine art t

## Version control your Dockerfiles

Keeping your Dockerfile stored only locally and untracked is likely to lead to headaches. Version control is always a good idea and containerization is no exception! To learn more about version control.
Keeping your Dockerfile stored only locally and untracked is likely to lead to headaches. Version control is always a good idea and containerization is no exception! To learn more about version control see our [Intro to Reproducibility in Cancer Informatics course](https://jhudatascience.org/Reproducibility_in_Cancer_Informatics/introduction.html) or [Advanced Reproducibility in Cancer Informatics course](https://jhudatascience.org/Adv_Reproducibility_in_Cancer_Informatics/introduction.html) sections on making your project open source with GitHub.

If you do decide to keep your Dockerfiles on GitHub, there are a lot of useful tools to help you manage your images there. GitHub Actions marketplace for examples has a lot of handy tools. [See our GitHub Actions course for more on this](https://hutchdatascience.org/GitHub_Automation_for_Scientists/).

## Versioning is always a good idea

Just like with software development, it's good to have a system to keep track of things as you develop. Container development can easily get out of hand. Especially if others are using your images you want to be clear about which versions of containers are in a more risky earlier "development" stage and which have been more vetted and ready for use.
Just like with software development, it's good to have a system to keep track of things as you develop. Container development can easily get out of hand. Especially if others are using your images, you want to be clear about which versions of containers are in a more risky earlier "development" stage and which are more vetted and ready for use.

Versioning tags can be whatever you'd like them to be. [Versioning number systems used elsewhere](https://en.wikipedia.org/wiki/Software_versioning#Schemes) like `major.minor.patch` are also used with images.

Expand All @@ -79,12 +79,11 @@ docker tag cool-new-image:2 username/cool-new-image:2

Other strategies for versioning can mirror git branch naming conventions, so `main` for the vetted version of the image and `dev` or `stage` for a version that's still being worked on but will probably eventually be released.

There's no one size fits all for image tags, its really up to whatever you and your team decide works best for your project.
Regardless being intentional, consistent, and clearly documenting any system you choose to use are the main keys.
There's no one size fits all for image tags, its really up to whatever you and your team decide works best for your project. Regardless, being intentional, consistent, and clearly documenting any system you choose to use are the main keys.

## Don't use random developers’ docker images

Images and containers can be difficult to have transparency into the build at times. And unfrotunately not everyone on the internet who makes images is trustworthy. To avoid security issues make sure to stick to trusted sources for your docker images. Either verified individuals or companies. Try not to wander too far off the beaten path. Better to make your own image from a trusted source's base image than to use a random strangers' image that promises to do it all.
Images and containers can be difficult to have transparency into the build at times. And unfortunately not everyone on the internet who makes images is trustworthy. To avoid security issues make sure to stick to trusted sources for your docker images. Trust only verified individuals or companies. Try not to wander too far off the beaten path. Better to make your own image from a trusted source's base image than to use a random strangers' image that promises to do it all.

## Summary of best practices

Expand All @@ -94,22 +93,22 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1T5Lfei2UVou9b0qa

## Container Registries

To share our image with others (or ourselves), we can push it to an online repository. There are a lot of options for container registries. Container registries are generally cross-compatible, meaning you can pull the image from just about anywhere if you have the right command and software. You can use different container registries for different purposes.
To share our image with others (or ourselves), we can push it to an online repository. There are a lot of options for container registries. Container registries are generally cross-compatible meaning you can pull the image from just about anywhere if you have the right command and software. You can use different container registries for different purposes.

[This article has a nice summary of some of the most popular ones](https://octopus.com/blog/top-8-container-registries).

And here's a TL;DR of the most common registries:

- [Dockerhub](https://hub.docker.com/) – widely used, a default
- [Amazon Web Services Container Registry](https://aws.amazon.com/containers/) - options for keeping private
- [Github container registry](https://aws.amazon.com/containers/) - If you are using GitHub packages works with that nicely
- [Amazon Web Services Container Registry](https://aws.amazon.com/containers/) - includes options for keeping private
- [Github container registry](https://aws.amazon.com/containers/) - works nicely with GitHub Packages
- [Singularity](https://docs.sylabs.io/guides/3.5/user-guide/introduction.html) – if you need more robust security

We encourage you to consider what container registries work best for you specific project and team. Here's a starter list of considerations you may want to think of, roughly in the order of importance.
We encourage you to consider what container registries work best for your specific project and team. Here's a starter list of considerations you may want to think of, roughly in the order of importance.

1. If you have protected data and security concerns (like we discussed earlier in this chapter) you may need to pick a container registry that allows privacy and strong security.
2. Price -- not all container registries are free, but many of them aren't. Think about what kind of budget do you have for specific purpose. Paying is generally not a necessity, so don't pay for a container registry subscription unless you need to.
2. What tools are you already using? For example GitHub, Azure, and AWS have their own container registries, if you already are using these services you may consider using their associated registry. (Note GitHub actions works quite seamlessly with Dockerhub, so personally I haven't had a reason to use GitHub Container Registry but it is an option.)
2. Price -- not all container registries are free, but many of them are. What kind of budget do you have for this purpose? Paying is generally not a necessity so don't pay for a container registry subscription unless you need to.
3. What tools are you already using? For example GitHub, Azure, and AWS have their own container registries, if you are already using these services you may consider using their associated registry. (Note GitHub actions works quite seamlessly with Dockerhub, so personally I haven't had a reason to use GitHub Container Registry but it is an option.)
4. Is there an industry standard? Where are your collaborators or those at your institution storing your images?

While there are lots of container registry options, for the purposes of this tutorial, we'll use Dockerhub. Dockerhub is one of the first container registries and still remains one of the largest. For most purposes, using Dockerhub will be just fine.
Expand Down
Loading