Docker Build Requirements

It is relatively simple to build a first image for your application’s container. Maintaining all your applications over time, however, makes it very difficult to design a reasonable “docker build” arrangement.

Current Conditions

The application image has to be built by a Jenkins job and stored in Harbor.

A Docker project has to be a single Git repository and the Dockefile has to be a file in the root of the working directory obtained by checking out a branch of that repository.

Harbor will regularly download a new list of security issues and will quarantine images that have serious vulnerabilities. Currently Projects are defined so that images with a High Severity problem cannot be pulled either to be run or fixed.

Base Images

Each line in a Dockerfile that changes something generates a layer. The content of the layer is the cumulative set of applying all changes from the bottom layer up to the top. Each layer is identified by a unique SHA256 hash of its content, although Docker normally only shows the first 12 hex characters of the hash.

When you finish processing the Dockerfile, the last layer generated is also called an image. In addition to its obscure hash name, it is common to create an alias (“tag”) that creates a friendly name that is easier to use, but when you build the Dockerfile again you may reassign the old friendly name to the new image, and then the old image remains but is only known by its unique content hash.

You can display all the layers in an image with the “docker history” command, which shows the hash of each layer and the Dockerfile line that created it.

The bottom layer will typically ADD a special file containing a root file system of some Linux distribution created by some vendor (Ubuntu, Debian, Alpine, etc.) and some environment variables and parameters telling Docker how to run that system. We cannot create such files, so the convention is to download a starting image already created by the vendor from Docker Hub.

For example, the Docker Hub image named “ubuntu:latest” (on of 3/16/2022) is

>docker pull ubuntu:latest
latest: Pulling from library/ubuntu
7c3b88808835: Pull complete
Digest: sha256:8ae9bafbb64f63a50caab98fd3a5e37b3eb837a3e0780b78e5218e63193961f9
>docker history ubuntu:latest
IMAGE          CREATED       CREATED BY                                      SIZE      COMMENT
2b4cba85892a   13 days ago   /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      13 days ago   /bin/sh -c #(nop) ADD file:8a50ad78a668527e9…   72.8MB

On line 8 we learn that the special file someone got from Canonical has a SHA256 hash beginning with 8a50ad78a668527e9… and was 72.8 MB. This was turned into a Docker Hub image by adding a line telling Docker that it can run the image by starting the bash program.

The top layer has a hash beginning with 2b4cba85892a which is also the identifier of the image, but the friendly alias tag “ubuntu:latest” is easier to remember. However, next week there may be a new “latest” image that has been updated and will have different contents and a new hash.

>docker image ls ubuntu
REPOSITORY   TAG       IMAGE ID       CREATED        SIZE
ubuntu       21.10     305cdd15bb4f   13 days ago    77.4MB
ubuntu       latest    2b4cba85892a   13 days ago    72.8MB
ubuntu       <none>    64c59b1065b1   2 months ago   77.4MB
ubuntu       <none>    d13c942271d6   2 months ago   72.8MB

While the “latest” image is 2b4cba85892a, it replaces an older image stored 2 months ago that had a contents hash of d13c942271d6. The old image remains stored in the system in case it was used to build other images to run applications. When all such applications have been updated to use the latest starting image, then the old image can be deleted.

Name and Content

A Dockerfile can contain ADD and COPY statements that copy files in the project directory to some location in the image. Docker only cares about and remembers the contents (represented by a hash of the contents) and not the name of the input file. So the Dockerfile may contain the line:

copy one.txt /var/tmp

but what shows up in “docker history” is
COPY file:cd49fd6bf375bcab4d23bcc5aa3a426ac923270c7991e1f201515406c3e11e9f in /var/tmp

In general, Docker doesn’t care about the source of data in a layer. You can change the source to a different file or URL, but if it has the same content there really hasn’t been a change.

Except for the date. In the image, the file you created will have the current timestamp in the destination directory. So even if you run the same Dockerfile twice and copy the same data, generally the layer you generate will have a different hash in each run because the timestamp of the file in the directory is part of the data that goes into the image content hash.

How Specific?

Generally, the best practice for production applications is to know exactly what is in them. However, at Yale production applications are loaded onto VMs that get updated once a month from Red Hat libraries. We trust Red Hat to update its system carefully. Fifty years ago applications ran on IBM mainframes that were updated once a month by a similar process. Today applications that run on a Windows system get monthly Patch Tuesday updates and application developers don’t track bug fixes.

However, we have to be more careful about the version of the OS we are running (Ubuntu 20.04 or 22.04), the version of Java we are running (Java 8 or 11), and the version of components like Tomcat we are running (Tomcat 8.5, 9, or 10). Upgrades to new versions can change behavior and cause problems for applications.

Generally these principles are already baked into the standard tag names assigned to images in Docker Hub. If you look at the standard images offered that include Debian, Java, and Tomcat, you will find a page that lists all the tags given to a specific supported image. For example:

9.0.60-jdk11-openjdk-bullseye, 9.0-jdk11-openjdk-bullseye, 9-jdk11-openjdk-bullseye, 9.0.60-jdk11-openjdk, 9.0-jdk11-openjdk, 9-jdk11-openjdk, 9.0.60-jdk11, 9.0-jdk11, 9-jdk11, 9.0.60, 9.0, 9

This means that if you just ask for “tomcat:9” you get the image that is specifically tomcat:9.0.60-jdk11-openjdk-bullseye (tomcat 9.0.60 on top of OpenJDK 11 running on the “bullseye” release of Debian (11.2).

If you are starting to develop a new application and know you want to use Java 11 and Tomcat 9, this is the image which is the most default for those choices (because it has aliases “9” and “9-jdk11”). However, once you put an application into production, you don’t want things to change unnecessarily, so you might use the more specific alias of “9.0.60-jdk11-openjdk-bullseye” and then change that tag only when there is a reported security problem with Tomcat 9.0.60 fixed by a later version number.

If you come back and rebuild this application after a few months, you may find that the new image has bug fixes to OpenJDK 11 and Debian bullseye. You may choose to just accept such fixes in the same way that you accept the montly RHEL or Windows maintenance applied to an application running on a VM.

Alternately, you may decide to control every single change made to a production application, but that may be an unreasonable burden.

Which Distribution?

Alpine is the leanest distribution, but the tomcat-alpine image is no longer being maintained. You can use it, but then you have to put all the maintenance on it through several years and releases.

It used to be that Ubuntu was more quickly updated than Debian. However, in the last few years there are been an intense focus on reported vulnerabilites and Security patches. In response, Debian created a separate repository for fixes to reported vulnerabilities and they deploy patches to it as soon as possible. This does not, however, extend to non-security bugs where Ubuntu may still be quicker to make fixes available.

If you want a Docker Hub standard image with Java and Tomcat pre-installed, you can choose between Debian (“bullseye”) and Ubuntu (“focus” or 20.04, the most recent LTS release soon to be replaced by 22.04).

Any additional comments from tracking Docker Hub releases and their ability to patch vulnerabilities will be added here as we gather experience.

FROM Behavior

Docker has a special conservative behavior when processing an ambiguous tag in the FROM statement, which is the first statement in a Dockerfile and sets the “base image” for this build. The first time you build a Dockerfile that begins with the statement

FROM ubuntu:latest

The Docker Engine doing the build downloads the image associated with this tag from Docker Hub and associates the name “ubuntu:latest” with that image for all subsequent Dockerfile image builds until you specifically replace it. For a desktop build, you add the parameter “--pull” to a “docker build” command to tell Docker to check for a newer image currently associated with this tag name and download it and make it the new “ubuntu:latest” for subsequent builds.

Yale’s Jenkins build process does not exactly use “--pull” but accomplishes the same thing using a different technique. So when you build an image with Jenkins, you get the current image associated with a tag, and generally you want to add “--pull” to your desktop build. If it is really important to start with a very old base image, use the 12 character unique hash name to be sure you get what you want.

Harbor CVEs

Harbor regularly downloads a database of reported vulnerabilities. It scans images to see if they contain any of these vulnerabilities. High severity problems need to be fixed.

We have found that Ubuntu makes fixes to critical problems available as soon as they are announced. It is not always possible to patch Debian or Alpine.

Once the vendor package libraries contain a fix, problems can be corrected by doing a simple

apt update
apt upgrade -y

If you run the commands without specific package names, the commands patch everything. Alternately, someone could try to build a list of just the packages needed to satisfy Harbor, but that is a massive amount of work. Since this list is not specific to IAM, and not even specific to Yale, but is potentially different for different versions of the base OS, creating such a list is clearly something that “somebody else should do” if only there were someone doing it.

At the Bottom and the Top

No matter what you decide, choosing a specific base image (impish-20220301 or focal-20220302 or impish next month) implies that a specific set of patches have been applied to the original release for you by the vendor. On top of that “level set” you can choose to immediately add packages with an “apt upgrade” before you add software and applications.

However, if you have an application that has been working fine for months and Harbor reports a single critical vulnerability that can be fixed by upgrading a single package, you can do that “apt upgrade” on either the top layer (where almost nothing happens and it builds instantly) or at the bottom layer where you then have to wait minutes for all your software to reinstall on top of the now modified base.

The Build Cache

New images are built in the Docker Engine in a specific environment which I will call the Builder. There are two versions of builder, traditional Docker and a newer Docker option called BuildKit. Whichever builder you choose shares the Image Cache with the hash names and tags described above.

I have already noted one Builder feature that interacts with the Image Cache. If Jenkins were to ever do a “docker build --pull” it would download any new version of the image whose tag/alias is specified in the FROM statement of the Dockerfile.

There is an entirely separate problem caused by the Builder’s desire to optimize its performance. It keeps a “statement cache” history of layers generated by processing Statements in all the Dockerfiles it has previously processed. As long as the statements in the current Dockerfile duplicate the beginning of some other Dockerfile it already processed, then it can avoid doing any work and simply reused the saved layer generated when it processed the same sequence of statements in the previous Dockerfile. Specifically:

When it processes the FROM statement, if “--pull” was not specified and a previous Dockerfile specified the same image tag name, then it has an image in the Cache prevously loaded that it can reuse. Only if “--pull” is specified does it look for a new copy of the iamge on the original source registry.
If it is processing an ADD or COPY statement, it creates a hash from the contents of the source file and compares it to previously saved hash values of identical ADD or COPY statements that appeared in other Dockerfiles that started with the exact same sequence of statements up to this line. If the content of the file changed, then the new data has to be copied and now we have a new layer and can stop optimizing.
For all other statements, if the next statement matches an identical next statement in a previously processed Dockerfile, then we reuse the layer generated by that previous Dockerfile. If this statement is new, then we have a new layer and stop optimizing.

To see why this is useful, consider building a generic Java Web application on top of the current version of Java 11 and Tomcat 9.5. If you have to start with a Docker Hub image, then every such application beings with what should be identical statements to FROM ubuntu:xxx, patch it, install Java, install Tomcat, install other packages that everyone needs, add common Yale configuration data, and only after all this is done are there are few lines at the end of the Dockerfile to install Netid, or ODMT, or YBATU, or something else.

If we used Docker the way it is supposed to work, we would create our own base image with all this stuff already done. Then the application Dockerfile could have a FROM statement that names our custom Yale Tomcat image and just does the application specific stuff. However, with this Builder optimization, every application Dockerfile could begin with the same 100 statements, and the Builder would recognize that all these Dockerfiles start with same way, essentially skip over all the duplicate statements, and just get to the different statements at the end of the file.

So if Yale Management doesn’t like us to create our own base images, we can do essentially the same thing if we can design a way to manufacture Dockerfiles by appending a few application specific lines on the end of some identical block of boilerplate code. Since there is nothing in Git or Docker that does this, we could add a step to the Jenkins pipeline to assemble the Dockerfile after the project is checked out from Git and before it is passed to “docker build”.

Except for the “RUN apt update && apt upgrade -y”.

If this is done in a Dockerfile that is subject to optimization, the Builder sees that this line is identical to the same line that was processed in a Dockerfile that began with the same statements and was run a month ago, or a year ago. Since the line is identical, it doesn’t actually run it and reuses the same layer it previously generated. This fails to put on any new patches.

There is no formal way to stop optimization, but the solution is obvious. Have Jenkins put today’s date in a file. COPY the file to the image. Then “RUN apt update && apt upgrade -y”. The content of this file changes each day. The processing of COPY checks the content and sees it has changed and stops optimizing the Dockerfile. Then the next statement always runs and patches the image.

However, if you do this at the beginning of the Dockerfile you lose all optimization. So it is better to arrange for this to be at then end of the Dockerfile.