Docker Build Requirements

It is relatively simple to build a first image for your application’s container. Maintaining all your applications over time, however, makes it very difficult to design a reasonable “docker build” arrangement.

Current Conditions

The application image has to be built by a Jenkins job and stored in Harbor.

A Docker project has to be a single Git repository and the Dockefile has to be a file in the root of the working directory obtained by checking out a branch of that repository.

Harbor will regularly download a new list of security issues and will quarantine images that have serious vulnerabilities. Currently Projects are defined so that images with a High Severity problem cannot be pulled either to be run or fixed.

Base Images

An image is a sequence of layers.

Each layer is a set of changes to the next lower layer, but because the next lower layer is itself a set of changes to the layer below it, the term “layer” has to refer to the cumulative set of all layers up to some final change that is on the top. Each layer is identified by a SHA265 hash/digest of its contents which will be globally unique.

At the bottom of the changes, there will be a special file containing a root file system of some Linux distribution created by some vendor (Ubuntu, Debian, Alpine, etc.). Nobody except vendors creates these files. In practice, they create a simple image from this special file and store it in Docker Hub. We get that version of that OS by pulling a Docker Hub image by its tag.

For example, the Docker Hub image named “ubuntu:latest” (on of 3/16/2022) is

>docker pull ubuntu:latest
latest: Pulling from library/ubuntu
7c3b88808835: Pull complete
Digest: sha256:8ae9bafbb64f63a50caab98fd3a5e37b3eb837a3e0780b78e5218e63193961f9
>docker history ubuntu:latest
IMAGE          CREATED       CREATED BY                                      SIZE      COMMENT
2b4cba85892a   13 days ago   /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      13 days ago   /bin/sh -c #(nop) ADD file:8a50ad78a668527e9…   72.8MB

On line 8 we learn that the special file someone got from Cannonical has a SHA256 hash beginning with 8a50ad78a668527e9… and was 72.8 MB. An image was created with a Dockerfile that added the line

CMD [“bash”]

The special file contains the filesystem, and the added CMD tells Docker that if the image is run, Docker should start the bash program. I do not know the internals of how the environment is set so there is a PATH to search to find the program, but obviously it must be somewhere in the special file.

With the addition of the CMD, the second and now top layer has a SHA256 hash beginning with 2b4cba85892a. The hash value of the top layer of an image is also the hash value and internal unique name of the image. In most cases, Docker only displays and needs the first 12 characters of the much longer actual hash value because 12 characters is enough to be unique most of the time.

>docker image ls ubuntu
REPOSITORY   TAG       IMAGE ID       CREATED        SIZE
ubuntu       21.10     305cdd15bb4f   13 days ago    77.4MB
ubuntu       latest    2b4cba85892a   13 days ago    72.8MB
ubuntu       <none>    64c59b1065b1   2 months ago   77.4MB
ubuntu       <none>    d13c942271d6   2 months ago   72.8MB

You see the same 2b4cba85892a hash value as the “Image ID” column on line 4 for the ubuntu:latest image. Lines 5 and 6 contain old images still in the Docker cache that used to be ubuntu:latest and ubuntu:21.10 before I pulled new images from Docker Hub that are currently associated with these two tags. The old images have to be kept around to support any other images that were previously built using them.

When you run Docker on your own machine, you control the Docker Engine and have layers and images in your cache that came from previous Docker commands and Dockerfile image builds you ran. When you submit a Dockerfile project to be built by Jenkins, you have almost no control over the friendly image tag names.

Now that I have pulled “ubuntu:latest” explicitly, that tag is associated in my Docker Engine and Dockerfile build environment with 2b4cba85892a. That name will remain associated with specific image until a new image is explicitly pulled, either by entering a “docker pull ubuntu:latest” or doing a “docker build --pull” on a Dockerfile with a “FROM ubuntu:latest” any time after a new image has been stored in Docker Hub and been tagged with the alias of “ubuntu:latest”.

It is clear from the example above, that ubuntu:latest is not ubuntu:21.10. The vendor has decided to associate “latest” with the latest refreshed version of the last Long Term Support, which in March, 2022 happens to be 20.04 (“focal”). This image was last refreshed March 2, so the only unique and unchanging name for this specific base image is ubuntu:focal-20220302.

Next month there will be a new LTS version of Ubuntu (“jammy” or 22.04). At that point, “latest” will switch from pointing to 20.04 to pointing to the new 22.04. However, the version number itself is not sufficient to be unique, because every few months they put maintenance on all of their versions. As you can see above, although version 21.10 was released last October, it was refreshed 13 days ago. So to have a never changing name for that version, you need to reference ubuntu:impish-20220301.

Unless you use a base OS image tag with a release and a date, you do not know what system level your application is running on. Worse, you have no reason to assume that if you build the application again the new and old image will run on the same OS version, nor will the version you used for unit testing on your desktop be the same version that is used when Jenkins builds its image and stores it in Harbor.

Since there is no explicit “docker pull” and no Jenkins option to do a “docker build --pull”, the image associated with any ambiguous tag name will be the first such image downloaded by that Docker Engine. However, if more than one Jenkins worker VM exists, then each worker will have its own Engine with its own image cache, and the version of the image in the two caches may be different.

This hasn’t mattered up to this point because in the real world, all our applications are simple Java programs that will run on any version of any Linux. We don’t really care. However, people who set standards for TEST and PROD do care about controlling the exact environment of production applications, and they have allowed this only because they do not understand how it really works.

To meet expectations, all our Dockerfiles must be changed to chain back to a specific lowest level special file representing a well defined version and maintenance level of the base OS. At least we need to make sure that the version we run our first TEST on is the version that ends up in PROD.

Harbor CVEs

Harbor regularly downloads a database of reported vulnerabilities. It scans images to see if they contain any of these vulnerabilities. High severity problems need to be fixed.

We have found that Ubuntu makes fixes to critical problems available as soon as they are announced. It is not always possible to patch Debian or Alpine.

Once the vendor package libraries contain a fix, problems can be corrected by doing a simple

apt update
apt upgrade -y

If you run the commands without specific package names, the commands patch everything. Alternately, someone could try to build a list of just the packages needed to satisfy Harbor, but that is a massive amount of work. Since this list is not specific to IAM, and not even specific to Yale, but is potentially different for different versions of the base OS, creating such a list is clearly something that “somebody else should do” if only there were someone doing it.

At the Bottom and the Top

No matter what you decide, choosing a specific base image (impish-20220301 or focal-20220302 or impish next month) implies that a specific set of patches have been applied to the original release for you by the vendor. On top of that “level set” you can choose to immediately add packages with an “apt upgrade” before you add software and applications.

However, if you have an application that has been working fine for months and Harbor reports a single critical vulnerability that can be fixed by upgrading a single package, you can do that “apt upgrade” on either the top layer (where almost nothing happens and it builds instantly) or at the bottom layer where you then have to wait minutes for all your software to reinstall on top of the now modified base.

The Build Cache

New images are built in the Docker Engine in a specific environment which I will call the Builder. There are two versions of builder, traditional Docker and a newer Docker option called BuildKit. Whichever builder you choose shares the Image Cache with the hash names and tags described above.

I have already noted one Builder feature that interacts with the Image Cache. If Jenkins were to ever do a “docker build --pull” it would download any new version of the image whose tag/alias is specified in the FROM statement of the Dockerfile.

There is an entirely separate problem caused by the Builder’s desire to optimize its performance. It keeps a “statement cache” history of layers generated by processing Statements in all the Dockerfiles it has previously processed. As long as the statements in the current Dockerfile duplicate the beginning of some other Dockerfile it already processed, then it can avoid doing any work and simply reused the saved layer generated when it processed the same sequence of statements in the previous Dockerfile. Specifically:

When it processes the FROM statement, if “--pull” was not specified and a previous Dockerfile specified the same image tag name, then it has an image in the Cache prevously loaded that it can reuse. Only if “--pull” is specified does it look for a new copy of the iamge on the original source registry.
If it is processing an ADD or COPY statement, it creates a hash from the contents of the source file and compares it to previously saved hash values of identical ADD or COPY statements that appeared in other Dockerfiles that started with the exact same sequence of statements up to this line. If the content of the file changed, then the new data has to be copied and now we have a new layer and can stop optimizing.
For all other statements, if the next statement matches an identical next statement in a previously processed Dockerfile, then we reuse the layer generated by that previous Dockerfile. If this statement is new, then we have a new layer and stop optimizing.

To see why this is useful, consider building a generic Java Web application on top of the current version of Java 11 and Tomcat 9.5. If you have to start with a Docker Hub image, then every such application beings with what should be identical statements to FROM ubuntu:xxx, patch it, install Java, install Tomcat, install other packages that everyone needs, add common Yale configuration data, and only after all this is done are there are few lines at the end of the Dockerfile to install Netid, or ODMT, or YBATU, or something else.

If we used Docker the way it is supposed to work, we would create our own base image with all this stuff already done. Then the application Dockerfile could have a FROM statement that names our custom Yale Tomcat image and just does the application specific stuff. However, with this Builder optimization, every application Dockerfile could begin with the same 100 statements, and the Builder would recognize that all these Dockerfiles start with same way, essentially skip over all the duplicate statements, and just get to the different statements at the end of the file.

So if Yale Management doesn’t like us to create our own base images, we can do essentially the same thing if we can design a way to manufacture Dockerfiles by appending a few application specific lines on the end of some identical block of boilerplate code. Since there is nothing in Git or Docker that does this, we could add a step to the Jenkins pipeline to assemble the Dockerfile after the project is checked out from Git and before it is passed to “docker build”.

Except for the “RUN apt update && apt upgrade -y”.

If this is done in a Dockerfile that is subject to optimization, the Builder sees that this line is identical to the same line that was processed in a Dockerfile that began with the same statements and was run a month ago, or a year ago. Since the line is identical, it doesn’t actually run it and reuses the same layer it previously generated. This fails to put on any new patches.

There is no formal way to stop optimization, but the solution is obvious. Have Jenkins put today’s date in a file. COPY the file to the image. Then “RUN apt update && apt upgrade -y”. The content of this file changes each day. The processing of COPY checks the content and sees it has changed and stops optimizing the Dockerfile. Then the next statement always runs and patches the image.

However, if you do this at the beginning of the Dockerfile you lose all optimization. So it is better to arrange for this to be at then end of the Dockerfile.