Docker Build Requirements

It is relatively simple to build a first image for your application’s container. Maintaining all your applications over time, however, makes it very difficult to design a reasonable “docker build” arrangement.

Current Conditions

The application image has to be built by a Jenkins job and stored in Harbor.

A Docker project has to be a single Git repository and the Dockefile has to be a file in the root of the working directory obtained by checking out a branch of that repository.

Harbor will regularly download a new list of security issues and will quarantine images that have serious vulnerabilities. Currently Projects are defined so that images with a High Severity problem cannot be pulled either to be run or fixed.

Base Images

Each line in a Dockerfile that changes something generates a layer. The content of the layer is the cumulative set of applying all changes from the bottom layer up to the top. Each layer is identified by a unique SHA256 hash of its content, although Docker normally only shows the first 12 hex characters of the hash.

When you finish processing the Dockerfile, the last layer generated is also called an image. In addition to its obscure hash name, it is common to create an alias (“tag”) that creates a friendly name that is easier to use. When you build a new image using the same Dockerfile you may reassign the same alias/tag to the new image. The old image remains in the Docker cache and may remain in the network image repository server, but now it is only known by its original unique content hash.

You can display all the layers in an image with the “docker history” command, which shows the hash of each layer and the Dockerfile line that created it.

The bottom layer will typically ADD a special file containing a root file system of some Linux distribution created by some vendor (Ubuntu, Debian, Alpine, etc.) and some environment variables and parameters telling Docker how to run that system. Yale doesn’t build Linux images from scratch, so we must start any of our images by referencing one of the Docker Hub images that in turn includes one of these starting Linux distribution image files.

For example, the Docker Hub image named “ubuntu:latest” (on of 3/16/2022) is

>docker pull ubuntu:latest
latest: Pulling from library/ubuntu
7c3b88808835: Pull complete
Digest: sha256:8ae9bafbb64f63a50caab98fd3a5e37b3eb837a3e0780b78e5218e63193961f9
>docker history ubuntu:latest
IMAGE          CREATED       CREATED BY                                      SIZE      COMMENT
2b4cba85892a   13 days ago   /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      13 days ago   /bin/sh -c #(nop) ADD file:8a50ad78a668527e9…   72.8MB

On line 8 we learn that the special file from Canonical has a SHA256 hash beginning with 8a50ad78a668527e9… and was 72.8 MB. This was turned into a Docker Hub image by adding a line telling Docker that it can run the image by starting the bash program.

We see that the top layer has a hash beginning with 2b4cba85892a. That layer hash also becomes the unique hash identifier of the image, but to make the image easier to reference, the friendly alias “ubuntu:latest” is also assigned (temporarily) to this image. Next week there may be a new updated image and the “ubuntu:latest: alias will point to the new iamge, while the 2b4cba85892a identifier will continue to point to this older image.

>docker image ls ubuntu
REPOSITORY   TAG       IMAGE ID       CREATED        SIZE
ubuntu       21.10     305cdd15bb4f   13 days ago    77.4MB
ubuntu       latest    2b4cba85892a   13 days ago    72.8MB
ubuntu       <none>    64c59b1065b1   2 months ago   77.4MB
ubuntu       <none>    d13c942271d6   2 months ago   72.8MB

While the “latest” image is 2b4cba85892a, it replaces an older image stored 2 months ago that had a contents hash of d13c942271d6. The old image remains stored in the system in case it was used to build other images to run applications. When all such applications have been updated to use the latest starting image, then the old image can be deleted.

Name and Content

A Dockerfile can contain ADD and COPY statements that copy files in the project directory to some location in the image. Docker creates a hash of the contents of that file and remembers it. Docker does not care about the name of the file, or its path location, or its date and attributes. If you change the Dockerfile to reference a file with a different name and location but the same contents, then because the content has has not changed, Docker regards the ADD or COPY statement as unchanged.

A Dockerfile may contain the line:

copy one.txt /var/tmp

but what shows up in “docker history” is
COPY file:cd49fd6bf375bcab4d23bcc5aa3a426ac923270c7991e1f201515406c3e11e9f in /var/tmp

Note that although the contents of the file have not changed, copying it to an image may give the destination file today’s date. Because the date is part of the file system, this changes the hash of the layer and of the subsequent new image.

How Specific?

The best practice for production applications is to carefully control when they change and what changes are made to them.

However, going back 50 years to mainframe computers, it has always been necessary for system administrators to put monthly maintenance on the operating systems on which applications run. You cannot afford to run systems with known vulnerabilities because you are not ready to “change” a running production application.

Of course, if we change the application itself we must do appropriate testing. It is also necessary to test when we upgrade versions of the OS, Java, Tomcat, database, or other key components. However, if all that we do is to patch bugs in the system or libraries, then such maintenance has to be more routine.

How do we translate these considerations to the maintenance of images?

There is no simple answer. Red Hat doesn’t contribute to Docker Hub, but maintains its own OpenShift container system. Open source vendors provide maintenance, but that is not the same thing as a subscription to a production oriented subsidiary of IBM.

Everyone knows you don’t base an application on a “latest” tag. If you select “9.0-jdk11-openjdk-bullseye” as your base, you know future images will get Debian 11.2 (bullseye) and the most recent minor release of OpenJDK 11 and Tomcat 9.0. You may be implicitly upgraded from Tomcat 9.0.59 to 9.0.60, but that upgrade will fix bugs and may address security vulnerabilities.

Using a more specific tag may prevent critical patches. Using a less specific tag will eventually upgrade you to other versions of Debian, Java, or Tomcat when someone changes the defaults.

Which Distribution?

Alpine is the leanest distribution, but the tomcat-alpine image is no longer being maintained. You can use it, but then you have to put all the maintenance on it through several years and releases.

In previous years Ubuntu was updated more quickly than Debina. Now, however, Debian has a special source of updates for patches to security problems and makes them available immediately after the vulnerability is announced.

So now this is mostly a matter of personal choice.

FROM Behavior (--pull)

In each section of a Dockerfile, the FROM statement chooses a base image.

The work of a “docker build” command is performed in a “builder” component in the Docker Engine. By default, the builders use a special conservative behavior which saves and reuses the first image they encounter that matches an alias tag name on the first FROM they process with that alias.

Specifically, if you process a Dockerfile with a “FROM tomcat:9.0-jdk11-openjdk-bullseye”, and that tag name has not been previously encountered, then the Docker Engine will download the image that at this moment is associated with that alias, will save it in its cache, and from now on will by default reuse that image for all subsequent Dockerfiles that have the same alias in their FROM statement.

To avoid this, the “docker build” command used by Yale Jenkins image build jobs typically specifies the “--pull” parameter. This causes Docker to check the base image source network repository server for an image with a newer date than the one stored in the local Engine cache. If one is available, then newer images is downloaded and used.

Since this is the normal behavior of Jenkins, a developer should also specify “--pull” on any “docker build” command used in a development sandbox. That way the image used in the sandbox for unit testing will be as close as possible to the one build by Jenkins and used in the Yale cluster for final testing and final production deployment.

However, you should understand that running the same image build using the same Dockerfile a second time may pull a newer base image with additional maintenance installed if, by luck, maintenance was done to that image alias in Docker Hub.

Harbor CVEs

Harbor regularly downloads a database of reported vulnerabilities. It scans images to see if they contain any of these vulnerabilities. High severity problems need to be fixed.

It used to be that Ubuntu made fixes available more quickly than Debian. However, as vulnerability scanning became routine in Harbor and other image repositories, Debian responded by creating a special “security” package source where they place fixes as soon as vulnerabilities are announced.

Since the base image may have been built before the most recent problems were reported, it is a good idea to include a

RUN apt update && apt upgrade -y

in your Dockerfile. To make sure this works, you need to understand the Build Cache.

Build Cache

As was previously mentioned, each line in a Dockerfile generates a new layer stored in the Engine Cache and identified by a hash of the layer contents.

By default, the Builder optimizes its performance by saving the contents of the line in the Dockerfile it processed linking it to the hash of the layer generated by that line. In subsequent image builds it compares lines from the new Dockerfile to the saved lines from previous Dockerfile processing. If the current Dockerfile begins with a sequence of statements that match statements in a previously processed Dockerfile, then the Builder does not rerun the processing but simply reuses the layer generated by that statement in the prevous build.

We have already noted that there is a special rule for the base image in the FROM statement. Once a base image is loaded, it is reused in all subsequent builds until the “--pull” parameter is specified in the “docker build” command.

When an ADD or COPY statement is encountered, the Builder obtains a hash of the content of source files. Although the name of the source file may be the same in the Dockerfile, it does not regard the statements as being identical if the content of the source file is different from the content in the previous build.

However, this does not solve the problem when a RUN statement explicitly or implicitly references source files in some network server. In particular:

RUN apt update && apt upgrade -y

implicitly references the current packages provided by the image vendor. These packages will change whenever there is a fix to an important bug, but there is no way for the Builder to determine if new packages are available.

To make sure that the latest data is used to build images, Jenkins also specifies the “--no-cache” parameter on a “docker build”. This parameter disables the Builder optimization and forces every image build to re-execute all the statements in the Dockerfile.

Again, since this is the default in Jenkins processing, it should be specified explicitly in the “docker build” command you run in your development sandbox to build images for initial testing.

Although this parameter forces each Dockerfile command to be run again, if the command generates the exact same layer (in this case, if no new packages were added and the “apt upgrade” therefore reapplies the same packages and ends up building the same layer as the previous Dockerfile build, then the new layer will have the same content hash and the Engine layer cache will already have a copy and reuse it. This also applies when you “docker push” an image you built to an image registry like Harbor. Even though you reran each line in the Dockerfile, if you generated a layer with the same contents and hash as a previously generated layer then Docker will discover this and will report that the layer already exists and therefore did not have to be pushed to the destination registry.

Howard Gilbert