Content Comparison

...

Specifically, if you process a Dockerfile with a “FROM tomcat:9.0-jdk11-openjdk-bullseye”, and that tag name has not been previously encountered, then the Docker Engine will download the image that at this moment is associated with that alias, will save it in its cache, and from now on will by default reuse that image for all subsequent Dockerfiles that have the same alias in their FROM statement. Meanwhile, tomorrow Docker Hub may get a new image with the same alias but with additional security patches applied.

To avoid this problem, the “docker build” command used by Yale Jenkins image build jobs typically add specifies the “--pull” parameter to their “docker build” command. This parameter tells causes Docker to check the base image source network repository server for an image with a newer image associated with the alias on the FROM command and to download that newer version when one becomes available. You probably want to use “docker build --pull” in your sandbox developmentdate than the one stored in the local Engine cache. If one is available, then newer images is downloaded and used.

Since this is the normal behavior of Jenkins, a developer should also specify “--pull” on any “docker build” command used in a development sandbox. That way the image used in the sandbox for unit testing will be as close as possible to the one build by Jenkins and used in the Yale cluster for final testing and final production deployment.

However, you should understand that running the same image build using the same Dockerfile a second time may pull a newer base image with additional maintenance installed if, by luck, maintenance was done to that image alias in Docker Hub.

Harbor CVEs

Harbor regularly downloads a database of reported vulnerabilities. It scans images to see if they contain any of these vulnerabilities. High severity problems need to be fixed.

We have found It used to be that Ubuntu makes made fixes to critical problems available as soon as they are announced. It is not always possible to patch Debian or Alpine.

Once the vendor package libraries contain a fix, problems can be corrected by doing a simple

apt update
apt upgrade -y

If you run the commands without specific package names, the commands patch everything. Alternately, someone could try to build a list of just the packages needed to satisfy Harbor, but that is a massive amount of work. Since this list is not specific to IAM, and not even specific to Yale, but is potentially different for different versions of the base OS, creating such a list is clearly something that “somebody else should do” if only there were someone doing it.

At the Bottom and the Top

No matter what you decide, choosing a specific base image (impish-20220301 or focal-20220302 or impish next month) implies that a specific set of patches have been applied to the original release for you by the vendor. On top of that “level set” you can choose to immediately add packages with an “apt upgrade” before you add software and applications.

However, if you have an application that has been working fine for months and Harbor reports a single critical vulnerability that can be fixed by upgrading a single package, you can do that “apt upgrade” on either the top layer (where almost nothing happens and it builds instantly) or at the bottom layer where you then have to wait minutes for all your software to reinstall on top of the now modified base.

The Build Cache

New images are built in the Docker Engine in a specific environment which I will call the Builder. There are two versions of builder, traditional Docker and a newer Docker option called BuildKit. Whichever builder you choose shares the Image Cache with the hash names and tags described above.

I have already noted one Builder feature that interacts with the Image Cache. If Jenkins were to ever do a “docker build --pull” it would download any new version of the image whose tag/alias is specified in the FROM statement of the Dockerfile.

There is an entirely separate problem caused by the Builder’s desire to optimize its performance. It keeps a “statement cache” history of layers generated by processing Statements in all the Dockerfiles it has previously processed. As long as the statements in the current Dockerfile duplicate the beginning of some other Dockerfile it already processed, then it can avoid doing any work and simply reused the saved layer generated when it processed the same sequence of statements in the previous Dockerfile. Specifically:

When it processes the FROM statement, if “--pull” was not specified and a previous Dockerfile specified the same image tag name, then it has an image in the Cache prevously loaded that it can reuse. Only if “--pull” is specified does it look for a new copy of the iamge on the original source registry.
If it is processing an ADD or COPY statement, it creates a hash from the contents of the source file and compares it to previously saved hash values of identical ADD or COPY statements that appeared in other Dockerfiles that started with the exact same sequence of statements up to this line. If the content of the file changed, then the new data has to be copied and now we have a new layer and can stop optimizing.
For all other statements, if the next statement matches an identical next statement in a previously processed Dockerfile, then we reuse the layer generated by that previous Dockerfile. If this statement is new, then we have a new layer and stop optimizing.

To see why this is useful, consider building a generic Java Web application on top of the current version of Java 11 and Tomcat 9.5. If you have to start with a Docker Hub image, then every such application beings with what should be identical statements to FROM ubuntu:xxx, patch it, install Java, install Tomcat, install other packages that everyone needs, add common Yale configuration data, and only after all this is done are there are few lines at the end of the Dockerfile to install Netid, or ODMT, or YBATU, or something else.

If we used Docker the way it is supposed to work, we would create our own base image with all this stuff already done. Then the application Dockerfile could have a FROM statement that names our custom Yale Tomcat image and just does the application specific stuff. However, with this Builder optimization, every application Dockerfile could begin with the same 100 statements, and the Builder would recognize that all these Dockerfiles start with same way, essentially skip over all the duplicate statements, and just get to the different statements at the end of the file.

So if Yale Management doesn’t like us to create our own base images, we can do essentially the same thing if we can design a way to manufacture Dockerfiles by appending a few application specific lines on the end of some identical block of boilerplate code. Since there is nothing in Git or Docker that does this, we could add a step to the Jenkins pipeline to assemble the Dockerfile after the project is checked out from Git and before it is passed to “docker build”.

Except for the “RUN apt update && apt upgrade -y”.

If this is done in a Dockerfile that is subject to optimization, the Builder sees that this line is identical to the same line that was processed in a Dockerfile that began with the same statements and was run a month ago, or a year ago. Since the line is identical, it doesn’t actually run it and reuses the same layer it previously generated. This fails to put on any new patches.

There is no formal way to stop optimization, but the solution is obvious. Have Jenkins put today’s date in a file. COPY the file to the image. Then “RUN apt update && apt upgrade -y”. The content of this file changes each day. The processing of COPY checks the content and sees it has changed and stops optimizing the Dockerfile. Then the next statement always runs and patches the image.

However, if you do this at the beginning of the Dockerfile you lose all optimization. So it is better to arrange for this to be at then end of the Dockerfileavailable more quickly than Debian. However, as vulnerability scanning became routine in Harbor and other image repositories, Debian responded by creating a special “security” package source where they place fixes as soon as vulnerabilities are announced.

Since the base image may have been built before the most recent problems were reported, it is a good idea to include a

RUN apt update && apt upgrade -y

in your Dockerfile. To make sure this works, you need to understand the Build Cache.

Build Cache

As was previously mentioned, each line in a Dockerfile generates a new layer stored in the Engine Cache and identified by a hash of the layer contents.

By default, the Builder optimizes its performance by saving the contents of the line in the Dockerfile it processed linking it to the hash of the layer generated by that line. In subsequent image builds it compares lines from the new Dockerfile to the saved lines from previous Dockerfile processing. If the current Dockerfile begins with a sequence of statements that match statements in a previously processed Dockerfile, then the Builder does not rerun the processing but simply reuses the layer generated by that statement in the prevous build.

We have already noted that there is a special rule for the base image in the FROM statement. Once a base image is loaded, it is reused in all subsequent builds until the “--pull” parameter is specified in the “docker build” command.

When an ADD or COPY statement is encountered, the Builder obtains a hash of the content of source files. Although the name of the source file may be the same in the Dockerfile, it does not regard the statements as being identical if the content of the source file is different from the content in the previous build.

However, this does not solve the problem when a RUN statement explicitly or implicitly references source files in some network server. In particular:

RUN apt update && apt upgrade -y

implicitly references the current packages provided by the image vendor. These packages will change whenever there is a fix to an important bug, but there is no way for the Builder to determine if new packages are available.

To make sure that the latest data is used to build images, Jenkins also specifies the “--no-cache” parameter on a “docker build”. This parameter disables the Builder optimization and forces every image build to re-execute all the statements in the Dockerfile.

Again, since this is the default in Jenkins processing, it should be specified explicitly in the “docker build” command you run in your development sandbox to build images for initial testing.

Although this parameter forces each Dockerfile command to be run again, if the command generates the exact same layer (in this case, if no new packages were added and the “apt upgrade” therefore reapplies the same packages and ends up building the same layer as the previous Dockerfile build, then the new layer will have the same content hash and the Engine layer cache will already have a copy and reuse it. This also applies when you “docker push” an image you built to an image registry like Harbor. Even though you reran each line in the Dockerfile, if you generated a layer with the same contents and hash as a previously generated layer then Docker will discover this and will report that the layer already exists and therefore did not have to be pushed to the destination registry.

Version	Old Version 4	New Version Current
Changes made by	Howard Gilbert	Howard Gilbert
Saved on	Mar 24, 2022	Mar 28, 2022

Versions Compared

Key

Harbor CVEs

At the Bottom and the Top

The Build Cache

Build Cache