Tech Blog of Stephen Souness: June 2022

Tuesday 21 June 2022

Speeding up Software Builds for Continuous Integration

Downloading the Internet

Can you remember the last time you started out on a clean development environment and ran the build of some software using Maven or Gradle for dependency management? It takes ages to download all of the necessary third party libraries from one or more remote repositories, leading to expression like, "Just waiting for Maven to download the Internet".

Once your development environment has been used for building a few projects the range of dependencies that will need to be downloaded for other builds reduces down as the previously referenced onces will now be cached and found locally on your computer's hard drive.

What happens on the Continuous Integration environment?

Now consider what goes on when Jenkins or your other preferred Continuous Integration server comes to build your software. If it doesn't have a local copy of the libraries that have been referenced then it is going to pay the cost of that slow "download the Internet" process every single time that it comes to check out your latest changes and run a build.

What are the main costs involved here?

Developer time waiting on the build to complete before moving on to the next change
Data transfer charges for sourcing from external repositories

Cutting down costs - saving time

What options do we have available for reducing these costs?

Localise the artifact repository, acting as a pass-through cache
Or Pre-download the most common artifacts in a build container image

Option 1 would involve the selection and setup of an appropriate artifact repository manager such as Nexus or Artifactory. There's a reasonable chance that if your organisation happens to write your own reusable libraries then this will be already be in place for supporting the distribution of those artifacts anyway, so it may just be a matter of re-configuring the setup to support mirroring of external third party libraries sources from external repositories.

Option 2 may seem a bit counter-intuitive as it would go against the current trend of trying to minimise container sizes and to be generally useful it would need to contain a broader range of artifacts than any one project's build would require.

Keep it local

For both options the performance improvement comes down to locality of reference. The builds should be able to obtain most, if not all, dependencies without having to go beyond the organisation's private build environment's network - whether that be a Virtual Private Cloud or a data centre.

With this type of setup in place builds should be able to spend less time on initial setup, and be more focussed on compilation, running tests, and ultimately making the new known good version of the code available for use.

If you want to understand the potential time savings on offer here, just try temporarily moving the content of your local development environment's build cache away and see how long a build takes. For a typical Java microservice I would not be at all surprised if the build time doubles or even triples for having to obtain the build plugin libraries, the application's direct dependencies, and all of the transitive dependencies.

Monday 20 June 2022

Docker SBOM - Software Bill Of Materials

In an earlier post on this blog I was curious about comparing Docker images to try to track down the differences that might be causing performance problems. Since then I have had a play with the sbom Docker command for listing out what is included in the image.

Following the documentation at: https://docs.docker.com/engine/sbom/

Below is an example of the output of a run of a locally built app:

> docker sbom hello-world-alpine-jlink:latest

Syft v0.43.0
✔ Loaded image
✔ Parsed image
✔ Cataloged packages      [16 packages]

NAME                    VERSION       TYPE
alpine-baselayout       3.2.0-r20     apk
alpine-baselayout-data 3.2.0-r20     apk
alpine-keys             2.4-r1        apk
apk-tools               2.12.9-r3     apk
busybox                 1.35.0-r13    apk
ca-certificates-bundle 20211220-r0   apk
docker-comparison       1.0-SNAPSHOT java-archive
jrt-fs                  11.0.15       java-archive
libc-utils              0.7.2-r3      apk
libcrypto1.1            1.1.1o-r0     apk
libssl1.1               1.1.1o-r0     apk
musl                    1.2.3-r0      apk
musl-utils              1.2.3-r0      apk
scanelf                 1.3.4-r0      apk
ssl_client              1.35.0-r13    apk
zlib                    1.2.12-r1     apk

This is a much more detailed listing of the components that are included in the docker image than we would get from looking at the Dockerfile or image history, so I would recommend it as a way of checking what you are including in an image. The main feature request that I have is to separate the artifacts by type, though in this trivial example that is simple enough to do by just looking at the listing.

Tuesday 14 June 2022

The Importance of Segmenting Infrastructure

Kafka for Logging

I was recently poking around in the source code of a few technologies that I have been using for a few years when I came across KafkaLog4jAppender. It enables you to use Kafka as a place to capture application logs. The thing that caught my eye was the latest commit associated with that particular class, "KafkaLog4jAppender deadlocks when idempotence is enabled".

In the context of Kafka, idempotence is intended to enable the system to avoid producing duplicate records when a producer may need to retry sending events due to some - hopefully - intermittent connectivity problem between the producer and the receiving broker.

The unfortunate situation that arises here is that the Kafka client code itself uses Log4j, so it can result in the application being blocked from sending its logs via a Kafka topic because the Kafka client Producer gets deadlocked waiting on transaction state.

Kafka For Metrics - But Not For Kafka Metrics

This reminded me of a similar scenario where an organisation might choose to use Kafka as their mechanism for sending out notifications of metrics for their microservices and associated infrastructure. If Kafka happens to be part of the infrastructure that you are interested in being able to monitor, then you need to keep those resources isolated from the metrics Kafka - otherwise you run the risk of an incident impacting Kafka which prevents the metrics from being transmitted.

Keeping Things Separated

A real world example of keeping infrastructure isolated from itself can be seen in the way Confluent Cloud handles audit logs. I found it a little confusing at first, as the organisation that I was working for at the time only had Kafka clusters in a single region, but the audit logs were on completely separate infrastructure in another region and even another cloud provider.

Sometimes You're Using A Service Indirectly

A slightly different - but no less significant - example of the need for isolating resources can arise when a particular type of infrastructure is being used for different types of workload. Rather than having a "big bang" release of changes to all of the systems, a phased rollout approach can be taken. One of my earliest involvements with using AWS came shortly after their 2015 DynamoDB outage, which had a ripple out impact for a range of other AWS services because behind the scenes those other services were themselves utilising DynamoDB.

It's my understanding that AWS subsequently moved to isolating their internal services' DynamoDB resource from general consumers' DynamoDB infrastructure - but don't quote me on that.

Friday 10 June 2022

Docker Images - Size matters, But So Does Performance

Introduction

I recently went through the exercise of re-building a Docker image based on what was supposed to be a stable, well-known application codebase. Along the way I observed an unexpected performance issue.

The application contained within the Docker image was just a Java command line utility for parsing some yaml files to provision kafka resources on our hosted development clusters. The code had not been changed for several months, so this was supposed to just be a matter of setting up a local copy of the Docker image instead of pulling down a trusted third party's image from Dockerhub.

The application was bundled within a Docker contrainer whose Dockerfile was alongside the code, so it should have been a simple matter of using that to produce the image and pushing it to our own repo, and then pulling that down for our runtime use.

It's the same, so why's it different?

We had been running with the existing third party Docker image for several months, so there was a well established history of how long each stage of the deployment pipeline should typically take to run.

When the new Docker image ran it took noticeably longer to complete each stage. I don't have the exact figures in front of me, but can recall that it was in the order of double digit percentage of time slower - so a six minute build might now be taking longer than seven minutes.

Examining the Docker images

The third party's build process for the original Docker image wasn't available for examination, so to compare the Docker images we need to use something like

> docker history --no-trunc <full image name>

From this I was quickly able to establish that there were a couple of significant differences between the application's specified Dockerfile and the Dockerfile that would have been used for building the faster running established version:

The base image

CentOS Linux versus Alpine Linux

The Java runtime

Full Java SDK versus jlink with specific modules

Getting back up to speed

Since the purpose of this setup was to be a lift and shift of the existing setup, I adjusted the Dockerfile to involve CentOS Linux as its base image and adjusted it to use a full JDK instead of the clever jlink minimised Java runtime environment.

At this point we were where we wanted to be as our baseline for migrating off the third party Docker image. Our image has the same base OS and Java runtime and performs close enough to the same - without taking the double digit percentage of time longer than our starting point.

What was the issue?

While I was working on this particular setup there was a pressing deadline that I was not free to play around with tuning this setup and isolating whether the issue was due to the OS or the jlink runtime (or something else).

Based on what I have seen mentioned online, I suspect that there may have been some aspect of the application that involved heavy use of system calls that were not set up to run Java efficiently with Alpine's musl library. For now that it just a theory, and not something that I have managed to reproduce on a simplified locally built application.

If the runtime environment had involved inputs from external systems I would have been more motivated to try to keep us on Alpine to minimise the potential vulnerabilities as it tends to have fewer services and libriaries that tend to have CVEs representing potential security vulnerabilities.