Skip to main content

Speeding up Software Builds for Continuous Integration

Downloading the Internet

Can you remember the last time you started out on a clean development environment and ran the build of some software using Maven or Gradle for dependency management? It takes ages to download all of the necessary third party libraries from one or more remote repositories, leading to expressions like, "Just waiting for Maven to download the Internet".

Once your development environment has been used for building a few projects the range of dependencies that will need to be downloaded for other builds reduces down as the previously referenced ones will now be cached and found locally on your computer's hard drive.

What happens on the Continuous Integration environment?

Now consider what goes on when Jenkins or your other preferred Continuous Integration server comes to build your software. If it doesn't have a local copy of the libraries that have been referenced then it is going to pay the cost of that slow "download the Internet" process every single time that it comes to check out your latest changes and run a build.

What are the main costs involved here?

  • Developer time waiting on the build to complete before moving on to the next change
  • Data transfer charges for sourcing from external repositories

Cutting down costs - saving time

What options do we have available for reducing these costs?

  1. Localise the artifact repository, acting as a pass-through cache
  2. Or Pre-download the most common artifacts in a build container image

Option 1 would involve the selection and setup of an appropriate artifact repository manager such as Nexus or Artifactory. There's a reasonable chance that if your organisation happens to write your own reusable libraries then this will already be in place for supporting the distribution of those artifacts anyway, so it may just be a matter of re-configuring the setup to support mirroring of external third party libraries sourced from external repositories.

Option 2 may seem a bit counter-intuitive as it would go against the current trend of trying to minimise container sizes and to be generally useful it would need to contain a broader range of artifacts than any one project's build would require.

Keep it local

For both options the performance improvement comes down to locality of reference. The builds should be able to obtain most, if not all, dependencies without having to go beyond the organisation's private build environment's network - whether that be a Virtual Private Cloud or a data centre.

With this type of setup in place builds should be able to spend less time on initial setup, and be more focused on compilation, running tests, and ultimately making the new known good version of the code available for use.

If you want to understand the potential time savings on offer here, just try temporarily moving the content of your local development environment's build cache away and see how long a build takes. For a typical Java microservice I would not be at all surprised if the build time doubles or even triples for having to obtain the build plugin libraries, the application's direct dependencies, and all of the transitive dependencies.

Comments

  1. With the container option, is there an impact in the number of images that are generated to support changing dependency versions? i.e. you regularly update your dependencies within a project for needed features, or more importantly for any required security patches. With a repo proxy (e.g. Nexus) the impact should be localised to the size of the new dependency. With an image, you'd have the layer supporting the dependencies changing constantly - so the size of that could vary depending on how you perform the update?

    ReplyDelete
    Replies
    1. That's a fair point.
      The value of the container option would degrade over time if the container isn't getting regularly updated to pick up the most up to date versions of dependencies, including transitive dependencies.
      The build container image itself could be built on a regular schedule with one or more dummy projects specifying artifacts, so something like Renovate can be applied to automatically detect available upgrades.
      (There's probably a less hacky way to achieve this, but this was the first thing that popped into my head).

      Delete
    2. The image size could balloon out if it picked up dependencies that get frequent updates, so something like an AWS client library might result in multiple versions per week.

      If this got out hand then I'd want to apply a policy to limit the range of older versions that are kept around.

      Delete

Post a Comment

Popular posts from this blog

Having a go at learning some Kotlin

What's this about?  The year 2025 is almost over, so that means that it has been a bit over a decade since my old colleague Filippo gave a presentation to the development team of ScienceDirect covering the merits of the Kotlin programming language. So, it's about time that I had a proper go at using it. This blog post is intended to trace what the experience has been like, covering surprises that I encounter along the way. Getting started The programming language that I am most experienced with is Java, so I have chosen to try out implementing some functionality in Kotlin from a recent hobby project that I developed in Java involving spinning up a database in a Docker container and running some queries. JVM version support IntelliJ IDEA includes some automation for creating a new project, so I selected the relevant options to use the latest LTS version of the Java virtual machine with Spring Boot, Kotlin, Postgresql and Test containers. After a few seconds I had a new project i...

The Importance of Segmenting Infrastructure

Kafka for Logging I was recently poking around in the source code of a few technologies that I have been using for a few years when I came across KafkaLog4jAppender. It enables you to use Kafka as a place to capture application logs. The thing that caught my eye was the latest commit associated with that particular class, "KafkaLog4jAppender deadlocks when idempotence is enabled" . In the context of Kafka, idempotence is intended to enable the system to avoid producing duplicate records when a producer may need to retry sending events due to some - hopefully - intermittent connectivity problem between the producer and the receiving broker. The unfortunate situation that arises here is that the Kafka client code itself uses Log4j, so it can result in the application being blocked from sending its logs via a Kafka topic because the Kafka client Producer gets deadlocked waiting on transaction state. Kafka For Metrics - But Not For Kafka Metrics This reminded me of a similar scen...

2022 - A year in review

Just a look back over the last 12 months. January I moved back to Christchurch to live, after having spent a few months further south since moving back from London. Work was mainly around balancing other peoples' understanding and expectations around our use of Kafka. February I decided that it would be worthwhile to have a year's subscription for streaming Sky Sports, as some rugby matches that I would want to watch would be on at time when venues wouldn't be open. Having moved to Christchurch to be close to an office, now found myself working from home as Covid restrictions came back into effect across New Zealand. March Got back into some actual coding at work - as opposed to mainly reviewing pull requests for configuration changes for Kafka topics.  This became urgent, as the command line interface tool that our provisioning system was dependent on had been marked for deprecation. April   Had my first direct experience with Covid-19.  I only went for a test because ...