Skip to main content

When to avoid or allow upserts

Introduction 

A few recent posts on this blog have outlined how we could achieve version aware upserting of data into various databases. In this post let's consider situations where that might be an unsuitable approach. 

An assumption about Id uniqueness

When we attempt to take an entity and write it into a database, we have an expectation that the attribute or attributes that are used to uniquely identify that entity are safe to handle as trustworthy within the business domain. Let's consider a situation where that assumption has been known to fall down in real production systems.

Generation of a value to be used as the primary key in a relational database can seem to be a solved problem, given that we now have UUIDs that can be generated and passed around for use in our applications and services.

Some earlier implementations for generation of UUID involved combining the MAC address associated with the network device of the machine and the current time as a way of combining to produce a value that would not clash with other machines.

There turned out to be limitations to the uniqueness guarantee when the processing speed and concurrency of the machine resulted in multiple "unique" values being generated at effectively the same time.

Another unfortunate way of producing colliding UUIDs values could occur when a virtual machine happened to have been set up specifying the same MAC address, further increasing the risk of collisions.

For situations where the identifier generator cannot be trusted, we should focus our efforts on recognizing inserts and updates as clearly differentiated operations. 

Trusting a source of truth

It is common for alternative representations of data to exist downstream from an originating system. These may asynchronously apply some transformations or aggregations to produce an entity that is intended for reporting or any manner of follow-on business processing.

In this situation the system is far enough removed from the original data creation that there is little point in expecting inserts and updates to arrive in order.

One size does not fit all

I have made some broad generalizations, but the universal consideration applies, "It depends"....  You may find yourself needing a pipeline that can and must differentiate between creating and updating, in which case you can also expect to need to have firm control over the ordering of those events - perhaps involving Kafka with suitable partitioning and concurrency controls. That's a topic for another post.

Comments

Popular posts from this blog

Speeding up Software Builds for Continuous Integration

Downloading the Internet Can you remember the last time you started out on a clean development environment and ran the build of some software using Maven or Gradle for dependency management? It takes ages to download all of the necessary third party libraries from one or more remote repositories, leading to expressions like, "Just waiting for Maven to download the Internet". Once your development environment has been used for building a few projects the range of dependencies that will need to be downloaded for other builds reduces down as the previously referenced ones will now be cached and found locally on your computer's hard drive. What happens on the Continuous Integration environment? Now consider what goes on when Jenkins or your other preferred Continuous Integration server comes to build your software. If it doesn't have a local copy of the libraries that have been referenced then it is going to pay the cost of that slow " download the Internet" p...

2022 - A year in review

Just a look back over the last 12 months. January I moved back to Christchurch to live, after having spent a few months further south since moving back from London. Work was mainly around balancing other peoples' understanding and expectations around our use of Kafka. February I decided that it would be worthwhile to have a year's subscription for streaming Sky Sports, as some rugby matches that I would want to watch would be on at time when venues wouldn't be open. Having moved to Christchurch to be close to an office, now found myself working from home as Covid restrictions came back into effect across New Zealand. March Got back into some actual coding at work - as opposed to mainly reviewing pull requests for configuration changes for Kafka topics.  This became urgent, as the command line interface tool that our provisioning system was dependent on had been marked for deprecation. April   Had my first direct experience with Covid-19.  I only went for a test because ...

Applying AI to software development can be like following SatNav

Trying out a different navigation system A month or so ago I upgraded to a car that has a SatNav system included, so I have been trying to use that instead of the Maps app on my phone. My experiences with it so far have generally been good, but it is far from flawless - a bit like Artificial Intelligence (AI) in software development. As context, my previous vehicle was not too old to include SatNav, it just hadn't been set up with English language or New Zealand maps - one of the down sides of having a second hand vehicle that originated in Japan. Flawed or incomplete information Driving around central Christchurch can be a bit challenging at times as various roadworks are underway, leaving streets closed off or narrowed down to a single lane. It could be reasonable to expect that a basic navigation system might not have up to the minute awareness of those closures and restrictions. However, something that I did not expect to encounter was the navigation system advising me to expec...