Hi there! Here is a brief overview of my struggles with an interesting (I think) race condition example. It has been taken from a real-life scenario; however, a different technology was initially used. But… the technology doesn’t matter (does it?) — what counts is the theory! (and the theory is the core of the below article 😃).
The User Story
So what it is all about — let's imagine such an algorithm:
- Get a password (from some external service)
- Make a request (authorised by the password from Step 1) to some other external service
- if it is the-time-for-change, then generate a new password and save it at the external service from Step 1
Let's assume (as for the time being) that the phrase if it is the-time-for-change means: always (the password time-to-live doesn’t matter at this stage) — taking this into account, the worker code could look (more or less) as below:
Quick note: The
sleep at line 4 doesn’t mimic anything — it increases the chances of messing up things!
Now the test environment. I will use Kubernetes to host the whole app, which contains the following:
- Redis — it is our storage used by both the backend service and the worker(s)
- backend service — it will not change during our tests (nor code or its configurations); in all tests, I will leverage only one replica of it; however, of course, its replicas number can be increased, and it can cause some interesting cases too, but I decided to ignore them now and focus entirely on workers. Moreover — this one backend service is used to provide the functionality of all external services mentioned by the User Story
- worker — the meat :) I’m going to perform tests for 1, 2 and 96 workers (as it turned out that the cluster used by me was not able to handle more)
If you are interested in code:
- backend service GitHub is lock-backend. The docker image can be pulled from the docker hub
- worker: GitHub: lock-worker / docker hub (versions have been appropriately tagged)
- lock-k8s — this repository contains the Kubernetes manifests if you want to prepare your testing environment
- lock-plot / https://pypi.org/project/lock-plot/ — application to generate the test plot (successes & failures) and report — it is run locally on the tester machine.
Additionally, I use (visible on screens) k9s to manage the Kubernetes cluster (so mainly — here — to change the worker's replicas number).
The Backend image version used in all the tests was
1.1.1. The default version for the Worker was
1.3.2 (in one test — Test07, to demonstrate one pitfall — the
1.3.1 worker’s version was used — there is a comment in the test descriptions about that, of course).
When running the app (the tests), it is essential to restart the Redis service before starting each new test (running the different tests on the same Redis instance can lead to unpredictable situations! — The app is not… advanced enough to behave appropriately when there is (any) data in Redis already).
There are three:
LOCK— describes the used lock type and can be equal to one of the below values:
TTL — it has some meaning only for the
rw lock, so this variable is ignored (equal to
0) for other tests
3. Number of workers — it is what Kubernetes does here for us (replicas!). Of course, you can run the whole app locally and run the workers in different terminal windows — that can make sense for the small number of workers, and it is what I was doing for testing purposes — but managing that way the 96 workers can be challenging!
OK, Let’s go!
Test 01: no lock, one worker
LOCK = no-lock
When the worker deployment (in Kubernetes context) is set up to run only one worker’s replica, everything works fine, as shown in the image below!
The different number of requests in the time unit visible on the diagram probably has several causes. First, it results from the
sleep function used within the code. Second, please notice that the Plot app is not perfectly accurate regarding the results frequency measurement — it tries to get the data for each step, making a network request which can have its latency etc. So the numbers should be correct, but don’t trust (in detail) the timeline visible on the diagram (the top-left corner) 😉
Test 02: no-lock, two workers
When only we increase the number of worker replicas to two…
The errors visible here are reported by the backend — each time the endpoint named
send-request is called, the backend checks if the token (password) used in that particular (sent) request is equal to the currently valid one.
So what happens here is the situation like this:
- worker 1 gets the password from the
- worker 1 sends a request to the
send-requestendpoint with the password from Step 1 — so far, so good
- worker 2 gets the password from the
- worker 1 generates the new password and updates the app by sending the request to the endpoint
- worker 2 — sends a request to the
send-requestendpoint with the password from Step 3 — and here we have an ERROR because the password has already been changed by worker 1 in Step 4 😟
And precisely, these errors — from Step 5 above — are reported on our diagram!
So when we have two workers working, the success of the
send-request operation is a kind of hazard… we will have to fix it, but first, let's look at what happens when there are 96 workers…
Test 03: no-lock, 96 workers
yeah… even if some successes are shown (first column in the top-right table) — they are from the stage (of this particular test) when the number of workers was not so high yet. Kubernetes won’t run 96 workers at once — there is some time needed to spawn them all — it is the slide visible in the diagram between the 40 and 50 cycle. The low number of workers still can make the app lucky enough to succeed occasionally… but not when 96 workers are running.
Ok, so let us fix it!
Let’s add the mutex to our code. It doesn’t make much sense to put two critical sections in our short sample, but I decided to make it that way for better reference for future improvement. So there are two critical sections:
- line 2–6: it is where the worker gets the password and sends the request signed by this password
- line 11–13: this section is responsible for changing the password
The idea here is to allow ONLY one process to be in any of these critical sections at the given time!
So, when one worker reaches line 3, the other will wait at lines 2 or 11. And when this particular process leaves the first critical section, then only one of all other processes will be allowed to continue its work — only one from all these workers waiting at line 2 or 11 will be allowed to acquire the lock and process with the code with inside the critic section — only one!
How does it work?
Test 04: mutex, two workers
LOCK = mutex
As you probably already know — for the one worker, the results are the same as when there was
no-lock used :)
Ok, so — two workers:
It works! Great, great, great! And for 96 workers…
Test 05: mutex, 96 workers
Hmmm… ok, still — no errors! But is there everything as we expected? Using the mutex, (it looks like) there is no difference if we are running one, two or 96 worker’s replicas — the bandwidth of the whole application is always the same… very similar to the “Test 01: no-lock, one worker” test results!
In the above screenshot, there is no sign (on the diagram) of the just done by Kubernetes change form 2 to 96 running worker’s replicas.
The mutex made the Kubernetes useless 😟 Can we do better?
Yes, we can!
Looking for a better solution (than mutex), I found this article on Wikipedia: https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock and decided to test the first proposed solution there — the RW lock, which uses two mutexes!
Let’s look at the code below. It is not ideal — this try/except block should probably be thought through yet (especially since it will block the whole app if the exception is thrown by the “do_request” method 🤔 ) — but, well… it does the work (generally):
So except for a bit different second critical section (because of the try/except), the only change is that now we distinguish between the reader and writer lock’s type (and thus, the two critical sections make more sense now).
Please notice that reader and writer are not just two different mutexes — there is a relation between them (described in quoted Wikipedia article), which can be checked in the code, e.g. https://github.com/lbacik/lock-worker/blob/main/worker/lock/redis_reader_lock.py.
Let’s test it!
Test 06: rw, 96 workers, ttl = 0
LOCK = rw
TTL = 0
With this kind of lock, the TTL becomes more critical, but first, let’s check how it works without any TTL defined (with TTL = 0):
YEP! The application bandwidth is much better (much more requests are processed in a given time!). The application also became scalable — the number of operations (in time unit) depends now on the number of workers (the first part of the diagram describes the number of operations for fewer workers). Great! But the chart doesn’t look… stable (or nice) — does it? So now is the time when we can go back to the phrase:
if it is the-time-for-change
from our algorithm description 😃 Let’s check what happens if we define the password's TTL (time to live).
For the time being, each worker always tries to change the password; let’s check what happens if we set the TTL of the password to 1 minute — so the password change step is going to be skipped (by the worker) if the password has already been changed (by any other worker) in less than 1 minute.
My first test showed some pitfalls…
Pitfalls (Test 07)
worker version: 1.3.1
LOCK = rw
TTL = 60
Adding the TTL to the algorithm has improved its bandwidth — you can see the number of successful operations in “time 100” is four times greater than with TTL = 0. But… why it's decreasing in time?.. 🤔
Of course, some periods of low algorithm activity are expected — that are the periods when the writer tries to change the password, but why the total number of operations didn’t return to its initial level after the very first such change?
The answer is that some workers were left hanging on the writer lock after each password change… And it is, of course, the implementation problem (not this kind of lock nature) 😅 It shows how fragile this game is — the next worker version solves this particular issue, but there are others (a bit more about that in the last paragraph).
Test 08: rw, 96 workers, ttl = 60 seconds
worker version: 1.3.2
LOCK = rw
TTL = 60
This version solves the problem from Test 07:
Nice! The application works great (without errors) and is horizontally scalable!
Wikipedia describes this kind of RW lock as the preferring reads one — the tests show that with the 96 workers alive and password TTL set to 1 minute, the password is changed not later than up to 4 seconds after the minute pass.
Note 1: the TTL here is a TTL managed on the worker side; the backend — in this case — doesn’t know anything about it; for the backend, every password is valid forever.
Note 2: It is not the only defined RW lock algorithm — but it is the only one tested by me so far.
Not bad, but still not perfect
Below are the results of Test 08 after 39 minutes after its start:
The number of operations is still on the same level, but as you can see, two failures are logged by the backend! Two errors vs ~102000 succeed — it is not bad, but they should not happen. It shows that the app / rw lock implementation is still imperfect (not ready to use in production environments). What could cause these problems is probably the cluster overload (and thus, maybe Kubernetes had to kill one of the pods because of some memory issues 🤔 — but I’m only guessing here); it indicates that decreasing (scaling down) the number of pods in this implementation is dangerous 😨 Yep, still a lot to do here!