Concurrency, Race Conditions and Concurrency Control

Concurrency, Race Conditions and Concurrency Control

Almost any large scale application will have some sort of concurrency involved, but what is concurrency? And what are the problems that arise because of concurrent code? And how can we handle them?

In this article, I will try to answer those questions using some examples to solidify those concepts and set you on a path to write your next concurrent application correctly.

What is Concurrency?

Concurrency as the name suggests is having multiple transactions running at the same time. For example, if you're implementing a web application, you can have multiple requests, say to insert a user in the database, being served at the same time, i.e. concurrent requests. This is basically how applications try to handle load and increase speed.

Let's take a step back and discuss a day to day life example to make this concept clearer. Let's say you go to a shop, buy your stuff and now you want to checkout and leave. If there is one cashier, everyone will be queuing up to use that only one cashier. In order to scale this (make it work for larger number of customers), the shop can hire a new cashier and now it can serve twice as many customers in the same time frame and so on. This is known as horizontal scaling in computer science (as opposed to vertical scaling, hiring a superhuman cashier that can serve twice as many customers all by herself).

When we have multiple things doing the same job, like for the example, 2 cashiers serving customers, we call that concurrency.

Concurrency Example

Now let's get back to computer science and check the following go code snippet:

package main

import (
    "log"
    "time"
)

var balance int = 10

func buy(price int, tx string) bool {
    if balance < price {
        log.Printf("%s, refused due to insufficient funds", tx)

        return false
    }

    balance -= price
    log.Printf("%s, succeeded", tx)

    return true
}

func main() {
    go buy(7, "transaction 1")
    go buy(5, "transaction 2")

    time.Sleep(10 * time.Second)
}

The previous code implements a simple shopping functionality. We have a function called buy that buys a specific product given its price. The logic is super simple, it just checks if the user has a sufficient balance, if not, it fails the transaction, if yes, it updates the balance accordingly and the transaction succeeds.

An inexperienced programmer might not notice anything wrong with this code, as a matter of fact when I ran it, it produced this output:

2022/06/19 23:00:00 transaction 1, succeeded
2022/06/19 23:00:00 transaction 2, refused due to insufficient funds

which makes sense, it means that transaction 1 checked the balance and it allowed the purchase so it succeeded but then later when transaction 2 checked the balance it didn't allow the purchase because of insufficient funds.

Race Conditions

To understand what race conditions are, let's now re-run the previous code. When I did so, I got this:

2022/06/19 23:00:00 transaction 2, refused due to insufficient funds
2022/06/19 23:00:00 transaction 1, succeeded

which also makes sense as either transactions succeeded and the other one failed, but take a closer look, transaction 2 failed before transaction 1 succeeded, how could this happen? This means that the following has likely happened:

  1. Transaction 1 checked the balance, allowed the purchase and updated the balance then it got the thread got suspended (yeah, this can happen, the OS can suspend threads while they're running and schedule other threads to do some work, then resumes the suspended threads).
  2. Transaction 2 checked the balance, found it insufficient and failed the transaction.
  3. Transaction 1 got resumed and it printed the success message.

This highlights the fact that threads can suspended and resumed and also that each thread is executing the same code section separately, so each line of code can have multiple threads executing it at the same time.

This suspension and resumption can lead to completely different results, for example, I got the following output when I ran the code one more time:

2022/06/19 23:00:00 transaction 2, succeeded
2022/06/19 23:00:00 transaction 1, refused due to insufficient funds

In this case, transaction 2 managed to complete before transaction 1 checks balance which is a completely different result than the one we got in the two previous runs.

All the previous results, at the end of the day, are correct since one transaction succeeded and the other failed. In more drastic cases, we might get an output like this however:

2022/06/19 23:00:00 transaction 1, succeeded
2022/06/19 23:00:00 transaction 2, succeeded

which completely wrong, how could this happen? Let's see:

  1. Transaction 1 checks balance, allows the purchase and gets suspended.
  2. Transaction 2 checks balance, allows the purchase and gets suspended.
  3. Transaction 1 updates the balance and prints succeeded.
  4. Transaction 2 updates the balance and prints succeeded.

Now the balance is negative and we allowed a transaction that should have never been allowed. This when concurrency causes you pain. This is called a race condition which is when 2 threads race to access a shared resource (in this case, it's the user's balance) and it results in inconsistencies and wrong behaviors (in this case, it's costing us money).

Concurrency Control

So now that we understand what is concurrency and how poorly written concurrent code can lead to race conditions, let's take a quick look at how can we solve them.

Single Threading

Single threading is a trivial solution but let's explore first and build up on it. The problem we had in the previous code is that we had 2 transactions running concurrently on 2 threads. Now what if we eliminate threading at all, and just make everything single threaded, like so:

func main() {
    buy(7, "transaction 1")
    buy(5, "transaction 2")
}

Now the code is single threaded (everything is executed by the main thread) and it's also sequential in the sense that the each statement is executed to completion first before the main thread goes on to execute the next statement. So in our example, transaction 1 will execute to completion, succeeds then transaction 2 will execute to completion and fails. So we will always get the following output:

2022/06/19 23:00:00 transaction 1, succeeded
2022/06/19 23:00:00 transaction 2, refused due to insufficient funds

Concurrent but Thread Safe

Single threading the code ofcourse solved our race conditions but it limited our ability to scale, now we are stuck with that single cashier forever as the number of our customers grow, it will infeasible to operate.

The idea of sequencing access to the shared resource (the user's balance) is still a valid though, we will try to make it that only one transaction is accessing (read and updating) the balance at any given time. The code section that we want to safeguard its access is commonly called a critical section.

A common way to protect critical sections from concurrent access is called locking, the idea of a lock is simple, it's like a lock and a key, if you have the key, you can get inside the critical section and if you don't you have to wait until you get it.

Now let's assume, we have a method called, mutex.Lock() to acquire that lock to access the critical section and another method called mutex.Unlock() to release the lock when we are done accessing the critical section and want to allow other waiting transactions to access the critical section since we are done. Our code will look like this:

package main

import (
    "log"
    "sync"
    "time"
)

var balance int = 10

func buy(price int, tx string, mu *sync.Mutex) bool {
    mu.Lock()
    defer mu.Unlock()

    if balance < price {
        log.Printf("%s, refused due to insufficient funds", tx)

        return false
    }

    balance -= price
    log.Printf("%s, succeeded", tx)

    return true
}

func main() {
    var mu sync.Mutex

    go buy(7, "transaction 1", &mu)
    go buy(5, "transaction 2", &mu)

    time.Sleep(10 * time.Second)
}

In this code snippet we create a lock called mu and pass it to the function buy. In the buy function, we acquire the lock by calling mu.Lock() and do a defer on releasing the lock using defer mu.Unlock(). If you are not familiar with go, defer executes its statement right before the function exits, so the unlock gets executed either before the return when the transaction fails or before it succeeds.

What this did for us, is that now only 1 transaction will be executing at a given time, for example:

  1. Transaction 1 acquires the lock and starts executing
  2. Transaction 2 tries to acquires the lock, but fails because it's unavailable so it waits.
  3. Transaction 1 completes and succeeds then releases the lock
  4. Transaction 2 can now acquire the lock and start executing then it fails.

A scenario when transaction 2 succeeds and transaction 1 fails is also possible and valid (try to trace it yourself).

But now, how is that different from single threading? As a matter of fact, in this specific example, they are exactly the same, but in so many other cases they will be different. Now let's assume that our buy function is complicated and does more stuff before or after the section that buys stuff, for example:

func buy(price int, tx string) bool {
    if balance < price {
        log.Printf("%s, refused due to insufficient funds", tx)

        send_insufficient_funds_email()

        return false
    }

    balance -= price
    log.Printf("%s, succeeded", tx)

    generate_sales_report()
    update_inventory()

    return true
}

As we can see, it's the same function but it does extra work like send_insufficient_funds_email() on transaction failures or generate_sales_report() and update_inventory() on transaction success. Assuming that those functions are thread safe on their own (there is no problem in them running concurrently), it wouldn't make sense to lock the entire function because now locks will have to be unavailable for a longer time (until all other function calls are done which might take time).

In such cases, we can fine grain our locking to only lock the critical section as follows:

func buy(price int, tx string, mu *sync.Mutex) bool {
    mu.Lock()
    if balance < price {
        log.Printf("%s, refused due to insufficient funds", tx)
        mu.Unlock()

        send_insufficient_funds_email()

        return false
    }

    balance -= price
    log.Printf("%s, succeeded", tx)
    mu.Unlock()

    generate_sales_report()
    update_inventory()

    return true
}

As we can see, we only lock the part that causes race conditions which checking and updating the balance. I want to mention two very common mistakes here:

  1. Underlocking: sometimes, you want to make your locking as fine grained as possible but you end up underlocking (locking less than you actually need). What if we unlock before we update the price? This might make sense at first because we will be locking the part that checks the balance, so no two transactions will check the balance at the same time. But updating the balance is also part of the critical section because if we unlock before we update the balance, another transaction might check the now old balance which will cause problems. So make sure that lock you fine grain your locking for better performance, but be wary of underlocking.
  2. Deadlocks: a deadlock happens in so many ways, but the one way we are interested in here is when a lock is acquired but never released. A common mistake here is to unlock after updating the price but forgetting to unlock when the transaction fails. If you do this, the lock will be acquired and no other transaction will be able to acquire the lock, essentially bringing your system to a halt.

Locking is just one method to implement thread safety, there are other ways though. In fact, if you're writing golang, I recommend you use channels for this, it's a completely different paradigm in sharing resources but follows the same idea. I have written an article on this before, check lesson 1 here.

Concurrency Scope

The important remaining part in the concurrency dilemma is what is our concurrency scope? Concurrency can be on different scopes, for example:

  1. An application can be concurrent on the scope of a single process, so we have a single process of the application running and it contains multiple threads. This is similar to the examples shown above.
  2. An application can be concurrent on the scope of multiple processes and those processes can either be on the same machine or on different machines in case we have replica applications running (like the cashier example).

In the previous example where we used locks implemented by the sync package in go, that will only work for threads within the same process. But if the same code is running on two different machines, we are back to the drawing board. So we really need to be aware of our concurrency scope in order to handle it correctly.

Now let's assume that we will run our code on multiple machines, how can we make locking work? Let's just ask ourselves the inverted question, why didn't lock work when we moved to a multi-machine setup? It's because locks were seen only within the same process so when we run on a different machines, the statement mu.Lock() doesn't actually stop the threads on the other machine from accessing the critical section. So now how can we make the lock visible for both machines?

We can move our locking logic to a centralized place where all machines can access. For example, we can use an in-memory store like Redis to do this. If you are not familiar with Redis, it's basically a key-value store like a dictionary, you can set a key to have some value and you can read those keys (among a ton of other useful features).

Now let's start implementing a Redis lock, I will use Ruby for this:

def lock(key)
  return false if REDIS.get(key)

  REDIS.set(key, "true")

  true
end

def unlock(key)
  REDIS.del(key)
end

In the previous code, the lock method check if the key indicating the lock has a value or not, if it does, then someone else must have locked it, so it return false. If not, it sets the value of the key to true and returns true. In the unlock method, it just deletes the key from Redis (so it's unset and can be set later).

This is a very simple lock implemented in Redis and can be seen and accessed by multiple machines connected to the same Redis instance. But wait, doesn't that code have the same race condition we have been talking about? Yes, it does, consider this:

  1. Thread 1 check the value of the key, founds it nil and gets suspended.
  2. Thread 2 checks the value of the key, founds it nil and acquires the locked.
  3. Thread 1 is resumed and acquires the lock as well.

Now both threads have acquired the lock, so the lock is useless.

Now let's ask ourselves, what are we missing here? The problem here is that a thread can get suspended between the two operations of checking the lock value and acquiring the lock. If we have a way to make sure that the thread will never get interrupted during those two steps, that would be ideal. Executing multiple instructions without being interrupted is called atomicity, so we need to make checking the lock and acquiring the lock atomic somehow.

Luckily, there are some hardware instructions for this, Redis introduces an option to the SET command called nx which sets the value of a key only if it doesn't exist and because Redis is single threaded, checking that the key exists or not and setting is atomic.

So the modified cod will look like this:

def lock(key)
  REDIS.set(key, "true", nx: true)
end

def unlock(key)
  REDIS.del(key)
end

Also the nx option makes the SET command return false if it fails to set the key and true if it succeeds.

This lock now works perfectly. There is one minor caveat however, we should never leave a chance to deadlocks. We need to think of edges cases, like for example, what if the entire machine crashes after acquiring a lock and before releasing it? It will be locked forever. A simple way to handle this in a Redis lock is to set a timeout for the lock to be held after which it will expire, thus being released. Redis has an ex option in the SET command that implements this expiry functionality, like so:

def lock(key)
  REDIS.set(key, "true", nx: true, ex: 1.minute.seconds.to_i)
end

def unlock(key)
  REDIS.del(key)
end

This sets the key (acquires the lock) for only 1 minute, then it expires. Be sure to choose an appropriate expiry time based on your use case because you don't want the lock to expire mid-execution and cause race conditions.

Conclusion and General Tips

I will try to summarize it all in some general tips:

  1. Whenever you write concurrent code, think of race conditions, basically think what will happen if two threads are executing the same code at the same time.
  2. It helps if you think in terms of suspension and resumption of threads, like, what if this thread executes this statement, then it gets suspended and another threads does something else, then this thread gets resumed.
  3. Don't look at statements as line. One line doesn't equal one statement, for example, a line like read(val) > 10 is not one instruction, it actually reads the value then compares, so it get get suspended between the two operations.
  4. When you're placing locks in your code, make sure to fine grain the locking range for better performance, but be sure to not underlock.
  5. Make sure that any code path that has a lock has a corresponding unlock. Make sure to check branching in the code, for example if you lock then get into a switch ... case, each code path in the cases should have an unlock in its path to completion, otherwise you will get into deadlocks.
  6. Think twice about your concurrency scope and handle it accordingly as explained.
  7. Never declare locks that can be left as locked forever, always add a safeguard unlock or an expiry.