CSE 422S: Studio 11

Build Your Own Locks


"Although it would be nice if every critical region consisted of code that did nothing more complicated than incrementing a variable, reality is much crueler."

—Robert Love, Linux Kernel Development 3rd Ed., Chapter 10, pp. 183

Locking primitives are important user space tools for concurrent and parallel programming. Two main types of locks exist: locks that spin while a program is waiting to acquire the lock, versus those that cause the program to sleep. Spinlocks consume CPU cycles as a process waits, but are well suited for low-latency applications when critical sections are very short. Other locks allow processes to sleep while waiting, which costs fewer CPU cycles but may result in longer latency with processes sleeping and waking up.

In this studio, you will:

  1. Build a userspace spin lock, using atomic instructions
  2. Build a userspace sleep lock, using atomic instructions and futexes

Please complete the required exercises below.

As you work through these exercises, please record your answers, and when finished email your results to eng-cse422s@email.wustl.edu with the phrase Userspace Locks in the subject line.

Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.


Required Exercises

  1. As the answer to the first exercise, please list the names of the people who worked together on this studio.

  2. Download the workload.c program. This program produces several seconds of work on each processor in a system. Build and run the program on your Raspberry Pi (see the comment in the program's code for the command line you will need to use to build it) and verify that it occupies all of your system's cores by using top and pressing 1.

    As the answer to this exercise, please explain why (based on the output of the top utility) you think the program is occupying all of the cores.

  3. Right now, all threads can execute the critical_section() function concurrently (or even simultaneously if they are on different cores). This is undesirable, since critical sections typically protect important shared data. First we will build a spin lock to protect access to the critical_section() function.

    Write an initially empty lock and unlock function. Each of these functions should take a pointer to a volatile integer.

    Note: Recall that the volatile specifier tells the compiler that the value of a variable may change. In this case, the compiler interprets the volatile int * type declaration to mean "a pointer to an int that's volatile" (the value pointed at by the pointer may change unexpectedly, not that the pointer itself may change.)

    To treat an integer like a lock, we need to define two values that represent the locked and unlocked states, respectively. Define these at the top of your program with a pre-compiler #define statement. Also create two integer variables that will hold these values, and initialize them accordingly.

    Finally, declare another integer variable, for the state of the lock itself, and initialize the variable to be in the unlocked state.

    Inside the parallel region of the program, insert calls to the lock and unlock functions before and after the critical section of the workload.c program, respectively, passing the address of the lock variable into each of those calls.

    As the answer to this exercise, please show your statements to declare the two values and the three variables.

  4. In order to implement the lock and unlock functions we'll use GCC's built-in atomic instructions. If we were working in C++ we could use C++ 11's atomic instructions. If we didn't have access to GCC, or if speed was very critical, we could implement these with assembly instructions.

    The atomic built-in functions are documented here. For the spin lock we will use the function __atomic_compare_exchange() . The first three arguments determine the meaning of this function: ptr, expected, and desired. When called, this function atomically compares the contents of the location pointed to by ptr with the contents of the expected variable, and if they are equal, writes the value of desired into the location pointed to by ptr. The last three arguments specify special memory orderings, but we'll just opt for a strong ordering for this studio, as in:

    __atomic_compare_exchange( ptr, expected, desired, false, __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE)

    To implement the lock function, you should check for the unlocked state and write the value of the locked state. However, it's possible that this function will fail (for example, if another process already holds the lock). Thus, your lock function should attempt to swap the state of the lock variable, and continue retrying until it succeeds. WARNING: The function may overwrite the value of expected when it fails! Please make sure to re-initialize its value as needed.

    Implement the unlock function using the same __atomic_compare_exchange() function. However, since we only expect the unlock function to be called when we already hold the lock, it should succeed unless something is drastically wrong. If the call to __atomic_compare_exchange() fails, then rather than retrying the function, your code should print out an error message and return.

    As the answer to this exercise, please show your code for both the lock and unlock functions.

  5. Compile and run your program on your Raspberry Pi, and use each thread's finishing statement to verify that only one thread enters the critical section at a time.

    As the answer to this exercise, please explain why (based on the output you saw) the program only allows one thread at a time into the critical section.

  6. Now we will implement a sleep lock. The lock described above consumes processor time while it's waiting, because it continually retries the lock operation until it succeeds. To implement our sleep lock, we will replace this behavior with one where we try to acquire the lock, but if we fail, the thread sleeps until it is later woken up. To begin, make a copy of your program and delete the bodies of the lock() and unlock() functions, and the variables that held the locked and unlocked states (don't remove the lock variable or the statements that #define the locked and unlocked state values).

    The sleep and wakeup mechanism we will use for this second version of userspace synchronization is a system call named futex, which stands for a fast userspace mutex. This system call handles the mechanisms for sleeping and waking processes, but userspace library code must decide how and when to use this capability. Specifically, the futex library function is designed to implement a semaphore on top of an integer. There are three states:

    Unlocked:1
    Locked:0
    At least one process is sleeping:any negative number

    Since the futex is designed to implement a semaphore, this means that processes lock and unlock the futex by atomic increments and decrements. When a process claims the futex, it atomically decrements the integer by one. When a process releases the futex, it atomically increments the integer by one. If two processes never conflict, then the value of the futex integer will always be zero or one, and no process will ever have to sleep (and thus, you will never need to make a futex system call).

    However, if multiple processes try to lock the futex simultaneously, they will decrement the integer value to be negative. In this case, a process that gets some value less than zero will want to go to sleep, and the kernel then must become involved. The semantics and the particulars of this process are documented in the man pages produced by the following commands: man 2 futex and man 7 futex.

    Make sure that the #define values you have declared for the unlocked and locked state values are consistent with the semantics described above, and if not update them accordingly.

    As the answer to this exercise, please explain whether or not you needed to change the #define values in your code, and why or why not.

  7. Implement your lock function according to the following algorithm.

    1. Decrement the lock variable with ret_val = __atomic_sub_fetch( ptr, 1, __ATOMIC_ACQ_REL );
    2. Check to see if the return value is less than zero
    3. If yes, we need to sleep. Set the lock integer to -1 with __atomic_store_n( ptr, -1, __ATOMIC_RELEASE );
    4. Then invoke the system call: syscall( SYS_futex, ptr, FUTEX_WAIT, -1, NULL );
    5. Then go back to step 1
    6. If the return value wasn't less than zero, exit the lock() function

  8. Implement your unlock function according to the following algorithm.

    1. Increment the lock integer with ret_val = __atomic_add_fetch( ptr, 1, __ATOMIC_ACQ_REL );
    2. Check to see if the return value is one
    3. If yes, simply exit the unlock() function
    4. Otherwise, we need to wake up some sleeping thread. Set the lock integer to 1 with __atomic_store_n( ptr, 1, __ATOMIC_RELEASE );
    5. Then invoke the system call: syscall( SYS_futex, ptr, FUTEX_WAKE, INT_MAX );

    As the answer to this exercise, please show your implementations of the new lock and unlock functions.

  9. Run your program and verify that only one thread is able to access the critical section at a time.

    As the answer to this exercise, please explain why you think only one thread at a time can access the critical section.

  10. Trace both versions of your program with trace-cmd record -e sched_switch. Take a screen shot showing both behaviors.

    Notice that your spin lock is able to do synchronization entirely in userspace, while the futex lock sometimes requires the intervention of the kernel. As the answer to this exercise, please explain whether or not it would be possible to do a sleep lock entirely in userspace (i.e. with no system calls), and why or why not?

Things to turn in