CSE 422S: Lab 2

Linux Scheduler Profiling


"There are two main reasons to use concurrency in an application: separation of concerns and performance. In fact I'd go so far as to say that they're pretty much the only reasons to use concurrency;"

—Anthony Williams, C++ Concurrency in Action, Chapter 1.2

As we have discussed previously, the Linux Kernel provides a variety of schedulers, each of which may be better (or worse) suited for different types of tasks. A good understanding of the semantics of the different schedulers, and which of them is better suited for different scenarios, can make a significant difference in the performance of a system.

As for the previous lab assignment, for this lab you are encouraged to work with one or two other people as a team, though you may complete this assignment individually if you prefer. Teams of more than 3 people are not allowed for this assignment.

In this lab, you will:

  1. Profile the SCHED_FIFO, SCHED_RR, and SCHED_NORMAL Linux schedulers.
  2. Use basic multi-threaded synchronization and concurrency techniques.
  3. Characterize/verify threading behavior under different schedulers, using tracing.

Readings and other resources

The following sources of information are likely to be useful as you work on this project.

Assignment

Please complete the exercises described below. As you work through them, please take notes regarding both your observations and your answers to the different questions given below, and when you are finished please write-up a cohesive report (see next) and e-mail it along with your code files, Makefile, etc. to eng-cse422s@email.wustl.edu with the phrase Lab2 in the subject line.

What you will turn in for this lab will differ somewhat from previous lab and studio exercises. You will submit a cohesive report that unifies your observations and thoughts: instead of answering this lab's questions independently of each other, as you go make note of your observations and your answers to each question. As you move from one question to the next, consider new answers as well as the previous answers and as appropriate go back to earlier questions and note connections with other questions. This should help you to synthesize a cohesive report when you are finished.

This lab also is meant to focus less on how things are implemented and more on what you learn and notice about the different scheduling classes. For more information see "What to turn in" below.

Note: the user-space program(s) for this lab can be written in any language of your choosing as long as that language supports the necessary features. The lab's purpose is to cultivate and demonstrate (1) knowledge about different Linux schedulers and (2) the ability to think critically about (and discuss) their behaviors (rather than to demonstrate mastery of any particular programming language). That said, everything in this lab is fairly straight-forward to do in C, and even more so if you use the C++11 libraries' threading and synchronization features. You are free to adopt algorithms, code fragments, etc. from other sources (such as Williams' book noted above), but if you do so you must (1) comment those portions of your code to clearly indicate from where they came, and also (2) discuss and cite what you've used (and the source from which it came) in your submitted report.

In this lab you will create a program that will spawn a certain number of threads to be pinned on each core. These threads will then wait at a synchronization barrier until all other threads have been successfully spawned and pinned. Once all threads have arrived at the barrier (once), they will each repeat the following operations: (safely) select the next number from a data structure and cube that number repeatedly (for a given number of iterations). This activity of: (1) safely selecting the next number (which should be protected from any data races), and (2) repeatedly cubing it (which is intended to define a basic unit of workload for the thread to perform) is then repeated for a given number of rounds in each thread, giving it a sustained and configurable overall workload and some degree of contention among the threads, through which the performance of each scheduler can be evaluated.

The program will take in five or more arguments indicating (1) the scheduling class to be used (SCHED_FIFO, SCHED_RR, or SCHED_NORMAL), (2) whether threads should consume CPU cycles or suspend their execution while they wait for each other (spin, sleep), (3) a positive number of rounds for each thread to perform overall, (4) a positive number of iterations of cubing the selected number that each thread will perform in each round, and (5+) one or more additional numbers that should be used to populate the data structure from which the threads will repeatedly obtain numbers. For example, a command line such as

./myprog SCHED_RR spin 100 1000 2 3 5 7 11

would use the round-robin real-time scheduler, threads would spin-wait in order to synchronize, and each thread would perform one hundred rounds of: obtaining one of the prime numbers in the range 2 through 11 inclusive and simply repeatedly computing the cube of that same number (not re-cubing the result of the previous iteration, which could easily introduce overflow and other representation issues we won't go into) 1000 times.

    NOTE: Some of these exercises/questions may freeze your Pi. Save your work often, and read ahead to make sure you are aware where we expect such freezes may occur.

  1. Begin by creating a program in the language of your choice that reads in arguments from the command line in the following format:

    <program_name> <scheduler> <wait-strategy> <rounds> <iterations> <number>+

    The scheduler argument should be either "SCHED_RR", "SCHED_FIFO", or "SCHED_NORMAL" (note that SCHED_NORMAL is sometimes called SCHED_OTHER but we will use SCHED_NORMAL).

    The wait-strategy argument should be either "spin" or "sleep" indicating whether active waiting (e.g., via spin-locks) or passive waiting (e.g., via library features such as mutexes and condition variables) should be used to synchronize threads.

    The rounds argument gives the number of times each thread should select a new number from the data structure.

    The iterations argument gives the number of the times within each round that the selected number should be cubed by the thread.

    One or more arguments should be given after that, indicating the values (number arguments) that should be read into the data structure (from which the threads will then select specific numbers to cube).

  2. The program should read in the provided number(s) into a data structure, and spawn 2 threads per core. Pin these threads onto specific cores so that each core has exactly 2 threads pinned to it, and set each thread to use the scheduler given in the scheduler argument.

  3. Write a function for your spawned threads that reads in a number from your data structure, cubes that number iterations times and then selects another number. This entire activity should be done rounds total times by each thread.

    The data structure holding the numbers will be accessed by multiple threads at once, and should maintain (safely, i.e., avoiding race conditions for it) a variable (e.g., an index, counter, pointer, etc.) that keeps track of which number the next thread should read. Each time a number is read by a thread, that variable should advance to refer to the next number in the data structure (and after the last number is read should go back to the first number). You must allow concurrent access to this structure but avoid data races (particularly for that variable), e.g., using spin locks if the program's wait-strategy argument was "spin", or mutexes and condition variables if it was "sleep".

    Furthermore, it would defeat the purpose of the lab to allow certain threads to begin their (important :-) work of cubing integers while other threads were still being spawned and pinned. Therefore, create a way for threads to synchronize and wait (again using spin locks if the program's wait-strategy argument was "spin", or mutexes and condition variables if it was "sleep") until all threads are ready to begin their work. This is known as a thread barrier.

  4. Do each of the following exercises with the wait-strategy argument set to "spin" and then run it again with the wait-strategy argument set to "sleep", and when you answer each of the questions below please discuss whether or not you saw any differences in behavior when using one strategy versus the other (and if you did, what those differences were and why you think they may have occurred).

    1. Run your program on your Raspberry Pi with SCHED_NORMAL (the default Linux scheduler) and use trace-cmd to verify that your threads correctly wait at the barrier. Question 1: How can you tell that your barrier worked?

    2. As you've learned from previous studios, SCHED_NORMAL uses (among other things) nice values to determine which threads run at any given point. If SCHED_NORMAL is the chosen scheduler, set a different nice value for each thread on a particular core. You can re-use those "nice" vs. "not-as-nice" values for the pairs of threads on each core, or you can vary them from core to core (as long as they differ on any given core). Play around with giving different nice levels on threads and cores and observe how those affect (1) the overall amount of time the program takes to run, and (2) the amount of time each thread spends on the CPU before the scheduler switches to the other thread.

      How you obtain the timing information is up to you. Be creative. Possibilities include creating a kernel module that that monitors which tasks are on the CPU and/or writing a script that would extract that information from a trace-cmd .dat file. Note and explain your observations. Please use appropriately large numbers of rounds and iterations, so that the scheduling behavior is clear (setting both to 1000 should suffice).

    3. Unlike SCHED_NORMAL, SCHED_RR and SCHED_FIFO do not use nice values. Instead they use fixed real-time priorities when making scheduling decisions. When one of these schedulers is chosen, give different real-time priorities to each thread on a core (again, feel free to re-use the same two values from core to core), creating a high-priority thread and a low-priority thread per core. Run your program a couple of different times with different priority values. Question 2: What happens?

    4. Don't be dismayed if the last exercise froze your Pi (it probably will have done so in at least some runs with spin-waiting, but that may depend on your approach to the last exercise). Question 3: Why might we have expected this with spin-waiting? To help you figure out why that happened (if it did), and when your Pi froze (if it did), you may want to consider placing print statements (followed by a flush statement to force the message from memory onto the output terminal) in your code.

    5. In order to fix the above situation (if it occurred), you may want to consider doing one or more of the following: (i) Determining that it is impossible to program with multiple RT priority threads running on the same processor at once and using spin locks (if so, please explain why); (ii) Using separate barriers for the high priority vs. low priority tasks; (iii) Increasing the number of threads per CPU to 4; (iv) Decreasing the RT priority of the threads; (v) Choosing a particular wait-strategy to always use (and if so which one and why).

      Question 4: Which of the above may help to address the problem of your Pi (potentially) freezing, and how and why would it help? Evaluate your hypothesis by implementing the necessary change and running your program with both SCHED_RR and SCHED_FIFO, examining how their traces differ from what you saw previously, and considering and discussing why they might have done so.

    6. Now change the number of threads that are spawned to be 4 threads per core, and repeat the previous exercises. For SCHED_NORMAL, give two tasks the same "nice" nice level and two tasks the same "not-so-nice" nice level. For SCHED_FIFO and SCHED_RR, give two tasks the same high priority and two tasks the same low priority. As before, make note of the timing behavior both of the program overall and of individual threads. Run your program with each of the schedulers and note what happens. If there are any peculiarities, note them and hypothesize about what their causes may be and how to fix them, and implement those fixes if you so desire.


What to turn in: (1) all the code and compilation files used to implement and run your solution (including a Makefile if you used one, etc.); (2) a readme.txt file with the contents described next, and (3) other files (e.g., with screen-shots from Kernelshark) that enhance your report.

The first section of your readme.txt file should include:

  1. The name and number of the lab.
  2. The name and email address of everyone who worked together on this lab.
  3. Attribution of sources for any materials that were used in (or strongly influenced) your solution, e.g., Williams' thread barrier described above if your approach was based on his approach.
  4. Design decisions you made in creating your lab and their rationale, including your rationale for using the programming language you chose and for how you structured your code.
  5. Detailed answers to the highlighted questions asked above (questions 1 through 4), not necessarily in the order in which the questions were asked, but rather as a thoughtful synthesis of all questions asked above and the thoughts or observations you may have had along the way.
  6. Precisely which values you used for iterations and rounds at different point in the assignment and how they may have affected your runs.
  7. Names of the files with interesting screen-shots you may have from Kernelshark along with what code you ran to generate them and discussion of why you find their results interesting.
  8. Any insights or questions you may have had while completing this assignment.
  9. Any suggestions you have for how to improve the assignment itself.
  10. The amount of time you spent on this assignment.
The second section of your readme.txt should include detailed instructions for how to:
  1. unzip or otherwise unpack your files,
  2. build your programs (including on which machines to do that), and
  3. run your programs on the Raspberry Pi 3 (including a list of all the different command lines used to generate your results).


Posted 8:30am Wednesday October 18th and updated 2:25pm on Friday October 20th and 10:00am Monday November 6th, 2017 by Chris Gill.

Changes since original posting: