*
"Proper synchronization -- locking that is free of deadlocks, scalable, and clean -- requires design decisions from start through finish."
*

—Robert Love, *Linux Kernel Development, 3rd Ed.*,
Chapter 9, pp. 172.

In your previous lab assignment, you examined scheduling behavior of compute-bound user-space threads that did not need to do memory management or I/O. Those threads were mostly independent, with the exception of occasional contention for a single brief critical section (reading and updating a shared counter variable).

In this assignment, you will implement a kernel module in which kernel threads will run concurrently on your Raspberry Pi to cooperatively compute prime numbers. To do so, you will use kernel memory allocation and deallocation, synchronization, and logging.

**As for the previous lab assignment, for this lab you are encouraged to work
with one or two other people as a team, though you may complete this assignment individually
if you prefer. Teams of more than 3 people are not allowed for this assignment.**

In this lab, you will:

- Implement a kernel module that uses concurrent (and to the extent possible, parallel) processing to compute all the prime numbers up to a specified upper bound.
- Again use basic multi-threaded synchronization and concurrency techniques, but in a more sophisticated configuration in which the threads will cooperate to complete a common task (computing prime numbers).
- Manage kernel memory dynamically within the kernel module.

- See LKD Chapters 9 and 10 for a discussion of kernel synchronization.
- See LKD Chapter 12 for a discussion of kernel memory management.
- See LKD pp. 338-348 (the discussion of modules, in Chapter 17 of the textbook) for information and examples for kernel module programming.
- See CSE 422S Studio 10 Kernel Synchronization for more about using lock-based and atomic synchronization techniques in kernel modules.

Please complete the exercises described below. As you work through them, please take notes regarding your observations,
and when you are finished please write-up a cohesive report (as in the previous lab assignment)
and e-mail it along with your code files, graph files, etc. to
**eng-cse422s@email.wustl.edu** with the phrase **Lab3** in the subject line.

In this lab assignment, you will again submit a cohesive report that unifies your observations and thoughts, and since part of the assingnment will be to present graphs of results obtained from trials you will run to assess performance, you may choose to use a non-text file format (e.g., .docx or .pdf) for your report if you'd like. For more information please see "What to turn in" below.

In this lab you will create a kernel module that will (1) allocate kernel memory for an array of integers from 2 up to a specified upper bound; (2) initialize that memory to contain those integers in ascending order; (3) spawn a specified number of kernel threads, which will then go through the array concurrently, "crossing out" numbers that are not primes by setting them to zero; and (4) once all the threads have finished, print out the remaining prime numbers along with several statistics about the module's performance.

Instead of pinning those threads to specific cores, in this lab you will allow them to run wherever
Linux puts them. As in your previous lab assignment, the threads will first wait at a synchronization
**barrier** until all other threads have been successfully spawned and have arrived at the barrier.
Once all threads have arrived at the barrier (once), they will each repeat the following operations: (safely)
(1) select the next prime number from an array of integers, and (2) "cross out" all of its larger multiples
in the array by setting their values to zero (thus marking them as non-prime).
In doing so, these threads will implement a cooperative, concurrent (and possibly parallel) version of the
Sieve of Eratosthenes algorithm.

**Correct but inefficient processing:** even the single threaded version of this algorithm has
some built-in inefficiencies, in that some non-prime elements in the array (e.g., 6, 10, 12, 14, etc.) may be
"crossed out" repeatedly when their different prime factors are visited; the multi-threaded version has an
inherent data race that similarly may allow a thread to perform futile processing of a non-prime element
(e.g., threads begin processing of 2, 3, and 4 concurrently even though 4 would have been marked earlier as
non-prime in the single-threaded version, and not processed when it was reached). Since these inefficiencies
impact only performance and not correctness of the algorithm, in this assignment you will record and analyze
them instead of trying to fix them (except optionally for extra credit as noted below).

Your kernel module should use the `module_param`

macro
to define two `ulong`

module parameters, indicating (1) the
number of threads to be used (must be 1 or greater), and (2) an upper
bound on the range of primes to compute (must be 2 or greater). These
parameters should be named `num_threads`

and
`upper_bound`

, respectively. Declare and provide default
values (of 1 and 10 respectively) for two static ```
unsigned
long
```

variables to store the values for these module parameters,
so that in the event that your module is loaded without parameters,
the module will use a single thread to compute all primes in the range
from 2 to 10 inclusive.

In addition to these parameters, your kernel module should have at least the following global variables that are shared by all of its threads: (1) a pointer to an array of counter variables in which each thread will keep track of how many times it has "crossed out" a non-prime number, (2) a pointer to the array of integers, within which primes will be computed, (3) an integer for the position of the current number (prime in the single-threaded version) that is being processed, and (4) an atomic variable that indicates whether or not the computation of prime numbers has finished in each thread. In addition, your kernel module may have additional global variables to implement correct barrier synchronization (e.g., a counter for how many threads still need to synchronize), protection of any critical sections (e.g., a lock), etc.

- Write a function that all spawned threads will run, which takes a pointer to a counter variable that it will use
to track how many non-prime numbers it has crossed out, and then sequentially (i.e., within that same thread): (1) calls a
function that performs barrier synchronization with the other threads, and after the barrier function has completed then
(2) calls a function that repeatedly marks non-prime numbers in the array (and increments its counter each time it marks one)
until the entire array is processed, (3) calls the barrier function again, and (4) updates the atomic variable to
indicate that all threads have finished processing and then returns.
How you implement the barrier is up to you, but it must work correctly twice! One straightforward way to do this is using spin locks with both a counter for how many threads threads still need to arrive at the barrier and an atomic variable for possible states of barrier synchronization (no barriers completed, one barrier completed, two barriers completed, etc.) but then you must make sure the module initialization function and the barrier function set/reset these variables appropriately.

The prime computation function should repeatedly (1) safely (as a critical section protected by a mutex or spin lock) store the value of the current position variable in a local variable, and then advance the current position variable until it either reaches another non-zero value in the array or exceeds the last position in the array; (2) if the position in the local variable is greater than the last element in the array then the function should simply return; otherwise the function should (3) safely (as a critical section protected by a mutex or spin lock) go to each number in the array that is a larger multiple than the (prime) number at the current position, set each of those larger multiples to zero, and increment its counter each time it does that.

**Note that since each thread has its own independent counter, there should be no need to protect the counter with locks, etc.** - Your module's
`init()`

function should initialize all of its variables (besides the module parameters) so that if either module parameter has an invalid value (threads less than 1, or an upper bound less than 2) the module can safely do nothing, and the module's`exit()`

function will not cause a memory leak, access violation, or other hazard. Specifically, the pointer to the array of integers should be initialized to 0, as should the variable that stores the size of the array, and the atomic variable that indicates whether or not processing has completed should be set to indicate that it has completed. Then, if either module parameter is invalid, the`init()`

function should print an error message to the system log that shows the invalid parameter values and says what's wrong with either (or both) of them, and then the`init()`

function should simply return.Otherwise (if both module parameters are valid), the

`init()`

function should then allocate kernel memory, using`kmalloc()`

with`GFP_KERNEL`

, for an array of integers large enough to hold all integers from 2 up to (and including) the number in the upper bound parameter. If that fails, the`init()`

function should print an appropriate message to the system log, make sure the pointer to the array of integers is 0, and then simply return.Otherwise (if the previous allocation succeded), the

`init()`

function should then allocate additional kernel memory, again using`kmalloc()`

with`GFP_KERNEL`

, for an array with individual unsigned integers that will be used as counters, with which each kernel thread will keep track of how many "cross out" operations it performs - the size of that array should be the number of threads, given in the`num_threads`

module parameter. If that fails, the`init()`

function should print an appropriate message to the system log and then (before returning) make sure the pointer to the array of counters is 0, and then either defer to the`exit()`

function to do the following, or itself perform the following: deallocate the memory for the (previously successfully allocated) integer array using`kfree()`

, and make sure the pointer to the array of integers is also 0.Otherwise, the

`init()`

function should iterate through the array of per-thread counter variables, setting each one to 0, and then should iterate through the integer array, setting each successive location to the next integer from 2 up to (and including) the upper bound, and then also should set the value of the variable for the current position, to the position that contains the number 2. The`init()`

function then should set the atomic variable that indicates whether or not processing has completed, to indicate that processing has**not**completed.Finally, the

`init()`

function should then spawn as many threads as are indicated by the`num_threads`

module parameter, passing a pointer to a**different**element of the array of counters (for tracking how many "cross out" operations each thread performs) to each thread (as an argument to that thread's thread function). - Add global time stamp variables and code to your module, to record the time at which each of the
following events occurs:
(1) module initialization begins (the
`init()`

function should record this time stamp value as the first thing it does, before doing anything else); (2) when all threads have reached the first barrier and are about to begin their work (the last thread to reach the first barrier should record this time stamp value before it or any other thread continues); and (3) when all threads have reached the second barrier after completing their work (the last thread to reach the second barrier should record this time stamp value before it or any other thread continues).**Hint:**initializing the global time stamp variables to zero may help to simplify your barrier code since at each barrier a different time stamp variable will need to be recorded. - The module's
`exit()`

function should examine the the atomic variable that indicates whether or not processing has completed, and if it has**not**completed should print an error message to the system log and then simply return without doing anything (presumably it's better to have a potential memory leak than to have an access violation or other more major issue). Similarly, if a memory allocation failure had occurred during the the module's`init()`

function, then depending on your design the module's`exit()`

function may need to complete any remaining deallocation responsibilities that have not been handled by the`init()`

function.Otherwise (if initialization was successful and processing has completed) the

`exit()`

function should:- iterate through the array and both count and print out (to the system log) all the non-zero numbers in it (i.e., the primes) in a nicely formatted style (e.g., 8 per line).
- print out (to the system log) how many primes were found, how many non-primes there were in the array (by subtraction from how many integers were in the array), and how many times numbers were unnecessarily crossed out (by summing up how many integers each thread crossed out and subtracting the number of non-primes from that total).
- print out (to the system log) the values of the module parameters (upper bound and number of threads).
- print out (to the system log) how long was spent setting up the module (from the initialization time stamp to the time stamp taken at the first barrier) and how long was spent processing primes (from the time stamp taken at the first barrier to the time stamp taken at the second barrier).

After that, the

`exit()`

should deallocate the memory for both arrays (of integers and counters, respectively) using`kfree()`

twice, and then return. - Compile your module and test it thoroughly on your Raspberry Pi to make sure it
(1) is functionally correct (terminates and finds all the prime numbers in a range, doesn't report any non-prime numbers
as being prime, etc.), and (2) is thread-safe (doesn't have meaningful data races, doesn't deadlock or crash, etc.
when run with multiple threads).
Then, make a copy of your module and in that copy: (1) change the type of element in the array of numbers to be sieved, to be

`atomic_t`

(if a different type was used previously to store integers), (2) remove the locks from the critical sections of the prime computation function, and (3) re-implement the prime computation function to use atomic operations such as`atomic_set()`

,`atomic_add()`

,`atomic_read()`

, etc. to (safely) perform all reading and writing of elements in the array.Also compile and test your new module that uses atomic operations, and when you are satisfied that it runs correctly, add a section titled "Module Design and Implementation" to your report, and in it describe how you implemented the features of this assignment, including locking versus atomic operations in the first and second modules.

- On your Raspberry Pi run representative trials that explore how well the two different modules perform,
in terms of initialization and processing times as well as in terms of how many unnecessary "cross out" operations
were performed, for different numbers of threads and different upper bounds on the range of primes to compute.
Add a section titled "Module Performance" to your report, and in it summarize what trials you ran, and what trends you noticed in (1) the timing of the initialization and prime computation sections for each module and (2) the number of unnecessary "cross out" operations performed, for different numbers of threads and different upper bounds, in each module.

For each of the modules, please graph (e.g., using Excel or gnuplot) the timing results as follows: completion time (initialization time plus prime computation time) on the y-axis vs. the upper bound on the range of primes on the x-axis, with a separate curve ("data series" in Excel) for each different number of threads used - leaving out curves that are essentially identical.

For each of the modules, please graph (e.g., using Excel or gnuplot) the efficiency results as follows: number of unnecessary "cross out" operations on the y-axis vs. the upper bound on the range of primes on the x-axis, with a separate curve ("data series" in Excel) for each different number of threads used - leaving out curves that are essentially identical. In your report please at least give the names of the graph files you are turning in with those plots, and if you would like to use a non-text format please also feel free to include pictures of the plots with your discussion. At the end of that section please compare and contrast the trends you saw with the two different modules, and discuss briefly why any trends differed and why any other trends were similar between the modules.

**What to turn in:**
(1) source code files for your two modules (with locking versus with atomic operations),
(2) a report file with the contents described next,
(3) the graph files for your plots as noted above, and
(4) any other files (e.g., screen-shots or traces) that may enhance your report.

Your report file should include:

- The name and number of the lab.
- The name and email address of everyone who worked together on this lab.
- Attribution of sources for any materials that were used in (or strongly influenced) your solution.
- A Module Design and Implementation section, with a discussion of design decisions you made in creating your lab solution (and their rationale), including details about your aproaches for using locking versus atomic operations and how you designed and implemented other features of the assignment.
- A Module Performance section, with a summary of trends you observed and what the data you collected show about the different modules' performance as described above.
- Names of the files with plots of the timing and efficiency results as described above.
- Names of any other files with interesting screen-shots or traces you may have collected along with how you generated them and a brief discussion of why you find their results interesting.
- Any insights or questions you may have had while completing this assignment.
- Any suggestions you have for how to improve the assignment itself.
- The approximate amount of time you spent on this assignment.

**EXTRA CREDIT**

Implement one of the following options (worth up to 5 percent of the value of the lab assignment):

- Make copies of both of your modules and modify those additional source code files to implement the
Sieve of Sundaram algorithm instead of
the Sieve of Eratosthenes algorithm.
Compile your additional modules, and on your Raspberry Pi run representative trials that show how well those new versions perform compared to the original versions, in terms of initialization and processing times as well as in terms of how many unnecessary "cross out" operations were performed, for different numbers of threads and different upper bounds on the range of primes to compute.

Add a section titled "Extra Credit" to your report, and in that section please document (1) how you designed and implemented this feature, (2) output results (and/or text or figures summarizing them), and (3) a brief analysis of what those results say about the strengths and weaknesses of the original versus new versions. Please also submit the additional source code files for the new modules with your report and other items noted above.

- Make a copy of your module that locks critical sections of the prime computation function, and in that copy
implement finer-grained locking strategies such as taking a module parameter for how many locks to use to protect the array of integers (with a default value of 1), and then using modulus arithmetic to determine which critical sections of the prime computation
function will use which locks. For example, if there were two locks protecting the array of integers, then even-numbered positions in the array
(starting with position 0) would use one lock, and odd-numbered positions in the array would use the other. Using a prime number of locks that is greater than 2 can allow even greater concurrency since there is likely to be less contention for any given lock.
Compile your additional modules, and on your Raspberry Pi run representative trials that illustrate how well this new version performs compared to both of the original versions (with a simple locking stratgy versus with atomic operations), in terms of initialization and processing times for different numbers of threads, different upper bounds on the range of primes to compute, and different numbers of locks.

Add a section titled "Extra Credit" to your report, and in that section please document (1) how you designed and implemented this feature, (2) output results (and/or text or figures summarizing them), and (3) a brief analysis of what those results say about the strengths and weaknesses of the original versions versus this new one and about how the choice of the number of locks may affect performance. Please also submit the additional source code file for the new module with your report and other items noted above.

Posted 8:40am and updated at 10:00am and 12:30pm on Monday November 20th 2017 by Chris Gill.

Changes since original posting:

- Added a link to CSE 422S Studio 10 Kernel Synchronization to the Readings and other resources section.
- Clarified that you are free to design whether the
`init()`

function or the`exit()`

function is responsible to clean up the memory for the first allocation if the second allocation fails. - Added further clarification and minor editorial corrections.