"Proper synchronization -- locking that is free of deadlocks, scalable, and clean -- requires design decisions from start through finish."
—Robert Love, Linux Kernel Development, 3rd Ed., Chapter 9, pp. 172.
In your previous lab assignment, you examined scheduling behavior of compute-bound user-space threads that did not need to do memory management or I/O. Those threads were mostly independent, with the exception of occasional contention for a single brief critical section (reading and updating a shared counter variable).
In this assignment, you will implement a kernel module in which kernel threads will run concurrently on your Raspberry Pi to cooperatively compute prime numbers. To do so, you will use kernel memory allocation and deallocation, synchronization, and logging.
As for the previous lab assignment, for this lab you are encouraged to work with one or two other people as a team, though you may complete this assignment individually if you prefer. Teams of more than 3 people are not allowed for this assignment.
In this lab, you will:
Please complete the exercises described below. As you work through them, please take notes regarding your observations, and when you are finished please write-up a cohesive report (as in the previous lab assignment) and e-mail it along with your code files, graph files, etc. to email@example.com with the phrase Lab3 in the subject line.
In this lab assignment, you will again submit a cohesive report that unifies your observations and thoughts, and since part of the assingnment will be to present graphs of results obtained from trials you will run to assess performance, you may choose to use a non-text file format (e.g., .docx or .pdf) for your report if you'd like. For more information please see "What to turn in" below.
In this lab you will create a kernel module that will (1) allocate kernel memory for an array of integers from 2 up to a specified upper bound; (2) initialize that memory to contain those integers in ascending order; (3) spawn a specified number of kernel threads, which will then go through the array concurrently, "crossing out" numbers that are not primes by setting them to zero; and (4) once all the threads have finished, print out the remaining prime numbers along with several statistics about the module's performance.
Instead of pinning those threads to specific cores, in this lab you will allow them to run wherever Linux puts them. As in your previous lab assignment, the threads will first wait at a synchronization barrier until all other threads have been successfully spawned and have arrived at the barrier. Once all threads have arrived at the barrier (once), they will each repeat the following operations: (safely) (1) select the next prime number from an array of integers, and (2) "cross out" all of its larger multiples in the array by setting their values to zero (thus marking them as non-prime). In doing so, these threads will implement a cooperative, concurrent (and possibly parallel) version of the Sieve of Eratosthenes algorithm.
Correct but inefficient processing: even the single threaded version of this algorithm has some built-in inefficiencies, in that some non-prime elements in the array (e.g., 6, 10, 12, 14, etc.) may be "crossed out" repeatedly when their different prime factors are visited; the multi-threaded version has an inherent data race that similarly may allow a thread to perform futile processing of a non-prime element (e.g., threads begin processing of 2, 3, and 4 concurrently even though 4 would have been marked earlier as non-prime in the single-threaded version, and not processed when it was reached). Since these inefficiencies impact only performance and not correctness of the algorithm, in this assignment you will record and analyze them instead of trying to fix them (except optionally for extra credit as noted below).
Your kernel module should use the
to define two
ulong module parameters, indicating (1) the
number of threads to be used (must be 1 or greater), and (2) an upper
bound on the range of primes to compute (must be 2 or greater). These
parameters should be named
upper_bound, respectively. Declare and provide default
values (of 1 and 10 respectively) for two static
long variables to store the values for these module parameters,
so that in the event that your module is loaded without parameters,
the module will use a single thread to compute all primes in the range
from 2 to 10 inclusive.
In addition to these parameters, your kernel module should have at least the following global variables that are shared by all of its threads: (1) a pointer to an array of counter variables in which each thread will keep track of how many times it has "crossed out" a non-prime number, (2) a pointer to the array of integers, within which primes will be computed, (3) an integer for the position of the current number (prime in the single-threaded version) that is being processed, and (4) an atomic variable that indicates whether or not the computation of prime numbers has finished in each thread. In addition, your kernel module may have additional global variables to implement correct barrier synchronization (e.g., a counter for how many threads still need to synchronize), protection of any critical sections (e.g., a lock), etc.
How you implement the barrier is up to you, but it must work correctly twice! One straightforward way to do this is using spin locks with both a counter for how many threads threads still need to arrive at the barrier and an atomic variable for possible states of barrier synchronization (no barriers completed, one barrier completed, two barriers completed, etc.) but then you must make sure the module initialization function and the barrier function set/reset these variables appropriately.
The prime computation function should repeatedly (1) safely (as a critical section protected by a mutex or spin lock) store the value of the current position variable in a local variable, and then advance the current position variable until it either reaches another non-zero value in the array or exceeds the last position in the array; (2) if the position in the local variable is greater than the last element in the array then the function should simply return; otherwise the function should (3) safely (as a critical section protected by a mutex or spin lock) go to each number in the array that is a larger multiple than the (prime) number at the current position, set each of those larger multiples to zero, and increment its counter each time it does that. Note that since each thread has its own independent counter, there should be no need to protect the counter with locks, etc.
init()function should initialize all of its variables (besides the module parameters) so that if either module parameter has an invalid value (threads less than 1, or an upper bound less than 2) the module can safely do nothing, and the module's
exit()function will not cause a memory leak, access violation, or other hazard. Specifically, the pointer to the array of integers should be initialized to 0, as should the variable that stores the size of the array, and the atomic variable that indicates whether or not processing has completed should be set to indicate that it has completed. Then, if either module parameter is invalid, the
init()function should print an error message to the system log that shows the invalid parameter values and says what's wrong with either (or both) of them, and then the
init()function should simply return.
Otherwise (if both module parameters are valid), the
init() function should then allocate kernel memory, using
GFP_KERNEL, for an array of
integers large enough to hold all integers from 2 up to (and
including) the number in the upper bound parameter. If that fails,
init() function should print an appropriate message
to the system log, make sure the pointer to the array of integers is
0, and then simply return.
Otherwise (if the previous allocation succeded), the
init() function should then allocate additional kernel
memory, again using
GFP_KERNEL, for an array with individual unsigned
integers that will be used as counters, with which each kernel
thread will keep track of how many
"cross out" operations it performs - the size of that array should be
the number of threads, given in the
parameter. If that fails, the
init() function should
print an appropriate message to the system log and then (before returning)
make sure the pointer to the array of counters is 0, and then either
defer to the
exit() function to do the following, or itself perform
the following: deallocate the memory for the (previously successfully
allocated) integer array using
kfree(), and make sure the
pointer to the array of integers is also 0.
init() function should iterate through the
array of per-thread counter variables, setting each one to 0, and then
should iterate through the integer array, setting each successive location
to the next integer from 2 up to (and including) the upper bound, and then
also should set the value of the variable for the current position, to the
position that contains the number 2. The
init() function then
should set the atomic variable that indicates whether or not processing has
completed, to indicate that processing has not completed.
init() function should then spawn as many threads as are indicated by the
num_threads module parameter, passing a pointer to a different element of the
array of counters (for tracking how many "cross out" operations each thread performs) to each thread (as an argument
to that thread's thread function).
init()function should record this time stamp value as the first thing it does, before doing anything else); (2) when all threads have reached the first barrier and are about to begin their work (the last thread to reach the first barrier should record this time stamp value before it or any other thread continues); and (3) when all threads have reached the second barrier after completing their work (the last thread to reach the second barrier should record this time stamp value before it or any other thread continues). Hint: initializing the global time stamp variables to zero may help to simplify your barrier code since at each barrier a different time stamp variable will need to be recorded.
exit()function should examine the the atomic variable that indicates whether or not processing has completed, and if it has not completed should print an error message to the system log and then simply return without doing anything (presumably it's better to have a potential memory leak than to have an access violation or other more major issue). Similarly, if a memory allocation failure had occurred during the the module's
init()function, then depending on your design the module's
exit()function may need to complete any remaining deallocation responsibilities that have not been handled by the
Otherwise (if initialization was successful and processing has completed) the
exit() function should:
After that, the
deallocate the memory for both arrays (of integers and counters, respectively) using
kfree() twice, and then return.
Then, make a copy of your module and in that copy: (1) change the type of element in the array of numbers to be sieved, to be
atomic_t (if a different type was used previously to store integers),
(2) remove the locks from the critical sections of the prime computation function, and (3) re-implement the prime
computation function to use atomic operations such as
atomic_read(), etc. to (safely) perform all reading and writing of elements in the array.
Also compile and test your new module that uses atomic operations, and when you are satisfied that it runs correctly, add a section titled "Module Design and Implementation" to your report, and in it describe how you implemented the features of this assignment, including locking versus atomic operations in the first and second modules.
Add a section titled "Module Performance" to your report, and in it summarize what trials you ran, and what trends you noticed in (1) the timing of the initialization and prime computation sections for each module and (2) the number of unnecessary "cross out" operations performed, for different numbers of threads and different upper bounds, in each module.
For each of the modules, please graph (e.g., using Excel or gnuplot) the timing results as follows: completion time (initialization time plus prime computation time) on the y-axis vs. the upper bound on the range of primes on the x-axis, with a separate curve ("data series" in Excel) for each different number of threads used - leaving out curves that are essentially identical.
For each of the modules, please graph (e.g., using Excel or gnuplot) the efficiency results as follows: number of unnecessary "cross out" operations on the y-axis vs. the upper bound on the range of primes on the x-axis, with a separate curve ("data series" in Excel) for each different number of threads used - leaving out curves that are essentially identical. In your report please at least give the names of the graph files you are turning in with those plots, and if you would like to use a non-text format please also feel free to include pictures of the plots with your discussion. At the end of that section please compare and contrast the trends you saw with the two different modules, and discuss briefly why any trends differed and why any other trends were similar between the modules.
What to turn in: (1) source code files for your two modules (with locking versus with atomic operations), (2) a report file with the contents described next, (3) the graph files for your plots as noted above, and (4) any other files (e.g., screen-shots or traces) that may enhance your report.
Your report file should include:
Implement one of the following options (worth up to 5 percent of the value of the lab assignment):
Compile your additional modules, and on your Raspberry Pi run representative trials that show how well those new versions perform compared to the original versions, in terms of initialization and processing times as well as in terms of how many unnecessary "cross out" operations were performed, for different numbers of threads and different upper bounds on the range of primes to compute.
Add a section titled "Extra Credit" to your report, and in that section please document (1) how you designed and implemented this feature, (2) output results (and/or text or figures summarizing them), and (3) a brief analysis of what those results say about the strengths and weaknesses of the original versus new versions. Please also submit the additional source code files for the new modules with your report and other items noted above.
Compile your additional modules, and on your Raspberry Pi run representative trials that illustrate how well this new version performs compared to both of the original versions (with a simple locking stratgy versus with atomic operations), in terms of initialization and processing times for different numbers of threads, different upper bounds on the range of primes to compute, and different numbers of locks.
Add a section titled "Extra Credit" to your report, and in that section please document (1) how you designed and implemented this feature, (2) output results (and/or text or figures summarizing them), and (3) a brief analysis of what those results say about the strengths and weaknesses of the original versions versus this new one and about how the choice of the number of locks may affect performance. Please also submit the additional source code file for the new module with your report and other items noted above.
Posted 8:40am and updated at 10:00am and 12:30pm on Monday November 20th 2017 by Chris Gill.
Changes since original posting:
init()function or the
exit()function is responsible to clean up the memory for the first allocation if the second allocation fails.