CSE 247 Module 9: Hash Tables

Lab

Abstract: We have seen hashing as a technique for obtaining expected constant-time operations on finding or inserting elements in a set. Hashing is also used to compute digital signatures for confirming the authenticity of messages.

In this assignment, you will apply a hashing scheme in pursuit of finding occurences of one string in another, using the Rabin–Karp algorithm.

That algorithm is discussed in Section 32.2 of our text, on page 990.

Introduction

Suppose we have two strings of interest:
target
is a relatively long string of length n.
query
is a relatively short string of length m.
We are interested in finding all occurrences of query in target. For example, consider the query aaba and the target aabaacaadaabaabaa. The last character of the query occurs within the target as shown below:
```a a b a a c a a d a a b a a b a a
|                 |     |
a a b a                  |     |
a a b a      |
a a b a
```
Notice that the second and third matches overlap. As discused in class, the most straightforward algorithm is as follows:
```   for each i in the interval [0, n)
match = true
for each j in the interval [0, m)
if target[i+j] ≠ query[j]
match = false
end if
end for
if match
// definite match of query starting at index i of target
end if
end for
```
The worst-case complexity of this approach is Θ(mn).

Rabin–Karp Algorithm

Instead of trying to match the query at each location in the target, the Rabin–Karp Algorithm computes a rolling hash of a window of m characters of the target. Near the beginning of the target, where there may not be m characters in the window, the value 0 is used for those entries outside of the target.

We will use in particular the following hash function for a window of size m in a character array, with the first character of the window at index start:

```private static int hashFor(char[] chars, int start, int m) {
int ans = 0;
for (int i=start; i < start+m; ++i) {
int val = 0;
//
// Within the bounds of the chars array?
//
if (i ≥ 0 && i < chars.length)
val = chars[i];                        // then use the character value there
ans = mod(
mod(31*ans, 511) + mod(val, 511),
511
);
}
return ans;
}
```
The above function is similar to how Java computes .hashCode() for a String, except that we confine the results to being strictly less than 511 using mod (the % operator). The following identities related to mod should be useful in this assignment:
• (a + b) mod p = ( (a mod p) + (b mod p) ) mod p
• (a × b) mod p = ( (a mod p) × (b mod p) ) mod p
You should investigate how large the intermediate results of addition and multiplication can get before the final mod p is applied.

Applied to our running example, the rolling hash values at the end of each window within our target string are as follows:
 a a b a a c a a d a a b a a b a a 97 38 254 306 214 430 337 153 73 368 92 224 306 214 429 306 214

• Note that at the end of each aaba substring, the hash value 306 is produced.
• Thus, each time the rolling hash produces 306, we probably have a match of aaba in the target.
• No other window produces the 306 hash value, so we were lucky in this example: the hash value 306 produced matches only for our query.
The Rabin–Karp (RK) algorithm is essentially as follows:
• Compute the RK hash for the query string of interest.
In our running example, that hash value is 306, which can be computed by invoking hashFor("aaba".toCharArray(), 0, 4)
• For each index i of the target string, compute the rolling hash of the target that begins at index i-m+1 and has length m.
• In the example, the value for the first a is computed at indices -3, -2, -1, and 0, for a window of length 4 that ends at the first index of the target. The numerical values for characters outside the window are 0, and the numerical value for a is its ASCII encoding of 97.
• Thus, the window for a (the first character in our target string) has hash value 97.
• Moving one character to the right encounters another a. Its hash value is at indices -2, -1, 0, and 1, which produces 31×97+97, which is 3104, but taken mod 511 is 38.
• Thus the window for a a (the first two characters in our target string) has value 38.
• However, if we compute the rolling hash by calling the hashFor method above, the complexity is still Θ(mn).
• Your task is to compute that rolling hash incrementally, as described below.

• Update your repository as usual.
• In the rabinkarp package of the labs source folder, find and open the RK class.
• The constructor accepts the value of m, which is the length of the query string in this assignment.
• The method int nextCh(char c) accepts a character supplied from outside, and returns the rolling KR hash of it and the previous m-1 characters.
You must complete the constructor and the nextCh method, subject to the constraints stated below.

In terms of our running example, the following calls after instantiation of a KR(4) object would return values as follows:

• nextCh('a') → 97
• nextCh('a') → 38
• nextCh('b') → 254
• nextCh('a')306
• nextCh('a') → 214
• nextCh('c') → 430

To receive full credit for this lab, your solution must:

• Produce the correct hash results using an incremental approach, taking constant (Θ(1)) time. This means your approach must take less than Ω(m) time.
• Suppose h is the hash value of a window of m characters.
• Suppose character c is about to leave that window, and character d is the next to be incorporated into the window.
• We can compute the new value of h using: h = (h×31 – 31m×c + d) mod 511
• Ensure that the above computation does not cause overflow, taking care to keep the numbers mod 511 wherever possible.
• Use space proportional only to m, which is the size of the sliding window.
• Implement a circular buffer as described in class, of size m, to keep track of the last m characters seen.
• Pass the unit tests.

Having trouble?

Post problems on piazza, but do not post code publicly there!

How to submit your work

• Make sure all of your work is commited in your repository.
Note that we will run your solution against the JUnit test as it was given to you. If you modify that test in some way, the results you see may not correspond with the testing necessary to receive credit.

You can always revert any changes you make to the testing code back to what we gave you in the repository using Team…Revert, but be sure not to do that to your source code as it will wipe away work you may not have yet committed.

• Your work will be graded as of the due date for this assignment.

• Make sure your code passes the tests as we originally gave them to you. It's those tests we will run to see if your code functions properly.
• Make sure you eliminate or disable any print statements you used for debugging, as they may slow down your code when we test and cause you to fail.
• Make sure you have placed ticker.tick() calls appropriately in your solution. Failure to do so will cost you points on this assignment.
• Make sure your solution's ticks behave in a manner consistent with the asymptotic complexity you should see for your solution.
• You must commit and push all of your work to your repository. It's best to do this from the top-most level of your repository, which bears your name and student ID.
You can tell that your code was pushed by logging into bitbucket.org, clicking on your repository, and seeing the push in the associated log traffic.

Because of this, you have no excuse for failing to push your code. Check and make sure it was pushed so your work is counted.

Work that is not pushed will receive no credit.

Last modified 14:14:22 CST 01 December 2016