CS 102 (Spring 2003)
Lab 2a: KWIC Index (Part a)

Authors: David Jurgens and Ron K. Cytron
Thanks to David Warner and ACM Student Chapter for the fortunes used in this lab.
Lab Assigned Design Due
(In class)
10 AM
(In Lab)
(In Lab)
Lab Due
(In lab)
21 Jan 22 Jan 29 Jan 29 Jan


By the end of this lab, you should...

Before starting:

Helpful resources:


In CS101 you studied the Relation and Set ADTs. Relation implements a function or a map, and a Set is a duplicate-free collection of objects. In this lab you will use both of those ADTs but in the form of implementations already available in java.util.

Specifically, you will implement a KWIC (KeyWord In Context) object and to place it and its supporting objects in a package called kwic.

Supose you wanted to search a document for all phrases that include a given word, say "swordfish". There are two approaches that could be used in this endeavor:

  1. Once "swordfish" is supplied, a computer could search the document and return each phrase that contains that word. Every time a word is supplied, the entire document is scanned to find matching phrases.
  2. The document could be preprocessed offline , in anticipation of the need for search. When "swordfish" is supplied, the result is already computed and simply returned.
Which approach is best? It depends on In essence, the choice depends on the frequency of insertions and deletions to the document as compared with the frequency of lookups for its words.

We shall assume that offline preprocessing pays off, and that it would be expensive to search the document each time a word is supplied. As an analogy, consider a search using Google . Imagine how slow it would be for Google to search the entire WWW each time you ask it to find a word. (Note: It takes about a month for Google to crawl the web currently!)

As an example for this lab, consider the following phrases:

The following table shows how phrases should be returned for words that might be supplied for KWIC:
Word Set of Phrases
  • Swordfish goes well with pasta; the pasta should not be overcooked.
  • The password for entry to the castle is: "swordfish".
  • Swordfish goes well with pasta; the pasta should not be overcooked.
  • All's well that ends well.
Notice that case and punctuation do not matter in matches, but that the returned phrases are exactly as they were entered.

Suggested implementation:

  1. Make a directory for your Lab 2a stuff.
  2. Save the Demo.java file there.
  3. Make a kwic directory into which your classes will go.
  4. Save the WordCanonical.java file there.
  5. Type in stubs for the other classes found in the documentation
  6. Complete and test the DefaultWordFilter class. You can just return the input string for now.
  7. Complete and test the Word class.
  8. Complete and test the Phrase class.
  9. Complete and test the KWIC class.

What To Turn In:

For every CS102 lab you turn in, you should fill in a cover sheet and staple it on the front of your lab. Attach a paper printout of the following:

  1. All classes written for this lab.
  2. Output from the class test Demo2a.java .
This must be turned in by the end of your lab section on the due date.  Check that you have header information (name, email, date, and lab section) at the top of the file, and you must have demonstrated your lab to have the printout graded.

If you need help printing, ask a TA or refer to the help homepage, which has detailed instructions for how to print from the labs.

Last modified 14:57:02 CST 29 January 2003 by Ron K. Cytron