## CSE 131 Module 3: Arrays

### Pascal's Triangle

Take a look at the Wikipedia page on Pascal's Triangle. We will implement this using a two-dimensional array, so it may be easier to imagine the entries left-justified rather than in triangular form:
```        column
0  1  2  3  4
row
0       1
1       1  1
2       1  2  1
3       1  3  3  1
4       1  4  6  4  1
.
.
.
```
Viewed as a two-dimensional array, the computation is specified as follows, for each entry at row r and column c:
• If c is 0, then the entry is 1.
• If c==r, then the entry is 1.
• If r<0 or c<0 or c>r, then the entry is 0 and is not shown.
• Everywhere else, the entry is computed as the sum of the entries at:
• r-1,c
• r-1,c-1

#### Procedure

1. Open the PascalsTriangle class, found in the arrays package of the extensions folder.
2. Insert code to obtain from the user the value N which is the number of rows you should compute of the triangle.
3. Instantiate the two-dimensional array needed to hold the results.
Java lets you do this as a ragged array, where each row can be of a different length.

You are welcome to instantiate the array that way, but it is simpler and equally acceptable to instantiate the array with the same number of columns in each row. Java syntax for that is much simpler. An example of this appears in SelfAvoidingWalk in your repositories.

4. Compute the triangle as a two-dimensional array and print the results left-justified as shown above.
5. For extra fun, try to print the triangle centered as shown on the Wikipedia page.
When you done with this extension, you must be cleared by the TA to receive credit.
• Fill in the form below with the relevant information
• Have a TA check your work
• The TA should check your work and then fill in his or her name
• Click OK while the TA watches

If you worked in a team using a group– repository:
• Your work must be committed in that repository for this demo.
• Each team member must complete this form and have the TA authenticate.
For example, if there are two of you, then the form must be submitted twice, so that each person's last name and 6 digit ID is captured with the repository name.
• The repository name must be included on each of the submissions.

group– Enter the name of your group repository here
 Last name 1

### Birthday Problem

Some number of people, say N, wander into a room. We are interested in computing the fraction of those people who are:
• are born in the same month
• are born on the same day, in the same or in different months
• are born on the exact same day (the same day in the same month)
Before proceeding, you might think a bit about the above statistics and hypothesize what values you will see in your simulation.

#### Procedure

To make things simple, let's assume that all months have 31 days.
1. Open the Birthday class, found in the arrays package of the extensions folder.
2. Insert code to obtain from the user the number of people N that will enter the room.
3. For each person, randomly generate a month and day on which that person was born.
4. Using a two-dimensional array that is indexed by month and by day, keep track of the number of people born on that month and day.
For example, perhaps you randomly decide the person was born in month 4 and day 28.
5. After processing all of the N people, iterate over your array to compute:
• For each month, the fraction (or percentage) of people born in the that month.
• The average of the 12 values you computed for the above.
• For each day, the fraction (or percentage) of people born on that day, whether in the same or in a different month.
• The average of the 31 values you computed for the above.
• The fraction (or percentage) of people born on exactly the same day.
Be careful! A 1 entry means just one person was born on that particular day. You are looking for the fraction of people who share a birthday with at least one other person.
When you done with this extension, you must be cleared by the TA to receive credit.
• Fill in the form below with the relevant information
• Have a TA check your work
• The TA should check your work and then fill in his or her name
• Click OK while the TA watches

If you worked in a team using a group– repository:
• Your work must be committed in that repository for this demo.
• Each team member must complete this form and have the TA authenticate.
For example, if there are two of you, then the form must be submitted twice, so that each person's last name and 6 digit ID is captured with the repository name.
• The repository name must be included on each of the submissions.

group– Enter the name of your group repository here
 Last name 1

### Biology Application—Computing the GC-Content of a Sequence

Computer science plays a large role in modern biology. BioJava is a framework for working with biological data. It provides a framework for using sequences and performing common biology functions. In this extension you will be reading in a DNA sequence and computing its GC-Content.

#### Notes

• You can find the Java files for this extension in the extensions source folder.
Locate the following files in the biojavagc package:
• GCContent.java is the file you complete. When you run it, analysis is performed on four genomes and the results are printed as output.
• GCContentTest.java is a unit test for your work. You will receive credit for this extension only if this test passes.
• You can do this extension by following the instructions below. However, you might want to take some time to familiarize yourself with the BioJava website. Of special importance is the BioJava Cookbook. The cookbook has information and examples for performing common tasks. You may also find the BioJava API helpful.

• You will be working with some DNA sequences, represented in the FASTA format, which you can find in the genomes folder of your workspace:
• NC_002677.fasta
• mysterya.fasta
• mysteryb.fasta
• mysteryc.fasta
As you might guess from their names, there is an air of mystery surrounding some of these genomes.

Consider a bacterium and a phage that might be hosted by that bacterium. It turns out that the DNA of the host and of the phage often need to be similar in terms of their GC content for the bacterium to play host to the phage:

Most bacteriophage and other bacteria are lower in GC content than Salmonella and its relatives, so invading DNA is an obvious target for H-NS. "It's like a primitive immune system," says Fang. "Reduce their expression, and the foreign genes can be tolerated."
From http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2064161/
In other words, if the GC content in the bacterium and phage is dissimilar, the bacterium may be immune to infection by the phage.

#### Procedure

1. Complete the method percentGC in GCContent.
Note: BioJava already has a function to get the GC-count (number of G and C nucleotides) of a sequence.

Do not use this function.

Instead, devise a way to iterate through the array of characters (representing nucleotides) yourself.

2. Run the unit test GCContentTest and make sure it passes.
3. Run GCContent as a Java Application and be prepared to answer questions by the TA at the demo.
When you done with this extension, you must be cleared by the TA to receive credit.
• Fill in the form below with the relevant information
• Have a TA check your work
• The TA should check your work and then fill in his or her name
• Click OK while the TA watches

If you worked in a team using a group– repository:
• Your work must be committed in that repository for this demo.
• Each team member must complete this form and have the TA authenticate.
For example, if there are two of you, then the form must be submitted twice, so that each person's last name and 6 digit ID is captured with the repository name.
• The repository name must be included on each of the submissions.

group– Enter the name of your group repository here
 Last name 1

### Biology Application - Finding the reading frame of a sequence

Sequences of nucleotides can be read as sequences of nucleotide triplets, known as codons. This means there are 3 different ways of reading a nucleotide sequence. One starts with the first nucleotide in the sequence, one starts with the second, and one starting with the third, as shown below:

Image from https://wikispaces.psu.edu/display/Biol230WFall09/Protein+Translation

• The sequences shown above are RNA. The corresponding DNA sequences would have a T everywhere you see a U.
• Reading frame 1 begins at the first nucleotide. Interpreting AGT using the genetic code, Serine would be generated as the first amino acid of the resulting protein, if this is the proper reading frame.
• However, if the reading frame should start one base later, then the first DNA triple is GTC, which encodes Valine.
• Finally, if the reading frame should start two bases later, then the first DNA triple is TCT, which encodes Serine. While this protein begins with the same amino acid as in reading frame 1, the next amino acid is different, as shown in the above figure.
• There is no other reading frame of interest. If we considered a reading frame that begins at the fourth nucleotide, that is the same as if the frame began at the first nucleotide.
How do we determine which reading frame is correct? Our approach is based on the following observations, which are somewhat simplified to suit our purposes. The interested student is encouraged to take Bio 2960 for a much more thorough and revealing treatment of this material.
• Protein translation is initiated only by a special sequence of 3 nucleotides called a start codon.
• Translation continues until a stop codon is encountered, which is also a specially designated sequence of 3 DNA nucleotides.
• If we scan random DNA, we expect each of the four bases (A, T, C, G) to occur with probability 1 in 4.
• If we look for two specific bases, one after the other, in random DNA, scanning pairs a time, we expect the two bases to occur with probability 1 in 16 (1 in 4 times 1 in 4).
• If we scan DNA looking for a specific triplet, moving 3 bases each time we look, we expect that given triplet to occur with probability 1 in 64 (1 in 4 times 1 in 4 times 1 in 4).

For example, we expect the start codon to occur with probability 1 in 64.

• Because there are three possible stop codons, the probability in random DNA that a given triplet is one of the three stop codons is 3 in 64, which is approximately 1 in 20.
• This means that if DNA is random, a protein sequence would be about 20 amino acids in length, once its start codon is found.
• It turns out that proteins are much longer than that, say at least 60 amino acids in length.
• If we start in the wrong reading frame, the DNA should appear to us as random, and we should obtain very short amino acid chains by translation.
Thus, the best reading frame for a sequence of DNA is the offset at which translation would produce the longest chain of amino acids.

Your task in this extension is to analyze a sequence of DNA to determine its best reading frame.

#### Notes

• You can find the Java files for this extension in the extensions source folder.
Locate the following files in the biofindframe package:
• FindTheFrame.java is the file you complete.
• FindTheFrameTest.java is a unit test for your work. As of this writing, this test is not yet available.
• You will be working with some DNA sequences, represented in the FASTA format, which you can find in the genomes folder of your workspace.

#### Procedure

1. The code given to you prompts the user for a genome DNA file, and then reads that file into a char array using methods provided by bioJava.

If you are uncertain about char arrays, this would be a good time to review the relevant material in your text and in the lecture notes.

2. Your work takes place in the method bestReadingFrame. Open that file now and take a look at what is provided.

There are comments in that file to direct your work, which consists of the steps described below.

You should read these instructions and the comments carefully before you begin. Solutions that fail to follow the instructions will receive no credit!
3. First, define 3 char arrays, one for each of the possible stop codons: ochre, amber, and opal.
4. Next, define a char array named methionine for the start codon.
The most common start codon is Methionine, which is encoded by the DNA sequence AUG. For the purposes of this extension, Methionine is the only start codon you will consider.

5. The rest of the code you write will attempt to read the DNA in each of the possible 3 reading frames. Find the best reading frame and return the index at which that best reading frame occurs. Thus, your method will return a value as follows:
• 0, if reading frame 1 is the right answer. That frame begins at index 0 of the dna array you are given.
• 1, if reading frame 2 is the right answer. That frame begins at index 1 of the dna array you are given.
• 2, if reading frame 3 is the right answer. That frame begins at index 2 of the dna array you are given.
Hint: Scan the DNA one base a time, looking for the longest coding sequence that begins at that base. A coding sequence begins with a start codon and ends at the next stop codon that is found by scanning triplets.

Based on the index at which the longest coding sequence occurs, compute the corresponding reading frame (0, 1, or 2). That is the value you want to return.

When you done with this extension, you must be cleared by the TA to receive credit.
• Fill in the form below with the relevant information
• Have a TA check your work
• The TA should check your work and then fill in his or her name
• Click OK while the TA watches

If you worked in a team using a group– repository:
• Your work must be committed in that repository for this demo.
• Each team member must complete this form and have the TA authenticate.
For example, if there are two of you, then the form must be submitted twice, so that each person's last name and 6 digit ID is captured with the repository name.
• The repository name must be included on each of the submissions.

group– Enter the name of your group repository here
 Last name 1