An Example: Text Classification

each example is a text document
label s the type of document (e.g. articles I will find interesting)
Give algorithm based on naive bayes which is very effective

Two key decision:
text document needs to be converted to attributes need way to estimate needed probabilities

Use very simple way to represent the document:
Define an attribute for each word position. 
Value for attribute is the English word found in that position.
So # attribues per ex vary

:  1000 training documents that someone has classified - 700 classified as "dislikes"
300 classified as "likes"
Suppose document 1 is "This is a very silly document"
VNB = argmax         P(Vj) PI (P(a i|Vj))
        Vj<-{likes,dislikes}             i=1
       = argmax          P(V j)·P(a1=this|Vj)·P(a 2 =is|Vj)·...·P(a6=document|V j )
Note: independance assumption that the word in one position is independant of that in the
others clearly does not hold here.  Yet in practive it works quite well

(HW3 paper option - Domingo and Pazzani, 1996 provide an interesting analysis of this phenomenon.)
Back to ex
we need estimates for
P(Vj), P(ai=Wk|Vj)
P(Vj) is easy, for example, P(like) = .3   P(dislike) = .7

VNB = argmax         P(Vj) PI (P(ai|Vj))
        Vj element-of V                   i=1
for P(Vj) classify based on % of total documents with label V j
ai is the ith word of text
Estimation P(ai|Vj) is still problematic.  In English about 50,000 distince words.  Suppose 2 target values and
100 text positions then you would need to estimae (2)(100)(50000) = 10,000,000 terms
Complexity is further reduced by making the very reasonable assumption that the probability of encountering a
specific word is independant of position.  That is you assume:
P(ai=Wk|Vj) = P(am=W k |Vj) For All i,j,k,m
So now (in ex above) only need to estimate (2)(50000) = 100,000 estimates which is large but manageable

Finally, an m-estimate is used with uniform priors with m = size of word vocabulary
That is, P(Wk|Vj) = (nk + 1)/(n + |vocabulary|)
where nk is the # times Wk appears in documents with label Vj
and n is the # of words in document with label Vj

Resulting algorithm:
LearnNaiveBayes(Examples, V)  where V is the set of target values
1.  Vocabulary = set of all distinct words and other tokens that occur in Examples
2.  Calculate P(Vj) and P(Wk|Vj ) terms by
   positions = all word positions in Doc that contain tokens in vocabulary
                                                       n                    where n is the number of positions
   return  VNB = argmax         P(Vj) PI (P(ai|Vj))
                    Vj element-of V                          i=1    
Note: any words in Doc not in training text are ignored

Experimental Results
20 usenet groups, 1000 articles from each group collected to give 20,000 examples
2/3 used for training others used for test set
random guessing would have accuracy of approx 5%
Naive Bayes achieved an accuracy of 89%
The only variation from pseudocode we gave was that the 100 most frequent words were removed
(such as "the", "of",etc) and also any word occuring fewer than 3 times were removed.  The resulting
vocab contained approx 38,500 words.

Newsreader (program for reading netnews that allows user to rate articles as he/she reads them)
16% articles interesting
59% of articles Newsreader recommended were interesting

Bayesian Belief Nets
Independance assumption
P(|Wj) = P(a1|W j )·...·P(an=Wj)
made by Naive Bayes greatly reduced the complexity but this assumption is often too strong.
Let's begin with an example:
Suppose you want to predict if there's a forest fire.  Suppose you observe 5 boolean attributes: storm, lightning,
campfire, thunder, and Bus Tour Group
(or more broadly you want to estimate the prob of one of the 25 =32 possible exs)

without any independance assumptions you would need to estimate 2 6 =64 probabilities (and this is a toy example)
What does the Bayes Next rep. As an example let's look at Campfire

Pr(Campfire|other 5 attribs) = Pr(campfire|storm,BusTourGroup)
P(S,B,L,C,T,F) = P(S) . P(S,B) . P(S,B,L) . P(S,B,L,C) . P(S,B,L,C,T) . P(S,B,L,C,T,F)
                                       P(S)        P(S,B)      P(S,B,L)       P(S,B,L,C)      P(S,B,L,C,T)
                         = P(S) · P(B|S) · P(L|S,B) · P(C|S,B,L) · P(T|S,B,L,C) · P(F|S,B,L,C,T)

Conditional Independance Assumptions
P(B|S) = P(B)
P(L|S,B) = P(L|S)
P(C|S,B,L) = P(C|S,B)
P(T|S,B,L,C) = P(T|L)
P(F|S,B,L,C,T) = P(F|S,L,C)
Thus, P(S,B,L,C,T,F) = P(S)·P(B)·P(L|S)·P(C|S,B)·P(T|L)·P(F|S,L,C)

These probabilities come directly from table stored at nodes
Instead of 32 probabilities in joint distribution, here we only need to estimate:
1+1+2+4+2+8 = 18 probabilities  (a.k.a - S+B+L+C+T+F = 18)
From these 6 marginal distributions (and using conditional independance assumptions) you can compute any of the
32 probabilities in the joint distribution and hence answer questions like Pr. of forest fire under certain conditions
Q) What Bayes net would correspond to assuming all variables are independant?
A) no edges

Now Let's Generalize
A Bayesian Belief Net represents the joint prob dist for a set of variables by specifying a set of conditional independance
assumptions (represented by a directed acyclic graph) together with a set of local conditional properties (often called marginals)
For values y1,...,yn for Y1,...,Y n
P(y1,...,yn) = PI(P(yi|Parents(Y i )))        where  P(y1,...,y n ) is shorthand for P(Y1=y1,...,Yn =y n)
                     i=1                                    and Parents(Y i) are nodes with edges directly into Yi

Problem: Infer prob dist for some variable (e.g. Forest Fire) given only observed values for a subset of the other variables
(Suppose 5 values for Forest Fire then computing prob dist with 5 components)

Applet: lets you try other variations

We'll use this ex

Suppose you are given smoke = T, How do the prob change.  That is what are the posteriors
By Bayes Thm: P(fire|smoke) = P(smoke|fire)·P(fire)
                                                       P(smoke)            where P(smoke) is the prior)
P(smoke) = P(smoke /\ fire) + P(smoke /\ !fire)
                = P(smoke|fire)·P(fire) + P(smoke|!fire)·P(!fire)
                = (.9)(.01) + (.01)(.99) =~ .0189

Posterior for P(fire) =  (.9)(.01)        where .9 = P(smoke|fire), .01 = prior for P(fire)
                                    .0189           and .0189 = prior for P(smoke)
So posterior for P(!fire) = 1-.4762 = .5238
tampering and fire are independant and so posterior for P(tampering) = .02, P(!tampering) = .98
P(alarm) = P(alarm /\ fire /\ tampering) + P(alarm /\ fire /\ !tampering) + P(alarm /\ !fire /\ tampering) + P(alarm /\ !fire /\ !tampering)
              = P(a|f,t)·P(f)·P(t)+P(a|f,!t)·P(f)·P(!t)+P(a|!f,t)·P(!f)·P(t)+P(a|!f,!t)·P(!f)·P(!t)
              = (.5)(.4762)(.02) + (.99)(.4762)(.98) + (.85)(.5238)(.02) + (.0001)(.5238)(.98)
              =~ .4757
P(leaving) = P(leaving|alarm)·P(alarm)+P(leaving|!alarm)·P(!alarm)
                = (.88)(.4757) + (.001)(.5243) = .4192

smoke=T /\ alarm=T
for a 10 pt HW problem show work (as above) to obtain the last two columns and also give Prob if observation leaving=T

In fact, a Bayesian net can be used to compute the prob dist for any subset of network variables given the values
or distributions for any subset of the remaining vars

Exact inference of probabilities for an arbitrary Bayes Net is NP-hard.  So here we try to approximate them. (even this can
be shown to be NP-hard but in practice these heuristics work)

Suppose you want to compute:
  P(y|x) where X is obs.  and Y is set of variables deemed important for prediction or diagnosis
By Bayes's rule
P(y|x) = SUM-OVER-s(P(y,x,s))   = P(x/\y)
             SUM-OVER-y,s(P(y,x,s)        P(x)
where s is all vars except those in X and Y
Complexity depends on # of parents

Learning Bayesian Belief Nets
If network structure is given in advance and all variables are fully observable then learning conditional prob tables
is straight forward.  Just estimate the cond. prob table entries as we would for a naive Bayes classifier

Consider when network structure is known but only some of the variable values are observable in the training data.
Problem is similar to learning weights for hidden units in a neural net.
(If network (viewed as undirected graph) is a tree then the problem can easily be solved as we did in the ex, but with
cycles it is harder and it's NP-Hard to find exact solution)

Objective function to maximize:
P(D|h)  where D is the training data and h is the hypothesis
By def this corresponds to searching for the maximum likelihood hyp for table entries

Gradien Ascent Training of Bayesian Nets (Russel, et al)
Maximize P(D|h) by following the gradient of ln P(D|h) wrt parameters that define the conditional prob tables
of the Bayesian network
Rule you end with is:
1.  Wijk = Wijk + eta·    SUM      (Ph(yij,Uik|d)/W ijk
                                          d element-of D
where Wijk is one enry in a conditional prob table for example:




eta is the learning rate
yij is the value of campfire
Uik is value of <storm,BusGroup>

2. renormalize Wijk to ensure they are valid probability distributions

The EM Algorithm
General technique to use when only a subset of the relevant features are observable.  Can also apply when label is missing
on some exs (i.e. have unlabeled data along with some labeled data)

EM alg has been used to train Bayesian Belief Nets

Ex. Estimating Means of k gaussians.  For now suppose k=2, G 1 and G2 are two normal gaussians with same variance

Get data as follows:
Goal: find h = <µ12>

Now consider same problem but there are k gaussians (all with same variance) Goal is to output hyp h = <µ1,...,µk >
that is a maximum likelihood hyp.
that is h should maximize p(D|h)

For a moment suppose each ex was <Xi,Zi1,Ziz ,...,Zik>  where Xi,Zi1,Ziz ,...,Zik are indicators from which normat it came

This is an easy problem.  For all mj examples from gaussian j you want
ML = argmin  SUM (xi-µ j)2
             µ           i=1
where µj is the mean of the jth gaussian
It can be shown that sum of squared errors is minimized by sample mean
ML = 1/mj SUM xi

But now suppose the first attribute is hidden so you just see xi
EM the alg we are about to see can be applied when you have hidden attributes
EM searches for maximum liklihood hyp by repeatedly re-estimating the expected values of hidden variables given
current hyp <µ1,...,µk> then re-calculating the maximum likelihood hyp.  using these expected values for the hidden vars

Let's look at EM for problem of estimating the two means
Note: "E" is the current hypothesis used to estimate unobserved variables
          "M" is the expected values for unobserved variables to calculate an improved hyp.
Initializarion: h = <µ12> where µ1 and µ2 are arbitrary initial values
Step 1: Calculate E[Zij] of each hidden variable Zij assuming the current hyp h = <µ12 >
Step 2: Calculate a new maximum likelihood hyp h'=<µ'1 ,µ'2> assuming the value taken on by each hidden var Zij is its
expected value E[Zij] calculated in step 1.  Then replace h = <µ12> by the new hypothesis h'=<µ'1,µ'2> and iterate

Let's look at how these two steps are implemented (for our two means ex)
Step 1  Must estimate E[Zij] which is the prob that xi was generated by the jth normal
E[Zij] = P(x=xi|µ=µ j)
             SUM(n=1 to 2)(P(x=x i|µ=µn)
          = e-1/2(sigma )2(xi - µ j)2
              SUM e-1/2(sigma )2(xi - µ n)2
Compute this using current values of <µ1 2> and observed xi into this
Step 2  You can show the maximum likelihood hyp in this case given by:  (for further info here see 6.12.3)
weighted sample mean =   µj = 1/m SUM (E[Z ij]xi)
This alg (in general) converges to a maximum likelihood hyp.

Sections 6.3-6.5 show how you can give Bayesian interpretations for:
First Order Hidden Markov Models
can represent as following Bayesian belief net

Hi hidden state variables
Oi are observed variables

Note P(Ht+1|H1,...,Ht) = P(Ht+1 |Ht)
That is given state Ht, Ht+1 is independant of earlier states.
Joint distribution completely specified by:
P(H1) initial state prob
P(Ht+1|Ht) transition prob
P(Ot|Ht)  emission probabilities

left-to-right topology of HMM
each node represents a value of state var Ht

represents distribution of acoustic sequences associated with a unit of speech (e.g. phenome, word)

For speech recognition bring in many levels of abstraction

Estimating probabilities: Use EM
To make predictions Input observations O1,...,On: Use dynamic programming.Subproblem is:
m[i,L] probability of most likely path from state i that produces output OL,...,On