An Example: Text Classification
each example is a text document
label s the type of document (e.g. articles I will find interesting)
Give algorithm based on naive bayes which is very effective
Two key decision:
text document needs to be converted to attributes need way to estimate
needed probabilities
Use very simple way to represent the document:
Define an attribute for each word position.
Value for attribute is the English word found in that position.
So # attribues per ex vary
Ex: 1000 training documents that someone has classified  700
classified as "dislikes"
300 classified as "likes"
Suppose document 1 is "This is a very silly document"
_{ 6}
V_{NB} = argmax P(V_{j})
PI (P(a_{ i}V_{j}))
^{ Vj<{likes,dislikes}
i=1}
= argmax
P(V_{ j})·P(a_{1}=thisV_{j})·P(a_{
2} =isV_{j})·...·P(a_{6}=documentV_{
j} )
^{Vj<{likes,dislikes}}
Note: independance assumption that the word in one position is independant
of that in the
others clearly does not hold here. Yet in practive it works quite
well
(HW3 paper option  Domingo and Pazzani, 1996 provide an interesting
analysis of this phenomenon.)
Back to ex
we need estimates for
P(V_{j}), P(a_{i}=W_{k}V_{j})
P(V_{j}) is easy, for example, P(like) = .3 P(dislike)
= .7
_{
n}
V_{NB} = argmax P(V_{j})
PI (P(a_{i}V_{j}))
^{ Vj elementof V
i=1}
for P(V_{j}) classify based on % of total documents with label
V_{ j}
a_{i} is the ith word of text
Estimation P(a_{i}V_{j}) is still problematic.
In English about 50,000 distince words. Suppose 2 target values and
100 text positions then you would need to estimae (2)(100)(50000) = 10,000,000
terms
Complexity is further reduced by making the very reasonable assumption
that the probability of encountering a
specific word is independant of position. That is you assume:
P(a_{i}=W_{k}V_{j}) = P(a_{m}=W_{
k} V_{j}) For All i,j,k,m
So now (in ex above) only need to estimate (2)(50000) = 100,000 estimates
which is large but manageable
Finally, an mestimate is used with uniform priors with m = size of word
vocabulary
That is, P(W_{k}V_{j}) = (n_{k} + 1)/(n + vocabulary)
where n_{k} is the # times W_{k} appears in documents
with label V_{j}
and n is the # of words in document with label V_{j}
Resulting algorithm:
LearnNaiveBayes(Examples, V) where V is the set of target values
1. Vocabulary = set of all distinct words and other tokens that
occur in Examples
2. Calculate P(V_{j}) and P(W_{k}V_{j}
) terms by
 for each V_{j} elementof V do
 docs_{j} = subset of Examples with label V_{j}
 P(V_{j}) = docs_{j}/Examples
 Text_{j} = document obtained by concatenating all documents
in doc_{j}
 n = # words/tokens in Text_{j}
 For each word W_{k }in vocabulary
 n_{k} = # of times W_{k} occurs in Text_{j}
 P(W_{k}V_{j}) = (n_{k} + 1)/(n + vocabulary)
ClassifyNaiveBayes(Doc)
positions = all word positions in Doc that contain tokens
in vocabulary
n
where n is the number of positions
return V_{NB} = argmax
P(V_{j}) PI (P(a_{i}V_{j}))
^{
Vj elementof V
i=1}
Note: any words in Doc not in training text are ignored
Experimental Results
20 usenet groups, 1000 articles from each group collected to give 20,000
examples
2/3 used for training others used for test set
random guessing would have accuracy of approx 5%
Naive Bayes achieved an accuracy of 89%
The only variation from pseudocode we gave was that the 100 most frequent
words were removed
(such as "the", "of",etc) and also any word occuring fewer than 3 times
were removed. The resulting
vocab contained approx 38,500 words.
Newsreader (program for reading netnews that allows user to rate articles
as he/she reads them)
16% articles interesting
59% of articles Newsreader recommended were interesting
Bayesian Belief Nets
Independance assumption
P(a_{1}...a_{n}W_{j}) = P(a_{1}W_{
j} )·...·P(a_{n}=W_{j})
made by Naive Bayes greatly reduced the complexity but this assumption
is often too strong.
Let's begin with an example:
Suppose you want to predict if there's a forest fire. Suppose you
observe 5 boolean attributes: storm, lightning,
campfire, thunder, and Bus Tour Group
(or more broadly you want to estimate the prob of one of the 2^{5}
=32 possible exs)
without any independance assumptions you would need to estimate 2^{
6} =64 probabilities (and this is a toy example)
What does the Bayes Next rep. As an example let's look at Campfire
Pr(Campfireother 5 attribs) = Pr(campfirestorm,BusTourGroup)
P(S,B,L,C,T,F) = P(S) . P(S,B) . P(S,B,L) . P(S,B,L,C)
. P(S,B,L,C,T) . P(S,B,L,C,T,F)
P(S)
P(S,B) P(S,B,L)
P(S,B,L,C) P(S,B,L,C,T)
= P(S) · P(BS) · P(LS,B) · P(CS,B,L)
· P(TS,B,L,C) · P(FS,B,L,C,T)
Conditional Independance Assumptions
P(BS) = P(B)
P(LS,B) = P(LS)
P(CS,B,L) = P(CS,B)
P(TS,B,L,C) = P(TL)
P(FS,B,L,C,T) = P(FS,L,C)
Thus, P(S,B,L,C,T,F) = P(S)·P(B)·P(LS)·P(CS,B)·P(TL)·P(FS,L,C)
These probabilities come directly from table stored at nodes
Instead of 32 probabilities in joint distribution, here we only need
to estimate:
1+1+2+4+2+8 = 18 probabilities (a.k.a  S+B+L+C+T+F = 18)
From these 6 marginal distributions (and using conditional independance
assumptions) you can compute any of the
32 probabilities in the joint distribution and hence answer questions
like Pr. of forest fire under certain conditions
Q) What Bayes net would correspond to assuming all variables are independant?
A) no edges
Now Let's Generalize
A Bayesian Belief Net represents the joint prob dist for a set of variables
by specifying a set of conditional independance
assumptions (represented by a directed acyclic graph) together with a
set of local conditional properties (often called marginals)
For values y_{1},...,y_{n} for Y_{1},...,Y_{
n}
_{n}
P(y_{1},...,y_{n}) = PI(P(y_{i}Parents(Y_{
i} ))) where P(y_{1},...,y_{
n} ) is shorthand for P(Y_{1}=y_{1},...,Y_{n}
=y_{ n})
^{i=1}
and Parents(Y_{
i}) are nodes with edges directly into Y_{i}
Inference
Problem: Infer prob dist for some variable (e.g. Forest Fire) given
only observed values for a subset of the other variables
(Suppose 5 values for Forest Fire then computing prob dist with 5 components)
Applet: www.cs.ubc.ca/labs/lci/CIspace/bayes.html lets you try other variations
We'll use this ex
Suppose you are given smoke = T, How do the prob change. That is
what are the posteriors
By Bayes Thm: P(firesmoke) = P(smokefire)·P(fire)
P(smoke)
where P(smoke) is the prior)
P(smoke) = P(smoke /\ fire) + P(smoke /\ !fire)
= P(smokefire)·P(fire)
+ P(smoke!fire)·P(!fire)
= (.9)(.01) +
(.01)(.99) =~ .0189
Posterior for P(fire) = (.9)(.01)
where .9 = P(smokefire), .01 = prior for P(fire)
.0189
and .0189 = prior for P(smoke)
So posterior for P(!fire) = 1.4762 = .5238
tampering and fire are independant and so posterior for P(tampering) =
.02, P(!tampering) = .98
P(alarm) = P(alarm /\ fire /\ tampering) + P(alarm /\ fire /\ !tampering)
+ P(alarm /\ !fire /\ tampering) + P(alarm /\ !fire /\ !tampering)
= P(af,t)·P(f)·P(t)+P(af,!t)·P(f)·P(!t)+P(a!f,t)·P(!f)·P(t)+P(a!f,!t)·P(!f)·P(!t)
= (.5)(.4762)(.02) +
(.99)(.4762)(.98) + (.85)(.5238)(.02) + (.0001)(.5238)(.98)
=~ .4757
P(leaving) = P(leavingalarm)·P(alarm)+P(leaving!alarm)·P(!alarm)
= (.88)(.4757)
+ (.001)(.5243) = .4192
Summary

smoke=T

alarm=T

smoke=T /\ alarm=T

priors

tampering

.02

.6334

.0287

.02

fire

.4762

.3667

.9812

.01

alarm

.4757

1.0

1.0

.0267

smoke

1.0

.3364

1.0

.0189

leaving

.4192

.88

.88

.0245

for a 10 pt HW problem show work (as above) to obtain the last two columns
and also give Prob if observation leaving=T
In fact, a Bayesian net can be used to compute the prob dist for any subset
of network variables given the values
or distributions for any subset of the remaining vars
Exact inference of probabilities for an arbitrary Bayes Net is NPhard.
So here we try to approximate them. (even this can
be shown to be NPhard but in practice these heuristics work)
Suppose you want to compute:
P(yx) where X is obs. and Y is set of variables deemed important
for prediction or diagnosis
By Bayes's rule
P(yx) = SUMOVERs(P(y,x,s)) = P(x/\y)
SUMOVERy,s(P(y,x,s)
P(x)
where s is all vars except those in X and Y
Complexity depends on # of parents
Learning Bayesian Belief Nets
If network structure is given in advance and all variables are fully observable
then learning conditional prob tables
is straight forward. Just estimate the cond. prob table entries as
we would for a naive Bayes classifier
Consider when network structure is known but only some of the variable
values are observable in the training data.
Problem is similar to learning weights for hidden units in a neural net.
(If network (viewed as undirected graph) is a tree then the problem can
easily be solved as we did in the ex, but with
cycles it is harder and it's NPHard to find exact solution)
Objective function to maximize:
P(Dh) where D is the training data and h is the hypothesis
By def this corresponds to searching for the maximum likelihood hyp for
table entries
Gradien Ascent Training of Bayesian Nets (Russel, et al)
Maximize P(Dh) by following the gradient of ln P(Dh) wrt parameters that
define the conditional prob tables
of the Bayesian network
Rule you end with is:
1. W_{ijk} = W_{ijk} + eta· SUM
(P_{h}(y_{ij,U}_{ik}d)/W_{
ijk}
^{
d elementof D}
where W_{ijk} is one enry in a conditional prob table for example:
Campfire
eta is the learning rate
y_{ij} is the value of campfire
U_{ik} is value of <storm,BusGroup>
2. renormalize W_{ijk} to ensure they are valid probability distributions
The EM Algorithm
General technique to use when only a subset of the relevant features are
observable. Can also apply when label is missing
on some exs (i.e. have unlabeled data along with some labeled data)
EM alg has been used to train Bayesian Belief Nets
Ex. Estimating Means of k gaussians. For now suppose k=2, G_{
1} and G_{2} are two normal gaussians with same variance
Get data as follows:
 with prob 1/2 pick G_{1} and with prob 1/2 pick G_{2}
 draw random x based on gaussian selected
 only x is given (whether you picked G_{1} or G_{2}
is a hidden variable)
Goal: find h = <µ_{1},µ_{2}>
Now consider same problem but there are k gaussians (all with same variance)
Goal is to output hyp h = <µ_{1},...,µ_{k}
>
that is a maximum likelihood hyp.
that is h should maximize p(Dh)
For a moment suppose each ex was <X_{i},Z_{i1},Z_{iz}
,...,Z_{ik}> where X_{i},Z_{i1},Z_{iz}
,...,Z_{ik} are indicators from which normat it came
This is an easy problem. For all m_{j} examples from gaussian
j you want
_{mj}
µ_{ML} = argmin SUM (x_{i}µ_{
j})^{2}
^{µ}
^{i=1}
where µ_{j} is the mean of the jth gaussian
It can be shown that sum of squared errors is minimized by sample mean
_{
mj}
µ_{ML} = 1/m_{j} SUM x_{i}
^{i=1}
But now suppose the first attribute is hidden so you just see x_{i}
EM the alg we are about to see can be applied when you have hidden attributes
EM searches for maximum liklihood hyp by repeatedly reestimating the expected
values of hidden variables given
current hyp <µ_{1},...,µ_{k}> then recalculating
the maximum likelihood hyp. using these expected values for the hidden
vars
Let's look at EM for problem of estimating the two means
Note: "E" is the current hypothesis used to estimate unobserved variables
"M" is the expected values for unobserved
variables to calculate an improved hyp.
Initializarion: h = <µ_{1},µ_{2}> where
µ_{1} and µ_{2} are arbitrary initial values
Step 1: Calculate E[Z_{ij}] of each hidden variable Z_{ij}
assuming the current hyp h = <µ_{1},µ_{2}
>
Step 2: Calculate a new maximum likelihood hyp h'=<µ'_{1}
,µ'_{2}> assuming the value taken on by each hidden var
Z_{ij} is its
expected value E[Z_{ij}] calculated in step 1. Then replace
h = <µ_{1},µ_{2}> by the new hypothesis
h'=<µ'_{1},µ'_{2}> and iterate
Let's look at how these two steps are implemented (for our two means ex)
Step 1 Must estimate E[Z_{ij}] which is the prob that
x_{i} was generated by the jth normal
E[Z_{ij}] = P(x=x_{i}µ=µ_{
j})
SUM(n=1 to 2)(P(x=x_{
i}µ=µ_{n})
= e^{1/2(sigma
)2}(x_{i}  µ_{
j})^{2}
SUM e^{1/2(sigma
)2}(x_{i}  µ_{
n})^{2}
Compute this using current values of <µ_{1},µ_{
2}> and observed x_{i} into this
Step 2 You can show the maximum likelihood hyp in this case
given by: (for further info here see 6.12.3)
m
weighted sample mean = µ_{j} = 1/m SUM (E[Z_{
ij}]x_{i})
i=1
This alg (in general) converges to a maximum likelihood hyp.
Sections 6.36.5 show how you can give Bayesian interpretations for:
 version spaces
 Using leastsquared error (give noise obeys normal dist)
 gradient search to max likelihood
First Order Hidden Markov Models
can represent as following Bayesian belief net
H_{i} hidden state variables
O_{i} are observed variables
Note P(H_{t+1}H_{1},...,H_{t}) = P(H_{t+1}
H_{t})
That is given state H_{t}, H_{t+1} is independant of earlier
states.
Joint distribution completely specified by:
P(H_{1}) initial state prob
P(H_{t+1}H_{t}) transition prob
P(O_{t}H_{t}) emission probabilities
lefttoright topology of HMM
each node represents a value of state var Ht
represents distribution of acoustic sequences associated with a unit of speech
(e.g. phenome, word)
For speech recognition bring in many levels of abstraction
Estimating probabilities: Use EM
To make predictions Input observations O_{1},...,O_{n}: Use
dynamic programming.Subproblem is:
m[i,L] probability of most likely path from state i that produces output
O_{L},...,On