CS517 - Machine Learning - Spring 2008
This is the TA page for
the course. For more information of the course, please visit the course
page.
Teacher Assistant: Yunpeng
Xu
Office Hour:
Tuesday/Thursday
Location: Jolley 509.
¡ì
Introduction and Background
Suggested
Informal
Comments:
¡ì
Instance Based Learning
k-nearest neighborhood
Linear Regression
Suggested
Informal
Comments:
Probably
K-Nearest Neighborhood (KNN) is the most well-known machine learning method.
It's easy to understand, analyze and implement. In addition, it has a nice
performance in many real applications, which is guaranteed by its worst case of
twice Bayesian error rate. Personally, KNN and SVM are the two methods I trust
most. Therefore, don't forget to try it when you want to solve some
classification problem. At least, you would get a sense of the problem in this
way.
¡ì
Decision Tree
Informal
Comments:
People like decision tree mainly because they can easily
interpret the role of each feature in the classification. This is important in
many real applications such as medical diagnosis and bioinformatics.
However, the method frequently suffers from one
serious problem, i.e., overfitting, especially when no pruning operation is performed.
It means that the trained decision tree would have an excellent performance on
training samples, but a poor performance on new samples. To make it more
robust, people tend to combine it with the ensemble learning methods to boost its
performance. Usually, the ensembled decision tree would yield a much better learning
performance than that of a single tree. Professor Zhang will discuss ensemble
learning later in class.
Data Set for Your Projects
Here are some most
frequently used public benchmarks in the machine learning fields. You can
download and use them in your course projects. You are also encouraged to use
any other datasets, but please discuss with me before you do your project.
1. UCI Machine Learning
Repository
This repository contains 160 data sets from different fields and problems such as iris, adult, breast cancer, etc. The usage of the data includes classification, clustering, regression, inference, etc. For more information of the datasets, please visit http://archive.ics.uci.edu/ml/.
This data was collected by AT&T as a bench mark for face recognition. The data includes images of 40 different people, each has 10 images. To use the data, it is desirable to adopt some alignments. However, a simple PCA would also yield a good result.
3. USPS Handwriting Recognition Data
This data was collected as a benchmark for handwriting postal address recognition. Each data is a scanned handwriting of a digital from 0 to 9. The goal is to find the correct category of the digit. Similarly, you can perform PCA as a preprocessing before you use it.
I will upload more interesting data later once available.