CS517 - Machine Learning - Spring 2008

 

This is the TA page for the course. For more information of the course, please visit the course page.

 

Teacher Assistant: Yunpeng Xu

Office Hour: Tuesday/Thursday 7:00pm - 8:00pm

Location: Jolley 509.

 

 

¡ì        Introduction and Background

Suggested Reading:

 

Informal Comments:

 

 

¡ì        Instance Based Learning

k-nearest neighborhood

 

Linear Regression

 

Suggested Reading:

 

Informal Comments:

Probably K-Nearest Neighborhood (KNN) is the most well-known machine learning method. It's easy to understand, analyze and implement. In addition, it has a nice performance in many real applications, which is guaranteed by its worst case of twice Bayesian error rate. Personally, KNN and SVM are the two methods I trust most. Therefore, don't forget to try it when you want to solve some classification problem. At least, you would get a sense of the problem in this way.

 

¡ì        Decision Tree

 

Informal Comments:

People like decision tree mainly because they can easily interpret the role of each feature in the classification. This is important in many real applications such as medical diagnosis and bioinformatics.

 

However, the method frequently suffers from one serious problem, i.e., overfitting, especially when no pruning operation is performed. It means that the trained decision tree would have an excellent performance on training samples, but a poor performance on new samples. To make it more robust, people tend to combine it with the ensemble learning methods to boost its performance. Usually, the ensembled decision tree would yield a much better learning performance than that of a single tree. Professor Zhang will discuss ensemble learning later in class.

 

 

 

Data Set for Your Projects

 

Here are some most frequently used public benchmarks in the machine learning fields. You can download and use them in your course projects. You are also encouraged to use any other datasets, but please discuss with me before you do your project.

 

1. UCI Machine Learning Repository

This repository contains 160 data sets from different fields and problems such as iris, adult, breast cancer, etc. The usage of the data includes classification, clustering, regression, inference, etc. For more information of the datasets, please visit http://archive.ics.uci.edu/ml/.

 

2. ORL Faces Recognition Data

This data was collected by AT&T as a bench mark for face recognition. The data includes images of 40 different people, each has 10 images. To use the data, it is desirable to adopt some alignments. However, a simple PCA would also yield a good result.

 

3. USPS Handwriting Recognition Data

This data was collected as a benchmark for handwriting postal address recognition. Each data is a scanned handwriting of a digital from 0 to 9. The goal is to find the correct category of the digit. Similarly, you can perform PCA as a preprocessing before you use it.

 

I will upload more interesting data later once available.