CSE 515T Project notes
The main goal of the project is to give you hands-on experience applying Bayesian methods to a real-world dataset. The use of real-world data can have many interesting (and potentially frustrating!) aspects that are difficult to convey without getting your hands dirty. The scope of the project is intended to be more than a homework problem, but less than a full-fledged research paper. We will read your proposals and suggest tweaks to bring them into the intended scope, if necessary.
Note that I don't necessarily expect your idea to "work." Part of research is trying out ideas, regardless of whether they're ultimately successful. If your idea does "work," I expect you to think about why it was succesful. If not, I expect you to think about why not!
Below are some potential project ideas, but please absolutely feel free to come up with your own!
- This dataset from the UCI repository is quite interesting. The task is to predict the depth in the body (effectively, the depth along the spine) given the properties of a two-dimensional "slice" of the body. The hard part about this problem is that it's actually the output causing the input rather than the other way around. I have not had luck designing a good regression method for this data. Can you?
- Find a Bayesian interpretation of elastic net regularization, and compare this method for regression against "standard" Bayesian regression (with a Gaussian prior) on a dataset of your choosing.
- Probabilistic PCA is a Bayesian interpretation of the classical PCA algorithm for dimensionality reduction. Implement PPCA and compare its performance with other methods (such as "standard" PCA) on a dataset of your choosing.
- Bayesian optimization is quite popular here, and could be the basis of many projects!
- The squared exponential covariance is widely used for Gaussian process regression. It is probably used in 90+% of all GP publications. That said, it is widely believed to be "too smooth" for many real-world regression tasks. Compare the squared exponential covariance versus the Matéern covariance on several datasets via Bayesian model selection. How often is the squared exponential the right choice?
- Latent Dirichlet allocation (LDA) is a Bayesian method for creating "topic models" of text documents. There are plenty of interesting text datasets available (some are listed below; DBpedia could be a good resource!). One idea would be to compare the behavior of LDA with other techniques, such as latent semantic analysis.
- This compentition could have been won by a Bayesian!
- Kaggle competitions can be a great source of data in general.
- This website has a fantastic compilation of 100 interesting, relevant datasets from all sorts of application areas.
- The creators of libSVM have also compiled a great list of datasets, all in a standardized format. The libSVM codebase also includes libsvmread for reading these in MATLAB.
- The UCI Machine Learning Repository is a mainstay in machine-learning research. There are a wide range of datasets there from many different application areas and with many different properties (large, small, high-dimensional, low-dimensional, classification, regression, etc.). Note that many of these datasets are also included in the libSVM collection above, which may be more convenient due to the common format used.
- DBpedia is an amazing resource that automatically extracts structured data from Wikipedia. They have all sorts of data available for download in convenient formats. I have written a little tool to extract labeled graphs from DBpedia, but there is so much more you could do.