Possible Thesis Topics with Oliver Schulte (February 2012)
These topics are in the area of statistical-relational learning, basically machine learning for relational databases. They are presented in descending order of my interest (roughly), so the one I'm keenest on is first. But I'm interested in all these topics and possibly others too, like relating learning to decision-making, planning, and game-playing.
- Multiple Link Analysis. I'm planning to work on problems that involve multiple types of links. One example are social network data where users can engage in different types of activities, like posting on a wall, sending messages, writing testimonials etc. One of the problems I have already worked on is link strength prediction: trying to predict the strength of a friendship given observations of the communication activities between the two friends. Another example would be combining information from searches on Youtube with information from web searches, for example in order to improve the guesses of the search engine as to what may be important for you. I'm looking at Bayes nets with and without hidden features for this problem. A Bayes net with hidden features is essentially equivalent to a multiple matrix factorization problem.
- Inference with Bayes nets for relational data. I am working on a new approach to the difficult problem of how to do inference with Bayes nets when there are cyclic dependencies, as often happens with relational data. For example, suppose that the smoking of Jane predicts the smoking of Jack, which predicts the smoking of Cecile, which predicts the smoking of Jane, where Jack, Jane, and Cecile are all friends with each other. This is a great topic for a Ph.D. thesis, but requires mature math skills, specifically the ability to pursue mathematical conjectures and prove theorems.
- Link-based classification via combining probabilistic predictions from standard classifiers. This is a new way to upgrade standard classifier models for relational data. For instance, I would like to adapt decision trees and relevance vector machines for link-based classification. Relevance vector machines are a probabilistic version of support vector machines.
- Graphical Models for OLAP. On-line Analytic Processing is a mainstream tool for analyzing complex highly structured data, widely used in the database industry. An important part of the structure are hierarchies, like sales in Hamburg, which are part of sales in Germany, which are part of sales in Europe. I'm interested in developing and learning Bayes nets that can compactly represent statistical patterns at different levels of a hierarchy.
- Combining Bayes nets with recommendation systems. Nonnegative matrix factorization models are among the state-of-the-art methods for recommendation systems. They can be naturally represented as graphical models with latent variables. Typically the main focus is on building a latent variable analysis of the link/rating matrix. The methods we have developed so far deal with observed features, like gender, age, profession of users. The idea is to combine our methods with latent variable analysis to obtain a model of the correlations between observed and latent features.
- Bayes nets for ontologies, the semantic web, and description logic. Ontological hierarchies are essential, widely used structures in knowledge representation. Adding hierarchical information to web pages is a key part of the semantic web. The formal foundation for this is typically description logic. I would like to expand Bayes nets for relational structures to Bayes nets with ontologies, in the spirit of Koller and Pfeffer's P-Classic system. A very nice system for representing ontologies is Protege from Stanford and Manchester Universities. It has a plug-in for adding Bayes nets; a neat project would be to expand the plug-in so it learns a Bayes net for a given T-box and A-box.