Richard Bergmair's Media Library

ML Lecture #4: Decision Trees & Issues in Representation

Decision trees are a highly generic concept class which can be used to fit a concept to almost any kind of data that might be thrown at a machine learner.

But this representational power comes at a price. Selecting a concept from a higher-dimensional concept class, i.e. from a more generic, more representationally powerful, concept class means that more information is required, and the output, even though it may provide a good fit to the data, may not describe the data in a meaningful way, nor generalize particularly well.

This video lecture introduces decision trees in detail, and discusses some of the tradeoffs around using a highly generic concept class like a decision tree.

We’ve mentioned on several occasions, that, at PANOPTICOM, our solution to media monitoring is based on solving the machine learning problem which we believe to be at the core of any media monitoring application.

The machine learning approach to media monitoring actually taken at PANOPTICOM is more advanced than decision trees, but it’s nevertheless instructive to think about how decision trees might be used to approach the problem of media monitoring.

Decision trees work by classifying each data point according to a series of yes-no questions that can be asked about it. For example, given a media article, the kinds of questions that might be asked are Is this a blog article? Was it written by a well-known blogger? Does it contain the keyword “regulatory”?

The reason it is useful to ask a series of questions like that is because, the computer has the ability to observe how answers to questions such as these affect the probability that the item will be relevant. For example, the computer may have observed that blog articles have a high probability of being relevant only if they come from a well-known blogger, and that blog articles from a well-known blogger have a higher probability of being relevant than a tweet on Twitter. It might furthermore have observed that blog articles containing the word “regulatory” could have a higher probability of being relevant to a specific client than the ones that don’t, etc.

In this video lecture, we also show how a decision tree consisting of useful questions to ask about data points can be constructed from a sample of data.