ML Lecture #6: Data Representation & Similarity Measures

Notions of similarity and dissimilarity, as well as closeness and
distance are at the heart of the kinds of mathematical models that
enable machine learning.

Distance, in a geometric sense, would seem to be a rather rigid
concept. But to a mathematician there are, in fact, surprising degrees
of freedom within the choice of a distance measure, giving different
mathematical properties to the resulting geometric spaces.

This video lecture starts by introducing some alternatives to the
usual Euclidean distance, including Manhattan distance, and, more
generally, the mathematical family of distance measures called Minkowski
distance.

It then introduces the Mahalanobis distance, which takes into account
the covariance structure of a statistical sample, and measures the
distance between two points within a reference system that is
appropriately decorrelated.

We then move on to the similarity and dissimilarity measures that
have found widespread applications in information retrieval and natural
language processing, such as cosine similarity and such as the set
overlap measures by Dice and Jaccard, as well as string edit distances
including Hamming distance, Levenshtein distance, and Jaro-Winkler
similarity.

String-edit distances are useful in natural language processing
applications such as PANOPTICOM’s media monitoring infrastructure in
order to deal with misspellings. For example, if the PANOPTICOM machine
learner has picked up “regulatory” as a keyword signalling relevance,
then, based on the low string edit distance between this keyword and the
token “regularoty”, it might be able to recognize it as a misspelling of
the keyword.