Want to take part in these discussions? Sign in if you have an account, or apply for one below
Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.
a stub (though I did try my hand on a brief idea-section), for the moment mostly to provide a home for
Having worked with a machine learning group, I posted some ideas at the n-Café years ago. Let me extract some of that and see if anything is worth retaining:
Now, one of the most important developments in machine learning over the past decade has been the use of kernel methods. For example, in the support vector machine (SVM) approach to classification, the data space is mapped into a feature space, a reproducing kernel Hilbert space. A linear classifier is then chosen in this feature space which does the best job at separating points with different labels. This classifier corresponds to a nonlinear decision boundary in the original space. The ’Bayesian’ analogue employs Gaussian processes (GP).
One of the most important developments in machine learning over the past 15 years has been the introduction of kernel methods. A kernel on a set $X$ is a function $K: X \times X \to \mathbb{R}$, which is symmetric and positive definite, in the sense that for any $N \geq 1$ and any $x_1,..., x_N \in X$, the matrix $K_{i j} = K(x_i, x_j)$ is positive definite, i.e., $\sum_{i, j} c_i c_j K_{i j} \geq 0$ for all $c_1, ..., c_N \in \mathbb{R}$. (Complex-valued kernels are possible.)
Another way of looking at this situation is to reformulate it as a mapping $\phi : X \to H$, where $H$ is a reproducing kernel Hilbert space, a function space in which pointwise evaluation is a continuous linear functional.
The ’kernel’ and ’feature map to Hilbert space’ stories relate to each other as follows:
$K(x, .) = \phi (x)$.
The reproducing aspect of $H$ means that
$\langle f(.), K(x, .) \rangle = f(x)$,
and this is continuous. So we have
$K(x, y) = \langle \phi (x), \phi(y) \rangle$.
$H$ is the span of the set of functions $\{K(x, .)| x \in X\}$, and, under certain conditions, when we find the $f \in H$ for which a functional is minimised, it takes the form
$f(x) = \sum_{i = 1}^m c_i K(x_i, x)$.
Many linear algebra techniques in $H$ just involve the inner product, and so can be conducted as a form of nonlinear algebra back in $X$, using the kernel trick.
Consider binary classification where we look to form a classifier which accurately finds the labels in $\{\pm 1\}$ of a sample from a distribution over $X \times \{\pm 1\}$ on the basis of their $X$ values, given a set of labelled training data. The support vector machine approach to this task looks to find the hyperplane in $H$ which best separates the images of data points $\phi (x_i)$, so that those with the same label $(y_i)$ fall in the same half space. The control for overfitting comes from finding such a hyperplane that classifies the training data accurately with the largest perpendicular distance to the nearest of them (the support vectors). Note that there’s a ’Bayesian’ version of kernel methods which uses Gaussian processes.
The class of kernels on $X$ is closed under addition, multiplication by a positive scalar, multiplication, and pointwise limits. What else do we know about the structure on this class?
Kernels characterise the similarity of the input set. But how do they get chosen? Often a Gaussian radial-basis function (RBF) kernel is selected:
$K(x, y) = e^{-\|x - y\|^2 / \sigma^2}$.
But this can lead to an odd notion of similarity. Imagine handwritten digits, pixellated 30 $\times$ 30 so that their grey scale values form a vector in $\mathbb{R}^{900}$. The image of a ’1’ shifted a couple of pixels to the right will appear very different in this representation. The classifying algorithm is not fed any information to let it know that it ought to classify it as the same digit. If it is trained on sufficient data this might well not matter since there will be a training example ’close enough’ to allow for correct classification.
One way around the problem is to train a support vector machine with the given data, find the support vectors, then use translations of them as new ’virtual’ data. This does have the effect of producing a more accurate classifier, but is seen by some as ad hoc. Ideally, the expected invariances would be encapsulated in the kernel.
But why not try to learn the kernel? This could be done by defining a finite parametric family of kernels and optimising for some associated quantity (or forming some Bayesian posterior). But a more interesting suggestion is to imitate the non-parametric nature of kernel methods at the level of kernel selection itself. More on this next time.
We’ve had a slight whiff of the idea that groupoids and groupoidication might have something to do with the kernel methods we discussed last time. It would surely help if we understood better what a kernel is. First off, why is it called a ’kernel’?
This must relate to the kernels appearing in integral transforms,
$(T g)(u) = \int_T K(t, u) g(t) dt,$where $g$, a function on the $T$ domain, is mapped to $T g$, a function on the $U$ domain. Examples include those used in the Fourier and Laplace transforms. Do these kernels get viewed as kinds of matrix?
In our case these two domains are the same and $K$ is symmetric. Our trained classifier has the form:
$f(x) = \sum_{i = 1}^m c_i K(x_i, x)$So the $g(t)$ is $\sum_{i = 1}^m c_i \delta (x_i, x)$. The kernel has allowed the transform of a weighted set of points in $X$, or, more precisely, a weighted sum of delta functions at those points, to a function on $X$.
Perhaps we can get a little closer to the beginning of John’s ’Tale of groupoidification’ by considering the ’bag of words’ kernel. This applies to a set of documents. For each member of a set of words we count the number of its occurrences in the document. The map we called $\phi$ in the last post sends a document to a vector whose entries are the counts of the occurrences of some list of words. Then the bag of words kernel forms the dot product between these vectors. So
$K(x, y) = \sum_i (#occurrences of i th word in x) \cdot (#occurrences of i th word in y).$This has the look of a span of sets $S: X \rightarrow X$ (not sure how to do that broken arrow), where $S$ is the set of pairs of word tokens of the same type and their positions in a document. E.g., ([computer, word #56, document 42], [computer, word #91, document 65]) witnesses the presence of ’computer’ in documents 42 and 65. The size of the subset of $S$ above this pair of documents is the (42, 65) entry of the kernel matrix.
You choose your set of words to be informative, e.g., avoid ’and’ and ’the’. Note that this kernel literally treats your documents as bags of words, or better multisets. It doesn’t care about the word order. Classifiers can work effectively even after information is thrown away, just as in the case of the digit classifier we spoke about which would work as effectively if all the 30 $\times$ 30 images were scrambled by the same permutation of its 900 pixels.
It would be fun if we could find a machine learning kernel which involved spans of groupoids rather than spans of sets.
The use of kernel methods in machine learning often goes by the name nonparametric statistics. Sometimes this gets taken as though it’s the opposite of finite parametric statistics, but that’s not quite right. A typical piece of parametric statistics has a model like the two-parameter family of normal distributions, and the parameters of this model are then estimated from a sample.
What happens in the case of kernel methods in machine learning is that a model from a possibly infinite-dimensional family is specified by a function on the (finite) collection of training points. In other words you are necessarily restricted to a subfamily of models of dimension the size of the training set. So, in contrast to the usual parametric case, you are dealing with a data-dependent finite dimensional subfamily of models.
Now we can see why regularisation is important. Fitting a normal distribution is choosing a point in a two-dimensional manifold within the space of all distributions on $\mathbb{R}$. The choice is made of the member of the family whose first and second moments match those of the empirical distribution. Now in the nonparametric case, before we know the training data, we are dealing with an infinite-dimensional space dense within the space of all distributions. Where in the Normal distribution case we have two constraints to match the two dimensions of the family of models, now we are able to satisfy as many constraints as we like, and must end by matching the empirical distribution if we opt simply for the maximum likelihood model.
Instead of this so-called maximum likelihood estimation, what we can do instead is impose a bias so that we end up in our family by favouring ’smaller’ functions on the training data. An intuitive way to think of this is by considering the Gaussian radial basis function as kernel on a Euclidean space $X$:
$K(x, y) = (4 \pi t)^{-n/2}exp\left(-\frac{\|x - y\|^2}{4t}\right).$This is known as a ’heat kernel’ in Euclidean space. Gaussian process classification with this kernel amounts to selecting negative and positive ’temperatures’ on the training points, setting the temperature of the rest of $X$ to zero, and allowing heat to flow for a designated time. After this time, points in $X$ will be classified according to the sign of their temperature, the magnitude of the temperature determing the certainty of the classification.
Now if we’re allowed to place any temperature values at the training points, we can select very hot ones for those labelled ’$+1$’ and very cold ones for those labelled ’$-1$’, so that they’re still hot or cold after heat flow has taken place. This is a classic case of overfitting, and we should have little confidence that our estimates on test points will be accurate. If on the other hand we manage to select modest temperatures in such a way that after the heat flow has taken place the classifier is accurate on training data, we should have greater confidence.
added pointer to:
added references (here) on kernel methods applicable to persistence diagrams/barcodes for making topological data analysis amenable to “topological” machine learning:
1 to 5 of 5