Not signed in (Sign In)

Not signed in

Want to take part in these discussions? Sign in if you have an account, or apply for one below

  • Sign in using OpenID

Site Tag Cloud

2-category 2-category-theory abelian-categories adjoint algebra algebraic algebraic-geometry algebraic-topology analysis analytic-geometry arithmetic arithmetic-geometry book bundles calculus categorical categories category category-theory chern-weil-theory cohesion cohesive-homotopy-type-theory cohomology colimits combinatorics complex complex-geometry computable-mathematics computer-science constructive cosmology deformation-theory descent diagrams differential differential-cohomology differential-equations differential-geometry digraphs duality elliptic-cohomology enriched fibration foundation foundations functional-analysis functor gauge-theory gebra geometric-quantization geometry graph graphs gravity grothendieck group group-theory harmonic-analysis higher higher-algebra higher-category-theory higher-differential-geometry higher-geometry higher-lie-theory higher-topos-theory homological homological-algebra homotopy homotopy-theory homotopy-type-theory index-theory integration integration-theory k-theory lie-theory limits linear linear-algebra locale localization logic mathematics measure-theory modal modal-logic model model-category-theory monad monads monoidal monoidal-category-theory morphism motives motivic-cohomology nforum nlab noncommutative noncommutative-geometry number-theory of operads operator operator-algebra order-theory pages pasting philosophy physics pro-object probability probability-theory quantization quantum quantum-field quantum-field-theory quantum-mechanics quantum-physics quantum-theory question representation representation-theory riemannian-geometry scheme schemes set set-theory sheaf sheaves simplicial space spin-geometry stable-homotopy-theory stack string string-theory superalgebra supergeometry svg symplectic-geometry synthetic-differential-geometry terminology theory topology topos topos-theory tqft type type-theory universal variational-calculus

Vanilla 1.1.10 is a product of Lussumo. More Information: Documentation, Community Support.

Welcome to nForum
If you want to take part in these discussions either sign in now (if you have an account), apply for one now (if you don't).
    • CommentRowNumber1.
    • CommentAuthorUrs
    • CommentTimeApr 20th 2021
    • (edited Apr 20th 2021)

    a stub (though I did try my hand on a brief idea-section), for the moment mostly to provide a home for

    v1, current

    • CommentRowNumber2.
    • CommentAuthorDavid_Corfield
    • CommentTimeApr 20th 2021
    • (edited Apr 20th 2021)

    Having worked with a machine learning group, I posted some ideas at the n-Café years ago. Let me extract some of that and see if anything is worth retaining:

    Now, one of the most important developments in machine learning over the past decade has been the use of kernel methods. For example, in the support vector machine (SVM) approach to classification, the data space is mapped into a feature space, a reproducing kernel Hilbert space. A linear classifier is then chosen in this feature space which does the best job at separating points with different labels. This classifier corresponds to a nonlinear decision boundary in the original space. The ’Bayesian’ analogue employs Gaussian processes (GP).

    One of the most important developments in machine learning over the past 15 years has been the introduction of kernel methods. A kernel on a set XX is a function K:X×XK: X \times X \to \mathbb{R}, which is symmetric and positive definite, in the sense that for any N1N \geq 1 and any x 1,...,x NXx_1,..., x_N \in X, the matrix K ij=K(x i,x j)K_{i j} = K(x_i, x_j) is positive definite, i.e., i,jc ic jK ij0\sum_{i, j} c_i c_j K_{i j} \geq 0 for all c 1,...,c Nc_1, ..., c_N \in \mathbb{R}. (Complex-valued kernels are possible.)

    Another way of looking at this situation is to reformulate it as a mapping ϕ:XH\phi : X \to H, where HH is a reproducing kernel Hilbert space, a function space in which pointwise evaluation is a continuous linear functional.

    The ’kernel’ and ’feature map to Hilbert space’ stories relate to each other as follows:

    K(x,.)=ϕ(x)K(x, .) = \phi (x).

    The reproducing aspect of HH means that

    f(.),K(x,.)=f(x)\langle f(.), K(x, .) \rangle = f(x),

    and this is continuous. So we have

    K(x,y)=ϕ(x),ϕ(y)K(x, y) = \langle \phi (x), \phi(y) \rangle.

    HH is the span of the set of functions {K(x,.)|xX}\{K(x, .)| x \in X\}, and, under certain conditions, when we find the fHf \in H for which a functional is minimised, it takes the form

    f(x)= i=1 mc iK(x i,x)f(x) = \sum_{i = 1}^m c_i K(x_i, x).

    Many linear algebra techniques in HH just involve the inner product, and so can be conducted as a form of nonlinear algebra back in XX, using the kernel trick.

    Consider binary classification where we look to form a classifier which accurately finds the labels in {±1}\{\pm 1\} of a sample from a distribution over X×{±1}X \times \{\pm 1\} on the basis of their XX values, given a set of labelled training data. The support vector machine approach to this task looks to find the hyperplane in HH which best separates the images of data points ϕ(x i)\phi (x_i), so that those with the same label (y i)(y_i) fall in the same half space. The control for overfitting comes from finding such a hyperplane that classifies the training data accurately with the largest perpendicular distance to the nearest of them (the support vectors). Note that there’s a ’Bayesian’ version of kernel methods which uses Gaussian processes.

    The class of kernels on XX is closed under addition, multiplication by a positive scalar, multiplication, and pointwise limits. What else do we know about the structure on this class?

    Kernels characterise the similarity of the input set. But how do they get chosen? Often a Gaussian radial-basis function (RBF) kernel is selected:

    K(x,y)=e xy 2/σ 2K(x, y) = e^{-\|x - y\|^2 / \sigma^2}.

    But this can lead to an odd notion of similarity. Imagine handwritten digits, pixellated 30 ×\times 30 so that their grey scale values form a vector in 900\mathbb{R}^{900}. The image of a ’1’ shifted a couple of pixels to the right will appear very different in this representation. The classifying algorithm is not fed any information to let it know that it ought to classify it as the same digit. If it is trained on sufficient data this might well not matter since there will be a training example ’close enough’ to allow for correct classification.

    One way around the problem is to train a support vector machine with the given data, find the support vectors, then use translations of them as new ’virtual’ data. This does have the effect of producing a more accurate classifier, but is seen by some as ad hoc. Ideally, the expected invariances would be encapsulated in the kernel.

    But why not try to learn the kernel? This could be done by defining a finite parametric family of kernels and optimising for some associated quantity (or forming some Bayesian posterior). But a more interesting suggestion is to imitate the non-parametric nature of kernel methods at the level of kernel selection itself. More on this next time.

    We’ve had a slight whiff of the idea that groupoids and groupoidication might have something to do with the kernel methods we discussed last time. It would surely help if we understood better what a kernel is. First off, why is it called a ’kernel’?

    This must relate to the kernels appearing in integral transforms,

    (Tg)(u)= TK(t,u)g(t)dt, (T g)(u) = \int_T K(t, u) g(t) dt,

    where gg, a function on the TT domain, is mapped to TgT g, a function on the UU domain. Examples include those used in the Fourier and Laplace transforms. Do these kernels get viewed as kinds of matrix?

    In our case these two domains are the same and KK is symmetric. Our trained classifier has the form:

    f(x)= i=1 mc iK(x i,x) f(x) = \sum_{i = 1}^m c_i K(x_i, x)

    So the g(t)g(t) is i=1 mc iδ(x i,x)\sum_{i = 1}^m c_i \delta (x_i, x). The kernel has allowed the transform of a weighted set of points in XX, or, more precisely, a weighted sum of delta functions at those points, to a function on XX.

    Perhaps we can get a little closer to the beginning of John’s ’Tale of groupoidification’ by considering the ’bag of words’ kernel. This applies to a set of documents. For each member of a set of words we count the number of its occurrences in the document. The map we called ϕ\phi in the last post sends a document to a vector whose entries are the counts of the occurrences of some list of words. Then the bag of words kernel forms the dot product between these vectors. So

    K(x,y)= i(#occurrencesofithwordinx)(#occurrencesofithwordiny). K(x, y) = \sum_i (#occurrences of i th word in x) \cdot (#occurrences of i th word in y).

    This has the look of a span of sets S:XXS: X \rightarrow X (not sure how to do that broken arrow), where SS is the set of pairs of word tokens of the same type and their positions in a document. E.g., ([computer, word #56, document 42], [computer, word #91, document 65]) witnesses the presence of ’computer’ in documents 42 and 65. The size of the subset of SS above this pair of documents is the (42, 65) entry of the kernel matrix.

    You choose your set of words to be informative, e.g., avoid ’and’ and ’the’. Note that this kernel literally treats your documents as bags of words, or better multisets. It doesn’t care about the word order. Classifiers can work effectively even after information is thrown away, just as in the case of the digit classifier we spoke about which would work as effectively if all the 30 ×\times 30 images were scrambled by the same permutation of its 900 pixels.

    It would be fun if we could find a machine learning kernel which involved spans of groupoids rather than spans of sets.

    The use of kernel methods in machine learning often goes by the name nonparametric statistics. Sometimes this gets taken as though it’s the opposite of finite parametric statistics, but that’s not quite right. A typical piece of parametric statistics has a model like the two-parameter family of normal distributions, and the parameters of this model are then estimated from a sample.

    What happens in the case of kernel methods in machine learning is that a model from a possibly infinite-dimensional family is specified by a function on the (finite) collection of training points. In other words you are necessarily restricted to a subfamily of models of dimension the size of the training set. So, in contrast to the usual parametric case, you are dealing with a data-dependent finite dimensional subfamily of models.

    Now we can see why regularisation is important. Fitting a normal distribution is choosing a point in a two-dimensional manifold within the space of all distributions on \mathbb{R}. The choice is made of the member of the family whose first and second moments match those of the empirical distribution. Now in the nonparametric case, before we know the training data, we are dealing with an infinite-dimensional space dense within the space of all distributions. Where in the Normal distribution case we have two constraints to match the two dimensions of the family of models, now we are able to satisfy as many constraints as we like, and must end by matching the empirical distribution if we opt simply for the maximum likelihood model.

    Instead of this so-called maximum likelihood estimation, what we can do instead is impose a bias so that we end up in our family by favouring ’smaller’ functions on the training data. An intuitive way to think of this is by considering the Gaussian radial basis function as kernel on a Euclidean space XX:

    K(x,y)=(4πt) n/2exp(xy 24t). K(x, y) = (4 \pi t)^{-n/2}exp\left(-\frac{\|x - y\|^2}{4t}\right).

    This is known as a ’heat kernel’ in Euclidean space. Gaussian process classification with this kernel amounts to selecting negative and positive ’temperatures’ on the training points, setting the temperature of the rest of XX to zero, and allowing heat to flow for a designated time. After this time, points in XX will be classified according to the sign of their temperature, the magnitude of the temperature determing the certainty of the classification.

    Now if we’re allowed to place any temperature values at the training points, we can select very hot ones for those labelled ’+1+1’ and very cold ones for those labelled ’1-1’, so that they’re still hot or cold after heat flow has taken place. This is a classic case of overfitting, and we should have little confidence that our estimates on test points will be accurate. If on the other hand we manage to select modest temperatures in such a way that after the heat flow has taken place the classifier is accurate on training data, we should have greater confidence.

    • CommentRowNumber3.
    • CommentAuthorDavid_Corfield
    • CommentTimeApr 20th 2021

    Added something

    diff, v2, current

    • CommentRowNumber4.
    • CommentAuthorUrs
    • CommentTimeApr 22nd 2021

    added pointer to:

    • Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola, Kernel methods in machine learning, Annals of Statistics 2008, Vol. 36, No. 3, 1171-1220 (arXiv:math/0701907)

    diff, v4, current

    • CommentRowNumber5.
    • CommentAuthorUrs
    • CommentTimeJun 13th 2022

    added references (here) on kernel methods applicable to persistence diagrams/barcodes for making topological data analysis amenable to “topologicalmachine learning:

    diff, v8, current