not really.

the estimator is consistent and will

(asymptotically) have the correct number of modes.

(there are fewer moded than the number of components

of the density estimator)

—LW

Isn’t nonparametric density estimation using kernels a mixture model with the exact same multimodality problems. Actually, in density estimation the multimodality is maximal compared to mixture models, where the number of mixtures can be less than ‘n’ for a smoother model.

]]>Let’s go through the points. Infinite likelihood and multimodality of the likelihood, yeah, these are annoying. Still, often you can find a pretty good local optimum, and try to beat the clustering with another method! (Of course you need to know how to measure what’s good without using the model assumption.)

Multimodality of the density – well first you need to decide whether a “cluster” for you is rather associated to a Gaussian component or a density mode anyway. It’s application dependent and both make sense sometimes. If it’s the latter, you shouldn’t have started fitting a mixture model in the first place. If it’s the former, why bother?

Nonidentifiability – as long as it’s only the problems that you mention, they won’t usually hurt unless you’re a Bayesian (although for some mixtures there are worse), and neither will the improper prior problem. Well, to be honest, they may hurt a bit but by far not as much as the problems with the likelihood.

The nonintuitive group memberships are not nonintuitive to me, and how strong an argument is this kind of intuition anyway?

Here is another one. The BIC is believed to be a consistent estimator for the number of components (this is proved to my knowledge for a quite restricted case only by Keribin, 2000). However, the truth is never precisely a Gaussian mixture, and it is for n to infinity in all likelihood best approximated by a bigger and bigger mixture. So this means that consistency of the BIC will imply that the estimated k diverges to infinity with n, and if you want to do clustering, that’s bad because you rather want a small number of not exactly Gaussian clusters. So consistency here is a bad thing and the BIC is good, if at all, only for small to moderate, but never for large n *because* it’s consistent.

]]>I just mean there are finite local maxima in the interior.

The largest of these has the usual properties.

I don’t see why this is a criticism of maximum likelihood

estimators. The MLE is just an estimator; there is no principle

that says “always use the MLE.”

If the MLE has good behavior, we use it.

If it doesn’t, we don’t.

Purple, not pink!

]]>> how about a nonparametric density estimator?

Doesn’t that hurt interpretability compared to, say, a gaussian mixture?

It seems to me that being given a mixture of gaussians, and told that “this is THE right number of components, and the best local maximum of the likelihood function” would carry more information than simply being given a nonparametric density estimate, even assuming its tuning parameters are also chosen to be perfect by some deity.

Of course I’m sure there are many nonparametric methods to get what the practitioner might hope to learn from the gaussian mixture, as well…

]]>In “infinite likelihood”, can you clarify “This is not necessarily deadly; the infinities are at the boundary and you can use the largest (finite) maximum in the interior as an estimator”?

In general, the boundaries can be approached with a sequence of interior points, meaning sup likelihood over interior = infinity. But I don’t think you are suggesting enumeration of all (actual) local maxima and choosing the biggest one..!

And isn’t this property, more than anything, just a criticism of maximum likelihood? it seems that, in the mixture setting, it is a nonsensical optimization problem.

Excellent blog btw. I just wish it used more pink.

]]>I believe that is correct

(but it has been a long time)

LW