Is Euclidean distance meaningful for high dimensional data?
July 10, 2018 / Ask Slater
The short answer is no. At high dimensions, Euclidean distance loses pretty much all meaning.
However, it’s not something that’s the fault of Euclidean distance in particular (though there are distance metrics that work better at high dimensions than Euclidean).
The main issue is something commonly referred to as the “Curse of Dimensionality”. It’s very unintuitive, but also a common and insidious issue that will plague anything you do in a high-dimensional space.
Let’s be clear though. By “high-d” we’re talking hundreds to thousands of dimensions for a dense vector (sparse vectors are a completely different topic). Basically once you get up to high-dimensionality, pairwise distance between all of your points approaches a constant. Not zero, not infinity, but a constant.
Now, there are several important caveats here, and quite frankly the curse of dimensionality isn’t something that we understand very well outside of toy examples.
First – this pattern starts to fall away if your different dimensions are correlated. If you can do a PCA or something similar to re-project into a lower-d space with a small amount of loss, then your distance metrics are probably still meaningful, though this varies case by case.
Second – this isn’t something as easy as “just use this other distance metric”. The critical problem here is sparsity, and the value of any distance metric at high-d. In a k-nn scenario it’s usually still the case that the relative distances between points have meaning, but just that the absolute distance have much less of it. A lot of modern manifold layout algorithms attempt to circumvent this problem by throwing out the distance and instead only considering narrow “neighborhoods” of nearest neighbors, though many approximate nearest neighbors solutions (such as barnes hut) become very ineffective at high-d. This is largely because the assumptions around the efficacy of linear sub-division of the underlying space fall away.
To address the second point there are interesting techniques like voronoi clustering that help to mitigate some of these issues.
In general it depends a lot on the use case, but if you’re using Euclidean distance in a space that has hundreds or thousands of independent variables, you should get very paranoid about your assumptions very quickly.