Is Euclidean distance meaningful for high dimensional data?

The short answer is no. At high dimensions, Euclidean distance loses pretty much all meaning.

However, it’s not something that’s the fault of Euclidean distance in particular (though there are distance metrics that work better at high dimensions than Euclidean).

The main issue is something commonly referred to as the “Curse of Dimensionality”. It’s very unintuitive, but also a common and insidious issue that will plague anything you do in a high-dimensional space.

Let’s be clear though. By “high-d” we’re talking hundreds to thousands of dimensions for a dense vector (sparse vectors are a completely different topic). Basically once you get up to high-dimensionality, pairwise distance between all of your points approaches a constant. Not zero, not infinity, but a constant.

Now, there are several important caveats here, and quite frankly the curse of dimensionality isn’t something that we understand very well outside of toy examples.

First – this pattern starts to fall away if your different dimensions are correlated. If you can do a PCA or something similar to re-project into a lower-d space with a small amount of loss, then your distance metrics are probably still meaningful, though this varies case by case.

Second – this isn’t something as easy as “just use this other distance metric”. The critical problem here is sparsity, and the value of any distance metric at high-d. In a k-nn scenario it’s usually still the case that the relative distances between points have meaning, but just that the absolute distance have much less of it. A lot of modern manifold layout algorithms attempt to circumvent this problem by throwing out the distance and instead only considering narrow “neighborhoods” of nearest neighbors, though many approximate nearest neighbors solutions (such as barnes hut) become very ineffective at high-d. This is largely because the assumptions around the efficacy of linear sub-division of the underlying space fall away.

To address the second point there are interesting techniques like voronoi clustering that help to mitigate some of these issues.

In general it depends a lot on the use case, but if you’re using Euclidean distance in a space that has hundreds or thousands of independent variables, you should get very paranoid about your assumptions very quickly.

View original question on Quora >

Follow Slater on Quora >>

Effective January 1, 2020, Indico will be deprecating all public APIs and sunsetting our Pay as You Go Plan.

Why are we deprecating these APIs?

Over the past two years our new product offering Indico IPA has gained a lot of traction. We’ve successfully allowed some of the world’s largest enterprises to automate their unstructured workflows with our award-winning technology. As we continue to build and support Indico IPA we’ve reached the conclusion that in order to provide the quality of service and product we strive for the platform requires our utmost attention. As such, we will be focusing on the Indico IPA product offering.

[addtoany]

Increase intake capacity. Drive top line revenue growth.

Schedule Demo

Unstructured Unlocked podcast

April 24, 2024 | E45

Unstructured Unlocked episode 45 with Daniel Faggella, Head of Research, CEO at Emerj Artificial Intelligence Research

Listen Now

April 10, 2024 | E44

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Listen Now

March 27, 2024 | E43

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Listen Now

View All

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Get Started

Industry

Use Cases

Get Started

Resources

Documentation

Customer Stories

Get Started

Get Started

Get Started

Indico Named as Major Contender and Star Performer in Everest Group's PEAK Matrix® for Intelligent Document Processing (IDP)

BLOG

Is Euclidean distance meaningful for high dimensional data?

Effective January 1, 2020, Indico will be deprecating all public APIs and sunsetting our Pay as You Go Plan.

Increase intake capacity. Drive top line revenue growth.

Related Posts

Ask Slater, Machine Learning

What is a tensor in physics terminology and what’s the difference from a tensor in machine learning and AI?

Ask Slater, Machine Learning

How does the ELMo machine learning model work?

Ask Slater, Machine Learning

Should we remove duplicates from a data-set while training a Machine Learning algorithm (shallow and/or deep methods)?

Unstructured Unlocked podcast

Unstructured Unlocked episode 45 with Daniel Faggella, Head of Research, CEO at Emerj Artificial Intelligence Research

Unstructured Unlocked episode 44 with Tom Wilde, Indico Data CEO, and Robin Merttens, Executive Chairman of InsTech

Unstructured Unlocked episode 43 with Sunil Rao, Chief Executive Officer at Tribble

Get started with Indico

Schedule1-1 Demo

Resources

Blog

Gain insights from experts in automation, data, machine learning, and digital transformation.

Unstructured Unlocked

Enterprise leaders discuss how to unlock value from unstructured data.

YouTube Channel

Check out our YouTube channel to see clips from our podcast and more.

Get our best content on intelligent automation sent to your inbox weekly!

Schedule
1-1 Demo