Marigold: Efficient k-means Clustering in High Dimensions

Kasper Overgaard Mortensen, Fatemeh Zardbani, Mohammad Ahsanul Haque, Steinn Ymir Agustsson, Davide Mottin, Philip Hofmann, Panagiotis Karras

VLDB 2023

On high-dimensional data, such as physics experiments, MARIGOLD is the scalable answer to compute k-means.

On high-dimensional data, such as physics experiments, MARIGOLD is the scalable answer to compute k-means.

TL;DR

Marigold delivers the exact result of Lloyd’s algorithm for k-means in at least one order of magnitude less time on high-dimensional data.

In this paper:

  • We propose MARIGOLD, an enhanced Lloyd’s algorithm for high-dimensional data.
  • We enhange Lloyds by introducing triangle inequality pruning and a novel stepwise method that approximates the centroid’s distances through compression.
  • We outperform existing Lloyd’s variants by up to an order of magnitude time on high-dimensional data.
  • We showcase our algorithm on an interesting physics application on Angle-Resolved Photoemission Spectroscopy (ARPES) in which the input are high-dimensional spectra of a solid’s electronic structure.