• Call for Social Sciences Papers
  • Sign-up for PNAS eTOC Alerts

Robust continuous clustering

  1. Vladlen Koltunb
  1. aDepartment of Electrical and Computer Engineering, University of Maryland, College Park, MD 20740;
  2. bIntel Labs, Santa Clara, CA 95054
  1. Edited by David L. Donoho, Stanford University, Stanford, CA, and approved August 7, 2017 (received for review January 13, 2017)

Significance

Clustering is a fundamental experimental procedure in data analysis. It is used in virtually all natural and social sciences and has played a central role in biology, astronomy, psychology, medicine, and chemistry. Despite the importance and ubiquity of clustering, existing algorithms suffer from a variety of drawbacks and no universal solution has emerged. We present a clustering algorithm that reliably achieves high accuracy across domains, handles high data dimensionality, and scales to large datasets. The algorithm optimizes a smooth global objective, using efficient numerical methods. Experiments demonstrate that our method outperforms state-of-the-art clustering algorithms by significant factors in multiple domains.

Abstract

Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank.

Footnotes

  • ?1To whom correspondence should be addressed. Email: sohilas{at}umd.edu.

Freely available online through the PNAS open access option.

Online Impact

                                      1. 99132880 2018-01-23
                                      2. 802899879 2018-01-23
                                      3. 295573878 2018-01-23
                                      4. 352668877 2018-01-23
                                      5. 984633876 2018-01-23
                                      6. 545928875 2018-01-23
                                      7. 976569874 2018-01-23
                                      8. 871324873 2018-01-23
                                      9. 263462872 2018-01-23
                                      10. 577161871 2018-01-23
                                      11. 255603870 2018-01-23
                                      12. 117346869 2018-01-23
                                      13. 90982868 2018-01-23
                                      14. 663415867 2018-01-23
                                      15. 793874866 2018-01-23
                                      16. 843582865 2018-01-23
                                      17. 864971864 2018-01-22
                                      18. 258841863 2018-01-22
                                      19. 957295862 2018-01-22
                                      20. 553518861 2018-01-22