Identifying Structure in Data: All you need to know about Dimensionality Reduction, Clustering and more

As large-scale models and datasets grow, techniques for data curation, training objective definition, and quality monitoring are essential. Mastering them ensures efficient dataset exploration, robust model development, and impactful decision-making in AI and computer vision workflows.

In today’s AI landscape, these techniques are crucial. Dimensionality reduction tools like t-SNE, UMAP and h-NNE pose a possibility to highlight meaningful structures in visualizations and draw insights. Clustering, on the other hand, organizes unstructured data into meaningful groups, aiding knowledge discovery, feature analysis, and retrieval-augmented generation.

These methods also address learned feature biases, errors, and redundancies affecting model performance. Dimensionality reduction, when done right, can help identify outliers and irregular patterns within the data. Robust clustering supports scalable embedding pipelines, enabling efficient data curation and querying. From k-means to DBSCAN and hierarchical approaches like FINCH, selecting the right method is key: including balancing scalability, managing noise sensitivity, and fitting computational demands.

This tutorial provides an in-depth exploration of the current state-of-the-art of data exploration techniques such as dimensionality reduction for data visualization and clustering methods, with a strong focus on their applications within computer vision. Attendees will gain a comprehensive understanding of both foundational and advanced techniques beyond classic methods like t-SNE and k-means. Through a blend of theoretical insights and hands-on applications, participants will learn how to effectively apply these methods to tasks such as big data analysis, representation learning, model development, pseudo-labeling, and data annotation.

Learning Outcomes

Grasp the core principles and techniques in dimensionality reduction and clustering.
Analyze and evaluate algorithmic trade-offs and explore alternatives to popular methods.
Learn practical strategies to apply these techniques for optimized results in computer vision.
Apply these methods in real-world computer vision applications.

Timeline

Time Slot	Speaker	Talk Title	Slides
13:00 - 13:15	Constantin Seibold	Welcome and Opening Remarks	[slides]
13:15 - 14:15	M. Saquib Sarfraz, Marios Koulakis	Clustering and its applications in modern computer vision	[slides]
14:15 - 15:15	Laurens van der Maaten	How to use and not use modern dimension reduction techniques for data visualization
15:15 - 15:30	Coffee Break ☕
15:30 - 16:30	Brandon Duderstadt	Scaling dimensionality reduction with NOMAD projection	[slides]
16:30 - 17:00	All Speakers	Q&A and Panel Discussion

Materials, Resources & Format

We're excited to offer you a wealth of materials and interactive resources to enhance your experience during our tutorial. In addition to hands-on exercises and interactive Jupyter notebooks demonstrating cutting-edge techniques in dimensionality reduction and clustering for computer vision, we've prepared a variety of supportive resources to inspire and engage you.

We warmly invite you to explore the following resources, which have been thoughtfully curated to spark your curiosity and support your learning journey:

Questionnaire: Share your insights and questions
Your responses will help us understand your background, interests, and specific challenges related to dimensionality reduction, clustering, and data curation in computer vision. This will allow us to tailor the tutorial and round table discussion to meet your needs.
Awesome GitHub Repositories:
- Dimensionality Reduction: Explore the repository
  This repository is a curated collection of resources, libraries, and research papers on various dimensionality reduction techniques. Whether you're looking for methods to improve data visualization or seeking practical implementations, you'll find a wealth of information to guide your projects.
- Clustering: Explore the repository
  Dive into an extensive collection of clustering techniques ranging from classic algorithms like k-means to advanced modern approaches. This repository offers detailed insights, implementation examples, and case studies to help you apply clustering effectively in computer vision and beyond.

We are thrilled to welcome researchers, practitioners, and students to this half-day in-person tutorial. We hope these resources will spark new ideas and enhance your exploration of the exciting world of computer vision and data analysis.

Speakers & Organizers

M. Saquib Sarfraz

Principal Lead AI & Deep Learning
Mercedes Benz Tech Innovation

Laurens van der Maaten

Distinguished Research Scientist
Meta AI

Brandon Duderstadt

Founder & CEO @ Nomic AI
Nomic AI

Marios Koulakis

Machine Learning Scientist & Researcher
DeepHealth

Constantin Seibold

Research Group Lead
University Clinic Heidelberg

Where to go?

Selected Publications

Sarfraz, M. Saquib, Sharma V., Stiefelhagen R. "Efficient parameter-free clustering using first neighbor relations". IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019.
Sarfraz, M. Saquib, Murray, N., Sharma, V., Diba, A., Van Gool, L., Stiefelhagen, R. "Temporally-weighted hierarchical clustering for unsupervised action segmentation". IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021.
Sarfraz, M. Saquib, Koulakis, M., Seibold, C., Stiefelhagen, R. (2022). "Hierarchical nearest neighbor graph embedding for efficient dimensionality reduction". IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022.
Van der Maaten, L. and Hinton, G. "Visualizing data using t-SNE". Journal of Machine Learning Research (JMLR) 2008.
Van der Maaten, L. "Barnes-Hut-SNE". International Conference on Learning Representations (ICLR) 2013.
Van der Maaten, L. "Accelerating t-SNE using Tree-Based Algorithms". Journal of Machine Learning Research (JMLR) 2014.
Anand, Y., Nussbaum, Z., Duderstadt, B., Schmidt, B., Mulya, A. "Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo". GitHub 2023.
Nussbaum, Z., Morris, J.X., Duderstadt, B., Mulyar, A. "Nomic embed: Training a reproducible long context text embedder". arXiv preprint 2024.
Helm, H., Duderstadt, B., Park, Y., Priebe, C. "Tracking the perspectives of interacting language models". Empirical Methods in Natural Language Processing (EMNLP) 2024.

Contact & Bio Sketches

For further information or inquiries, please contact the organizers via the social links provided in the speakers’ section.

Identifying Structure in Data:

All you need to know about Dimensionality Reduction, Clustering and more

Half-Day Tutorial