Investigating the universality of embeddings

Konzulens:
Barcza Bende
Tárgy:
Önálló laboratórium 1 - Egészségügyi mérnök, MSc Eü.
Önálló laboratórium 2 - Egészségügyi mérnök, MSc Eü.
Önálló laboratórium 1 - Irányító és látórendszerek MSc. főspec.
Önálló laboratórium 1 - Vizuális informatika MSc. főspec.
Önálló laboratórium 2 - Irányító és látórendszerek MSc. főspec.
Önálló laboratórium 2 - Vizuális informatika MSc. főspec.
Önálló laboratórium - Irányítórendszerek ágazat, BSc Vill.
Önálló laboratórium - Szoftverfejlesztés és rendszertervezés specializáció, BSc Info.
Projektfeladat mechatronikusoknak
Hallgatói létszám:
4
Folytatás:
Szakdolgozat / Diplomaterv
TDK dolgozat
Leírás:

Description: Modern Large Language Models (LLMs) encode their understanding of topics into dense vector embeddings. Traditionally, these vector spaces are model-specific and incompatible. However, recent research (such as arXiv:2505.12540) demonstrates that these spaces share a "universal geometry." By leveraging this, it is possible to transform embeddings from one model to another (e.g., via vec2vec), enabling interoperability without expensive retraining. This has significant implications for modular AI, RAG systems, and transfer learning.

In the project laboratory topic, the student’s task is to understand the geometry of high-dimensional latent spaces and implement methods to translate representations between different models to solve specific NLP problems.

Within this field, the following tasks can be performed:

  • Cross-Model Mapping: Implementing and training transformation layers to map embeddings from a source model (e.g., BERT) to a target model (e.g., Llama) without semantic loss.

  • Flexible RAG Systems: Developing retrieval architectures where the document index and user queries rely on different embedding models, bridged by latent space alignment.

  • Zero-Shot Classification Transfer: Training classifiers on embeddings from one architecture and deploying them on another after applying universal geometric transformations.

  • Latent Space Analysis: Using dimensionality reduction (t-SNE/UMAP) to visualize and quantify the geometric alignment between different model families.

  • Individual ideas: Unique applications of interoperable embeddings.

Requirements for the topic:

  • Knowledge of Python and PyTorch

  • Understanding of vector algebra and neural network fundamentals.

    Recommended for the topic:

  • Familiarity with NLP concepts (embeddings, cosine similarity).

  • Basic understanding of linear mapping or alignment techniques.