Investigating the universality of embeddings
Project Laboratory 2 - Biomedical engineering, MSc Bio.
Project Laboratory 1 - Control and vision systems, MSc Elec.
Project Laboratory 1 - Visual informatics, MSc IT.
Project Laboratory 2 - Control and vision systems, MSc Elec.
Project Laboratory 2 - Visual informatics, MSc IT.
Project Laboratory - Control systems study specialization, BSc Elec.
Project Laboratory - Software development study specialization, BSc IT.
Teamwork Project for Mechatronics Engineers
TDK dolgozat
Description: Modern Large Language Models (LLMs) encode their understanding of topics into dense vector embeddings. Traditionally, these vector spaces are model-specific and incompatible. However, recent research (such as arXiv:2505.12540) demonstrates that these spaces share a "universal geometry." By leveraging this, it is possible to transform embeddings from one model to another (e.g., via vec2vec), enabling interoperability without expensive retraining. This has significant implications for modular AI, RAG systems, and transfer learning.
In the project laboratory topic, the student’s task is to understand the geometry of high-dimensional latent spaces and implement methods to translate representations between different models to solve specific NLP problems.
Within this field, the following tasks can be performed:
Cross-Model Mapping: Implementing and training transformation layers to map embeddings from a source model (e.g., BERT) to a target model (e.g., Llama) without semantic loss.
Flexible RAG Systems: Developing retrieval architectures where the document index and user queries rely on different embedding models, bridged by latent space alignment.
Zero-Shot Classification Transfer: Training classifiers on embeddings from one architecture and deploying them on another after applying universal geometric transformations.
Latent Space Analysis: Using dimensionality reduction (t-SNE/UMAP) to visualize and quantify the geometric alignment between different model families.
Individual ideas: Unique applications of interoperable embeddings.
Requirements for the topic:
Knowledge of Python and PyTorch
Understanding of vector algebra and neural network fundamentals.
Recommended for the topic:
Familiarity with NLP concepts (embeddings, cosine similarity).
Basic understanding of linear mapping or alignment techniques.
