Real-Time Pose Action Recognition on Embedded Edge Devices
TDK dolgozat
AI vision capabilities are stronger than ever, Vision Language Models (VLM) are able to understand and describe a lot of images, with high precision and detail. The generalisation of VLMs are comes at a price of large number of parameters, makes it unreasonably computationally expensive for downstream tasks. The possibility to run VLM on edge devices became important for robotics or autonomous vehicles. Their datasets highly varies for different scenarios, so the models both has to have high capabilities, but also to run them in a reasonable frequency.
Within this field, the following tasks can be performed:
- Literature Review & Technology Selection: Investigating state-of-the-art, lightweight pose estimation frameworks and edge-compatible machine learning tools.
- Data Processing Pipeline: Creating an automated pipeline to extract temporal sequences of skeletal keypoints or other compact spatial features from video datasets, bypassing the computational cost of processing raw image pixels.
- Sequence Modeling: Designing or adapting a lightweight neural network architecture capable of classifying temporal sequences (e.g., recognizing specific actions or intentions over time) with a minimal parameter count.
- Model Optimization: Exploring and applying model compression techniques (such as quantization or pruning) to shrink the model footprint and prepare it for embedded deployment.
- Edge Deployment & Benchmarking: Deploying the end-to-end pipeline onto a target embedded edge device and analyzing the trade-offs between classification accuracy, resource utilization, and real-time inference speed.
Requirements for the topic:
- Knowledge of python programming
- Good English communication skills
Recommended for the topic:
- Basic knowledge of machine learning/deep learning concepts.
- Minimal familiarity with Linux environments or single-board embedded systems.
- Basic mathematical foundation (linear algebra and matrix operations).
