Fine-Grained Activity Classification in Assembly Based on Multi-Visual Modalities

Fine-Grained Activity Classification

Goals

  • Develop a Fine-Grained Activity Classification System: Create a system capable of recognizing and predicting detailed assembly activities with low inter-class variability in manufacturing environments.
  • Leverage Multi-Visual Modalities: Utilize both RGB frames and hand skeleton frames to capture comprehensive scene-level and temporal-level features for accurate activity recognition.
  • Achieve High Accuracy and Real-Time Performance: Design a two-stage neural network architecture that ensures over 99% accuracy on trimmed videos and over 91% accuracy on untrimmed, continuous activity videos in real-time scenarios.
Project Goals

Key Findings

  • Custom Dataset Creation: Developed a dataset comprising 15 fine-grained activities specific to the assembly of a desktop carving machine, addressing the lack of manufacturing-specific fine-grained activity datasets.
  • Two-Stage Neural Network Architecture:
    • Stage 1: Feature Awareness Block that extracts scene-level features from RGB and hand skeleton frames using transfer learning with pre-trained VGG-16 models.
    • Stage 2: Recurrent Neural Network (LSTM) layers that learn temporal-level features from the extracted scene-level features.
  • Fusion Strategies: Implemented fusion-before-RNN and fusion-after-RNN mechanisms to effectively combine RGB and skeleton frame features, with fusion-after-RNN yielding superior performance.
  • Prediction Model: Designed a partial video observation method to predict upcoming activities, achieving over 97% accuracy using 50% of the activity onset information.
  • Real-Time Continuous Activity Recognition: Developed a fusion recognition-prediction model capable of accurately recognizing continuous fine-grained activities in untrimmed videos with an average accuracy of 91.33%, operating faster than real-time.
  • Performance Benchmarking: Demonstrated superior performance compared to state-of-the-art models on both the custom assembly dataset and the public UCF101 dataset, establishing a new baseline for continuous fine-grained activity recognition in assembly contexts.
Key Contributions

Technologies Utilized

  • Machine Learning Models:
    • Convolutional Neural Networks (CNNs): VGG-16 for feature extraction.
    • Recurrent Neural Networks (RNNs): Long Short-Term Memory (LSTM) for temporal feature learning.
  • Data Processing:
    • Motion History Images (MHI): For capturing dynamic gesture features.
    • Hand Skeleton Extraction: Using MediaPipe for precise hand landmark detection.
  • Software Development: Python, TensorFlow, Keras.
  • Hardware and Frameworks:
    • Cameras: Logitech C920 for high-resolution video capture.
    • Computing: NVIDIA GeForce RTX 3090 for accelerated deep learning model training and inference.
  • Data Structures: FIFO queues for managing and processing video frame sequences.
  • Transfer Learning: Applied to pre-trained VGG-16 models to enhance feature extraction from limited assembly activity data.

Impact

This project advances the precision of activity recognition systems in manufacturing by enabling the classification of fine-grained assembly activities with high accuracy. The integration of multi-visual modalities (RGB and hand skeleton frames) and the development of a two-stage neural network architecture enhance the system’s ability to discern subtle differences between similar activities, leading to better resource allocation, quality control, and safety measures in smart factories. The real-time performance ensures that the system can be effectively deployed in dynamic manufacturing environments, contributing to increased productivity and reduced operational costs. Additionally, the creation and sharing of a specialized dataset facilitate further research and development in the field of fine-grained activity recognition.

Selected Publications

Please refer to the full publication for detailed references and further reading. 1.Chen, H., Zendehdel, N., Leu, M.C. et al. Fine-grained activity classification in assembly based on multi-visual modalities. J Intell Manuf 35, 2215–2233 (2024). doi