Vision Transformer-Based Multi-Class Classification for Simulated 6DoF Robot

GENERATING LARGE-SCALE SYNTHETIC DATASET AND CODING A VISION TRANSFORMER (VIT) MODEL FROM SCRATCH

Implementing a 6-DOF robotic arm in an industrial context involves careful planning, especially for ensuring safety and efficient space use. To tackle object recognition, we generate synthetic images using the Unity Perception package, allowing for a diverse dataset by varying object characteristics and environmental conditions.

We opted for a Vision Transformer (ViT) model for object classification. We initially faced challenges with training from scratch due to the small dataset size. By adopting a pre-trained ViT model and applying transfer learning, we significantly improved accuracy to 100% for both training and testing.

Testing the robotic arm and its AI component in a real-world setting poses logistical challenges, such as the potential disruptions of existing workflows. To address this, we suggest conducting simulations that mimic the actual environment the robot will operate in. This not only allows for the refinement of the robot's task execution without interrupting production but also ensures that the AI's object classification capabilities are finely tuned to the specific conditions of the deployment site.