VLM-guided pick-and-place
Using Claude Sonnet vision to identify and describe objects, then translating that understanding into robotic arm movements. Testing on a 6-DOF arm with RGB-D camera.
Teaching machines to understand and act in physical environments. Edge inference, real-time adaptation, safety-critical systems.
We're connecting LLMs and vision models to physical robots — running inference at the edge, fusing sensor data in real-time, and building systems that adapt when the environment changes.
The focus is practical: warehouse automation that handles unexpected obstacles, inspection systems that flag anomalies, and collaborative robots that respond to natural gestures and voice commands.
What we're building, testing, and learning.
Using Claude Sonnet vision to identify and describe objects, then translating that understanding into robotic arm movements. Testing on a 6-DOF arm with RGB-D camera.
Feeding obstacle descriptions to an LLM and having it generate trajectory waypoints in natural language, then parsing to robot commands.
Training a small model to predict 'this movement is probably unsafe' from camera + proprioception data. Goal: cheap safety net that catches edge cases.
Natural language commands → robot actions. 'Pick up the red thing next to the keyboard' style interaction.
Things we're still figuring out.
How do you maintain safety guarantees when an LLM is in the control loop?
What's the right abstraction level for LLM→robot communication?
Can we get VLM inference fast enough for real-time adaptation (< 100ms)?
How much can sim-to-real transfer reduce the need for physical robot training?
Interested in this research? Have a related problem?
Let's talk →Reach out to us at info@deepklarity.com