4D reasoning from demonstration data for VLA

Visual-Language-Action (VLA) models are typically trained through imitation learning, which teaches policies to reproduce demonstrated actions but provides limited supervision about the conditions that define task success.

We propose a framework that automatically extracts executable 3D task verifiers from demonstrations and uses them to improve policy learning beyond imitation.

Given task instructions and demonstration trajectories, a vision-language model infers stage-wise success constraints grounded in reconstructed 3D scene geometry. Starting from a pretrained VLA, we generate counterfactual action sequences and imagine their consequences in a reconstructed 3D world without executing them on a robot. The inferred verifiers evaluate these imagined interactions, producing additional successful and failed trajectories for training.

Unlike reward-learning approaches, our method represents task knowledge as structured verification predicates rather than scalar rewards. We hypothesize that learning from verifier-guided imagined successes and failures enables improved task understanding, robustness, and generalization while requiring no additional robot interaction.

References

Tran, T., Nguyen, H. M. D., Tran, H.-C., Barz, M., Doan, K. D., Wattenhofer, R., Vien, N. A., Niepert, M., Sonntag, D., & Swoboda, P. (2025). How many tokens do 3D point cloud transformer architectures really need? In: The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS). Neural Information Processing Systems (NeurIPS-2025), December 2-12, USA, Advances in Neural Information Processing Systems, 12/2025.

Contact

Tuan Tran (Tuan.Tran@dfki.de)

Published by Franziska Scheurer on June 22, 2026June 22, 2026

References

Contact

Grounded Label Space Engineering for Knowledge-Centric Annotation Workflows

Active Learning for Medical
Image Segmentation

Foundation Models for Medical AI

4D reasoning from demonstration data for VLA

Published by Franziska Scheurer on June 22, 2026June 22, 2026

References

Contact

Related Posts

Grounded Label Space Engineering for Knowledge-Centric Annotation Workflows

Active Learning for Medical Image Segmentation

Foundation Models for Medical AI

Active Learning for Medical
Image Segmentation