ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
ENACT evaluates embodied cognition through egocentric interaction world modeling. Notably, it has a simple and scalable dataset pipeline design.
Dataset Viewer
Explore the ENACT dataset for embodied world modeling tasks.
Loading dataset...
Leaderboard
Performance comparison of different models on our embodied cognition benchmark. Dark highlighting indicates the best result within each category, light highlighting denotes the second-best.
Click on column headers to sort the results
| Model ↕ | Forward World Modeling | Inverse World Modeling | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3↕ | 4↕ | 5↕ | 6↕ | 7↕ | 8↕ | 9↕ | 10↕ | 3↕ | 4↕ | 5↕ | 6↕ | 7↕ | 8↕ | 9↕ | 10↕ | |
Key Findings
Our research reveals critical insights about VLM spatial reasoning capabilities through comprehensive forward and inverse world modeling tasks.
Findings from Forward/Inverse World Modeling Tasks
Key Takeaways
- Inverse consistently surpasses forward, and the margin grows as the horizon \(L\) increases.
- Accuracy declines steadily with step length \(L\), and all VLMs drop sharper at long horizons.
- Humans demonstrate near-ceiling performance across all tested step lengths.
Findings from Probing Tasks
Key Takeaways
- GPT-5 mini and InternVL3.5-241B are robust to image realism variations on our tasks.
- Large apertures, a fisheye lens, and a high camera will harm models' performance greatly.
- GPT-5 mini and InternVL3.5-241B are robust to the robot's appearance.
- VLMs exhibit a significant right-handed bias, which is similar to human handedness.
Performance Analysis
Probing experiment results with GPT-5 mini on ENACT. Heatmaps show two-tailed unpaired t-test results against the baseline, using Pairwise Accuracy. \(p<0.05\) is considered significant. Darker red means more significant. \(\Delta\) is the performance change from the baseline. If significant and \(\Delta<0\), the setting is worse than the baseline. C.2 reports the robot's performance on the left- and right-hand predicates, where Mixing is the proportion of ground truth left or right cases that are predicted as the other hand (i.e., mixing one hand into the other). \(\pm\) means standard error.
Probing experiment results with InternVL3.5-241B-A28B on ENACT. Heatmaps show two-tailed unpaired p-values against the baseline, using Pairwise Accuracy. \(p<0.05\) is considered significant. Darker red means more significant. \(\Delta\) is the performance change from the baseline. If significant and \(\Delta<0\), the setting is worse than the baseline. C.2 reports the robot's performance on the left- and right-hand predicates, where Mixing is the proportion of ground truth left or right cases that are predicted as the other hand (i.e., mixing one hand into the other hand). Note that, although InternVL3.5-241B-A28B performance is less significant than GPT-5 mini, the \(|\Delta|\) across unnatural camera configurations still remains high (\(>0.05\)) when the same settings are significant for GPT-5 mini.
Image Realism
Key Takeaways
- GPT-5 mini and InternVL3.5-241B are robust to image realism variations on our tasks.
Camera Configurations
Key Takeaways
- Models perform best on images that resemble what humans typically see.
- Large apertures, a fisheye lens, and a high camera will harm models' performance greatly.
Camera FOV
Camera Height
Embodied Biases
Key Takeaways
- GPT-5 mini and InternVL3.5-241B are robust to the robot's appearance.
- VLMs exhibit a significant right-handed bias, which is similar to human handedness.
Robot Appearance
Citation
If you find our work useful in your research, please cite:
@article{enact2025,
title={ENACT: Embodied Cognition through World Modeling from Egocentric Interaction},
author={ENACT Team},
year={2025}
}