ENACT Logo ENACT

ENACT evaluates embodied cognition through egocentric interaction world modeling. Notably, it has a simple and scalable dataset pipeline design.

Qineng Wang1*, Wenlong Huang2*, Yu Zhou3, Hang Yin2, Tianwei Bao1, Jianwen Lyu1, Weiyu Liu2

Ruohan Zhang2†, Jiajun Wu2†, Li Fei-Fei2†, Manling Li1†

*Equal contribution, †Equal advising

1Northwestern University, 2Stanford University, 3UCLA

Spatial Mental Modeling Challenge

Dataset Viewer

Explore the ENACT dataset for embodied world modeling tasks.

Loading dataset...

Leaderboard

Performance comparison of different models on our embodied cognition benchmark. Dark highlighting indicates the best result within each category, light highlighting denotes the second-best.

Click on column headers to sort the results

Proprietary Models Open-Weight Models Human Performance
Model Forward World Modeling Inverse World Modeling
3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10

Key Findings

Our research reveals critical insights about VLM spatial reasoning capabilities through comprehensive forward and inverse world modeling tasks.

Findings from Forward/Inverse World Modeling Tasks

💡

Key Takeaways

  • Inverse consistently surpasses forward, and the margin grows as the horizon \(L\) increases.
  • Accuracy declines steadily with step length \(L\), and all VLMs drop sharper at long horizons.
  • Humans demonstrate near-ceiling performance across all tested step lengths.
Visualizations

Findings from Probing Tasks

💡

Key Takeaways

  • GPT-5 mini and InternVL3.5-241B are robust to image realism variations on our tasks.
  • Large apertures, a fisheye lens, and a high camera will harm models' performance greatly.
  • GPT-5 mini and InternVL3.5-241B are robust to the robot's appearance.
  • VLMs exhibit a significant right-handed bias, which is similar to human handedness.

Performance Analysis

GPT-5 mini probing results

Probing experiment results with GPT-5 mini on ENACT. Heatmaps show two-tailed unpaired t-test results against the baseline, using Pairwise Accuracy. \(p<0.05\) is considered significant. Darker red means more significant. \(\Delta\) is the performance change from the baseline. If significant and \(\Delta<0\), the setting is worse than the baseline. C.2 reports the robot's performance on the left- and right-hand predicates, where Mixing is the proportion of ground truth left or right cases that are predicted as the other hand (i.e., mixing one hand into the other). \(\pm\) means standard error.

InternVL3.5-241B probing results

Probing experiment results with InternVL3.5-241B-A28B on ENACT. Heatmaps show two-tailed unpaired p-values against the baseline, using Pairwise Accuracy. \(p<0.05\) is considered significant. Darker red means more significant. \(\Delta\) is the performance change from the baseline. If significant and \(\Delta<0\), the setting is worse than the baseline. C.2 reports the robot's performance on the left- and right-hand predicates, where Mixing is the proportion of ground truth left or right cases that are predicted as the other hand (i.e., mixing one hand into the other hand). Note that, although InternVL3.5-241B-A28B performance is less significant than GPT-5 mini, the \(|\Delta|\) across unnatural camera configurations still remains high (\(>0.05\)) when the same settings are significant for GPT-5 mini.

Image Realism

💡

Key Takeaways

  • GPT-5 mini and InternVL3.5-241B are robust to image realism variations on our tasks.
Forward
Forward Current State
Current State
Forward Next State 1
Next State 1
Forward Next State 2
Next State 2
Inverse
Inverse Current State
Current State
Inverse Next State 1
Next State 1
Inverse Next State 2
Next State 2

Camera Configurations

💡

Key Takeaways

  • Models perform best on images that resemble what humans typically see.
  • Large apertures, a fisheye lens, and a high camera will harm models' performance greatly.
Camera FOV
Forward
Forward Current State
Current State
Forward Next State 1
Next State 1
Forward Next State 2
Next State 2
Inverse
Inverse Current State
Current State
Inverse Next State 1
Next State 1
Inverse Next State 2
Next State 2
Camera Height
Forward
Forward Current State
Current State
Forward Next State 1
Next State 1
Forward Next State 2
Next State 2
Inverse
Inverse Current State
Current State
Inverse Next State 1
Next State 1
Inverse Next State 2
Next State 2

Embodied Biases

💡

Key Takeaways

  • GPT-5 mini and InternVL3.5-241B are robust to the robot's appearance.
  • VLMs exhibit a significant right-handed bias, which is similar to human handedness.
Robot Appearance
Forward
Forward Current State
Current State
Forward Next State 1
Next State 1
Forward Next State 2
Next State 2
Inverse
Inverse Current State
Current State
Inverse Next State 1
Next State 1
Inverse Next State 2
Next State 2

Citation

If you find our work useful in your research, please cite:

@article{enact2025,
  title={ENACT: Embodied Cognition through World Modeling from Egocentric Interaction},
  author={ENACT Team},
  year={2025}
}