ENACT - Evaluating Embodied Cognition with Egocentric Interaction World Modeling

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

ENACT evaluates embodied cognition through egocentric interaction world modeling. Notably, it has a simple and scalable dataset pipeline design.

Qineng Wang^1*, Wenlong Huang^2*, Yu Zhou³, Hang Yin², Tianwei Bao¹, Jianwen Lyu¹, Weiyu Liu²

Ruohan Zhang^2†, Jiajun Wu^2†, Li Fei-Fei^2†, Manling Li^1†

*Equal contribution, †Equal advising

¹Northwestern University, ²Stanford University, ³UCLA

Code Paper (Coming Soon) Dataset Data Viewer tl;dr (Coming Soon)

Dataset Viewer

Explore the ENACT dataset for embodied world modeling tasks.

World Modeling:

Task Name:

Step Length:

Loading dataset...

Leaderboard

Performance comparison of different models on our embodied cognition benchmark. Dark highlighting indicates the best result within each category, light highlighting denotes the second-best.

Click on column headers to sort the results

Proprietary Models Open-Weight Models Human Performance

Model ↕	Forward World Modeling								Inverse World Modeling
Model ↕	3↕	4↕	5↕	6↕	7↕	8↕	9↕	10↕	3↕	4↕	5↕	6↕	7↕	8↕	9↕	10↕

Key Findings

Our research reveals critical insights about VLM spatial reasoning capabilities through comprehensive forward and inverse world modeling tasks.

Findings from Forward/Inverse World Modeling Tasks

💡

Key Takeaways

Inverse consistently surpasses forward, and the margin grows as the horizon \(L\) increases.
Accuracy declines steadily with step length \(L\), and all VLMs drop sharper at long horizons.
Humans demonstrate near-ceiling performance across all tested step lengths.

Visualizations

Findings from Probing Tasks

💡

Key Takeaways

GPT-5 mini and InternVL3.5-241B are robust to image realism variations on our tasks.
Large apertures, a fisheye lens, and a high camera will harm models' performance greatly.
GPT-5 mini and InternVL3.5-241B are robust to the robot's appearance.
VLMs exhibit a significant right-handed bias, which is similar to human handedness.

Performance Analysis

Probing experiment results with GPT-5 mini on ENACT. Heatmaps show two-tailed unpaired t-test results against the baseline, using Pairwise Accuracy. \(p<0.05\) is considered significant. Darker red means more significant. \(\Delta\) is the performance change from the baseline. If significant and \(\Delta<0\), the setting is worse than the baseline. C.2 reports the robot's performance on the left- and right-hand predicates, where Mixing is the proportion of ground truth left or right cases that are predicted as the other hand (i.e., mixing one hand into the other). \(\pm\) means standard error.

Probing experiment results with InternVL3.5-241B-A28B on ENACT. Heatmaps show two-tailed unpaired p-values against the baseline, using Pairwise Accuracy. \(p<0.05\) is considered significant. Darker red means more significant. \(\Delta\) is the performance change from the baseline. If significant and \(\Delta<0\), the setting is worse than the baseline. C.2 reports the robot's performance on the left- and right-hand predicates, where Mixing is the proportion of ground truth left or right cases that are predicted as the other hand (i.e., mixing one hand into the other hand). Note that, although InternVL3.5-241B-A28B performance is less significant than GPT-5 mini, the \(|\Delta|\) across unnatural camera configurations still remains high (\(>0.05\)) when the same settings are significant for GPT-5 mini.

Image Realism

💡

Key Takeaways

GPT-5 mini and InternVL3.5-241B are robust to image realism variations on our tasks.

Forward

Current State

Next State 1

Next State 2

Inverse

Current State

Next State 1

Next State 2

Camera Configurations

💡

Key Takeaways

Models perform best on images that resemble what humans typically see.
Large apertures, a fisheye lens, and a high camera will harm models' performance greatly.

Camera FOV

Forward

Current State

Next State 1

Next State 2

Inverse

Current State

Next State 1

Next State 2

Camera Height

Forward

Current State

Next State 1

Next State 2

Inverse

Current State

Next State 1

Next State 2

Embodied Biases

💡

Key Takeaways

GPT-5 mini and InternVL3.5-241B are robust to the robot's appearance.
VLMs exhibit a significant right-handed bias, which is similar to human handedness.

Robot Appearance

Forward

Current State

Next State 1

Next State 2

Inverse

Current State

Next State 1

Next State 2

Citation

If you find our work useful in your research, please cite:

@article{enact2025,
  title={ENACT: Embodied Cognition through World Modeling from Egocentric Interaction},
  author={ENACT Team},
  year={2025}
}