doScenes Instructed Driving Challenge

Overview

The doScenes Instructed Driving Challenge evaluates whether natural language instructions can improve trajectory prediction for autonomous driving. Traditional models rely on past motion and scene context; doScenes adds human-authored intent descriptions so models can predict with privileged information.

Can models that incorporate text descriptions of intended maneuvers achieve better prediction accuracy than state-of-the-art history-only methods?

This challenge is built on doScenes, an instruction-augmented extension of nuScenes. Instructions cover maneuvers such as lane changes, turns, stopping, yielding, and speed adjustments, with multiple annotators providing diverse phrasings.

Task Definition

Given:

Visual scene input from multiple cameras
Historical trajectory information (past 2 seconds)
High-definition map data
Natural language instruction describing the intended maneuver

Participants must predict the ego vehicle’s future trajectory over the next 6 seconds. The key research question is whether models can effectively leverage the language instruction to improve prediction accuracy compared to baselines that use only visual, historical, and map information.

Evaluation Metrics

Submissions are evaluated using standard trajectory prediction metrics:

Average Displacement Error (ADE): Mean L2 distance between predicted and ground-truth positions across all time steps.
Final Displacement Error (FDE): L2 distance between predicted and ground-truth positions at the final time step.
Instruction Conditioning Gain (Delta ADE): Improvement in ADE relative to a history-only baseline, measuring the benefit of incorporating language instructions.

ADE reporting specification: ADE is computed as mean Average Displacement Error. We report the Q97.5-filtered mean, removing the top 2.5% highest-error scenes to avoid catastrophic outliers.

The Instruction Conditioning Gain is computed as:

Delta ADE = ADE_baseline − ADE_instruction (1)

Positive values indicate improvement from using language instructions.

Positive Delta ADE indicates language improves trajectory prediction relative to the history-only baseline.

Evaluation Tracks

Participants may submit to one or more tracks:

Full Multimodal: Uses all available inputs (cameras, history, map, and language instruction).
Language + History: Uses only historical trajectory and language instruction (no visual input).
Ablation: Participants report results with and without language conditioning to demonstrate the instruction benefit.

Goals and Expected Outcomes

The doScenes Instructed Driving Challenge aims to:

Establish instruction-conditioned prediction as a research paradigm: Demonstrate that language instructions provide actionable information for trajectory prediction.
Benchmark vision-language models for driving: Evaluate how well current VLM architectures can ground natural language instructions in driving behavior.
Quantify the instruction benefit: Measure the gap between history-only and instruction-conditioned prediction, providing motivation for human-in-the-loop autonomy systems.
Build community: Foster collaboration between the computer vision, NLP, and autonomous driving research communities.

We hypothesize that instruction-conditioned models will show significant improvements over history-only baselines, particularly for:

Ambiguous scenarios where multiple future trajectories are plausible
Complex maneuvers requiring multi-step planning
Rare or unusual driving situations

Leaderboard

Test-set leaderboard for instruction-conditioned trajectory prediction. Baselines are listed first.

Rank	Team / Model	Track	ADE	FDE	Delta ADE	Submission Date
—	History-OnlyNO LANGUAGE	Full Multimodal	2.879	---	---	---
—	InstructionLANGUAGE	Full Multimodal	2.929	---	-0.050	---
1	TBD
2	TBD
3	TBD

Baselines are from the paper: Martinez-Sanchez et al., Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models, arXiv:2602.04184, 2026.

Dataset & Data Splits

doScenes extends the nuScenes training and validation splits with human-authored instructions. Instructions are written by multiple annotators to capture natural variation in phrasing.

Instructions cover a variety of maneuvers including:

Lane changes (left/right)
Turns at intersections
Proceeding straight
Stopping and yielding
Speed adjustments

The diversity of annotators ensures that models must handle natural variation in instruction phrasing, rather than memorizing specific templates.

The test split is held out and labels are never released. Submit predictions to receive scores.

Dataset Statistics

View Dataset

Property	Value
Base dataset	nuScenes
Number of scenes	850
Instruction annotations	Human-authored
Annotators per scene	Multiple
Languages	English

Train / Validation / Test Splits

Split	Availability	Purpose	Notes
Training	Public	Model training	Includes instructions for all scenes
Validation	Public	Model selection and ablations	Use for tuning and reporting ablation results
Test	Held out	Final leaderboard scoring	Inputs only; labels never released

Test labels are not shared. Only prediction uploads are evaluated, and scores are returned within 24 hours.

Submission Instructions

Ready to submit? Use the official submission form.

Submit Here

Submission Checklist

Prediction CSV file in the required format
Public code repository link
Paper PDF link

Key Dates

Milestone	Date (AoE)	Details
Competition opens	February 11, 2026	Registration and dataset access
Final submission deadline	May 20, 2026	Leaderboard freeze
Workshop	June 22, 2026	Hybrid event + invited talks

Ethical Considerations

The doScenes dataset is built upon nuScenes, which uses public road data. All videos are anonymized with no identifiable faces or license plates. Language annotations are collected consensually and anonymously for research purposes. We encourage participants to consider the broader implications of instruction-conditioned autonomy, including:

Safety in deployment
Handling of adversarial or conflicting instructions
Accountability when human instructions influence autonomous behavior

Contact

Email rossgreer@ucmerced.edu

Lab https://mi3-lab.github.io/

Resources

Dataset and project page: https://github.com/rossgreer/doScenes

Citation

Roy et al., doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation, arXiv:2412.05893, 2024. https://arxiv.org/abs/2412.05893

BibTeX

@misc{roy2024doscenesautonomousdrivingdataset,
      title={doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation},
      author={Parthib Roy and Srinivasa Perisetla and Shashank Shriram and Harsha Krishnaswamy and Aryan Keskar and Ross Greer},
      year={2024},
      eprint={2412.05893},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.05893},
}

Martinez-Sanchez et al., Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models, arXiv:2602.04184, 2026. https://arxiv.org/abs/2602.04184

BibTeX

@misc{martinezsanchez2026naturallanguage,
  title={Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models},
  author={Angel Martinez-Sanchez and Parthib Roy and Ross Greer},
  year={2026},
  eprint={2602.04184},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.04184}
}

Related Work from the Authors

Greer et al., Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning, arXiv:2602.07680, 2026. https://arxiv.org/abs/2602.07680

BibTeX

@misc{greer2026visionlanguagenovelrepresentations,
      title={Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning},
      author={Ross Greer and Maitrayee Keskar and Angel Martinez-Sanchez and Parthib Roy and Shashank Shriram and Mohan Trivedi},
      year={2026},
      eprint={2602.07680},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.07680}
}

Greer et al., Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making, arXiv:2602.07668, 2026. https://arxiv.org/abs/2602.07668

BibTeX

@misc{greer2026lookinglisteninginsideoutside,
      title={Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making},
      author={Ross Greer and Laura Fleig and Maitrayee Keskar and Erika Maquiling and Giovanni Tapia Lopez and Angel Martinez-Sanchez and Parthib Roy and Jake Rattigan and Mira Sur and Alejandra Vidrio and Thomas Marcotte and Mohan Trivedi},
      year={2026},
      eprint={2602.07668},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.07668}
}