Overview

The doScenes Instructed Driving Challenge evaluates whether natural language instructions can improve trajectory prediction for autonomous driving. Traditional models rely on past motion and scene context; doScenes adds human-authored intent descriptions so models can predict with privileged information.

Can models that incorporate text descriptions of intended maneuvers achieve better prediction accuracy than state-of-the-art history-only methods?

This challenge is built on doScenes, an instruction-augmented extension of nuScenes. Instructions cover maneuvers such as lane changes, turns, stopping, yielding, and speed adjustments, with multiple annotators providing diverse phrasings.

Task Definition

Given:

  • Visual scene input from multiple cameras
  • Historical trajectory information (past 2 seconds)
  • High-definition map data
  • Natural language instruction describing the intended maneuver

Participants must predict the ego vehicle’s future trajectory over the next 6 seconds. The key research question is whether models can effectively leverage the language instruction to improve prediction accuracy compared to baselines that use only visual, historical, and map information.

Evaluation Metrics

Submissions are evaluated using standard trajectory prediction metrics:

  • Average Displacement Error (ADE): Mean L2 distance between predicted and ground-truth positions across all time steps.
  • Final Displacement Error (FDE): L2 distance between predicted and ground-truth positions at the final time step.
  • Instruction Conditioning Gain (Delta ADE): Improvement in ADE relative to a history-only baseline, measuring the benefit of incorporating language instructions.

ADE reporting specification: ADE is computed as mean Average Displacement Error. We report the Q97.5-filtered mean, removing the top 2.5% highest-error scenes to avoid catastrophic outliers.

The Instruction Conditioning Gain is computed as:

Delta ADE = ADEbaseline − ADEinstruction (1)

Positive values indicate improvement from using language instructions.

Positive Delta ADE indicates language improves trajectory prediction relative to the history-only baseline.

Evaluation Tracks

Participants may submit to one or more tracks:

  1. Full Multimodal: Uses all available inputs (cameras, history, map, and language instruction).
  2. Language + History: Uses only historical trajectory and language instruction (no visual input).
  3. Ablation: Participants report results with and without language conditioning to demonstrate the instruction benefit.

Goals and Expected Outcomes

The doScenes Instructed Driving Challenge aims to:

  1. Establish instruction-conditioned prediction as a research paradigm: Demonstrate that language instructions provide actionable information for trajectory prediction.
  2. Benchmark vision-language models for driving: Evaluate how well current VLM architectures can ground natural language instructions in driving behavior.
  3. Quantify the instruction benefit: Measure the gap between history-only and instruction-conditioned prediction, providing motivation for human-in-the-loop autonomy systems.
  4. Build community: Foster collaboration between the computer vision, NLP, and autonomous driving research communities.

We hypothesize that instruction-conditioned models will show significant improvements over history-only baselines, particularly for:

  • Ambiguous scenarios where multiple future trajectories are plausible
  • Complex maneuvers requiring multi-step planning
  • Rare or unusual driving situations

Leaderboard

Test-set leaderboard for instruction-conditioned trajectory prediction. Baselines are listed first.

Rank Team / Model Track ADE FDE Delta ADE Submission Date
History-OnlyNO LANGUAGE Full Multimodal 2.879 --- --- ---
InstructionLANGUAGE Full Multimodal 2.929 --- -0.050 ---
1 TBD
2 TBD
3 TBD

Baselines are from the paper: Martinez-Sanchez et al., Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models, arXiv:2602.04184, 2026.

Dataset & Data Splits

doScenes extends the nuScenes v1.0-trainval and v1.0-test splits with human-authored instructions. Instructions are written by multiple annotators to capture natural variation in phrasing.

Instructions cover a variety of maneuvers including:

  • Lane changes (left/right)
  • Turns at intersections
  • Proceeding straight
  • Stopping and yielding
  • Speed adjustments

The diversity of annotators ensures that models must handle natural variation in instruction phrasing, rather than memorizing specific templates.

Dataset Statistics

Property Value
Base dataset nuScenes
Allowed nuScenes viewpoint Egocentric point of view only
Number of scenes 850
Instruction annotations Human-authored
Annotators per scene Multiple
Languages English

Dataset Splits and Evaluation Protocol

The doScenes benchmark uses the standard nuScenes dataset splits as the basis for training, validation, and website evaluation.

Split Source Scenes Challenge Role Protocol
Training / Validation nuScenes v1.0-trainval split 850 Model training and local validation Models may use this split for training and validation
Website Testing nuScenes v1.0-test split 150 Evaluations on the website This split is used for submitted predictions and website leaderboard evaluation

The doScenes annotations provided in the repository correspond to scenes from these splits.

Fair Evaluation Requirements

To ensure fair evaluation:

  • Participants may use the nuScenes v1.0-trainval split for training and validation.
  • The nuScenes v1.0-test split must be used only for website evaluation submissions.
  • Ground-truth future trajectories from the nuScenes v1.0-test split must not be used during training.
  • Models must use only egocentric point of view data from nuScenes.
  • Predictions must be generated for the nuScenes v1.0-test split and submitted for website evaluation.
  • Teams are expected to submit reproducible code that adheres to this protocol.
The challenge is governed by a reproducibility and fair-use policy. Any method that uses test-set future trajectories during training is considered out of protocol.

Required Inputs

All required inputs for prediction, including historical trajectories, sensor data, maps, and language instructions, can be obtained by combining the standard nuScenes dataset with the doScenes annotation files provided in the repository. Participants should use only the egocentric point of view data from nuScenes when preparing inputs for this challenge.

Submission Instructions

Ready to submit? Use the official submission form.

Submission Checklist

  • Prediction CSV file in the required format
  • Public code repository link
  • Paper PDF link
  • Submitted papers are encouraged to follow the CVPR paper format

Workshop Participation

Participation in the workshop is not required in order to complete the challenge. However, teams are encouraged to attend if possible. Details about online participation will be posted once that information becomes available.

Key Dates

Milestone Date (AoE) Details
Competition opens February 11, 2026 Registration and dataset access
Final submission deadline May 20, 2026 Leaderboard freeze
Workshop June 22, 2026 Hybrid event + invited talks

FAQ

Do I need to attend the workshop to participate in the challenge?

No. Participation in the workshop is not required in order to complete the challenge, although teams are encouraged to attend if possible.

Which dataset splits should be used for training and evaluation?

For training and validation, participants can use the nuScenes v1.0-trainval split, which contains 850 scenes. For testing and evaluations on the website, participants should use the nuScenes v1.0-test split, which contains 150 scenes.

Can ground-truth future trajectories from the test split be used during training?

No. Ground-truth future trajectories from the nuScenes v1.0-test split must not be used during training.

What do I need to include with my submission?

Submissions should include a prediction CSV file in the required format, a public code repository link, and a paper PDF link. Submitted papers are encouraged to follow the CVPR paper format.

Where do the required inputs for prediction come from?

Required inputs can be assembled by combining the standard nuScenes dataset with the doScenes annotation files provided in the repository. Only egocentric point of view data from nuScenes should be used.

How do I get started with the doScenes dataset?

A reference dataloader is available inside the doScenes repository to help participants get started with loading the annotations together with the underlying nuScenes data.

Can I use additional datasets or pretrained models?

Yes. Participants may use additional datasets and pretrained models, including models derived from nuScenes, provided that no information from the nuScenes v1.0-test split is used during training in any form.

External or auxiliary data is permitted under the following conditions:

  • External datasets are allowed as long as they do not include test split data.
  • Pretrained models are allowed, but they must not be trained on the nuScenes v1.0-test split.
  • Auxiliary training data is permitted, provided the evaluation protocol is respected.

Some pretrained models, especially those trained on nuScenes, may have used the test split during training. This would constitute a violation of the challenge protocol.

Participants are responsible for verifying and clearly documenting their data sources and training procedures.

Are resubmissions allowed?

Yes. Resubmissions are allowed, but they must be clearly identified in the submission so the most recent eligible submission can be evaluated correctly.

Ethical Considerations

The doScenes dataset is built upon nuScenes, which uses public road data. All videos are anonymized with no identifiable faces or license plates. Language annotations are collected consensually and anonymously for research purposes. We encourage participants to consider the broader implications of instruction-conditioned autonomy, including:

  • Safety in deployment
  • Handling of adversarial or conflicting instructions
  • Accountability when human instructions influence autonomous behavior

Contact

Email kng408@ucmerced.edu | rossgreer@ucmerced.edu

Resources

  • Dataset and project page: https://github.com/rossgreer/doScenes

Citation

Roy, P., Perisetla, S., Shriram, S., Krishnaswamy, H., Keskar, A., & Greer, R. (2025, November). doscenes: An autonomous driving dataset with natural language instruction for human interaction and vision-language navigation. In 2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC) (pp. 1651-1658). IEEE. https://arxiv.org/abs/2412.05893

BibTeX
@inproceedings{roy2025doscenes,
      title={doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation},
      author={Parthib Roy and Srinivasa Perisetla and Shashank Shriram and Harsha Krishnaswamy and Aryan Keskar and Ross Greer},
      booktitle={2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC)},
      pages={1651--1658},
      year={2025},
      organization={IEEE},
      url={https://arxiv.org/abs/2412.05893},
}

Martinez-Sanchez et al., Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models, to appear in IEEE Intelligent Vehicles Symposium 2026. arXiv:2602.04184, 2026. https://arxiv.org/abs/2602.04184

BibTeX
@inproceedings{martinezsanchez2026naturallanguage,
  title={Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models},
  author={Angel Martinez-Sanchez and Parthib Roy and Ross Greer},
  booktitle={2026 IEEE Intelligent Vehicles Symposium (IV)},
  year={2026},
  note={To appear},
  url={https://arxiv.org/abs/2602.04184}
}

Related Work from the Authors

Greer et al., Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning, to appear in Enhanced Safety of Vehicles Conference 2026. arXiv:2602.07680, 2026. https://arxiv.org/abs/2602.07680

BibTeX
@inproceedings{greer2026visionlanguagenovelrepresentations,
      title={Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning},
      author={Ross Greer and Maitrayee Keskar and Angel Martinez-Sanchez and Parthib Roy and Shashank Shriram and Mohan Trivedi},
      booktitle={Enhanced Safety of Vehicles Conference},
      year={2026},
      note={To appear},
      url={https://arxiv.org/abs/2602.07680}
}

Greer et al., Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making, to appear in Enhanced Safety of Vehicles Conference 2026. arXiv:2602.07668, 2026. https://arxiv.org/abs/2602.07668

BibTeX
@inproceedings{greer2026lookinglisteninginsideoutside,
      title={Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making},
      author={Ross Greer and Laura Fleig and Maitrayee Keskar and Erika Maquiling and Giovanni Tapia Lopez and Angel Martinez-Sanchez and Parthib Roy and Jake Rattigan and Mira Sur and Alejandra Vidrio and Thomas Marcotte and Mohan Trivedi},
      booktitle={Enhanced Safety of Vehicles Conference},
      year={2026},
      note={To appear},
      url={https://arxiv.org/abs/2602.07668}
}