Overview

The doScenes Instructed Driving Challenge evaluates whether natural language instructions can improve trajectory prediction for autonomous driving. Traditional trajectory prediction methods on datasets such as nuScenes rely on historical motion and scene context, yet human drivers often know their intended maneuver before executing it. The doScenes dataset augments nuScenes with human-authored natural language instructions describing upcoming driving maneuvers, enabling models to incorporate this privileged information.

Can models that incorporate text descriptions of intended maneuvers achieve better prediction accuracy than state-of-the-art history-only methods?

Trajectory prediction, the task of forecasting future vehicle positions from observed history and environmental context, is fundamental to autonomous driving. Large-scale real-world benchmarks such as nuScenes have driven rapid progress by providing multi-modal sensor data, high-definition maps, and annotations across 1,000 driving scenes collected from Boston and Singapore.

Despite this progress, a central assumption remains largely unchanged: future motion is predicted using only observed history and scene context. This is appropriate for predicting other agents, but it is an artificial constraint when predicting the ego vehicle, where the intended maneuver may be known before execution. Natural language instructions such as "turn left at the next intersection" expose this intent and can reduce uncertainty in otherwise ambiguous scenarios.

This challenge is built on doScenes, an instruction-augmented extension of nuScenes. All instructions are human-authored, with multiple annotators providing diverse phrasings for the same maneuver. This variation challenges models to condition on meaning rather than surface form.

Task Definition

For the Full Multimodal track, participants are given:

  • Visual scene input from permitted nuScenes cameras
  • Historical trajectory information (past 2 seconds)
  • High-definition map data
  • Natural language instruction describing the intended maneuver

Each sample provides 2 seconds of ground-truth history. Participants must generate one ego-vehicle trajectory for the next 6 seconds, sampled at 2 Hz for 12 future timesteps. Leaderboard evaluation uses single-shot prediction per sample; sliding-window evaluation is not used for official scoring.

Evaluation is open-loop: models receive ground-truth history at inference time, and predicted trajectories are not fed back into the model for rollout or closed-loop scoring. The key research question is whether models can effectively leverage the language instruction to improve prediction accuracy compared to baselines that use the same non-language inputs.

Evaluation Metrics

Submissions are evaluated using standard trajectory prediction metrics:

  • Average Displacement Error (ADE): Mean L2 distance between predicted and ground-truth positions across all 12 predicted timesteps.
  • Final Displacement Error (FDE): L2 distance between predicted and ground-truth positions at the final timestep, 6 seconds into the future.
  • Instruction Conditioning Gain (ΔADE): Improvement in ADE relative to the history-only baseline, measuring the benefit of incorporating language instructions.

ADE and FDE are computed per predicted trajectory, then averaged across all samples in the evaluated dataset. Each model must submit exactly one predicted trajectory per sample.

The Instruction Conditioning Gain is computed as:

ΔADE = ADEbaseline − ADEinstruction (1)

For official leaderboard reporting, the history-only baseline is defined per track: it uses the same non-language inputs and identical evaluation protocol except that no language instruction is provided. Participants must report ADE for the instruction-conditioned model, ADE for the baseline, and ΔADE.

Positive ΔADE indicates language improves trajectory prediction relative to the history-only baseline.

Evaluation Tracks

Participants may submit to one or more tracks:

  1. Full Multimodal: Uses cameras, 2 seconds of history, map data, and language instruction.
  2. Language + History: Uses 2 seconds of history and language instruction only; visual input is not allowed.
  3. Ablation: Participants must report both with-language and without-language results under the same protocol to demonstrate the instruction benefit.

Goals and Expected Outcomes

The doScenes Instructed Driving Challenge aims to:

  1. Establish instruction-conditioned prediction as a research paradigm: Demonstrate that language instructions provide actionable information for trajectory prediction.
  2. Benchmark vision-language models for driving: Evaluate how well current VLM architectures can ground natural language instructions in driving behavior.
  3. Quantify the instruction benefit: Measure the gap between history-only and instruction-conditioned prediction, providing motivation for human-in-the-loop autonomy systems.
  4. Build community: Foster collaboration between the computer vision, NLP, and autonomous driving research communities.

Leaderboard

Test-set leaderboard for instruction-conditioned trajectory prediction. Baselines are listed first.

Leaderboard protocol update: The nuScenes v1.0-test split contains approximately 150 source scenes. The current doScenes instruction-conditioned leaderboard evaluates the 127 test scenes with non-empty doScenes language instructions. The remaining 23 scenes have blank instruction fields and are not used for instruction-conditioned leaderboard scoring. Submitted files that include all 150 scenes are filtered to the 127 language-available scenes for scoring.

Full Multimodal

Rank Team / Model ADE FDE ΔADE Status Submission Date
History-Only BaselineNO LANGUAGE 7.1086 13.8567 --- Paper baseline ---
InstructionLANGUAGE 7.2046 13.9224 -0.0960 Paper baseline ---
1 MIC Lab 2.0913 5.5946 --- VERIFIED May 14, 2026
2 NudgeVAD 2.6774 5.5673 +0.3996 VERIFIED May 20, 2026
3 UCF UrbanITY Lab 2.7577 6.3753 --- VERIFIED May 21, 2026
4 why 3.3309 8.1090 +0.0213 SELF-REPORTED May 14, 2026

Language + History

Rank Team / Model ADE FDE ΔADE Status Submission Date
1 why 2.9207 6.7470 --- VERIFIED May 13, 2026
2 UCF UrbanITY Lab 3.1421 7.2102 +0.0039 VERIFIED May 11, 2026
3 TJNU_PRCV 3.2097 7.7726 --- VERIFIED May 18, 2026
4 sztudk 3.2107 7.2955 --- VERIFIED May 17, 2026
5 1 3.3000 7.4900 +0.0400 SELF-REPORTED May 12, 2026
6 vitality 3.7511 8.9083 --- DIAGNOSTIC May 6, 2026

Ablation

Rank Team / Model Instruction ADE Baseline ADE ΔADE Status Submission Date
1 NudgeVAD 2.6774 3.0770 +0.3996 VERIFIED May 20, 2026
2 UCF UrbanITY Lab
Language + History ablation
3.1421 3.1460 +0.0039 VERIFIED May 20, 2026
3 why 3.3522 3.4828 +0.1306 SELF-REPORTED May 13, 2026
4 TJNU_PRCV 3.5484 4.3001 +0.7517 VERIFIED May 21, 2026

Leaderboard scores use single-shot, open-loop evaluation with one 6-second trajectory per sample. ADE reports ade_6s, and FDE reports the final 6-second displacement error. VERIFIED instruction-conditioned rows were locally evaluated on the 127 nuScenes v1.0-test scenes with non-empty doScenes instructions. The full nuScenes v1.0-test split contains approximately 150 scenes, but 23 current doScenes test scenes have blank instruction fields and are excluded from instruction-conditioned leaderboard scoring. For submitted files with all 150 scene rows, evaluation filters to the 127 language-available scenes. SELF-REPORTED rows come from submitted papers where the local CSV could not be verified for that track, and DIAGNOSTIC rows required non-official duplicate handling. For teams with multiple files, the displayed row uses the organizer-selected best available submission for that applied track, except Ablation uses matched with-language and without-language files. Non-verified results are provisional and will undergo further organizer review before being treated as final verified leaderboard results.

Dataset & Data Splits

doScenes extends nuScenes with human-authored instruction annotations. Public challenge development uses the nuScenes v1.0-trainval split, while website leaderboard evaluation uses the nuScenes v1.0-test split. Instructions are written by multiple annotators to capture natural variation in phrasing.

The nuScenes v1.0-test split contains approximately 150 scenes. In the current doScenes test annotations, 127 of those scenes have non-empty language instructions and form the official instruction-conditioned leaderboard evaluation set. An instruction field may be blank when no instruction interaction is needed to cause the observed action, such as waiting at a red light or continuing with traffic. The 23 blank-instruction scenes remain part of the nuScenes test split, but are not used for instruction-conditioned leaderboard scoring.

Instructions cover a variety of maneuvers including:

  • Lane changes (left/right)
  • Turns at intersections
  • Proceeding straight
  • Stopping and yielding
  • Speed adjustments

The diversity of annotators ensures that models must handle natural variation in instruction phrasing, rather than memorizing specific templates. For leaderboard evaluation, each test scene corresponds to one trajectory prediction from the first scene segment, matched using the scene's scene_token.

Dataset Statistics

Property Value
Base dataset nuScenes
Allowed nuScenes data Past or current nuScenes sensor, map, or metadata input, subject to track rules and no future leakage
Train/validation scenes ~850
Test scenes ~150
Instruction annotations Human-authored
Annotators per scene Multiple
Languages English
Sampling rate 2 Hz

Annotation Statistics

Category Count Percentage
Total annotations 3,924
Unique scenes 854
Average annotations per scene 4.6
Dynamic-referential (d) 444 11.3%
Static-referential (s) 787 20.1%
Mixed dynamic + static (ds) 192 4.9%
Non-referential (None) 2,501 63.7%

doScenes provides controlled evaluation of instruction-conditioned trajectory prediction across grounded and non-grounded language. Dynamic-referential instructions are grounded in moving agents or temporal interactions, such as following the car ahead. Static-referential instructions are grounded in scene elements such as lanes, signs, or intersections. Mixed instructions combine both dynamic and static references, while non-referential instructions describe motion without explicitly grounding to scene entities, such as continuing straight or slowing down.

Annotation statistics describe doScenes instruction annotations and may not exactly match the rounded nuScenes split counts shown above.

Dataset Splits and Evaluation Protocol

The doScenes benchmark uses the standard nuScenes dataset splits as the basis for training, validation, and website evaluation.

Split Source Scenes Challenge Role Protocol
Training / Validation nuScenes v1.0-trainval split ~850 Model training and local validation Models may use this split for training, validation, and method development
Website Testing nuScenes v1.0-test split ~150 source scenes; 127 with non-empty doScenes instructions Evaluations on the website Instruction-conditioned leaderboard scoring uses the 127 language-available scenes; test ground truth is not publicly available

Dataloader Update

The doScenes repository includes an updated dataloader that provides the correct split for history, anchor, and future trajectory points. For training and validation, participants may switch the dataloader version to v1.0-trainval. For evaluation, the challenge uses the first segment of each scene from the test set.

The dataloader returns a scene_token for each scene. Use this value as the sample_token in submission.csv, since each evaluation scene corresponds to a single trajectory.

Official Evaluation Protocol

  • Each sample uses 2 seconds of ground-truth history and a 6-second prediction horizon sampled at 2 Hz.
  • Leaderboard evaluation uses single-shot, open-loop prediction per sample.
  • Instruction-conditioned leaderboard evaluation uses the first segment of each language-available doScenes test scene from the nuScenes v1.0-test split.
  • Sliding-window evaluation is not used for official scoring.
  • Models must generate exactly one trajectory per evaluation sample, matched by sample_token.
  • The current official instruction-conditioned evaluation set contains 127 test scenes with non-empty doScenes instructions.

Fair Evaluation Requirements

To ensure fair evaluation:

  • Participants may use the nuScenes v1.0-trainval split for training and validation.
  • The nuScenes v1.0-test split may be used only to generate website evaluation submissions, not for training, validation, model selection, or tuning.
  • Ground-truth future trajectories from the nuScenes v1.0-test split must not be used for training, validation, tuning, model selection, or submission debugging.
  • Test-set ground truth is not publicly available and must not be inferred, reconstructed, or used through any external source.
  • Participants may use non-egocentric nuScenes data when it is allowed by the selected track, available at or before the prediction time, and does not reveal future information.
  • Predictions must be generated for the nuScenes v1.0-test split and submitted for website evaluation.
  • Teams are expected to submit reproducible code that adheres to this protocol.
The challenge is governed by a reproducibility and fair-use policy. Any method that uses test-set future trajectories for training, validation, tuning, model selection, or submission debugging is considered out of protocol.

Required Inputs

All required inputs for prediction, including historical trajectories, sensor data, maps, and language instructions, can be obtained by combining the standard nuScenes dataset with the doScenes annotation files provided in the repository. Participants may use egocentric or non-egocentric nuScenes inputs when allowed by the selected track, provided the inputs are available at or before the prediction time and do not leak future information.

Submission Instructions

Ready to submit? Use the official submission form.

Submission CSV Format

Prediction files must be named submission.csv. Each evaluation sample must have exactly one corresponding prediction row, and predictions are matched using sample_token.

The required CSV header is:

sample_token,instruction,x1,y1,x2,y2,...,x12,y12

The sample_token uniquely identifies each evaluation instance. When using the updated doScenes dataloader, use the returned scene_token as the sample_token, since each evaluation scene corresponds to a single trajectory.

Coordinates (x_t, y_t) represent the predicted 2D ego-vehicle position at each future timestep. The prediction horizon is 6 seconds at 2 Hz, for 12 total timesteps. All coordinates must be expressed in meters in the ego vehicle coordinate frame at prediction time t = 0, where the x-axis points forward and the y-axis points left.

Submission Checklist

  • submission.csv with one row per evaluation sample and columns sample_token,x1,y1,x2,y2,...,x12,y12.
  • ADE for the instruction-conditioned model, ADE for the baseline, and ΔADE.
  • Public code repository link
  • Paper PDF link
  • Submitted papers are encouraged to follow the CVPR paper format

Workshop Participation

The doScenes challenge runs across two workshops:

Participation in either workshop is not required to complete the challenge, but teams are encouraged to attend if possible. Details about online participation will be posted once available.

Key Dates

Milestone Date (AoE) Details
Competition opens February 11, 2026 Registration and dataset access
Leaderboard freeze — CVPR May 20, 2026 DriveX Workshop at CVPR 2026 (Jun 3–4, Denver)
Leaderboard freeze — IV June 15, 2026 VL-IIV Workshop at IEEE IV 2026 (Jun 22, Detroit)
CVPR DriveX Workshop June 3–4, 2026 Denver, CO — hybrid event + invited talks
IV VL-IIV Workshop June 22, 2026 Detroit, MI — hybrid event + invited talks

FAQ

What is the official leaderboard evaluation protocol?

Leaderboard evaluation uses single-shot, open-loop prediction on the first segment of each language-available doScenes test scene from the nuScenes v1.0-test split. See the Official Evaluation Protocol section for the full setup.

Do I need to attend the workshop to participate in the challenge?

No. Participation in the workshop is not required in order to complete the challenge, although teams are encouraged to attend if possible.

Which dataset splits should be used for training and evaluation?

For training and validation, participants can use the nuScenes v1.0-trainval split, which contains approximately 850 scenes. For testing and evaluations on the website, participants should use the nuScenes v1.0-test split, which contains approximately 150 source scenes. The current doScenes instruction-conditioned leaderboard uses the 127 test scenes with non-empty language instructions. The test split may be used only to generate leaderboard submissions; test-set ground truth is not publicly available, and no training, validation, model selection, or tuning is allowed on the test split.

Can ground-truth future trajectories from the test split be used?

No. Ground-truth future trajectories from the nuScenes v1.0-test split must not be used for training, validation, tuning, model selection, submission debugging, or any other part of method development.

What do I need to include with my submission?

Submissions should include a prediction CSV file named submission.csv, a public code repository link, and a paper PDF link. Participants must also report instruction-conditioned ADE, baseline ADE, and ΔADE.

Where do the required inputs for prediction come from?

Required inputs can be assembled by combining the standard nuScenes dataset with the doScenes annotation files provided in the repository. Participants may use egocentric or non-egocentric nuScenes inputs when allowed by the selected track, as long as those inputs are available at or before the prediction time and do not reveal future information. Full Multimodal submissions may use cameras, history, map, and language; Language + History submissions may use history and language only; Ablation submissions must include both with-language and without-language results.

Which coordinate system should submissions use?

Use ego-centric coordinates in meters, expressed in the sample-time ego frame. The exact convention is listed in the Submission CSV Format section.

How are multiple instructions for one scene handled?

For leaderboard evaluation, each language-available scene corresponds to a single trajectory from the first segment of the test-set scene. When using the dataloader, use the returned scene_token as the sample_token and submit one predicted trajectory for each evaluation scene.

How do I get started with the doScenes dataset?

Use the updated dataloader in the doScenes repository. It provides the history, anchor, and future trajectory split and returns the scene_token needed for submission rows.

Can I use additional datasets or pretrained models?

Yes. Participants may use additional datasets and pretrained models, including models derived from nuScenes, provided that no information from the nuScenes v1.0-test split is used for training, validation, tuning, model selection, or submission debugging.

External or auxiliary data is permitted under the following conditions:

  • External datasets are allowed as long as they do not include test split data.
  • Pretrained models are allowed, but they must not use nuScenes v1.0-test information for training, validation, tuning, model selection, or submission debugging.
  • Auxiliary training data is permitted, provided the evaluation protocol is respected.

Some pretrained models, especially those trained on nuScenes, may have used the test split during development. This would constitute a violation of the challenge protocol if test-set information influenced training, validation, tuning, model selection, or submission debugging.

Participants are responsible for verifying and clearly documenting their data sources and training procedures.

Are resubmissions allowed?

Yes. Resubmissions are allowed, but they must be clearly identified in the submission so the most recent eligible submission can be evaluated correctly.

Ethical Considerations

The doScenes dataset is built upon nuScenes, which uses public road data. All videos are anonymized with no identifiable faces or license plates. Language annotations are collected consensually and anonymously for research purposes.

Participants should consider the following questions:

  • What should a system do when an instruction conflicts with traffic laws?
  • Should a model defer to scene context when language is underspecified?
  • How robust is the system to misleading or adversarial instructions?

Language must not override safety constraints. Models developed for this challenge should prioritize regulatory compliance, interpretability of instruction influence, and robustness to adversarial phrasing.

Concluding Remarks

The doScenes Instructed Driving Challenge introduces a trajectory prediction paradigm that leverages natural language descriptions of intended maneuvers to improve forecasting accuracy. By augmenting the established nuScenes benchmark with human-authored instructions, the challenge enables research into instruction-conditioned prediction models that can outperform history-only baselines.

Contact

Email kng408@ucmerced.edu | rossgreer@ucmerced.edu

Resources

  • Dataset and project page: https://github.com/rossgreer/doScenes

Citation

Roy, P., Perisetla, S., Shriram, S., Krishnaswamy, H., Keskar, A., & Greer, R. (2025, November). doscenes: An autonomous driving dataset with natural language instruction for human interaction and vision-language navigation. In 2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC) (pp. 1651-1658). IEEE. https://arxiv.org/abs/2412.05893

BibTeX
@inproceedings{roy2025doscenes,
      title={doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation},
      author={Parthib Roy and Srinivasa Perisetla and Shashank Shriram and Harsha Krishnaswamy and Aryan Keskar and Ross Greer},
      booktitle={2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC)},
      pages={1651--1658},
      year={2025},
      organization={IEEE},
      url={https://arxiv.org/abs/2412.05893},
}

Martinez-Sanchez et al., Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models, to appear in IEEE Intelligent Vehicles Symposium 2026. arXiv:2602.04184, 2026. https://arxiv.org/abs/2602.04184

BibTeX
@inproceedings{martinezsanchez2026naturallanguage,
  title={Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models},
  author={Angel Martinez-Sanchez and Parthib Roy and Ross Greer},
  booktitle={2026 IEEE Intelligent Vehicles Symposium (IV)},
  year={2026},
  note={To appear},
  url={https://arxiv.org/abs/2602.04184}
}

Related Work from the Authors

Greer et al., Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning, to appear in Enhanced Safety of Vehicles Conference 2026. arXiv:2602.07680, 2026. https://arxiv.org/abs/2602.07680

BibTeX
@inproceedings{greer2026visionlanguagenovelrepresentations,
      title={Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning},
      author={Ross Greer and Maitrayee Keskar and Angel Martinez-Sanchez and Parthib Roy and Shashank Shriram and Mohan Trivedi},
      booktitle={Enhanced Safety of Vehicles Conference},
      year={2026},
      note={To appear},
      url={https://arxiv.org/abs/2602.07680}
}

Greer et al., Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making, to appear in Enhanced Safety of Vehicles Conference 2026. arXiv:2602.07668, 2026. https://arxiv.org/abs/2602.07668

BibTeX
@inproceedings{greer2026lookinglisteninginsideoutside,
      title={Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making},
      author={Ross Greer and Laura Fleig and Maitrayee Keskar and Erika Maquiling and Giovanni Tapia Lopez and Angel Martinez-Sanchez and Parthib Roy and Jake Rattigan and Mira Sur and Alejandra Vidrio and Thomas Marcotte and Mohan Trivedi},
      booktitle={Enhanced Safety of Vehicles Conference},
      year={2026},
      note={To appear},
      url={https://arxiv.org/abs/2602.07668}
}