2nd Edition — IEEE Intelligent Vehicles Symposium 2026
Vision, Language, and Multimodal Human Instructions
for Interactive Intelligent Vehicles
13:30–17:30 • LaSalle B • IEEE IV 2026 • VL-IIV 2026
About the Workshop
The Vision, Language, and Multimodal Human Instructions for Interactive Intelligent Vehicles (VL-IIV 2026) workshop explores the intersection of computer vision, language understanding, and multimodal reasoning for human-in-the-loop autonomous driving. The workshop focuses on systems and datasets that allow vehicles to perceive, interpret, and respond to visual and linguistic instructions.
Interactive autonomous systems capable of interpreting multimodal human instructions are critical to the next generation of safe and trustworthy transportation. This workshop promotes human-centered autonomy, reducing risks from fully unsupervised systems while enhancing transparency and user control.
Topics
We welcome contributions with a strong focus on — but not limited to — the following topics:
Speakers
doScenes Instructed Driving Challenge
VL-IIV 2026 hosts the doScenes Instructed Driving Challenge
The challenge evaluates how well vision-language models predict trajectories conditioned on human driving instructions. The dataset contains scene-level captions, driver intent labels, and natural-language instructions for upcoming maneuvers — all human-generated and labeled by multiple annotators, creating a diverse set of descriptors mapping to the same maneuver.
Participants predict the vehicle's future trajectory conditioned on any combination of (1) visual scene input (multi-camera), (2) language instruction, and (3) scene context (history + map), evaluated using displacement error, visualization, and explainability.
View Challenge Details →Schedule
Half-day workshop, 13:30–17:30 — LaSalle B.
| 13:30 | Welcome |
Opening Remarks
Ross Greer & Mohan Trivedi
|
15 min |
| 13:45 | Invited Talk |
Beyond Visual Question Answering: Context-Grounded LVLMs for Safer Transportation Perception
Abhijit Sarkar, PhD — Virginia Tech Transportation Institute
AbstractLarge Vision-Language Models have created new opportunities for transportation scene understanding by allowing researchers and practitioners to query complex traffic scenes using natural language. However, transportation perception is not simply an image-understanding problem. Safety-relevant reasoning often depends on contextual information not fully captured by a single RGB frame, including temporal motion, 3D spatial structure, road-user interactions, environmental conditions, and visibility constraints. This talk presents recent work on context-grounded LVLMs for transportation through two complementary directions. The first is 3D spatial grounding, where 2D visual inputs are augmented with LiDAR-, stereo-, or tracking-derived information so that LVLMs can reason about depth, object position, motion, and vehicle-to-vehicle interaction — moving beyond visual description toward spatially informed traffic-scene reasoning critical for safety questions such as lane-change feasibility, vulnerable-road-user presence, and interaction risk. The second is concept grounding, where LVLMs are encouraged to reason through human-interpretable evidence rather than directly predicting a final label, making outputs more interpretable, auditable, and robust for safety-critical transportation applications. |
40 min |
| 14:25 | Invited Talk |
TBA
Krzysztof Czarnecki — University of Waterloo
|
40 min |
| 15:05 | Break |
Coffee Break
|
15 min |
| 15:20 | Invited Talk |
TBA
Mustafa Bal — NomadicML
|
40 min |
| 16:00 | Challenge |
doScenes Instructed Driving Challenge — Overview
Angel Martinez, Parthib Roy
|
10 min |
| 16:10 | Challenge |
Full Multimodal Track — 1st Place Presentation
TBA
|
10 min |
| 16:20 | Challenge |
Language + History Track — 1st Place Presentation
TBA
|
10 min |
| 16:30 | Challenge |
Ablation Track — 1st Place Presentation
TBA
|
10 min |
| 16:40 | Oral Papers |
INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Reasoning
Dianwei Chen, Zifan Zhang, Lei Cheng, Yuchen Liu, Xianfeng Yang
|
25 min |
| 17:05 | Closing |
Closing Remarks & Awards
Organizers
|
25 min |