2nd Edition — IEEE Intelligent Vehicles Symposium 2026

Vision, Language, and Multimodal Human Instructions
for Interactive Intelligent Vehicles

13:30–17:30  •  LaSalle B  •  IEEE IV 2026  •  VL-IIV 2026

About the Workshop

The Vision, Language, and Multimodal Human Instructions for Interactive Intelligent Vehicles (VL-IIV 2026) workshop explores the intersection of computer vision, language understanding, and multimodal reasoning for human-in-the-loop autonomous driving. The workshop focuses on systems and datasets that allow vehicles to perceive, interpret, and respond to visual and linguistic instructions.

Interactive autonomous systems capable of interpreting multimodal human instructions are critical to the next generation of safe and trustworthy transportation. This workshop promotes human-centered autonomy, reducing risks from fully unsupervised systems while enhancing transparency and user control.

Topics

We welcome contributions with a strong focus on — but not limited to — the following topics:

Human-in-the-loop and instructed autonomy
Representation learning and foundation models for embodied, instruction-conditioned behavior
Multimodal learning and grounding (gesture, speech, gaze)
Multi-agent interactions
Vision-language models for driving and robotics
Scene understanding for control transitions
Safety, trust, explainability, and transparency in human-interactive AV systems
Datasets, benchmarks, and evaluation metrics for interactive autonomy
Generative and contrastive modeling for multimodal control

Speakers

Krzysztof Czarnecki
Krzysztof Czarnecki
University of Waterloo
Abhijit Sarkar
Abhijit Sarkar, PhD
Virginia Tech Transportation Institute
Mustafa Bal
Mustafa Bal
NomadicML

doScenes Instructed Driving Challenge

VL-IIV 2026 hosts the doScenes Instructed Driving Challenge

The challenge evaluates how well vision-language models predict trajectories conditioned on human driving instructions. The dataset contains scene-level captions, driver intent labels, and natural-language instructions for upcoming maneuvers — all human-generated and labeled by multiple annotators, creating a diverse set of descriptors mapping to the same maneuver.

Participants predict the vehicle's future trajectory conditioned on any combination of (1) visual scene input (multi-camera), (2) language instruction, and (3) scene context (history + map), evaluated using displacement error, visualization, and explainability.

View Challenge Details →

Schedule

Half-day workshop, 13:30–17:30 — LaSalle B.

13:30 Welcome
Opening Remarks
Ross Greer & Mohan Trivedi
15 min
13:45 Invited Talk
Beyond Visual Question Answering: Context-Grounded LVLMs for Safer Transportation Perception
Abhijit Sarkar, PhD — Virginia Tech Transportation Institute
Abstract

Large Vision-Language Models have created new opportunities for transportation scene understanding by allowing researchers and practitioners to query complex traffic scenes using natural language. However, transportation perception is not simply an image-understanding problem. Safety-relevant reasoning often depends on contextual information not fully captured by a single RGB frame, including temporal motion, 3D spatial structure, road-user interactions, environmental conditions, and visibility constraints.

This talk presents recent work on context-grounded LVLMs for transportation through two complementary directions. The first is 3D spatial grounding, where 2D visual inputs are augmented with LiDAR-, stereo-, or tracking-derived information so that LVLMs can reason about depth, object position, motion, and vehicle-to-vehicle interaction — moving beyond visual description toward spatially informed traffic-scene reasoning critical for safety questions such as lane-change feasibility, vulnerable-road-user presence, and interaction risk. The second is concept grounding, where LVLMs are encouraged to reason through human-interpretable evidence rather than directly predicting a final label, making outputs more interpretable, auditable, and robust for safety-critical transportation applications.

40 min
14:25 Invited Talk
TBA
Krzysztof Czarnecki — University of Waterloo
40 min
15:05 Break
Coffee Break
15 min
15:20 Invited Talk
TBA
Mustafa Bal — NomadicML
40 min
16:00 Challenge
doScenes Instructed Driving Challenge — Overview
Angel Martinez, Parthib Roy
10 min
16:10 Challenge
Full Multimodal Track — 1st Place Presentation
TBA
10 min
16:20 Challenge
Language + History Track — 1st Place Presentation
TBA
10 min
16:30 Challenge
Ablation Track — 1st Place Presentation
TBA
10 min
16:40 Oral Papers
INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Reasoning
Dianwei Chen, Zifan Zhang, Lei Cheng, Yuchen Liu, Xianfeng Yang
25 min
17:05 Closing
Closing Remarks & Awards
Organizers
25 min

Organizers

Lead Organizers

Prof. Ross Greer
University of California, Merced
Prof. Mohan Trivedi
University of California, San Diego

Organizing Committee & Challenge Leads

Max Ronecker
TU Graz
Walter Zimmer
University of Sydney / UCLA
Rui Song
UCLA
Kianna Ng
UC Merced
Angel Martinez
UC Merced
Maitrayee Keskar
UC San Diego
Anas Saeed
Bonsai Robotics
Erika Maquiling
UC Merced
Edmund Chao
UCLA
Giovanni Tapia Lopez
UC Merced
Marcus Blennemann
UC San Diego
Parthib Roy
UC Merced
Afnan Alofi
Princess Nourah bint Abdulrahman University

Contact

For inquiries, please contact rossgreer@ucmerced.edu.