VL-IIV 2026 Workshop

About the Workshop

The Vision, Language, and Multimodal Human Instructions for Interactive Intelligent Vehicles (VL-IIV 2026) workshop explores the intersection of computer vision, language understanding, and multimodal reasoning for human-in-the-loop autonomous driving. The workshop focuses on systems and datasets that allow vehicles to perceive, interpret, and respond to visual and linguistic instructions.

Interactive autonomous systems capable of interpreting multimodal human instructions are critical to the next generation of safe and trustworthy transportation. This workshop promotes human-centered autonomy, reducing risks from fully unsupervised systems while enhancing transparency and user control.

Topics

We welcome contributions with a strong focus on — but not limited to — the following topics:

Human-in-the-loop and instructed autonomy

Representation learning and foundation models for embodied, instruction-conditioned behavior

Multimodal learning and grounding (gesture, speech, gaze)

Multi-agent interactions

Vision-language models for driving and robotics

Scene understanding for control transitions

Safety, trust, explainability, and transparency in human-interactive AV systems

Datasets, benchmarks, and evaluation metrics for interactive autonomy

Generative and contrastive modeling for multimodal control

Speakers

Krzysztof Czarnecki

University of Waterloo

Abhijit Sarkar, PhD

Virginia Tech Transportation Institute

Mustafa Bal

NomadicML

doScenes Instructed Driving Challenge

VL-IIV 2026 hosts the doScenes Instructed Driving Challenge

The challenge evaluates how well vision-language models predict trajectories conditioned on human driving instructions. The dataset contains scene-level captions, driver intent labels, and natural-language instructions for upcoming maneuvers — all human-generated and labeled by multiple annotators, creating a diverse set of descriptors mapping to the same maneuver.

Participants predict the vehicle's future trajectory conditioned on any combination of (1) visual scene input (multi-camera), (2) language instruction, and (3) scene context (history + map), evaluated using displacement error, visualization, and explainability.

View Challenge Details →

Schedule

Half-day workshop, 13:30–17:30 — LaSalle B.

13:30	Welcome	Opening Remarks Ross Greer & Mohan Trivedi	15 min
13:45	Invited Talk	Beyond Visual Question Answering: Context-Grounded LVLMs for Safer Transportation Perception Abhijit Sarkar, PhD — Virginia Tech Transportation Institute Abstract Large Vision-Language Models have created new opportunities for transportation scene understanding by allowing researchers and practitioners to query complex traffic scenes using natural language. However, transportation perception is not simply an image-understanding problem. Safety-relevant reasoning often depends on contextual information not fully captured by a single RGB frame, including temporal motion, 3D spatial structure, road-user interactions, environmental conditions, and visibility constraints. This talk presents recent work on context-grounded LVLMs for transportation through two complementary directions. The first is 3D spatial grounding, where 2D visual inputs are augmented with LiDAR-, stereo-, or tracking-derived information so that LVLMs can reason about depth, object position, motion, and vehicle-to-vehicle interaction — moving beyond visual description toward spatially informed traffic-scene reasoning critical for safety questions such as lane-change feasibility, vulnerable-road-user presence, and interaction risk. The second is concept grounding, where LVLMs are encouraged to reason through human-interpretable evidence rather than directly predicting a final label, making outputs more interpretable, auditable, and robust for safety-critical transportation applications.	40 min
14:25	Invited Talk	TBA Krzysztof Czarnecki — University of Waterloo	40 min
15:05	Break	Coffee Break	15 min
15:20	Invited Talk	TBA Mustafa Bal — NomadicML	40 min
16:00	Challenge	doScenes Instructed Driving Challenge — Overview Angel Martinez, Parthib Roy	10 min
16:10	Challenge	Full Multimodal Track — 1st Place Presentation TBA	10 min
16:20	Challenge	Language + History Track — 1st Place Presentation TBA	10 min
16:30	Challenge	Ablation Track — 1st Place Presentation TBA	10 min
16:40	Oral Papers	INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Reasoning Dianwei Chen, Zifan Zhang, Lei Cheng, Yuchen Liu, Xianfeng Yang	25 min
17:05	Closing	Closing Remarks & Awards Organizers	25 min

Organizers

Lead Organizers

Prof. Ross Greer

University of California, Merced

Prof. Mohan Trivedi

University of California, San Diego

Organizing Committee & Challenge Leads

Max Ronecker

TU Graz

Walter Zimmer

University of Sydney / UCLA

Rui Song

UCLA

Kianna Ng

UC Merced

Angel Martinez

UC Merced

Maitrayee Keskar

UC San Diego

Anas Saeed

Bonsai Robotics

Erika Maquiling

UC Merced

Edmund Chao

UCLA

Giovanni Tapia Lopez

UC Merced

Marcus Blennemann

UC San Diego

Parthib Roy

UC Merced

Afnan Alofi

Princess Nourah bint Abdulrahman University

Contact

For inquiries, please contact rossgreer@ucmerced.edu.

Vision, Language, and Multimodal Human Instructionsfor Interactive Intelligent Vehicles

About the Workshop

Topics

Speakers

doScenes Instructed Driving Challenge

VL-IIV 2026 hosts the doScenes Instructed Driving Challenge

Schedule

Organizers

Lead Organizers

Organizing Committee & Challenge Leads

Contact

Vision, Language, and Multimodal Human Instructions
for Interactive Intelligent Vehicles