VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Agents (research.nvidia.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

The introduction of VideoFDB marks a significant advancement in assessing full-duplex audio-visual conversational agents, addressing a critical gap in the AI/ML community. Traditional benchmarks have focused solely on either speech or turn-based interactions, neglecting the fluid, overlapping dynamics of human conversation that include nonverbal cues such as gaze and gestures. With 237 real video call clips and a detailed rubric for evaluation, VideoFDB examines how well agents manage these nuances, revealing systematic shortcomings in current models that often default to audio-only engines or misinterpret visual inputs as mere captions. In testing various leading agents, the findings highlight two primary failure modes: models frequently misidentify visual input, focusing on appearance rather than engaging in dialogue, and struggle to integrate visual and audio outputs cohesively. This leads to significant delays in interaction, with latency levels far exceeding human conversational timing. For instance, while humans achieve a 90% turn-on-response alignment with a median latency of just 1.4 seconds, even the best-performing AI models lag at 73% with latencies over 700 milliseconds. These insights pose substantial implications for the future of conversational AI development, emphasizing the need for end-to-end models that can seamlessly incorporate both verbal and nonverbal aspects of communication.

Loading comments...

loading comments...