🤖 AI Summary
Researchers accepted to the Learning from Time Series for Health workshop at NeurIPS 2025 demonstrate a practical "late fusion" approach that uses large language models (LLMs) to combine modality-specific outputs for activity recognition from audio and motion time-series. Using a curated subset of Ego4D spanning household and sports activities, they feed predictions or summaries from separate audio and motion models into an LLM and perform 12-class classification in zero- and one-shot settings. The LLM-based fusion yielded F1 scores well above chance without any task-specific training, showing that LLMs can meaningfully reason about and reconcile complementary sensor streams even with minimal labeled examples.
This work is significant because it offers a deployment-friendly alternative to training a joint multimodal embedding: late fusion via LLMs can enable multimodal temporal applications where aligned multimodal training data are scarce, and it avoids the additional memory and compute cost of building bespoke multimodal models. Practically, this suggests new pipelines where lightweight sensor-specific models provide interpretable outputs that an LLM can fuse via prompting, enabling fast prototyping and transfer across contexts. Caveats remain—latency, inference cost of LLMs, and robustness under noisy sensor outputs—but the results point to LLMs as a flexible fusion layer for time-series multimodal tasks.
Loading comments...
login to comment
loading comments...
no comments yet