Garbage In, Garbage Out: The Case for Better Robot Data Understanding (huggingface.co)

0 points 3 hours ago ago | visit original

🤖 AI Summary

Researchers released a lightweight open-source toolkit to automatically surface low-quality robot demonstration episodes in large teleoperation datasets (Open X-Embodiment). Because robot data is costly to collect, the team argues that “data understanding” is a crucial first step: even small fractions of noisy episodes—e.g., dark/blurry camera frames, idling teleoperation, collisions or actuator saturation—can substantially degrade policy learning. In controlled corruption experiments they show that injecting realistic visual and motion noise into just 20% of episodes slows learning and raises training loss (models need many more optimization steps to reach the same loss), quantifying the “garbage in, garbage out” effect in robot policy training. Technically, the tool scores episodes by uniformly sampling 10 frames and computing a per-episode visual score that penalizes blur (variance of the Laplacian) and low mean brightness (mean intensity < 50). Motion-quality scores flag likely collisions via joint-space acceleration spikes using robust median-based thresholds, estimate path efficiency as the ratio of straight-line distance to actual joint-space path length, detect actuator saturation by checking action/state divergence, and measure idle-time as the fraction of steps with near-zero velocity. They validate detection (precision/recall) on Stanford HYDRA and demonstrate with artificial corruptions (darkening, blur, short idle windows, single-step acceleration outliers) that filtering low-quality episodes materially improves training efficiency and final behavior.

Loading comments...

loading comments...