Show HN: Benchmark AI on your actual code (GPT-5, Claude, Grok, Gemini, o3) (codelens.ai)

0 points 1 day ago ago | visit original

🤖 AI Summary

CodeLens.AI is a community-driven benchmarking service that runs six leading code-generation models—GPT-5, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, Gemini 2.5 Pro and o3—against real developer-submitted coding tasks. Users paste a problem, the platform auto-detects domain, task type and language, runs all models in parallel, and the current champion model provides an initial AI judgment. Community members then cast a required vote with a short explanation; votes update a live leaderboard and help build a dataset of real-world LLM performance. Results are delivered by email (typically within 24 hours) and no payment or signup is required. For the AI/ML community this matters because it emphasizes task-specific, human-validated comparisons rather than synthetic or vendor-curated metrics. The required qualitative comments capture contextual strengths and failure modes (security, refactoring, architecture, etc.), making the benchmark more actionable for practitioners choosing models for particular code tasks. Technical implications include rapid, side-by-side output comparisons, crowdsourced ground truth, and a growing, task-oriented dataset that could influence model selection and future evaluation standards—though reliance on a single “champion” AI judge for initial scoring could introduce bias that the community voting step aims to mitigate.

Loading comments...

loading comments...