Challenges in Building Large-Scale Information Retrieval Systems [pdf] (static.googleusercontent.com)

0 points 3 hours ago ago | visit original

🤖 AI Summary

Jeff Dean’s talk surveys the practical and technical challenges of building web-scale information retrieval systems, using Google’s search evolution (circa 1997–2009) as a case study. He frames retrieval as a blend of science and large‑scale engineering where multiple interacting dimensions—number of indexed docs, queries/sec, index freshness, query latency, per‑doc metadata and scoring complexity—multiply to determine difficulty and cost. Over a decade Google’s footprint exploded (≈100× more docs, ≈1,000× more queries/day, per‑doc index info ~3×) while update latency moved from months to minutes and average query latency fell from <1s to ~0.2s, forcing continuous re‑architecture rather than one‑off optimizations. Key technical takeaways: partitioning by document id (doc‑shards) was chosen because it simplifies per‑doc metadata and limits network traffic despite requiring every shard to be consulted per query; caching (30–60% hit rates) dramatically reduces resource needs but introduces latency spikes on index updates; robust index update and disk‑layout strategies (rolling copies, careful placement on faster disk zones) are essential for availability. Significant gains came from compact, CPU‑efficient encodings—moving from simple byte‑aligned postings to varints, Gamma/Rice/Golomb codes and block‑based formats with skip tables—to reduce disk seeks and bandwidth. The talk underscores that designs must anticipate orders‑of‑magnitude growth and plan staged rewrites as requirements shift.

Loading comments...

loading comments...