🤖 AI Summary
Graze has made two archived Bluesky/ATProto datasets publicly accessible via AWS S3 (Requester Pays) for researchers, developers, and archivists: the turbostream (graze-turbo-01) — a long-running, metadata‑hydrated slice of the Bluesky firehose (available from 2025-04-21) — and the megastream (graze-mega-02) — the same stream enriched with ML inferences (available from 2025-09-09). Data are published as compressed SQLite snapshots (jetstream_YYYYMMDD_HHMMSS.db.zip and mega/mega_jetstream_...) containing raw events plus hydrated references (profiles, mentions, parent/quoted posts) and, for megastream, video transcriptions and per-record inference vectors. Access requires an AWS account, configured CLI/credentials, and the Requester Pays flag (e.g., aws s3 cp --request-payer requester); anonymous access is not supported and standard S3 transfer fees apply.
Technically, megastream adds extensive probabilistic ML signals (0–1) per post: language detection (20+ languages), content-moderation flags (violence, hate, self-harm, sexual content, harassment), sentiment, 20+ topic categories, 28-emotion detection, toxicity sub-scores, financial sentiment, marketing/spam detection, and multiple-model text embeddings for semantic search. That combination enables reproducible research in moderation, NLP/embedding benchmarks, social-graph analysis, content discovery, and fine‑tuning datasets — but users should plan for data-transfer costs and respect Requester Pays semantics when scripting downloads (CLI or boto3).
Loading comments...
login to comment
loading comments...
no comments yet