Make capability measurable — with real data-science discipline.

01

Who it is for

For people who can measure what others assume — benchmarking, agent and model diagnosis, judgement pipelines. You work close to frontier AI, use Arc's evaluation substrate (Lancet), and run a full professional data-science process end to end — collect, validate, preprocess, EDA, pipeline, test, eval — often producing a corpus, benchmark, or dataset that resists gaming.
02

What it trains

  • Running a full data-science process: collect → validate → preprocess → EDA → pipeline → test → eval
  • Designing benchmarks that resist gaming
  • Diagnosing where a system actually fails
  • Working with Arc's evaluation substrate (Lancet)
03

Example missions

  • Design a benchmark for a capability that lacks one
  • Build an agent-failure diagnosis harness
  • Stand up a judgement pipeline with quality controls
  • Audit a system's claims against measured evidence
04

What you leave with

  • A benchmark, corpus, or dataset
  • A diagnosis others can trust
  • An evaluation pipeline
05

How a mission works

Arc shows you a few real projects it judges you ready for, and you choose the one that draws you. Then it is mission-based and asynchronous — a clear brief in, a concrete artifact out; you investigate, decide, and return with evidence, and Arc evaluates the outcome, not the motion. Expect the start to be hard — unfamiliar tools, an unfamiliar problem space; that crossing is the point.
06

What it is not

  • Not a course or a bootcamp — the work is real, and harder
  • Not employment, salary, a title, or a guaranteed role — a cultivation path, not a job
  • Not metric theatre
07

Selection

  • Recognised through real work, by invitation — not an application
  • Rigour under ambiguity; you measure honestly, including failure