Skip to main content

High-Level Flow

  1. SDK calls POST /v0/jobs.
  2. API validates auth and workload guardrails.
  3. Job is persisted and queued.
  4. Fleet controller scales workers on demand.
  5. Worker executes simulation and uploads artifacts.
  6. Billing settles to Hardlight ledger.
  7. Customer downloads artifacts via presigned URLs.
Managed RL flow:
  1. Customer creates run via POST /v1/training-runs.
  2. Orchestrator creates rollout batches per iteration.
  3. Rollout artifacts are grouped into a dataset manifest.
  4. Trainer GPU task consumes manifest and emits checkpoint + metrics.
  5. Next iteration uses prior checkpoint lineage.

Components

  • API control plane
  • Worker runtime
  • Runtime store (Postgres in production)
  • Queue (SQS in production)
  • Managed artifact storage
  • Autoscaling worker fleet (ASG)
  • Autoscaling trainer GPU fleet (ASG)
  • Billing ledger integration

Scaling Behavior

  • 10k submitted jobs does not mean 10k EC2 instances.
  • Capacity follows queue depth and active workload.
  • Workers scale up/down automatically based on demand.
  • Trainer fleet scaling is API-controlled and task-class aware.
  • Ghost-running protection prevents stale trainer task state from pinning GPU capacity.