High-Level Flow
- SDK calls
POST /v0/jobs. - API validates auth and workload guardrails.
- Job is persisted and queued.
- Fleet controller scales workers on demand.
- Worker executes simulation and uploads artifacts.
- Billing settles to Hardlight ledger.
- Customer downloads artifacts via presigned URLs.
- Customer creates run via
POST /v1/training-runs. - Orchestrator creates rollout batches per iteration.
- Rollout artifacts are grouped into a dataset manifest.
- Trainer GPU task consumes manifest and emits checkpoint + metrics.
- Next iteration uses prior checkpoint lineage.
Components
- API control plane
- Worker runtime
- Runtime store (Postgres in production)
- Queue (SQS in production)
- Managed artifact storage
- Autoscaling worker fleet (ASG)
- Autoscaling trainer GPU fleet (ASG)
- Billing ledger integration
Scaling Behavior
- 10k submitted jobs does not mean 10k EC2 instances.
- Capacity follows queue depth and active workload.
- Workers scale up/down automatically based on demand.
- Trainer fleet scaling is API-controlled and task-class aware.
- Ghost-running protection prevents stale trainer task state from pinning GPU capacity.