Architecture Overview - Hardsim Docs

High-Level Flow

SDK calls POST /v0/jobs.
API validates auth and workload guardrails.
Job is persisted and queued.
Fleet controller scales workers on demand.
Worker executes simulation and uploads artifacts.
Billing settles to Hardlight ledger.
Customer downloads artifacts via presigned URLs.

Managed RL flow:

Customer creates run via POST /v1/training-runs.
Orchestrator creates rollout batches per iteration.
Rollout artifacts are grouped into a dataset manifest.
Trainer GPU task consumes manifest and emits checkpoint + metrics.
Next iteration uses prior checkpoint lineage.

Components

API control plane
Worker runtime
Runtime store (Postgres in production)
Queue (SQS in production)
Managed artifact storage
Autoscaling worker fleet (ASG)
Autoscaling trainer GPU fleet (ASG)
Billing ledger integration

Scaling Behavior

10k submitted jobs does not mean 10k EC2 instances.
Capacity follows queue depth and active workload.
Workers scale up/down automatically based on demand.
Trainer fleet scaling is API-controlled and task-class aware.
Ghost-running protection prevents stale trainer task state from pinning GPU capacity.

Managed Training Quickstart Billing