Skip to main content

Goal

Create a managed training run (/v1/training-runs) that:
  • dispatches rollout jobs
  • builds an iteration dataset manifest
  • runs a trainer container
  • emits metrics.json and chained checkpoints

Prerequisites

  • A service API key (hls_live_*) with training scopes.
  • Docker installed locally to build your trainer image.
  • API environment configured:
export HARDSIM_API_URL=https://api-sim.hardlightsim.com
export HARDSIM_API_KEY=hls_live_your_service_key

1) Build and Push Test Trainer Image

Use the reference image in this repo:
  • examples/managed_training/smoke_trainer/Dockerfile
  • examples/managed_training/smoke_trainer/train_smoke.py
Build locally:
docker build -t hardsim-trainer-smoke:0.1.1 -f examples/managed_training/smoke_trainer/Dockerfile examples/managed_training/smoke_trainer
Push to your registry (GHCR/ECR) and use that image URI in trainer_spec.image_uri.

2) Submit Training Run

curl -X POST "$HARDSIM_API_URL/v1/training-runs" \
  -H "Authorization: Bearer $HARDSIM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "proj_demo",
    "name": "managed-loop-smoke",
    "rollout_template": {
      "rollout_parallelism": 2,
      "episodes_per_iteration": 2,
      "scene_id": "franka_cabinet_oige_v1",
      "num_envs": 8,
      "steps": 128,
      "control": {"task_mode": "fr3_pick_lift_block_v1", "fix_base": true}
    },
    "trainer_spec": {
      "image_uri": "ghcr.io/acme-robotics/hardsim-trainer-smoke:0.1.1",
      "entrypoint": "python /app/train_smoke.py",
      "args_template": [
        "--dataset-manifest-path", "{dataset_manifest_path}",
        "--output-dir", "{output_dir}",
        "--checkpoint-in-uri", "{checkpoint_in_uri}"
      ],
      "env": {
        "HARDSIM_RUN_ID": "{run_id}",
        "HARDSIM_ITERATION_ID": "{iteration_id}"
      },
      "resources": {"gpu": 1, "cpu": 1, "memory_gb": 2}
    },
    "loop_policy": {
      "max_iterations": 3,
      "max_wallclock_hours": 2,
      "stop_on_iteration_failure": true,
      "rollout_retry_limit": 1,
      "trainer_retry_limit": 1
    },
    "success_criteria": {"patience_iterations": 2},
    "budget_policy": {"max_credits": 50}
  }'
Notes:
  • Managed training is GPU-only by default (trainer_spec.resources.gpu > 0 is required).
  • If gpu is omitted or 0, API returns HTTP 400.
  • Trainer fleet scaling uses active runnable tasks and includes startup recovery + ghost-running protection.
  • You can use checkpoint_init_asset_id instead of raw checkpoint URI.

3) Monitor

Use:
  • GET /v1/training-runs/{run_id}
  • GET /v1/training-runs/{run_id}/iterations
  • GET /v1/training-runs/{run_id}/checkpoints
  • GET /v1/training-runs/{run_id}/events
  • GET /v1/training-runs/{run_id}/checkpoints/{checkpoint_id}/download-url (checkpoint download link)
SDK terminal monitoring:
import hardsim as hs

client = hs.HardsimClient.from_env()

# Manual status checks anytime:
print(client.get_training_run(run_id))
print(client.list_training_events(run_id, limit=20))

# Live watch stream (full watch if timeout is large enough):
client.watch_training_run(run_id, poll_interval_s=5.0, timeout_s=24*3600)

Expected Success Contract

Each completed iteration must produce:
  • trainer output metrics.json
  • one checkpoint (checkpoint.pt or checkpoint_out_uri.txt)
  • one checkpoint lineage entry in /checkpoints
checkpoint_init_asset_id is optional. Add it only when you want to start from an existing checkpoint asset.