Skip to main content

Goal

Create a managed training run (/v1/training-runs) that:
  • dispatches rollout jobs
  • builds an iteration dataset manifest
  • runs a trainer container
  • emits metrics.json and chained checkpoints

1) Build and Push Test Trainer Image

Use the reference image in this repo:
  • examples/managed_training/smoke_trainer/Dockerfile
  • examples/managed_training/smoke_trainer/train_smoke.py
Build locally:
docker build -t hardsim-trainer-smoke:0.1.1 -f examples/managed_training/smoke_trainer/Dockerfile examples/managed_training/smoke_trainer
Push to your registry (GHCR/ECR). See docs/trainer-smoke-image.md for exact commands.

2) Configure Trainer Runner

On the trainer-runner host (GPU worker class):
export HARDSIM_TRAINER_RUNNER_ENABLED=true
export HARDSIM_TRAINER_EXEC_MODE=docker
export HARDSIM_TRAINER_DOCKER_PULL_IMAGE=true
export HARDSIM_TRAINER_WORKER_CLASS=gpu

3) Submit Training Run

curl -X POST "$HARDSIM_API_URL/v1/training-runs" \
  -H "Authorization: Bearer $HARDSIM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "proj_demo",
    "name": "managed-loop-smoke",
    "rollout_template": {
      "rollout_parallelism": 2,
      "episodes_per_iteration": 2,
      "scene_id": "franka_cabinet_oige_v1",
      "num_envs": 8,
      "steps": 128,
      "control": {"task_mode": "fr3_pick_lift_block_v1", "fix_base": true}
    },
    "trainer_spec": {
      "image_uri": "ghcr.io/<org-or-user>/hardsim-trainer-smoke:0.1.1",
      "entrypoint": "python /app/train_smoke.py",
      "args_template": [
        "--dataset-manifest-path", "{dataset_manifest_path}",
        "--output-dir", "{output_dir}",
        "--checkpoint-in-uri", "{checkpoint_in_uri}"
      ],
      "env": {
        "HARDSIM_RUN_ID": "{run_id}",
        "HARDSIM_ITERATION_ID": "{iteration_id}"
      },
      "resources": {"gpu": 1, "cpu": 1, "memory_gb": 2}
    },
    "checkpoint_init_asset_id": "asset_checkpoint_init_v1",
    "loop_policy": {
      "max_iterations": 3,
      "max_wallclock_hours": 2,
      "stop_on_iteration_failure": true,
      "rollout_retry_limit": 1,
      "trainer_retry_limit": 1
    },
    "success_criteria": {"patience_iterations": 2},
    "budget_policy": {"max_credits": 50}
  }'
Notes:
  • Managed training is GPU-only by default (trainer_spec.resources.gpu > 0 is required).
  • If gpu is omitted or 0, API returns HTTP 400.
  • Trainer fleet scaling uses active runnable tasks and includes startup recovery + ghost-running protection.
  • You can use checkpoint_init_asset_id instead of raw checkpoint URI.

4) Monitor

Use:
  • GET /v1/training-runs/{run_id}
  • GET /v1/training-runs/{run_id}/iterations
  • GET /v1/training-runs/{run_id}/checkpoints
  • GET /v1/training-runs/{run_id}/events
  • GET /v1/training-runs/{run_id}/checkpoints/{checkpoint_id}/download-url (checkpoint download link)

Expected Success Contract

Each completed iteration must produce:
  • trainer output metrics.json
  • one checkpoint (checkpoint.pt or checkpoint_out_uri.txt)
  • one checkpoint lineage entry in /checkpoints