OGC/docs/train_args.md
2024-06-25 16:22:33 +02:00

125 lines
10 KiB
Markdown

# Command-line usage guide for `minimax.train`
Parsing command-line arguments is handled by [`Parsnip`](parsnip.md).
You can quickly generate batches of training commands from a JSON configuration file using [`minimax.config.make_cmd`](make_cmd.md).
## General arguments
| Argument | Description |
| ----------------------- | ---------------------------------------------------------------------------------------------------- |
| `seed` | Random seed, should be unique per experimental run |
| `agent_rl_algo` | Base RL algorithm used for training (e.g. PPO) |
| `n_total_updates` | Total number of updates for the training run |
| `train_runner` | Which training runner to use, e.g. `dr`, `plr`, or `paired` |
| `n_devices` | Number of devices over which to shard the environment batch dimension |
| `n_students` | Number of students in the autocurriculum |
| `n_parallel` | Number of parallel environments |
| `n_eval` | Number of parallel trials per environment (environment batch dimension is then `n_parallel*n_eval`) |
| `n_rollout_steps` | Number of steps per rollout (used for each update cycle) |
| `lr` | Learning rate |
| `lr_final` | Final learning rate, based on linear schedule. Defaults to `None`, corresponding to no schedule. |
| `lr_anneal_steps` | Number of steps over which to linearly anneal from `lr` to `lr_final` |
| `student_value_coef` | Value loss coefficient |
| `student_entropy_coef` | Entropy bonus coefficient |
| `student_unroll_update` | Unroll multi-gradient updates this many times (can lead to speed ups) |
| `max_grad_norm` | Clip gradients beyond this magnitude |
| `adam_eps` | Value of $`\epsilon`$ numerical stability constant for Adam |
| `discount` | Discount factor $`\gamma`$ for the student's RL optimization |
| `n_unroll_rollout` | Unroll rollout scans this many times (can lead to speed ups) |
## Logging arguments
| Argument | Description |
| ------------------- | -------------------------------------------------------- |
| `verbose` | Random seed, should be unique per experimental run |
| `track_env_metrics` | Track per rollout batch environment metrics if `True` |
| `log_dir` | Path to directory storing all experiment folders |
| `xpid` | Unique name for experiment folder, stored in `--log_dir` |
| `log_interval` | Log training statistics every this many rollout cycles |
| `wandb_base_url` | Base API URL if logging with `wandb` |
| `wandb_api_key` | API key for `wandb` |
| `wandb_entity` | `wandb` entity associated with the experiment run |
| `wandb_project` | `wandb` project for the experiment run |
| `wandb_group` | `wandb` group for the experiment run |
## Checkpointing arguments
| Argument | Description |
| ---------------------- | ----------------------------------------------------------------------------- |
| `checkpoint_interval` | Random seed, should be unique per experimental run |
| `from_last_checkpoint` | Begin training from latest `checkpoint.pkl`, if any, in the experiment folder |
| `archive_interval` | Save an additional checkpoint for models trained per this many rollout cycles |
## Evaluation arguments
| Argument | Description |
| ----------------- | -------------------------------------------------------------------- |
| `test_env_names` | Random seed, should be unique per experimental run |
| `test_n_episodes` | Average test results over this many episodes per test environment |
| `test_agent_idxs` | Test agents at these indices (csv of indices or `*` for all indices) |
## PPO arguments
These arguments activate when `--agent_rl_algo=ppo`.
| Argument | Description |
| ----------------------------- | ----------------------------------------------------------- |
| `student_ppo_n_epochs` | Random seed, should be unique per experimental run |
| `student_ppo_n_epochs` | Number of PPO epochs per update cycle |
| `student_ppo_n_minibatches` | Number of minibatches per PPO epoch |
| `student_ppo_clip_eps` | Clip coefficient for PPO |
| `student_ppo_clip_value_loss` | Perform value clipping if `True` |
| `gae_lambda` | Lambda discount factor for Generalized Advantage Estimation |
## PAIRED arguments
The arguments in this section activate when `--train_runner=paired`.
| Argument | Description |
| ------------------------- | --------------------------------------------------------------------- |
| `teacher_lr` | Learning rate |
| `teacher_lr_final` | Anneal learning rate to this value (defaults to `teacher_lr`) |
| `teacher_lr_anneal_steps` | Number of steps over which to linearly anneal from `lr` to `lr_final` |
| `teacher_discount` | Discount factor, $`\gamma`$ |
| `teacher_value_loss_coef` | Value loss coefficient |
| `teacher_entropy_coef` | Entropy bonus coefficient |
| `teacher_n_unroll_update` | Unroll multi-gradient updates this many times (can lead to speed ups) |
| `ued_score` | Name of UED objective, e.g. `relative_regret` |
These PPO-specific arguments for teacher optimization further activate when `--agent_rl_algo=ppo`.
| Argument | Description |
| ----------------------------- | ----------------------------------------------------------- |
| `teacher_ppo_n_epochs` | Number of PPO epochs per update cycle |
| `teacher_ppo_n_minibatches` | Number of minibatches per PPO epoch |
| `teacher_ppo_clip_eps` | Clip coefficient for PPO |
| `teacher_ppo_clip_value_loss` | Perform value clipping if `True` |
| `teacher_gae_lambda` | Lambda discount factor for Generalized Advantage Estimation |
## PLR arguments
The arguments in this section activate when `--train_runner=paired`.
| Argument | Description |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------- |
| `ued_score` | Name of UED objective (aka PLR scoring function) |
| `plr_replay_prob` | Replay probability |
| `plr_buffer_size` | Size of level replay buffer |
| `plr_staleness_coef` | Staleness coefficient |
| `plr_temp` | Score distribution temperature |
| `plr_use_score_ranks` | Use rank-based prioritization (rather than proportional) |
| `plr_min_fill_ratio` | Only replay once level replay buffer is filled above this ratio |
| `plr_use_robust_plr` | Use robust PLR (i.e. only update policy on replay levels) |
| `plr_force_unique` | Force level replay buffer members to be unique |
| `plr_use_parallel_eval` | Use Parallel PLR or Parallel ACCEL (if `plr_mutation_fn` is set) |
| `plr_mutation_fn` | If set, PLR becomes ACCEL. Use `'default'` for default mutation operator per environment. |
| `plr_n_mutations` | Number of applications of `plr_mutation_fn` per mutation cycle. |
| `plr_mutation_criterion` | How replay levels are selected for mutation (e.g. `batch`, `easy`, `hard`). |
| `plr_mutation_subsample_size` | Number of replay levels selected for mutation according to the criterion (ignored if using `batch` criterion) |
## Environment-specific arguments
### Maze
See the [`AMaze`](envs/maze.md) docs for details on how to specify [training](envs/maze.md#student-environment), [evaluation](envs/maze.md#student-environment), and [teacher-specific](envs/maze.md#teacher-environment) environment parameters via command line