# Command-line usage guide for `minimax.train` Parsing command-line arguments is handled by [`Parsnip`](parsnip.md). You can quickly generate batches of training commands from a JSON configuration file using [`minimax.config.make_cmd`](make_cmd.md). ## General arguments | Argument | Description | | ----------------------- | ---------------------------------------------------------------------------------------------------- | | `seed` | Random seed, should be unique per experimental run | | `agent_rl_algo` | Base RL algorithm used for training (e.g. PPO) | | `n_total_updates` | Total number of updates for the training run | | `train_runner` | Which training runner to use, e.g. `dr`, `plr`, or `paired` | | `n_devices` | Number of devices over which to shard the environment batch dimension | | `n_students` | Number of students in the autocurriculum | | `n_parallel` | Number of parallel environments | | `n_eval` | Number of parallel trials per environment (environment batch dimension is then `n_parallel*n_eval`) | | `n_rollout_steps` | Number of steps per rollout (used for each update cycle) | | `lr` | Learning rate | | `lr_final` | Final learning rate, based on linear schedule. Defaults to `None`, corresponding to no schedule. | | `lr_anneal_steps` | Number of steps over which to linearly anneal from `lr` to `lr_final` | | `student_value_coef` | Value loss coefficient | | `student_entropy_coef` | Entropy bonus coefficient | | `student_unroll_update` | Unroll multi-gradient updates this many times (can lead to speed ups) | | `max_grad_norm` | Clip gradients beyond this magnitude | | `adam_eps` | Value of $`\epsilon`$ numerical stability constant for Adam | | `discount` | Discount factor $`\gamma`$ for the student's RL optimization | | `n_unroll_rollout` | Unroll rollout scans this many times (can lead to speed ups) | ## Logging arguments | Argument | Description | | ------------------- | -------------------------------------------------------- | | `verbose` | Random seed, should be unique per experimental run | | `track_env_metrics` | Track per rollout batch environment metrics if `True` | | `log_dir` | Path to directory storing all experiment folders | | `xpid` | Unique name for experiment folder, stored in `--log_dir` | | `log_interval` | Log training statistics every this many rollout cycles | | `wandb_base_url` | Base API URL if logging with `wandb` | | `wandb_api_key` | API key for `wandb` | | `wandb_entity` | `wandb` entity associated with the experiment run | | `wandb_project` | `wandb` project for the experiment run | | `wandb_group` | `wandb` group for the experiment run | ## Checkpointing arguments | Argument | Description | | ---------------------- | ----------------------------------------------------------------------------- | | `checkpoint_interval` | Random seed, should be unique per experimental run | | `from_last_checkpoint` | Begin training from latest `checkpoint.pkl`, if any, in the experiment folder | | `archive_interval` | Save an additional checkpoint for models trained per this many rollout cycles | ## Evaluation arguments | Argument | Description | | ----------------- | -------------------------------------------------------------------- | | `test_env_names` | Random seed, should be unique per experimental run | | `test_n_episodes` | Average test results over this many episodes per test environment | | `test_agent_idxs` | Test agents at these indices (csv of indices or `*` for all indices) | ## PPO arguments These arguments activate when `--agent_rl_algo=ppo`. | Argument | Description | | ----------------------------- | ----------------------------------------------------------- | | `student_ppo_n_epochs` | Random seed, should be unique per experimental run | | `student_ppo_n_epochs` | Number of PPO epochs per update cycle | | `student_ppo_n_minibatches` | Number of minibatches per PPO epoch | | `student_ppo_clip_eps` | Clip coefficient for PPO | | `student_ppo_clip_value_loss` | Perform value clipping if `True` | | `gae_lambda` | Lambda discount factor for Generalized Advantage Estimation | ## PAIRED arguments The arguments in this section activate when `--train_runner=paired`. | Argument | Description | | ------------------------- | --------------------------------------------------------------------- | | `teacher_lr` | Learning rate | | `teacher_lr_final` | Anneal learning rate to this value (defaults to `teacher_lr`) | | `teacher_lr_anneal_steps` | Number of steps over which to linearly anneal from `lr` to `lr_final` | | `teacher_discount` | Discount factor, $`\gamma`$ | | `teacher_value_loss_coef` | Value loss coefficient | | `teacher_entropy_coef` | Entropy bonus coefficient | | `teacher_n_unroll_update` | Unroll multi-gradient updates this many times (can lead to speed ups) | | `ued_score` | Name of UED objective, e.g. `relative_regret` | These PPO-specific arguments for teacher optimization further activate when `--agent_rl_algo=ppo`. | Argument | Description | | ----------------------------- | ----------------------------------------------------------- | | `teacher_ppo_n_epochs` | Number of PPO epochs per update cycle | | `teacher_ppo_n_minibatches` | Number of minibatches per PPO epoch | | `teacher_ppo_clip_eps` | Clip coefficient for PPO | | `teacher_ppo_clip_value_loss` | Perform value clipping if `True` | | `teacher_gae_lambda` | Lambda discount factor for Generalized Advantage Estimation | ## PLR arguments The arguments in this section activate when `--train_runner=paired`. | Argument | Description | | ----------------------------- | ------------------------------------------------------------------------------------------------------------- | | `ued_score` | Name of UED objective (aka PLR scoring function) | | `plr_replay_prob` | Replay probability | | `plr_buffer_size` | Size of level replay buffer | | `plr_staleness_coef` | Staleness coefficient | | `plr_temp` | Score distribution temperature | | `plr_use_score_ranks` | Use rank-based prioritization (rather than proportional) | | `plr_min_fill_ratio` | Only replay once level replay buffer is filled above this ratio | | `plr_use_robust_plr` | Use robust PLR (i.e. only update policy on replay levels) | | `plr_force_unique` | Force level replay buffer members to be unique | | `plr_use_parallel_eval` | Use Parallel PLR or Parallel ACCEL (if `plr_mutation_fn` is set) | | `plr_mutation_fn` | If set, PLR becomes ACCEL. Use `'default'` for default mutation operator per environment. | | `plr_n_mutations` | Number of applications of `plr_mutation_fn` per mutation cycle. | | `plr_mutation_criterion` | How replay levels are selected for mutation (e.g. `batch`, `easy`, `hard`). | | `plr_mutation_subsample_size` | Number of replay levels selected for mutation according to the criterion (ignored if using `batch` criterion) | ## Environment-specific arguments ### Maze See the [`AMaze`](envs/maze.md) docs for details on how to specify [training](envs/maze.md#student-environment), [evaluation](envs/maze.md#student-environment), and [teacher-specific](envs/maze.md#teacher-environment) environment parameters via command line