Command-line usage guide for `minimax.train`

Parsing command-line arguments is handled by Parsnip.

You can quickly generate batches of training commands from a JSON configuration file using minimax.config.make_cmd.

General arguments

Argument	Description
`seed`	Random seed, should be unique per experimental run
`agent_rl_algo`	Base RL algorithm used for training (e.g. PPO)
`n_total_updates`	Total number of updates for the training run
`train_runner`	Which training runner to use, e.g. `dr`, `plr`, or `paired`
`n_devices`	Number of devices over which to shard the environment batch dimension
`n_students`	Number of students in the autocurriculum
`n_parallel`	Number of parallel environments
`n_eval`	Number of parallel trials per environment (environment batch dimension is then `n_parallel*n_eval`)
`n_rollout_steps`	Number of steps per rollout (used for each update cycle)
`lr`	Learning rate
`lr_final`	Final learning rate, based on linear schedule. Defaults to `None`, corresponding to no schedule.
`lr_anneal_steps`	Number of steps over which to linearly anneal from `lr` to `lr_final`
`student_value_coef`	Value loss coefficient
`student_entropy_coef`	Entropy bonus coefficient
`student_unroll_update`	Unroll multi-gradient updates this many times (can lead to speed ups)
`max_grad_norm`	Clip gradients beyond this magnitude
`adam_eps`	Value of `\epsilon` numerical stability constant for Adam
`discount`	Discount factor `\gamma` for the student's RL optimization
`n_unroll_rollout`	Unroll rollout scans this many times (can lead to speed ups)

Argument	Description
`verbose`	Random seed, should be unique per experimental run
`track_env_metrics`	Track per rollout batch environment metrics if `True`
`log_dir`	Path to directory storing all experiment folders
`xpid`	Unique name for experiment folder, stored in `--log_dir`
`log_interval`	Log training statistics every this many rollout cycles
`wandb_base_url`	Base API URL if logging with `wandb`
`wandb_api_key`	API key for `wandb`
`wandb_entity`	`wandb` entity associated with the experiment run
`wandb_project`	`wandb` project for the experiment run
`wandb_group`	`wandb` group for the experiment run

Argument	Description
`checkpoint_interval`	Random seed, should be unique per experimental run
`from_last_checkpoint`	Begin training from latest `checkpoint.pkl`, if any, in the experiment folder
`archive_interval`	Save an additional checkpoint for models trained per this many rollout cycles

Argument	Description
`test_env_names`	Random seed, should be unique per experimental run
`test_n_episodes`	Average test results over this many episodes per test environment
`test_agent_idxs`	Test agents at these indices (csv of indices or `*` for all indices)

These arguments activate when --agent_rl_algo=ppo.

Argument	Description
`student_ppo_n_epochs`	Random seed, should be unique per experimental run
`student_ppo_n_epochs`	Number of PPO epochs per update cycle
`student_ppo_n_minibatches`	Number of minibatches per PPO epoch
`student_ppo_clip_eps`	Clip coefficient for PPO
`student_ppo_clip_value_loss`	Perform value clipping if `True`
`gae_lambda`	Lambda discount factor for Generalized Advantage Estimation

The arguments in this section activate when --train_runner=paired.

Argument	Description
`teacher_lr`	Learning rate
`teacher_lr_final`	Anneal learning rate to this value (defaults to `teacher_lr`)
`teacher_lr_anneal_steps`	Number of steps over which to linearly anneal from `lr` to `lr_final`
`teacher_discount`	Discount factor, `\gamma`
`teacher_value_loss_coef`	Value loss coefficient
`teacher_entropy_coef`	Entropy bonus coefficient
`teacher_n_unroll_update`	Unroll multi-gradient updates this many times (can lead to speed ups)
`ued_score`	Name of UED objective, e.g. `relative_regret`

These PPO-specific arguments for teacher optimization further activate when --agent_rl_algo=ppo.

Argument	Description
`teacher_ppo_n_epochs`	Number of PPO epochs per update cycle
`teacher_ppo_n_minibatches`	Number of minibatches per PPO epoch
`teacher_ppo_clip_eps`	Clip coefficient for PPO
`teacher_ppo_clip_value_loss`	Perform value clipping if `True`
`teacher_gae_lambda`	Lambda discount factor for Generalized Advantage Estimation

The arguments in this section activate when --train_runner=paired.

Argument	Description
`ued_score`	Name of UED objective (aka PLR scoring function)
`plr_replay_prob`	Replay probability
`plr_buffer_size`	Size of level replay buffer
`plr_staleness_coef`	Staleness coefficient
`plr_temp`	Score distribution temperature
`plr_use_score_ranks`	Use rank-based prioritization (rather than proportional)
`plr_min_fill_ratio`	Only replay once level replay buffer is filled above this ratio
`plr_use_robust_plr`	Use robust PLR (i.e. only update policy on replay levels)
`plr_force_unique`	Force level replay buffer members to be unique
`plr_use_parallel_eval`	Use Parallel PLR or Parallel ACCEL (if `plr_mutation_fn` is set)
`plr_mutation_fn`	If set, PLR becomes ACCEL. Use `'default'` for default mutation operator per environment.
`plr_n_mutations`	Number of applications of `plr_mutation_fn` per mutation cycle.
`plr_mutation_criterion`	How replay levels are selected for mutation (e.g. `batch`, `easy`, `hard`).
`plr_mutation_subsample_size`	Number of replay levels selected for mutation according to the criterion (ignored if using `batch` criterion)

See the AMaze docs for details on how to specify training, evaluation, and teacher-specific environment parameters via command line