OGC/docs/train_args.md
2024-06-25 16:22:33 +02:00

10 KiB

Command-line usage guide for minimax.train

Parsing command-line arguments is handled by Parsnip.

You can quickly generate batches of training commands from a JSON configuration file using minimax.config.make_cmd.

General arguments

Argument Description
seed Random seed, should be unique per experimental run
agent_rl_algo Base RL algorithm used for training (e.g. PPO)
n_total_updates Total number of updates for the training run
train_runner Which training runner to use, e.g. dr, plr, or paired
n_devices Number of devices over which to shard the environment batch dimension
n_students Number of students in the autocurriculum
n_parallel Number of parallel environments
n_eval Number of parallel trials per environment (environment batch dimension is then n_parallel*n_eval)
n_rollout_steps Number of steps per rollout (used for each update cycle)
lr Learning rate
lr_final Final learning rate, based on linear schedule. Defaults to None, corresponding to no schedule.
lr_anneal_steps Number of steps over which to linearly anneal from lr to lr_final
student_value_coef Value loss coefficient
student_entropy_coef Entropy bonus coefficient
student_unroll_update Unroll multi-gradient updates this many times (can lead to speed ups)
max_grad_norm Clip gradients beyond this magnitude
adam_eps Value of `\epsilon` numerical stability constant for Adam
discount Discount factor `\gamma` for the student's RL optimization
n_unroll_rollout Unroll rollout scans this many times (can lead to speed ups)

Logging arguments

Argument Description
verbose Random seed, should be unique per experimental run
track_env_metrics Track per rollout batch environment metrics if True
log_dir Path to directory storing all experiment folders
xpid Unique name for experiment folder, stored in --log_dir
log_interval Log training statistics every this many rollout cycles
wandb_base_url Base API URL if logging with wandb
wandb_api_key API key for wandb
wandb_entity wandb entity associated with the experiment run
wandb_project wandb project for the experiment run
wandb_group wandb group for the experiment run

Checkpointing arguments

Argument Description
checkpoint_interval Random seed, should be unique per experimental run
from_last_checkpoint Begin training from latest checkpoint.pkl, if any, in the experiment folder
archive_interval Save an additional checkpoint for models trained per this many rollout cycles

Evaluation arguments

Argument Description
test_env_names Random seed, should be unique per experimental run
test_n_episodes Average test results over this many episodes per test environment
test_agent_idxs Test agents at these indices (csv of indices or * for all indices)

PPO arguments

These arguments activate when --agent_rl_algo=ppo.

Argument Description
student_ppo_n_epochs Random seed, should be unique per experimental run
student_ppo_n_epochs Number of PPO epochs per update cycle
student_ppo_n_minibatches Number of minibatches per PPO epoch
student_ppo_clip_eps Clip coefficient for PPO
student_ppo_clip_value_loss Perform value clipping if True
gae_lambda Lambda discount factor for Generalized Advantage Estimation

PAIRED arguments

The arguments in this section activate when --train_runner=paired.

Argument Description
teacher_lr Learning rate
teacher_lr_final Anneal learning rate to this value (defaults to teacher_lr)
teacher_lr_anneal_steps Number of steps over which to linearly anneal from lr to lr_final
teacher_discount Discount factor, `\gamma`
teacher_value_loss_coef Value loss coefficient
teacher_entropy_coef Entropy bonus coefficient
teacher_n_unroll_update Unroll multi-gradient updates this many times (can lead to speed ups)
ued_score Name of UED objective, e.g. relative_regret

These PPO-specific arguments for teacher optimization further activate when --agent_rl_algo=ppo.

Argument Description
teacher_ppo_n_epochs Number of PPO epochs per update cycle
teacher_ppo_n_minibatches Number of minibatches per PPO epoch
teacher_ppo_clip_eps Clip coefficient for PPO
teacher_ppo_clip_value_loss Perform value clipping if True
teacher_gae_lambda Lambda discount factor for Generalized Advantage Estimation

PLR arguments

The arguments in this section activate when --train_runner=paired.

Argument Description
ued_score Name of UED objective (aka PLR scoring function)
plr_replay_prob Replay probability
plr_buffer_size Size of level replay buffer
plr_staleness_coef Staleness coefficient
plr_temp Score distribution temperature
plr_use_score_ranks Use rank-based prioritization (rather than proportional)
plr_min_fill_ratio Only replay once level replay buffer is filled above this ratio
plr_use_robust_plr Use robust PLR (i.e. only update policy on replay levels)
plr_force_unique Force level replay buffer members to be unique
plr_use_parallel_eval Use Parallel PLR or Parallel ACCEL (if plr_mutation_fn is set)
plr_mutation_fn If set, PLR becomes ACCEL. Use 'default' for default mutation operator per environment.
plr_n_mutations Number of applications of plr_mutation_fn per mutation cycle.
plr_mutation_criterion How replay levels are selected for mutation (e.g. batch, easy, hard).
plr_mutation_subsample_size Number of replay levels selected for mutation according to the criterion (ignored if using batch criterion)

Environment-specific arguments

Maze

See the AMaze docs for details on how to specify training, evaluation, and teacher-specific environment parameters via command line