Init

2024-06-25 16:22:33 +02:00 · 2024-06-25 16:22:33 +02:00 · a291702af9
commit a291702af9
216 changed files with 39249 additions and 0 deletions
--- a/docs/envs/maze.md
+++ b/docs/envs/maze.md
@ -0,0 +1,126 @@
+# `AMaze`
+
+## 🧭 Partially-observable navigation in procedural mazes.
+
+![Maze Overview](../images/env_maze_overview.png)
+
+The `AMaze` environment reproduces the MiniGrid-based, partially-observable maze navigation environments featured in previous works. Specifically `AMaze` provides feature-parity with respect to the previous reference implementation of the maze environment in [facebookresearch/dcd](https://github.com/facebookresearch/dcd). 
+
+## Student environment
+View source: [`envs/maze/maze.py`](../../src/minimax/envs/maze/maze.py)
+
+### Static EnvParams
+
+The table below summarizes the configurable static environment parameters of `AMaze`. The parameters that can be provided via `minimax.train` by default are denoted in the table below. Their corresponding command-line argument is the name of the parameter, preceded by the prefix `maze`, e.g. `maze_n_walls` for specifying `n_walls`. 
+
+Similarly, evaluation parameters can be specified via the prefix `maze_eval`, e.g. `maze_eval_see_agent` for specifying `see_agent`. Currently, `minimax.train` only accepts `maze_eval_see_agent` and `maze_eval_normalize_obs`.
+
+Note that `AMaze` treats `height` and `width` as parameterizing only the portion of the maze grid that can vary, and thus excludes the 1-tile wall border surrounding each maze instance. Thus, a 15x15 maze in the prior `MiniGrid`-based implementation corresponds to an `AMaze` parameterization with `height=13` and `width=13`.
+
+| Parameter | Description| Command-line support | 
+| - | - | - |
+| `height` | Height of maze | ✅ |
+| `width` | Width of maze | ✅ |
+| `n_walls` | Number of walls to place per maze | ✅ |
+| `agent_view_size` | Size of foward-facing partial observation see by agent | ✅ |
+| `replace_wall_pos` | Wall positions are sampled with replacement if `True` | ✅ |
+| `see_agent` | Agent sees itself in its partial observation if `True` | ✅ |
+| `normalize_obs`| Scale observation values to [0,1] if `True`| ✅ |
+| `sample_n_walls` | Sample # walls placed between [0, `n_walls`] if `True` | ✅ |
+| `obs_agent_pos` | Include `agent_pos` in the partial observation | ✅ |
+| `max_episode_steps` | Maximum # steps per episode | ✅ |
+| `singleton_seed` | Fix the random seed to this value, making the environment a singleton |  |
+
+### State space
+| Variable | Description|
+| - | - |
+| `agent_pos` | Agent's (x,y) position |
+| `agent_dir` | Agent's orientation vector |
+| `agent_dir_idx` | Agent's orientation enum |
+| `goal_pos` | Goal (x,y) position |
+| `wall_map` | H x W bool tensor, `True` in wall positions |
+| `maze_map` | Full maze map with all objects for rendering |
+| `time` | Time step |
+| `terminal` | `True` iff episode is done |
+
+
+### Observation space
+| Variable | Description|
+| - | - |
+| `image`| Partial observation seen by agent |
+| `agent_dir` | Agent's orientation enum |
+| `agent_pos` | Agent's (x,y) position (not included by default) |
+
+
+### Action space
+| Action index | Description|
+| - | - |
+| `0` | Left |
+| `1` | Right |
+| `2` | Foward |
+| `3` | Pick up |
+| `4` | Drop |
+| `5` | Toggle |
+| `6` | Done |
+
+Note that the navigation environments only use actions `0` through `2`, however all actions are included for parity with the original `MiniGrid`-based environments.
+
+
+## Teacher environment
+View source: [`envs/maze/maze_ued.py`](../../src/minimax/envs/maze/maze_ued.py)
+
+To support autocurricula generated by a co-adapting teacher policy (e.g. PAIRED), `AMaze` includes `UEDMaze`, which implements the teacher's MDP for designing `Maze` instances. By design, a pair of `Maze` and `UEDMaze` objects (corresponding to a specific setting of `EnvParams`) can be wrapped into a `UEDEnvironment` object for use in a training runner (see `PAIREDRunner` for an example).
+
+The parameters that can be provided via `minimax.train` by default are denoted in the table below. Their corresponding command-line argument is the name of the parameter, preceded by the prefix `maze_ued`, e.g. `maze_ued_n_walls` for specifying `n_walls`. Note that when the corresponding `maze_*` and `maze_ued_*` arguments conflict, those specified in `maze_*` take precedent.
+
+### Static EnvParams
+| Variable | Description| Command-line support |
+| - | - | - |
+| `height` | Height of maze | ✅ |
+| `width` | Width of maze | ✅ |
+| `n_walls` | Wall budget | ✅ |
+| `noise_dim` | Size of noise vector in the observation | ✅ |
+| `replace_wall_pos` | If `True`, placing an object over an existing way replaces it. Otherwise, the object is placed in a random unused position. | ✅ |
+| `fixed_n_wall_steps` | First `n_walls` actions are wall positions if `True`. Otherwise, the first action only determines the fraction of wall budget to use. | ✅ |
+| `first_wall_pos_sets_budget` | First wall position also determines the fraction of wall budget to use (rather than using a separate first action to separately determine this fraction) | ✅ |
+| `set_agent_dir` | If `True`, the action in an extra last time step determines the agent's initial orientation index | ✅ |
+| `normalize_obs` | If `True`, Scale observation values to [0,1] | ✅ |
+
+
+### State space
+| Variable | Description|
+| - | - |
+| `encoding` | `A 1D vector encoding the running action sequence of the teacher` |
+| `time` | `current time step` |
+| `terminal` | `True` if the episode is done |
+
+### Observation space
+| Variable | Description|
+| - | - |
+| `image` | Full `maze_map` of the maze instance under construction |
+| `time` | Time step |
+| `noise` | A noise vector sampled from Uniform(0,1) |
+
+### Action space
+The action space corresponds to integers in [0,`height*width`]. Each action corresponds to a selected wall location in the flattened maze grid, with the exception of the last two actions, which correspond to the goal position and the agent's starting position. This interpretation of the action sequence can change based on the specific configuration of `EnvParams`:
+
+- If `params.replace_wall_pos=True`, the first action corresponds to the number of walls to place in the current episode.
+
+- If `params.set_agent_dir=True`, an additional step is appended to the episode, where the action corresponds to the agent's initial orientation index.
+
+## OOD test environments
+The `AMaze` module includes the set of OOD, human-designed environments for testing zero-shot transfer from previous studies (See the figure above for a summary of these environments). Several of these environments are procedurally-generated: 
+
+- `Maze-SmallCorridor`
+- `Maze-LargeCorridor`
+- `Maze-FourRooms`
+- `Maze-Crossing`
+- `Maze-PerfectMaze*`
+
+The OOD maze environments are defined in [`envs/maze/maze_ood.py`](../minimax/envs/maze/maze_ood.py). They each subclass `Maze` and support customization via the `EnvParams` configuration, e.g. changing the default `height` or `width` values to generate larger or smaller instances.
+
+
+
+
+
+
--- a/docs/envs/overcooked.md
+++ b/docs/envs/overcooked.md
@ -0,0 +1,110 @@
+# `AMaze`
+
+## 🧭 Partially-observable navigation in procedural mazes.
+
+![Maze Overview](../images/Training6x9SmallStylised.png)
+
+The `OvercookedUED` environment reproduces the Overcooked in its classical state as described by Carroll et al. (https://github.com/HumanCompatibleAI/overcooked_ai) while also adding parallelisation across layouts and the possibility to design layouts by a teacher agents.
+Observation and action spaces are consistent with original and thus excluded from the description here.
+The student environment is built by starting from the JaxMARL project: https://github.com/FLAIROx/JaxMARL.
+
+## Student environment
+View source: [`envs/overcooked_proc/overcooked.py`](../../src/minimax/envs/overcooked_proc/overcooked.py)
+
+### Static EnvParams
+
+Similar to the `AMaze` environment the parameters of the environment are described below.
+The interaction with these env parameters is fundamentally the same.
+All commands are command-line supported.
+
+| Parameter | Description| Command-line support | 
+| - | - | - |
+| `height` | Height of Overcooked layout | ✅ |
+| `width` | Width of Overcooked layout | ✅ |
+| `h_min` | Minimum height of Overcooked layout | - |
+| `w_min` | Minimum width of Overcooked layout | - |
+| `n_walls` | Number of walls to place per Overcooked layout | ✅ |
+| `replace_wall_pos` | Wall positions are sampled with replacement if `True` | ✅ |
+| `normalize_obs`| Scale observation values to [0,1] if `True`| ✅ |
+| `sample_n_walls` | Sample # walls placed between [0, `n_walls`] if `True` | ✅ |
+| `max_steps` | Steps in Overcooked until termination | ✅ |
+| `max_episode_steps` | Same as `max_steps` for consistency | ✅ |
+| `singleton_seed` | Fix the random seed to this value, making the environment a singleton |  |
+
+### State space
+| Variable | Description|
+| - | - |
+| `agent_pos` | Agent's (x,y) position |
+| `agent_dir` | Agent's orientation vector |
+| `agent_dir_idx` | Agent's orientation enum |
+| `agent_inv` | The agents inventory |
+| `goal_pos` | Where serving locations are |
+| `pot_pos` | Where pots are |
+| `wall_map` | Boolean wall map |
+| `maze_map` | hxwx3 map |
+| `bowl_pile_pos` | Where bowl piles are |
+| `onion_pile_pos` | Where onion piles are |
+| `time` | N steps taken |
+| `terminal` | Terminal step? |
+
+
+## Teacher environment
+View source: [`envs/overcooked_proc/overcooked_ued.py`](../../src/minimax/envs/overcooked_proc/overcooked_ued.py)
+
+Also similar to `AMaze` we document the teacher environment below.
+`UEDOvercooked` is the teacher's MDP for setting the env params described above.
+Similar to above:
+
+### Static EnvParams
+| Variable | Description| Command-line support |
+| - | - | - |
+| `height` | Height of maze | ✅ |
+| `width` | Width of maze | ✅ |
+| `n_walls` | Wall budget | ✅ |
+| `noise_dim` | Size of noise vector in the observation | ✅ |
+| `replace_wall_pos` | If `True`, placing an object over an existing way replaces it. Otherwise, the object is placed in a random unused position. | ✅ |
+| `fixed_n_wall_steps` | First `n_walls` actions are wall positions if `True`. Otherwise, the first action only determines the fraction of wall budget to use. | ✅ |
+| `first_wall_pos_sets_budget` | First wall position also determines the fraction of wall budget to use (rather than using a separate first action to separately determine this fraction) | ✅ |
+| `use_seq_actions` | Whether to use sequential actions, always true | ✅ |
+| `normalize_obs` | If `True`, Scale observation values to [0,1] | ✅ |
+| `sample_n_walls` | Whether to sample n walls | ✅ |
+| `max_steps` | See above | ✅ |
+| `singleton_seed` | See above | ✅ |
+| `max_episode_steps` | See above | ✅ |
+
+
+### State space
+| Variable | Description|
+| - | - |
+| `encoding` | `A 1D vector encoding the running action sequence of the teacher` |
+| `time` | `current time step` |
+| `terminal` | `True` if the episode is done |
+
+### Observation space
+| Variable | Description|
+| - | - |
+| `image` | Full `maze_map` of the Overcooked instance under construction: hxwx3 |
+| `time` | Time step |
+| `noise` | A noise vector sampled from Uniform(0,1) |
+
+### Action space
+Similar to in `AMaze`, the action space corresponds to integers in [0,`height*width`]. Each action corresponds to a selected wall location in the flattened maze grid, with the exception of the last few actions, which place objects in the environment. This interpretation of the action sequence can change based on the specific configuration of `EnvParams`:
+
+- If `params.replace_wall_pos=True`, the first action corresponds to the number of walls to place in the current episode.
+
+- If `params.set_agent_dir=True`, an additional step is appended to the episode, where the action corresponds to the agent's initial orientation index.
+
+The actions are: 
+```python
+class SequentialActions(IntEnum):
+    skip = 0
+    wall = 1
+    goal = 2
+    agent = 3
+    onion = 4
+    soup = 5
+    bowls = 6
+```
+
+## OOD test environments
+We include the original 5 and more layouts for OOD testing in [`envs/overcooked_proc/overcooked_ood.py`](../../src/minimax/envs/overcooked_proc/overcooked_ood.py)
--- a/docs/evaluate_args.md
+++ b/docs/evaluate_args.md
@ -0,0 +1,37 @@
+# Command-line usage guide for `minimax.evaluate`
+
+You can evaluate student agent checkpoints using `minimax.evaluate` as follows:
+
+```bash
+python -m minimax.evaluate \
+--seed 1 \
+--log_dir <absolute path log directory> \
+--xpid_prefix <select checkpoints with xpids matching this prefix> \
+--env_names <csv string of test environment names> \
+--n_episodes <number of trials per test environment> \
+--results_path <path to results folder> \
+--results_fname <filename of output results csv>
+```
+
+Some behaviors of `minimax.evaluate` to be aware of:
+- This command will search `log_dir` for all experiment directories with names matching `xpid_prefix` and evaluate the checkpoint named `<checkpoint_name>.pkl`. 
+- `minimax.evaluate` assumes xpid values end with a unique index, so that they match the regex `.*_[0-9]+$`.
+- The results will be averaged over all such checkpoints (at most one checkpoint per matching experiment folder). Using the `--xpid_prefix` argument can be useful for evaluating corresponding to the same experimental configuration with different training seeds (and thus share an xpid prefix, e.g. <xpid_prefix_0>, <xpid_prefix_1>, <xpid_prefix_2>).
+
+If you would like to evaluate a checkpoint for only a single experiment, specify the full experiment directory name using `--xpid` instead of using `--xpid_prefix`.
+
+
+## All command-line arguments
+| Argument          | Description                                                                                                                      |
+| ----------------- | -------------------------------------------------------------------------------------------------------------------------------- |
+| `seed`            | Random seed for evaluation                                                                                                       |
+| `log_dir`         | Directory containing experiment folders                                                                                          |
+| `xpid`            | Name of experiment folder, i.e. the experiment ID                                                                                |
+| `xpid_prefix`     | Evaluate and average results over checkpoints for experiments with experiment IDs matching this prefix (ignores `--xpid` if set) |
+| `checkpoint_name` | Name of checkpoint to evaluate (in each matching experiment folder)                                                              |
+| `env_names`       | Number of devices over which to shard the environment batch dimension                                                            |
+| `n_episodes`      | Number of students in the autocurriculum                                                                                         |
+| `agent_idxs`      | Indices of student agents to evaluate (csv of indices or `*` for all indices)                                                    |
+| `results_path`    | Number of parallel environments                                                                                                  |
+| `results_fname`   | Number of parallel trials per environment (environment)                                                                         |
+| `render_mode`     | If set, renders the evaluation episode. Requires disabling JIT. Use `'ipython'` if rendering inside an IPython notebook.         |
--- a/docs/images/OvercookedDCD.png
+++ b/docs/images/OvercookedDCD.png
--- a/docs/images/Training6x9SmallStylised.pdf
+++ b/docs/images/Training6x9SmallStylised.pdf
--- a/docs/images/Training6x9SmallStylised.png
+++ b/docs/images/Training6x9SmallStylised.png
--- a/docs/images/env_maze_overview.png
+++ b/docs/images/env_maze_overview.png
--- a/docs/images/minimax_logo.png
+++ b/docs/images/minimax_logo.png
--- a/docs/images/minimax_speedups.png
+++ b/docs/images/minimax_speedups.png
--- a/docs/images/minimax_speedups_darkmode.png
+++ b/docs/images/minimax_speedups_darkmode.png
--- a/docs/images/minimax_system_diagram.png
+++ b/docs/images/minimax_system_diagram.png
--- a/docs/images/parallel_dcd_overview.png
+++ b/docs/images/parallel_dcd_overview.png
--- a/docs/make_cmd.md
+++ b/docs/make_cmd.md
@ -0,0 +1,28 @@
+# Generating commands
+
+The `minimax.config.make_cmd` module enables generating batches of commands from a JSON configuration file, e.g. for running array jobs with Slurm. The JSON should adhere to the following format:
+- Each key is a valid command-line argument for `minimax.train`.
+- Each value is a list of values for the corresponding command-line argument. Commands are generated for each combination of command-line argument values.
+- Boolean values should be specified as 'True' or 'False'.
+- If a value is specified as `null`, the associated command-line argument is not included in the generated command (and thus would take on the default value specified when defining the argument parser).
+
+You can try it out by running the following command in your project root directory:
+
+```
+python -m minimax.config.make_cmd --config maze/plr
+```
+
+The above command will create a directory called `config` in the calling directory with a subdirectory `config/maze` containing configuration files for several autocurriculum methods. 
+
+By default, `minimax.config.make_cmd` searches for configuration files inside `config`. You can create your own JSON configuration files within `config`. If your JSON configuration is located at `config/path/to/my/json`, then you can generate commands with it by calling `minimax.config.make_cmd --config path/to/my/json`.
+
+## Configuring `wandb`
+
+If your configuration includes the argument `wandb_project`, then `minimax.config.make_cmd` will look for a JSON dictionary with your credentials at `config/wandb.json`. The expected format of this JSON file is
+
+```json
+{
+	"base_url": <URL for wandb API endpoint, e.g. https://api.wandb.ai>,
+	"api_key": <Your wandb API key>
+}
+```
--- a/docs/parsnip.md
+++ b/docs/parsnip.md
@ -0,0 +1,131 @@
+# `Parsnip`
+
+## 🥕 `argparse` with conditional argument groups.
+
+As `minimax.train` is the single point-of-entry for training, its command-line arguments can grow quickly in number with each additional autocurriculum method supported in `minimax`. This complexity arises for several reasons:
+
+- New components in the form of training runners, environments, agents, and models may require additional arguments
+- New components may require existing arguments shared with previous components
+- New components may overload the meaning of existing arguments used by other components
+
+We make use of a custom module called `Parsnip` to help manage the complexity of specifying and parsing command-line arguments. `Parsnip` allows the creation of named argument groups, which allows adding new arguments while explicitly separating them into name spaces. Each argument group results in its own kwarg dictionary when parsed. 
+
+`Parsnip` directly builds on `argparse` by adding the notion of a "subparser". Here, a subparser is simply an `argparse` parser responsible for a named argument group. Subparsers enable some useful behavior:
+- Arguments can be added to the top-level `Parsnip` parser or to a subparser. 
+- Each subparser is initialized with a `name` for its corresponding argument group. All arguments under this subparser will be contained in a nested kwarg dictionary under the key equal to `name`. 
+- Each subparser can be initialized with an optional `prefix`, in which case all command-line arguments added to the subparser will be prepended with the value of `prefix` (see example below), thus creating a namespace for the corresponding argument group.
+- Subparsers can be added conditionally, based on the specific value of a top-level argument (with support for the wildcard `*`).
+- After parsing, `Parsnip` produces a kwargs dictionary containing a key:value pair for each top-level argument and a nested kwargs dictionary, under the key `<prefix>` containing the parsed arguments managed by each active subparser initialized with `prefix=<prefix>`.
+
+Other than these details, `Parsnip`'s interface remains identical to that of `argparse`. 
+
+## A minimal example
+In this example, we assume the parser is used inside a script called `run.py`.
+
+```python
+from util.parsnip import Parsnip
+
+# Create a new Parsnip parser
+parser = Parsnip()
+
+# Add some top-level arguments (same as argparse)
+parser.add_argument(
+    '--name', 
+    type=str,  
+    help='Name of my farm.')
+parser.add_argument(
+    '--kind', 
+    type=str,
+    choices=['apple', 'radish'],
+    help='What kind of farm I run.')
+parser.add_argument(
+    '--n_acres', 
+    type=str,  
+    help='Size of my farm in acres.')
+
+# Create a nested argument group with a prefix
+crop_subparser = parser.add_subparser(name='crop', prefix='crop')
+parser.add_argument(
+    '--n_acres', 
+    type=str,  
+    help='Size of land for growing radish, in acres.')
+
+# Create a conditional argument group
+radish_subparser = parser.add_subparser(
+    name='radish',
+    prefix='radish',
+    dependency={'crop': 'radish'},
+    dest='crop')
+radish_subparser.add_argument(
+    '--is_pickled'
+    type=str2bool,
+    default=False,
+    help='Whether my farm produces pickled radish.')
+
+# Create another conditional argument group
+apple_subparser = parser.add_subparser(
+    name='apple',
+    prefix='apple',
+    dependency={'crop': 'apple'},
+    dest='crop')
+apple_subparser.add_argument(
+    '--kind'
+    type=str,
+    choices=['fuji', 'mcintosh'],
+    default='fuji',
+    help='Whether my farm produces pickled radish.')
+
+args = parser.parse_args()
+```
+
+Then running this command
+
+```bash
+python run.py \
+--name 'Radelicious Farms' \
+--kind radish \
+--n_acres 200 \
+--crop_n_acres 150 \
+--radish_is_pickled
+```
+
+would produce this kwargs dictionary:
+
+```python
+{
+    'name': 'Radelicious Farms',
+    'kind': 'radish',
+    'n_acres': 200,
+    'crop_args': {
+        'n_acres': 150,
+        'is_pickled': True
+    }
+}
+```
+
+Notice how the `prefix` for each subparser is appended to each argument name added to that subparser (e.g. `n_acres` became `crop_n_acres`, and `is_pickled` became `radish_is_pickled`). Also notice how the `radish_is_pickled` argument became active, as its activation conditions on `kind=radish`, as we specified when defining the `radish_subparser`.
+
+Likewise, running this argument
+
+```bash
+python run.py \
+--name 'Appledores Farms' \
+--kind apple \
+--n_acres 200 \
+--crop_n_acres 150 \
+--apple_kind fuji
+```
+
+results in this kwargs dictionary:
+
+```python
+{
+    'name': 'Appledores Farms',
+    'kind': 'apple',
+    'n_acres': 200,
+    'crop_args': {
+        'n_acres': 150,
+        'kind': 'fuji'
+    }
+}
+```
--- a/docs/train_args.md
+++ b/docs/train_args.md
@ -0,0 +1,125 @@
+# Command-line usage guide for `minimax.train`
+
+Parsing command-line arguments is handled by [`Parsnip`](parsnip.md). 
+
+You can quickly generate batches of training commands from a JSON configuration file using [`minimax.config.make_cmd`](make_cmd.md).
+
+## General arguments
+
+| Argument                | Description                                                                                          |
+| ----------------------- | ---------------------------------------------------------------------------------------------------- |
+| `seed`                  | Random seed, should be unique per experimental run                                                   |
+| `agent_rl_algo`         | Base RL algorithm used for training (e.g. PPO)                                                       |
+| `n_total_updates`       | Total number of updates for the training run                                                         |
+| `train_runner`          | Which training runner to use, e.g. `dr`, `plr`, or `paired`                                          |
+| `n_devices`             | Number of devices over which to shard the environment batch dimension                                |
+| `n_students`            | Number of students in the autocurriculum                                                             |
+| `n_parallel`            | Number of parallel environments                                                                      |
+| `n_eval`                | Number of parallel trials per environment (environment batch dimension is then `n_parallel*n_eval`) |
+| `n_rollout_steps`       | Number of steps per rollout (used for each update cycle)                                             |
+| `lr`                    | Learning rate                                                                                        |
+| `lr_final`              | Final learning rate, based on linear schedule. Defaults to `None`, corresponding to no schedule.     |
+| `lr_anneal_steps`       | Number of steps over which to linearly anneal from `lr` to `lr_final`                                |
+| `student_value_coef`    | Value loss coefficient                                                                               |
+| `student_entropy_coef`  | Entropy bonus coefficient                                                                            |
+| `student_unroll_update` | Unroll multi-gradient updates this many times (can lead to speed ups)                                |
+| `max_grad_norm`         | Clip gradients beyond this magnitude                                                                 |
+| `adam_eps`              | Value of $`\epsilon`$ numerical stability constant for Adam                                            |
+| `discount`              | Discount factor $`\gamma`$ for the student's RL optimization                                           |
+| `n_unroll_rollout`      | Unroll rollout scans this many times (can lead to speed ups)                                         |
+
+## Logging arguments
+
+| Argument            | Description                                              |
+| ------------------- | -------------------------------------------------------- |
+| `verbose`           | Random seed, should be unique per experimental run       |
+| `track_env_metrics` | Track per rollout batch environment metrics if `True`    |
+| `log_dir`           | Path to directory storing all experiment folders         |
+| `xpid`              | Unique name for experiment folder, stored in `--log_dir` |
+| `log_interval`      | Log training statistics every this many rollout cycles   |
+| `wandb_base_url`    | Base API URL if logging with `wandb`                     |
+| `wandb_api_key`     | API key for `wandb`                                      |
+| `wandb_entity`      | `wandb` entity associated with the experiment run        |
+| `wandb_project`     | `wandb` project for the experiment run                   |
+| `wandb_group`       | `wandb` group for the experiment run                     |
+
+## Checkpointing arguments
+
+| Argument               | Description                                                                   |
+| ---------------------- | ----------------------------------------------------------------------------- |
+| `checkpoint_interval`  | Random seed, should be unique per experimental run                            |
+| `from_last_checkpoint` | Begin training from latest `checkpoint.pkl`, if any, in the experiment folder |
+| `archive_interval`     | Save an additional checkpoint for models trained per this many rollout cycles |
+
+## Evaluation arguments
+
+| Argument          | Description                                                          |
+| ----------------- | -------------------------------------------------------------------- |
+| `test_env_names`  | Random seed, should be unique per experimental run                   |
+| `test_n_episodes` | Average test results over this many episodes per test environment    |
+| `test_agent_idxs` | Test agents at these indices (csv of indices or `*` for all indices) |
+
+## PPO arguments
+
+These arguments activate when `--agent_rl_algo=ppo`.
+
+| Argument                      | Description                                                 |
+| ----------------------------- | ----------------------------------------------------------- |
+| `student_ppo_n_epochs`        | Random seed, should be unique per experimental run          |
+| `student_ppo_n_epochs`        | Number of PPO epochs per update cycle                       |
+| `student_ppo_n_minibatches`   | Number of minibatches per PPO epoch                         |
+| `student_ppo_clip_eps`        | Clip coefficient for PPO                                    |
+| `student_ppo_clip_value_loss` | Perform value clipping if `True`                            |
+| `gae_lambda`                  | Lambda discount factor for Generalized Advantage Estimation |
+
+## PAIRED arguments
+
+The arguments in this section activate when `--train_runner=paired`.
+
+| Argument                  | Description                                                           |
+| ------------------------- | --------------------------------------------------------------------- |
+| `teacher_lr`              | Learning rate                                                         |
+| `teacher_lr_final`        | Anneal learning rate to this value (defaults to `teacher_lr`)         |
+| `teacher_lr_anneal_steps` | Number of steps over which to linearly anneal from `lr` to `lr_final` |
+| `teacher_discount`        | Discount factor, $`\gamma`$                                             |
+| `teacher_value_loss_coef` | Value loss coefficient                                                |
+| `teacher_entropy_coef`    | Entropy bonus coefficient                                             |
+| `teacher_n_unroll_update` | Unroll multi-gradient updates this many times (can lead to speed ups) |
+| `ued_score`               | Name of UED objective, e.g. `relative_regret`                         |
+
+These PPO-specific arguments for teacher optimization further activate when `--agent_rl_algo=ppo`.
+
+| Argument                      | Description                                                 |
+| ----------------------------- | ----------------------------------------------------------- |
+| `teacher_ppo_n_epochs`        | Number of PPO epochs per update cycle                       |
+| `teacher_ppo_n_minibatches`   | Number of minibatches per PPO epoch                         |
+| `teacher_ppo_clip_eps`        | Clip coefficient for PPO                                    |
+| `teacher_ppo_clip_value_loss` | Perform value clipping if `True`                            |
+| `teacher_gae_lambda`          | Lambda discount factor for Generalized Advantage Estimation |
+
+## PLR arguments
+
+The arguments in this section activate when `--train_runner=paired`.
+
+| Argument                      | Description                                                                                                   |
+| ----------------------------- | ------------------------------------------------------------------------------------------------------------- |
+| `ued_score`                   | Name of UED objective (aka PLR scoring function)                                                              |
+| `plr_replay_prob`             | Replay probability                                                                                            |
+| `plr_buffer_size`             | Size of level replay buffer                                                                                   |
+| `plr_staleness_coef`          | Staleness coefficient                                                                                         |
+| `plr_temp`                    | Score distribution temperature                                                                                |
+| `plr_use_score_ranks`         | Use rank-based prioritization (rather than proportional)                                                      |
+| `plr_min_fill_ratio`          | Only replay once level replay buffer is filled above this ratio                                               |
+| `plr_use_robust_plr`          | Use robust PLR (i.e. only update policy on replay levels)                                                     |
+| `plr_force_unique`            | Force level replay buffer members to be unique                                                                |
+| `plr_use_parallel_eval`       | Use Parallel PLR or Parallel ACCEL (if `plr_mutation_fn` is set)                                              |
+| `plr_mutation_fn`             | If set, PLR becomes ACCEL. Use `'default'` for default mutation operator per environment.                     |
+| `plr_n_mutations`             | Number of applications of `plr_mutation_fn` per mutation cycle.                                               |
+| `plr_mutation_criterion`      | How replay levels are selected for mutation (e.g. `batch`, `easy`, `hard`).                                   |
+| `plr_mutation_subsample_size` | Number of replay levels selected for mutation according to the criterion (ignored if using `batch` criterion) |
+
+## Environment-specific arguments
+
+### Maze
+
+See the [`AMaze`](envs/maze.md) docs for details on how to specify [training](envs/maze.md#student-environment), [evaluation](envs/maze.md#student-environment), and [teacher-specific](envs/maze.md#teacher-environment) environment parameters via command line