trl
$
npx mdskill add elizaOS/eliza/trlTRL is organized around a trainer hierarchy that extends Hugging Face `transformers.Trainer`.
SKILL.md
.github/skills/trlView on GitHub ↗
--- name: trl description: Reference for the TRL (Transformer Reinforcement Learning) library codebase. Use proactively before reading or editing any file under `trl/` so you have the intended contracts and invariants in mind, not just what the current code says. Covers trainer hierarchy (SFT, DPO, GRPO, KTO), shared utility functions (selective_log_softmax, decode_and_strip_padding, padding helpers), configuration system, model wrappers, and how data flows through any TRL trainer. --- # TRL Library Reference ## Package Structure TRL is organized around a trainer hierarchy that extends Hugging Face `transformers.Trainer`. ``` trl/ ├── trainer/ │ ├── grpo_trainer.py # GRPOTrainer │ ├── grpo_config.py # GRPOConfig │ ├── sft_trainer.py # SFTTrainer (supervised fine-tuning) │ ├── dpo_trainer.py # DPOTrainer (direct preference optimization) │ ├── kto_trainer.py # KTOTrainer (Kahneman-Tversky optimization) │ ├── online_dpo_trainer.py # OnlineDPOTrainer │ ├── utils.py # Shared utilities (log probs, decoding, padding) │ └── ... ├── models/ │ └── modeling_value_head.py # Value head for PPO-style trainers ├── data_utils.py ├── commands/ # CLI entry points └── ... ``` ## Trainer Hierarchy All TRL trainers extend `transformers.Trainer`: ``` transformers.Trainer ├── SFTTrainer # Supervised fine-tuning ├── DPOTrainer # Direct preference optimization ├── GRPOTrainer # Group relative policy optimization ├── KTOTrainer # Kahneman-Tversky optimization └── OnlineDPOTrainer # Online DPO ``` Each trainer overrides `compute_loss` with its specific objective, and RL-based trainers (GRPO, OnlineDPO) additionally override `training_step` to add a generation phase before the optimization step. ## Shared Utility Functions (trainer/utils.py) These utilities are used across multiple trainers. Read the source before modifying; the contracts below are what callers rely on. ### `selective_log_softmax(logits, index)` Memory-efficient per-token log-probability. Equivalent in value to `F.log_softmax(logits, dim=-1).gather(...)` at the selected token positions, but avoids materializing the full vocab-sized tensor. **Contract:** - Input: `logits [B, T, V]`, `index [B, T]` - Output: `log_probs [B, T]`, each entry a valid log-probability (i.e. non-positive) - Must agree with `F.log_softmax` to within numerical tolerance on the same inputs ### `decode_and_strip_padding(input_ids, tokenizer)` Converts a batch of token ID tensors into the cleaned text strings that the reward function will score. **Contract:** - Input: `input_ids [B, T]`, tokenizer - Output: `list[str]` of length `B` - Strips padding and decoder artefacts - Handles any reasoning-block conventions the library supports; the exact policy for complete, incomplete, and absent reasoning markers is defined in the implementation ### Other Utilities - `pad` / `pad_to_length` — Pad tensors to equal or specific lengths - Various tokenizer helpers for batch processing ## Configuration System All TRL configs extend `transformers.TrainingArguments`. Each trainer adds its own fields: | Config | Trainer | Key fields | |--------|---------|------------| | `SFTConfig` | SFTTrainer | `max_seq_length`, `packing`, `dataset_text_field` | | `DPOConfig` | DPOTrainer | `beta`, `loss_type`, `reference_free` | | `GRPOConfig` | GRPOTrainer | `num_generations`, `beta`, `epsilon`, `reward_functions` | | `KTOConfig` | KTOTrainer | `beta`, `desirable_weight`, `undesirable_weight` | ## Available References | File | Contents | When to load | |------|----------|-------------| | `references/trl-codebase.md` | Module-by-module guide to TRL source: detailed breakdown of each trainer, model wrappers, data utilities, and CLI commands | When navigating unfamiliar parts of TRL beyond the trainer layer, or when you need details about a specific non-GRPO trainer |