[launcher] Add elastic and fault-tolerant training support (#8286)

Signed-off-by: Butui Hu <hot123tea123@gmail.com>
2025-06-05 16:40:03 +08:00
parent 69c9e379d5
commit 1a33d65a56
3 changed files with 66 additions and 18 deletions
--- a/examples/README.md
+++ b/examples/README.md
@@ -165,6 +165,14 @@ FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=0 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500
 FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=1 MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_full/llama3_full_sft.yaml
 ```

+### Elastic and Fault-Tolerant Supervised Fine-Tuning on Multiple Nodes
+
+To launch an elastic job with `MAX_RESTARTS` failures retries, run the following on at least `MIN_NNODES` nodes and at most `MAX_NNODES` nodes. `RDZV_ID` should be set as a unique job id (shared by all nodes participating in the job). See also [torchrun](https://docs.pytorch.org/docs/stable/elastic/run.html).
+
+```bash
+FORCE_TORCHRUN=1 MIN_NNODES=1 MAX_NNODES=3 MAX_RESTARTS=3 RDZV_ID=llamafactory MASTER_ADDR=192.168.0.1 MASTER_PORT=29500 llamafactory-cli train examples/train_full/llama3_full_sft.yaml
+```
+
 #### Multimodal Supervised Fine-Tuning

 ```bash