support badam for all stages

Former-commit-id: 7a1380646119bfe6855f73dd90570defcea05281
This commit is contained in:
hiyouga
2024-04-16 17:44:48 +08:00
parent 42084e08ae
commit a4167fd925
9 changed files with 61 additions and 28 deletions

View File

@@ -46,7 +46,7 @@ Choose your path:
- **Various models**: LLaMA, Mistral, Mixtral-MoE, Qwen, Yi, Gemma, Baichuan, ChatGLM, Phi, etc.
- **Integrated methods**: (Continuous) pre-training, supervised fine-tuning, reward modeling, PPO, DPO and ORPO.
- **Scalable resources**: 32-bit full-tuning, 16-bit freeze-tuning, 16-bit LoRA and 2/4/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8.
- **Advanced algorithms**: GaLore, DoRA, LongLoRA, LLaMA Pro, LoRA+, LoftQ and Agent tuning.
- **Advanced algorithms**: GaLore, BAdam, DoRA, LongLoRA, LLaMA Pro, LoRA+, LoftQ and Agent tuning.
- **Practical tricks**: FlashAttention-2, Unsloth, RoPE scaling, NEFTune and rsLoRA.
- **Experiment monitors**: LlamaBoard, TensorBoard, Wandb, MLflow, etc.
- **Faster inference**: OpenAI-style API, Gradio UI and CLI with vLLM worker.
@@ -68,14 +68,16 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
## Changelog
[24/04/16] We supported **[BAdam](https://arxiv.org/abs/2404.02827)**. See `examples/extras/badam` for usage.
[24/04/16] We supported **[unsloth](https://github.com/unslothai/unsloth)**'s long-sequence training (Llama-2-7B-56k within 24GB). It achieves **117%** speed and **50%** memory compared with FlashAttention-2, more benchmarks can be found in [this page](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison).
[24/03/31] We supported **[ORPO](https://arxiv.org/abs/2403.07691)**. See `examples/lora_single_gpu` for usage.
[24/03/21] Our paper "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" is available at arXiv!
<details><summary>Full Changelog</summary>
[24/03/21] Our paper "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" is available at arXiv!
[24/03/20] We supported **FSDP+QLoRA** that fine-tunes a 70B model on 2x24GB GPUs. See `examples/extras/fsdp_qlora` for usage.
[24/03/13] We supported **[LoRA+](https://arxiv.org/abs/2402.12354)**. See `examples/extras/loraplus` for usage.
@@ -278,16 +280,15 @@ huggingface-cli login
\* *estimated*
| Method | Bits | 7B | 13B | 30B | 70B | 8x7B |
| ------ | ---- | ----- | ----- | ----- | ------ | ------ |
| Full | AMP | 120GB | 240GB | 600GB | 1200GB | 900GB |
| Full | 16 | 60GB | 120GB | 300GB | 600GB | 400GB |
| GaLore | 16 | 16GB | 32GB | 64GB | 160GB | 120GB |
| Freeze | 16 | 20GB | 40GB | 80GB | 200GB | 160GB |
| LoRA | 16 | 16GB | 32GB | 64GB | 160GB | 120GB |
| QLoRA | 8 | 10GB | 20GB | 40GB | 80GB | 60GB |
| QLoRA | 4 | 6GB | 12GB | 24GB | 48GB | 30GB |
| QLoRA | 2 | 4GB | 8GB | 16GB | 24GB | 18GB |
| Method | Bits | 7B | 13B | 30B | 70B | 8x7B |
| ----------------- | ---- | ----- | ----- | ----- | ------ | ------ |
| Full | AMP | 120GB | 240GB | 600GB | 1200GB | 900GB |
| Full | 16 | 60GB | 120GB | 300GB | 600GB | 400GB |
| Freeze | 16 | 20GB | 40GB | 80GB | 200GB | 160GB |
| LoRA/GaLore/BAdam | 16 | 16GB | 32GB | 64GB | 160GB | 120GB |
| QLoRA | 8 | 10GB | 20GB | 40GB | 80GB | 60GB |
| QLoRA | 4 | 6GB | 12GB | 24GB | 48GB | 30GB |
| QLoRA | 2 | 4GB | 8GB | 16GB | 24GB | 18GB |
## Getting Started