release v0.6.2

Former-commit-id: f92ad0a62d957b595f6a76a5403216b163eb3d17
Merge branch 'main' of https://github.com/hiyouga/LLaMA-Factory
2024-04-11 20:08:51 +08:00 · 2024-04-10 23:58:18 +08:00 · 2024-04-10 23:57:59 +08:00 · 2024-04-10 00:58:48 +08:00 · 2024-04-10 00:57:51 +08:00 · 2024-04-10 00:57:30 +08:00
131 changed files with 4145 additions and 2264 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -0,0 +1,11 @@
+.vscode
+.git
+.github
+.venv
+cache
+data
+examples
+.dockerignore
+.gitattributes
+.gitignore
+Dockerfile
--- a/.github/CODE_OF_CONDUCT.md
+++ b/.github/CODE_OF_CONDUCT.md
--- a/.github/CONTRIBUTING.md
+++ b/.github/CONTRIBUTING.md
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -4,4 +4,4 @@ Fixes # (issue)

 ## Before submitting

- [ ] Did you read the [contributor guideline](/CONTRIBUTING.md)?
+- [ ] Did you read the [contributor guideline](https://github.com/hiyouga/LLaMA-Factory/blob/main/.github/CONTRIBUTING.md)?
--- a/.github/SECURITY.md
+++ b/.github/SECURITY.md
@@ -1,6 +1,6 @@
 # Reporting Security Issues

-To report a security issue, please use the GitHub Security Advisory ["Report a Vulnerability"](https://github.com/electron/electron/security/advisories/new) tab.
+To report a security issue, please use the GitHub Security Advisory ["Report a Vulnerability"](https://github.com/hiyouga/LLaMA-Factory/security/advisories/new) tab.

 We will send a response indicating the next steps in handling your report. After the initial reply to your report, the security team will keep you informed of the progress towards a fix and full announcement, and may ask for additional information or guidance.

--- a/CITATION.cff
+++ b/CITATION.cff
@@ -0,0 +1,37 @@
+cff-version: 1.2.0
+date-released: 2024-03
+message: "If you use this software, please cite it as below."
+authors:
+- family-names: "Zheng"
+  given-names: "Yaowei"
+- family-names: "Zhang"
+  given-names: "Richong"
+- family-names: "Zhang"
+  given-names: "Junhao"
+- family-names: "Ye"
+  given-names: "Yanhan"
+- family-names: "Luo"
+  given-names: "Zheyan"
+- family-names: "Ma"
+  given-names: "Yongqiang"
+title: "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models"
+url: "https://arxiv.org/abs/2403.13372"
+preferred-citation:
+  type: article
+  authors:
+    - family-names: "Zheng"
+      given-names: "Yaowei"
+    - family-names: "Zhang"
+      given-names: "Richong"
+    - family-names: "Zhang"
+      given-names: "Junhao"
+    - family-names: "Ye"
+      given-names: "Yanhan"
+    - family-names: "Luo"
+      given-names: "Zheyan"
+    - family-names: "Ma"
+      given-names: "Yongqiang"
+  journal: "arXiv preprint arXiv:2403.13372"
+  title: "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models"
+  url: "https://arxiv.org/abs/2403.13372"
+  year: 2024
--- a/14
+++ b/14
@@ -0,0 +1,14 @@
+FROM nvcr.io/nvidia/pytorch:24.01-py3
+
+WORKDIR /app
+
+COPY requirements.txt /app/
+RUN pip install -r requirements.txt
+
+COPY . /app/
+RUN pip install -e .[deepspeed,metrics,bitsandbytes,qwen]
+
+VOLUME [ "/root/.cache/huggingface/", "/app/data", "/app/output" ]
+EXPOSE 7860
+
+CMD [ "python", "src/train_web.py" ]
--- a/6
+++ b/6
@@ -1,11 +1,11 @@
 .PHONY: quality style

-check_dirs := src tests
+check_dirs := scripts src tests

 quality:
-	ruff $(check_dirs)
+	ruff check $(check_dirs)
 	ruff format --check $(check_dirs)

 style:
-	ruff $(check_dirs) --fix
+	ruff check $(check_dirs) --fix
 	ruff format $(check_dirs)
--- a/README.md
+++ b/README.md
@@ -5,23 +5,26 @@
 [![GitHub last commit](https://img.shields.io/github/last-commit/hiyouga/LLaMA-Factory)](https://github.com/hiyouga/LLaMA-Factory/commits/main)
 [![PyPI](https://img.shields.io/pypi/v/llmtuner)](https://pypi.org/project/llmtuner/)
 [![Downloads](https://static.pepy.tech/badge/llmtuner)](https://pypi.org/project/llmtuner/)
-[![Citation](https://img.shields.io/badge/Citation-21-green)](#projects-using-llama-factory)
+[![Citation](https://img.shields.io/badge/citation-28-green)](#projects-using-llama-factory)
 [![GitHub pull request](https://img.shields.io/badge/PRs-welcome-blue)](https://github.com/hiyouga/LLaMA-Factory/pulls)
 [![Discord](https://dcbadge.vercel.app/api/server/rKfvV9r9FK?compact=true&style=flat)](https://discord.gg/rKfvV9r9FK)
-[![Spaces](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue)](https://huggingface.co/spaces/hiyouga/LLaMA-Board)
-[![Studios](https://img.shields.io/badge/ModelScope-Open%20In%20Studios-blue)](https://modelscope.cn/studios/hiyouga/LLaMA-Board)
+[![Twitter](https://img.shields.io/twitter/follow/llamafactory_ai)](https://twitter.com/llamafactory_ai)
+[![Spaces](https://img.shields.io/badge/🤗-Open%20in%20Spaces-blue)](https://huggingface.co/spaces/hiyouga/LLaMA-Board)
+[![Studios](https://img.shields.io/badge/ModelScope-Open%20in%20Studios-blue)](https://modelscope.cn/studios/hiyouga/LLaMA-Board)
+[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing)

 👋 Join our [WeChat](assets/wechat.jpg).

 \[ English | [中文](README_zh.md) \]

-## LLaMA Board: A One-stop Web UI for Getting Started with LLaMA Factory
+**Fine-tuning a large language model can be easy as...**

-Preview LLaMA Board at **[🤗 Spaces](https://huggingface.co/spaces/hiyouga/LLaMA-Board)** and **[ModelScope](https://modelscope.cn/studios/hiyouga/LLaMA-Board)**, or launch it locally with `CUDA_VISIBLE_DEVICES=0 python src/train_web.py`.
+https://github.com/hiyouga/LLaMA-Factory/assets/16256802/9840a653-7e9c-41c8-ae89-7ace5698baf6

-Here is an example of altering the self-cognition of an instruction-tuned language model within 10 minutes on a single GPU.
+Choose your path:

-https://github.com/hiyouga/LLaMA-Factory/assets/16256802/6ba60acc-e2e2-4bec-b846-2d88920d5ba1
+- **Colab**: https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing
+- **Local machine**: Please refer to [usage](#getting-started)

 ## Table of Contents

@@ -41,15 +44,16 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/6ba60acc-e2e2-4bec-b846
 ## Features

 - **Various models**: LLaMA, Mistral, Mixtral-MoE, Qwen, Yi, Gemma, Baichuan, ChatGLM, Phi, etc.
- **Integrated methods**: (Continuous) pre-training, supervised fine-tuning, reward modeling, PPO and DPO.
- **Scalable resources**: 32-bit full-tuning, 16-bit freeze-tuning, 16-bit LoRA, 2/4/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8.
- **Advanced algorithms**: DoRA, LongLoRA, LLaMA Pro, LoftQ, agent tuning.
- **Practical tricks**: FlashAttention-2, Unsloth, RoPE scaling, NEFTune, rsLoRA.
+- **Integrated methods**: (Continuous) pre-training, supervised fine-tuning, reward modeling, PPO, DPO and ORPO.
+- **Scalable resources**: 32-bit full-tuning, 16-bit freeze-tuning, 16-bit LoRA and 2/4/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8.
+- **Advanced algorithms**: GaLore, DoRA, LongLoRA, LLaMA Pro, LoRA+, LoftQ and Agent tuning.
+- **Practical tricks**: FlashAttention-2, Unsloth, RoPE scaling, NEFTune and rsLoRA.
 - **Experiment monitors**: LlamaBoard, TensorBoard, Wandb, MLflow, etc.
+- **Faster inference**: OpenAI-style API, Gradio UI and CLI with vLLM worker.

 ## Benchmark

-Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/ptuning), LLaMA-Factory's LoRA tuning offers up to **3.7 times faster** training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA-Factory's QLoRA further improves the efficiency regarding the GPU memory.
+Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/ptuning), LLaMA Factory's LoRA tuning offers up to **3.7 times faster** training speed with a better Rouge score on the advertising text generation task. By leveraging 4-bit quantization technique, LLaMA Factory's QLoRA further improves the efficiency regarding the GPU memory.

 ![benchmark](assets/benchmark.svg)

@@ -58,23 +62,35 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
 - **Training Speed**: the number of training samples processed per second during the training. (bs=4, cutoff_len=1024)
 - **Rouge Score**: Rouge-2 score on the development set of the [advertising text generation](https://aclanthology.org/D19-1321.pdf) task. (bs=4, cutoff_len=1024)
 - **GPU Memory**: Peak GPU memory usage in 4-bit quantized training. (bs=1, cutoff_len=1024)
- We adopt `pre_seq_len=128` for ChatGLM's P-Tuning and `lora_rank=32` for LLaMA-Factory's LoRA tuning.
+- We adopt `pre_seq_len=128` for ChatGLM's P-Tuning and `lora_rank=32` for LLaMA Factory's LoRA tuning.

 </details>

 ## Changelog

-[24/02/28] We supported weight-decomposed LoRA (**[DoRA](https://arxiv.org/abs/2402.09353)**). Try `--use_dora` to activate DoRA training.
+[24/03/31] We supported **[ORPO](https://arxiv.org/abs/2403.07691)**. See `examples/lora_single_gpu` for usage.

-[24/02/15] We supported **block expansion** proposed by [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro). See `tests/llama_pro.py` for usage.
+[24/03/21] Our paper "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" is available at arXiv!

-[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this [blog post](https://qwenlm.github.io/blog/qwen1.5/) for details.
+[24/03/20] We supported **FSDP+QLoRA** that fine-tunes a 70B model on 2x24GB GPUs. See `examples/extras/fsdp_qlora` for usage.

 <details><summary>Full Changelog</summary>

+[24/03/13] We supported **[LoRA+](https://arxiv.org/abs/2402.12354)**. See `examples/extras/loraplus` for usage.
+
+[24/03/07] We supported gradient low-rank projection (**[GaLore](https://arxiv.org/abs/2403.03507)**) algorithm. See `examples/extras/galore` for usage.
+
+[24/03/07] We integrated **[vLLM](https://github.com/vllm-project/vllm)** for faster and concurrent inference. Try `--infer_backend vllm` to enjoy **270%** inference speed. (LoRA is not yet supported, merge it first.)
+
+[24/02/28] We supported weight-decomposed LoRA (**[DoRA](https://arxiv.org/abs/2402.09353)**). Try `--use_dora` to activate DoRA training.
+
+[24/02/15] We supported **block expansion** proposed by [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro). See `examples/extras/llama_pro` for usage.
+
+[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this [blog post](https://qwenlm.github.io/blog/qwen1.5/) for details.
+
 [24/01/18] We supported **agent tuning** for most models, equipping model with tool using abilities by fine-tuning with `--dataset glaive_toolcall`.

-[23/12/23] We supported **[unsloth](https://github.com/unslothai/unsloth)**'s implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try `--use_unsloth` argument to activate unsloth patch. It achieves 1.7x speed in our benchmark, check [this page](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison) for details.
+[23/12/23] We supported **[unsloth](https://github.com/unslothai/unsloth)**'s implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try `--use_unsloth` argument to activate unsloth patch. It achieves **170%** speed in our benchmark, check [this page](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison) for details.

 [23/12/12] We supported fine-tuning the latest MoE model **[Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)** in our framework. See hardware requirement [here](#hardware-requirement).

@@ -122,13 +138,14 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
 | [InternLM2](https://huggingface.co/internlm)             | 7B/20B                      | wqkv              | intern2   |
 | [LLaMA](https://github.com/facebookresearch/llama)       | 7B/13B/33B/65B              | q_proj,v_proj     | -         |
 | [LLaMA-2](https://huggingface.co/meta-llama)             | 7B/13B/70B                  | q_proj,v_proj     | llama2    |
-| [Mistral](https://huggingface.co/mistralai)              | 7B                          | q_proj,v_proj     | mistral   |
-| [Mixtral](https://huggingface.co/mistralai)              | 8x7B                        | q_proj,v_proj     | mistral   |
+| [Mistral/Mixtral](https://huggingface.co/mistralai)      | 7B/8x7B                     | q_proj,v_proj     | mistral   |
+| [OLMo](https://huggingface.co/allenai)                   | 1B/7B                       | att_proj          | olmo      |
 | [Phi-1.5/2](https://huggingface.co/microsoft)            | 1.3B/2.7B                   | q_proj,v_proj     | -         |
 | [Qwen](https://huggingface.co/Qwen)                      | 1.8B/7B/14B/72B             | c_attn            | qwen      |
-| [Qwen1.5](https://huggingface.co/Qwen)                   | 0.5B/1.8B/4B/7B/14B/72B     | q_proj,v_proj     | qwen      |
+| [Qwen1.5 (MoE)](https://huggingface.co/Qwen)             | 0.5B/1.8B/4B/7B/14B/32B/72B | q_proj,v_proj     | qwen      |
+| [StarCoder2](https://huggingface.co/bigcode)             | 3B/7B/15B                   | q_proj,v_proj     | -         |
 | [XVERSE](https://huggingface.co/xverse)                  | 7B/13B/65B                  | q_proj,v_proj     | xverse    |
-| [Yi](https://huggingface.co/01-ai)                       | 6B/34B                      | q_proj,v_proj     | yi        |
+| [Yi](https://huggingface.co/01-ai)                       | 6B/9B/34B                   | q_proj,v_proj     | yi        |
 | [Yuan](https://huggingface.co/IEITYuan)                  | 2B/51B/102B                 | q_proj,v_proj     | yuan      |

 > [!NOTE]
@@ -138,6 +155,8 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/

 Please refer to [constants.py](src/llmtuner/extras/constants.py) for a full list of models we supported.

+You also can add a custom chat template to [template.py](src/llmtuner/data/template.py).
+
 ## Supported Training Approaches

 | Approach               |     Full-tuning    |    Freeze-tuning   |       LoRA         |       QLoRA        |
@@ -147,9 +166,7 @@ Please refer to [constants.py](src/llmtuner/extras/constants.py) for a full list
 | Reward Modeling        | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
 | PPO Training           | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
 | DPO Training           | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
-
-> [!NOTE]
-> Use `--quantization_bit 4` argument to enable QLoRA.
+| ORPO Training          | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |

 ## Provided Datasets

@@ -204,6 +221,7 @@ Please refer to [constants.py](src/llmtuner/extras/constants.py) for a full list
 - [LMSYS Chat 1M (en)](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)
 - [Evol Instruct V2 (en)](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)
 - [Glaive Function Calling V2 (en)](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2)
+- [Cosmopedia (en)](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)
 - [Open Assistant (de)](https://huggingface.co/datasets/mayflowergmbh/oasst_de)
 - [Dolly 15k (de)](https://huggingface.co/datasets/mayflowergmbh/dolly-15k_de)
 - [Alpaca GPT4 (de)](https://huggingface.co/datasets/mayflowergmbh/alpaca-gpt4_de)
@@ -221,13 +239,12 @@ Please refer to [constants.py](src/llmtuner/extras/constants.py) for a full list
 - [HH-RLHF (en)](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 - [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
 - [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
+- [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
 - [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)
 - [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)

 </details>

-Please refer to [data/README.md](data/README.md) for details.
-
 Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.

 ```bash
@@ -240,386 +257,144 @@ huggingface-cli login
 | Mandatory    | Minimum | Recommend |
 | ------------ | ------- | --------- |
 | python       | 3.8     | 3.10      |
-| torch        | 1.13.1  | 2.2.1     |
-| transformers | 4.37.2  | 4.38.1    |
-| datasets     | 2.14.3  | 2.17.1    |
-| accelerate   | 0.27.2  | 0.27.2    |
-| peft         | 0.9.0   | 0.9.0     |
-| trl          | 0.7.11  | 0.7.11    |
+| torch        | 1.13.1  | 2.2.0     |
+| transformers | 4.37.2  | 4.39.3    |
+| datasets     | 2.14.3  | 2.18.0    |
+| accelerate   | 0.27.2  | 0.28.0    |
+| peft         | 0.9.0   | 0.10.0    |
+| trl          | 0.8.1   | 0.8.1     |

 | Optional     | Minimum | Recommend |
 | ------------ | ------- | --------- |
 | CUDA         | 11.6    | 12.2      |
-| deepspeed    | 0.10.0  | 0.13.4    |
-| bitsandbytes | 0.39.0  | 0.41.3    |
-| flash-attn   | 2.3.0   | 2.5.5     |
+| deepspeed    | 0.10.0  | 0.14.0    |
+| bitsandbytes | 0.39.0  | 0.43.0    |
+| flash-attn   | 2.3.0   | 2.5.6     |

 ### Hardware Requirement

 \* *estimated*

-| Method | Bits |   7B  |  13B  |  30B  |   65B  |   8x7B |
+| Method | Bits |   7B  |  13B  |  30B  |   70B  |   8x7B |
 | ------ | ---- | ----- | ----- | ----- | ------ | ------ |
-| Full   |  16  | 160GB | 320GB | 600GB | 1200GB |  900GB |
-| Freeze |  16  |  20GB |  40GB | 120GB |  240GB |  200GB |
-| LoRA   |  16  |  16GB |  32GB |  80GB |  160GB |  120GB |
-| QLoRA  |   8  |  10GB |  16GB |  40GB |   80GB |   80GB |
-| QLoRA  |   4  |   6GB |  12GB |  24GB |   48GB |   32GB |
+| Full   | AMP  | 120GB | 240GB | 600GB | 1200GB |  900GB |
+| Full   |  16  |  60GB | 120GB | 300GB |  600GB |  400GB |
+| GaLore |  16  |  16GB |  32GB |  64GB |  160GB |  120GB |
+| Freeze |  16  |  20GB |  40GB |  80GB |  200GB |  160GB |
+| LoRA   |  16  |  16GB |  32GB |  64GB |  160GB |  120GB |
+| QLoRA  |   8  |  10GB |  20GB |  40GB |   80GB |   60GB |
+| QLoRA  |   4  |   6GB |  12GB |  24GB |   48GB |   30GB |
+| QLoRA  |   2  |   4GB |   8GB |  16GB |   24GB |   18GB |

 ## Getting Started

-### Data Preparation (optional)
+### Data Preparation

-Please refer to [data/README.md](data/README.md) for checking the details about the format of dataset files. You can either use a single `.json` file or a [dataset loading script](https://huggingface.co/docs/datasets/dataset_script) with multiple files to create a custom dataset.
+Please refer to [data/README.md](data/README.md) for checking the details about the format of dataset files. You can either use datasets on HuggingFace / ModelScope hub or load the dataset in local disk.

 > [!NOTE]
-> Please update `data/dataset_info.json` to use your custom dataset. About the format of this file, please refer to `data/README.md`.
+> Please update `data/dataset_info.json` to use your custom dataset.

-### Dependence Installation (optional)
+### Dependence Installation

 ```bash
 git clone https://github.com/hiyouga/LLaMA-Factory.git
 conda create -n llama_factory python=3.10
 conda activate llama_factory
 cd LLaMA-Factory
-pip install -r requirements.txt
+pip install -e .[metrics]
 ```

-If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you will be required to install a pre-built version of `bitsandbytes` library, which supports CUDA 11.1 to 12.2.
+Extra dependencies available: deepspeed, metrics, unsloth, galore, vllm, bitsandbytes, gptq, awq, aqlm, qwen, modelscope, quality
+
+<details><summary>For Windows users</summary>
+
+If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you will be required to install a pre-built version of `bitsandbytes` library, which supports CUDA 11.1 to 12.2, please select the appropriate [release version](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels) based on your CUDA version.

 ```bash
-pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.0-py3-none-win_amd64.whl
+pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl
 ```

 To enable FlashAttention-2 on the Windows platform, you need to install the precompiled `flash-attn` library, which supports CUDA 12.1 to 12.2. Please download the corresponding version from [flash-attention](https://github.com/bdashore3/flash-attention/releases) based on your requirements.

-### Use ModelScope Hub (optional)
+</details>

-If you have trouble with downloading models and datasets from Hugging Face, you can use LLaMA-Factory together with ModelScope in the following manner.
+### LLaMA Board GUI
+
+> [!IMPORTANT]
+> LLaMA Board GUI only supports training on a single GPU, please use [CLI](#command-line-interface) for distributed training.
+
+#### Use local environment
+
+```bash
+export CUDA_VISIBLE_DEVICES=0 # `set CUDA_VISIBLE_DEVICES=0` for Windows
+python src/train_web.py # or python -m llmtuner.webui.interface
+```
+
+#### Use Docker
+
+```bash
+docker build -f ./Dockerfile -t llama-factory:latest .
+docker run --gpus=all \
+    -v ./hf_cache:/root/.cache/huggingface/ \
+    -v ./data:/app/data \
+    -v ./output:/app/output \
+    -e CUDA_VISIBLE_DEVICES=0 \
+    -p 7860:7860 \
+    --shm-size 16G \
+    --name llama_factory \
+    -d llama-factory:latest
+```
+
+#### Use Docker Compose
+
+```bash
+docker compose -f ./docker-compose.yml up -d
+```
+
+<details><summary>Details about volume</summary>
+
+- hf_cache: Utilize Hugging Face cache on the host machine. Reassignable if a cache already exists in a different directory.
+- data: Place datasets on this dir of the host machine so that they can be selected on LLaMA Board GUI.
+- output: Set export dir to this location so that the merged result can be accessed directly on the host machine.
+
+</details>
+
+### Command Line Interface
+
+See [examples/README.md](examples/README.md) for usage.
+
+Use `python src/train_bash.py -h` to display arguments description.
+
+### Deploy with OpenAI-style API and vLLM
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \
+    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 \
+    --template mistral \
+    --infer_backend vllm \
+    --vllm_enforce_eager
+```
+
+### Use ModelScope Hub
+
+If you have trouble with downloading models and datasets from Hugging Face, you can use ModelScope.

 ```bash
 export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` for Windows
 ```

-Then you can train the corresponding model by specifying a model ID of the ModelScope Hub. (find a full list of model IDs at [ModelScope Hub](https://modelscope.cn/models))
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --model_name_or_path modelscope/Llama-2-7b-ms \
-    ... # arguments (same as above)
-```
-
-LLaMA Board also supports using the models and datasets on the ModelScope Hub.
-
-```bash
-CUDA_VISIBLE_DEVICES=0 USE_MODELSCOPE_HUB=1 python src/train_web.py
-```
-
-### Train on a single GPU
-
-> [!IMPORTANT]
-> If you want to train models on multiple GPUs, please refer to [Distributed Training](#distributed-training).
-
-#### Pre-Training
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage pt \
-    --do_train \
-    --model_name_or_path path_to_llama_model \
-    --dataset wiki_demo \
-    --finetuning_type lora \
-    --lora_target q_proj,v_proj \
-    --output_dir path_to_pt_checkpoint \
-    --overwrite_cache \
-    --per_device_train_batch_size 4 \
-    --gradient_accumulation_steps 4 \
-    --lr_scheduler_type cosine \
-    --logging_steps 10 \
-    --save_steps 1000 \
-    --learning_rate 5e-5 \
-    --num_train_epochs 3.0 \
-    --plot_loss \
-    --fp16
-```
-
-#### Supervised Fine-Tuning
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage sft \
-    --do_train \
-    --model_name_or_path path_to_llama_model \
-    --dataset alpaca_gpt4_en \
-    --template default \
-    --finetuning_type lora \
-    --lora_target q_proj,v_proj \
-    --output_dir path_to_sft_checkpoint \
-    --overwrite_cache \
-    --per_device_train_batch_size 4 \
-    --gradient_accumulation_steps 4 \
-    --lr_scheduler_type cosine \
-    --logging_steps 10 \
-    --save_steps 1000 \
-    --learning_rate 5e-5 \
-    --num_train_epochs 3.0 \
-    --plot_loss \
-    --fp16
-```
-
-#### Reward Modeling
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage rm \
-    --do_train \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_sft_checkpoint \
-    --create_new_adapter \
-    --dataset comparison_gpt4_en \
-    --template default \
-    --finetuning_type lora \
-    --lora_target q_proj,v_proj \
-    --output_dir path_to_rm_checkpoint \
-    --per_device_train_batch_size 2 \
-    --gradient_accumulation_steps 4 \
-    --lr_scheduler_type cosine \
-    --logging_steps 10 \
-    --save_steps 1000 \
-    --learning_rate 1e-6 \
-    --num_train_epochs 1.0 \
-    --plot_loss \
-    --fp16
-```
-
-#### PPO Training
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage ppo \
-    --do_train \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_sft_checkpoint \
-    --create_new_adapter \
-    --dataset alpaca_gpt4_en \
-    --template default \
-    --finetuning_type lora \
-    --lora_target q_proj,v_proj \
-    --reward_model path_to_rm_checkpoint \
-    --output_dir path_to_ppo_checkpoint \
-    --per_device_train_batch_size 2 \
-    --gradient_accumulation_steps 4 \
-    --lr_scheduler_type cosine \
-    --top_k 0 \
-    --top_p 0.9 \
-    --logging_steps 10 \
-    --save_steps 1000 \
-    --learning_rate 1e-5 \
-    --num_train_epochs 1.0 \
-    --plot_loss \
-    --fp16
-```
-
-> [!TIP]
-> Use `--adapter_name_or_path path_to_sft_checkpoint,path_to_ppo_checkpoint` to infer the fine-tuned model.
-
-> [!WARNING]
-> Use `--per_device_train_batch_size=1` for LLaMA-2 models in fp16 PPO training.
-
-#### DPO Training
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage dpo \
-    --do_train \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_sft_checkpoint \
-    --create_new_adapter \
-    --dataset comparison_gpt4_en \
-    --template default \
-    --finetuning_type lora \
-    --lora_target q_proj,v_proj \
-    --output_dir path_to_dpo_checkpoint \
-    --per_device_train_batch_size 2 \
-    --gradient_accumulation_steps 4 \
-    --lr_scheduler_type cosine \
-    --logging_steps 10 \
-    --save_steps 1000 \
-    --learning_rate 1e-5 \
-    --num_train_epochs 1.0 \
-    --plot_loss \
-    --fp16
-```
-
-> [!TIP]
-> Use `--adapter_name_or_path path_to_sft_checkpoint,path_to_dpo_checkpoint` to infer the fine-tuned model.
-
-### Distributed Training
-
-#### Use Huggingface Accelerate
-
-```bash
-accelerate config # configure the environment
-accelerate launch src/train_bash.py # arguments (same as above)
-```
-
-<details><summary>Example config for LoRA training</summary>
-
-```yaml
-compute_environment: LOCAL_MACHINE
-debug: false
-distributed_type: MULTI_GPU
-downcast_bf16: 'no'
-gpu_ids: all
-machine_rank: 0
-main_training_function: main
-mixed_precision: fp16
-num_machines: 1
-num_processes: 4
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
-```
-
-</details>
-
-#### Use DeepSpeed
-
-```bash
-deepspeed --num_gpus 8 --master_port=9901 src/train_bash.py \
-    --deepspeed ds_config.json \
-    ... # arguments (same as above)
-```
-
-<details><summary>Example config for full-parameter training with DeepSpeed ZeRO-2</summary>
-
-```json
-{
-  "train_batch_size": "auto",
-  "train_micro_batch_size_per_gpu": "auto",
-  "gradient_accumulation_steps": "auto",
-  "gradient_clipping": "auto",
-  "zero_allow_untested_optimizer": true,
-  "fp16": {
-    "enabled": "auto",
-    "loss_scale": 0,
-    "initial_scale_power": 16,
-    "loss_scale_window": 1000,
-    "hysteresis": 2,
-    "min_loss_scale": 1
-  },
-  "zero_optimization": {
-    "stage": 2,
-    "allgather_partitions": true,
-    "allgather_bucket_size": 5e8,
-    "reduce_scatter": true,
-    "reduce_bucket_size": 5e8,
-    "overlap_comm": false,
-    "contiguous_gradients": true
-  }
-}
-```
-
-</details>
-
-### Merge LoRA weights and export model
-
-```bash
-python src/export_model.py \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --template default \
-    --finetuning_type lora \
-    --export_dir path_to_export \
-    --export_size 2 \
-    --export_legacy_format False
-```
-
-> [!WARNING]
-> Merging LoRA weights into a quantized model is not supported.
-
-> [!TIP]
-> Use `--export_quantization_bit 4` and `--export_quantization_dataset data/c4_demo.json` to quantize the model after merging the LoRA weights.
-
-### Inference with OpenAI-style API
-
-```bash
-python src/api_demo.py \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --template default \
-    --finetuning_type lora
-```
-
-> [!TIP]
-> Visit `http://localhost:8000/docs` for API documentation.
-
-### Inference with command line
-
-```bash
-python src/cli_demo.py \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --template default \
-    --finetuning_type lora
-```
-
-### Inference with web browser
-
-```bash
-python src/web_demo.py \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --template default \
-    --finetuning_type lora
-```
-
-### Evaluation
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/evaluate.py \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --template vanilla \
-    --finetuning_type lora \
-    --task mmlu \
-    --split test \
-    --lang en \
-    --n_shot 5 \
-    --batch_size 4
-```
-
-### Predict
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage sft \
-    --do_predict \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --dataset alpaca_gpt4_en \
-    --template default \
-    --finetuning_type lora \
-    --output_dir path_to_predict_result \
-    --per_device_eval_batch_size 1 \
-    --max_samples 100 \
-    --predict_with_generate \
-    --fp16
-```
-
-> [!WARNING]
-> Use `--per_device_train_batch_size=1` for LLaMA-2 models in fp16 predict.
-
-> [!TIP]
-> We recommend using `--per_device_eval_batch_size=1` and `--max_target_length 128` at 4/8-bit predict.
+Train the model by specifying a model ID of the ModelScope Hub as the `--model_name_or_path`. You can find a full list of model IDs at [ModelScope Hub](https://modelscope.cn/models), e.g., `modelscope/Llama-2-7b-ms`.

 ## Projects using LLaMA Factory

+If you have a project that should be incorporated, please contact via email or create a pull request.
+
+<details><summary>Click to show</summary>
+
 1. Wang et al. ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation. 2023. [[arxiv]](https://arxiv.org/abs/2308.02223)
 1. Yu et al. Open, Closed, or Small Language Models for Text Classification? 2023. [[arxiv]](https://arxiv.org/abs/2308.10092)
+1. Wang et al. UbiPhysio: Support Daily Functioning, Fitness, and Rehabilitation with Action Understanding and Feedback in Natural Language. 2023. [[arxiv]](https://arxiv.org/abs/2308.10526)
 1. Luceri et al. Leveraging Large Language Models to Detect Influence Campaigns in Social Media. 2023. [[arxiv]](https://arxiv.org/abs/2311.07816)
 1. Zhang et al. Alleviating Hallucinations of Large Language Models through Induced Hallucinations. 2023. [[arxiv]](https://arxiv.org/abs/2312.15710)
 1. Wang et al. Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs. 2024. [[arxiv]](https://arxiv.org/abs/2401.04319)
@@ -634,37 +409,43 @@ CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
 1. Cao et al. Head-wise Shareable Attention for Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2402.11819)
 1. Zhang et al. Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages. 2024. [[arxiv]](https://arxiv.org/abs/2402.12204)
 1. Kim et al. Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2402.14714)
+1. Yu et al. KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2402.15043)
+1. Huang et al. Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning. 2024. [[arxiv]](https://arxiv.org/abs/2403.02333)
+1. Duan et al. Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization. 2024. [[arxiv]](https://arxiv.org/abs/2403.03419)
+1. Xie and Schwertfeger. Empowering Robotics with Large Language Models: osmAG Map Comprehension with LLMs. 2024. [[arxiv]](https://arxiv.org/abs/2403.08228)
+1. Weller et al. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. 2024. [[arxiv]](https://arxiv.org/abs/2403.15246)
+1. Hongbin Na. CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering. 2024. [[arxiv]](https://arxiv.org/abs/2403.16008)
 1. **[StarWhisper](https://github.com/Yu-Yang-Li/StarWhisper)**: A large language model for Astronomy, based on ChatGLM2-6B and Qwen-14B.
 1. **[DISC-LawLLM](https://github.com/FudanDISC/DISC-LawLLM)**: A large language model specialized in Chinese legal domain, based on Baichuan-13B, is capable of retrieving and reasoning on legal knowledge.
 1. **[Sunsimiao](https://github.com/thomas-yanxin/Sunsimiao)**: A large language model specialized in Chinese medical domain, based on Baichuan-7B and ChatGLM-6B.
 1. **[CareGPT](https://github.com/WangRongsheng/CareGPT)**: A series of large language models for Chinese medical domain, based on LLaMA2-7B and Baichuan-13B.
 1. **[MachineMindset](https://github.com/PKU-YuanGroup/Machine-Mindset/)**: A series of MBTI Personality large language models, capable of giving any LLM 16 different personality types based on different datasets and training methods.

-> [!TIP]
-> If you have a project that should be incorporated, please contact via email or create a pull request.
+</details>

 ## License

 This repository is licensed under the [Apache-2.0 License](LICENSE).

-Please follow the model licenses to use the corresponding model weights: [Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2](https://ai.meta.com/llama/license/) / [Mistral](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan)
+Please follow the model licenses to use the corresponding model weights: [Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2](https://ai.meta.com/llama/license/) / [Mistral](LICENSE) / [OLMo](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [StarCoder2](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan)

 ## Citation

 If this work is helpful, please kindly cite as:

 ```bibtex
-@Misc{llama-factory,
-  title = {LLaMA Factory},
-  author = {hiyouga},
-  howpublished = {\url{https://github.com/hiyouga/LLaMA-Factory}},
-  year = {2023}
+@article{zheng2024llamafactory,
+  title={LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models},
+  author={Yaowei Zheng and Richong Zhang and Junhao Zhang and Yanhan Ye and Zheyan Luo and Yongqiang Ma},
+  journal={arXiv preprint arXiv:2403.13372},
+  year={2024},
+  url={http://arxiv.org/abs/2403.13372}
 }
 ```

 ## Acknowledgement

-This repo benefits from [PEFT](https://github.com/huggingface/peft), [QLoRA](https://github.com/artidoro/qlora) and [FastChat](https://github.com/lm-sys/FastChat). Thanks for their wonderful works.
+This repo benefits from [PEFT](https://github.com/huggingface/peft), [TRL](https://github.com/huggingface/trl), [QLoRA](https://github.com/artidoro/qlora) and [FastChat](https://github.com/lm-sys/FastChat). Thanks for their wonderful works.

 ## Star History

--- a/README_zh.md
+++ b/README_zh.md
@@ -5,23 +5,26 @@
 [![GitHub last commit](https://img.shields.io/github/last-commit/hiyouga/LLaMA-Factory)](https://github.com/hiyouga/LLaMA-Factory/commits/main)
 [![PyPI](https://img.shields.io/pypi/v/llmtuner)](https://pypi.org/project/llmtuner/)
 [![Downloads](https://static.pepy.tech/badge/llmtuner)](https://pypi.org/project/llmtuner/)
-[![Citation](https://img.shields.io/badge/Citation-21-green)](#使用了-llama-factory-的项目)
+[![Citation](https://img.shields.io/badge/citation-28-green)](#使用了-llama-factory-的项目)
 [![GitHub pull request](https://img.shields.io/badge/PRs-welcome-blue)](https://github.com/hiyouga/LLaMA-Factory/pulls)
 [![Discord](https://dcbadge.vercel.app/api/server/rKfvV9r9FK?compact=true&style=flat)](https://discord.gg/rKfvV9r9FK)
-[![Spaces](https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue)](https://huggingface.co/spaces/hiyouga/LLaMA-Board)
-[![Studios](https://img.shields.io/badge/ModelScope-Open%20In%20Studios-blue)](https://modelscope.cn/studios/hiyouga/LLaMA-Board)
+[![Twitter](https://img.shields.io/twitter/follow/llamafactory_ai)](https://twitter.com/llamafactory_ai)
+[![Spaces](https://img.shields.io/badge/🤗-Open%20in%20Spaces-blue)](https://huggingface.co/spaces/hiyouga/LLaMA-Board)
+[![Studios](https://img.shields.io/badge/ModelScope-Open%20in%20Studios-blue)](https://modelscope.cn/studios/hiyouga/LLaMA-Board)
+[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing)

 👋 加入我们的[微信群](assets/wechat.jpg)。

 \[ [English](README.md) | 中文 \]

-## LLaMA Board: 通过一站式网页界面快速上手 LLaMA Factory
+**微调大模型可以像这样轻松…**

-通过 **[🤗 Spaces](https://huggingface.co/spaces/hiyouga/LLaMA-Board)** 或 **[ModelScope](https://modelscope.cn/studios/hiyouga/LLaMA-Board)** 预览 LLaMA Board，或者通过命令 `CUDA_VISIBLE_DEVICES=0 python src/train_web.py` 本地启动。
+https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd-d76c6d0a6594

-下面是使用单张 GPU 在 10 分钟内更改对话式大型语言模型自我认知的示例。
+选择你的打开方式：

-https://github.com/hiyouga/LLaMA-Factory/assets/16256802/6ba60acc-e2e2-4bec-b846-2d88920d5ba1
+- **Colab**：https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing
+- **本地机器**：请见[如何使用](#如何使用)

 ## 目录

@@ -41,15 +44,16 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/6ba60acc-e2e2-4bec-b846
 ## 项目特色

 - **多种模型**：LLaMA、Mistral、Mixtral-MoE、Qwen、Yi、Gemma、Baichuan、ChatGLM、Phi 等等。
- **集成方法**：（增量）预训练、指令监督微调、奖励模型训练、PPO 训练和 DPO 训练。
+- **集成方法**：（增量）预训练、指令监督微调、奖励模型训练、PPO 训练、DPO 训练和 ORPO 训练。
 - **多种精度**：32 比特全参数微调、16 比特冻结微调、16 比特 LoRA 微调和基于 AQLM/AWQ/GPTQ/LLM.int8 的 2/4/8 比特 QLoRA 微调。
- **先进算法**：DoRA、LongLoRA、LLaMA Pro、LoftQ 和 Agent 微调。
+- **先进算法**：GaLore、DoRA、LongLoRA、LLaMA Pro、LoRA+、LoftQ 和 Agent 微调。
 - **实用技巧**：FlashAttention-2、Unsloth、RoPE scaling、NEFTune 和 rsLoRA。
 - **实验监控**：LlamaBoard、TensorBoard、Wandb、MLflow 等等。
+- **极速推理**：基于 vLLM 的 OpenAI 风格 API、浏览器界面和命令行接口。

 ## 性能指标

-与 ChatGLM 官方的 [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/ptuning) 微调相比，LLaMA-Factory 的 LoRA 微调提供了 **3.7 倍**的加速比，同时在广告文案生成任务上取得了更高的 Rouge 分数。结合 4 比特量化技术，LLaMA-Factory 的 QLoRA 微调进一步降低了 GPU 显存消耗。
+与 ChatGLM 官方的 [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/ptuning) 微调相比，LLaMA Factory 的 LoRA 微调提供了 **3.7 倍**的加速比，同时在广告文案生成任务上取得了更高的 Rouge 分数。结合 4 比特量化技术，LLaMA Factory 的 QLoRA 微调进一步降低了 GPU 显存消耗。

 ![benchmark](assets/benchmark.svg)

@@ -58,23 +62,35 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/6ba60acc-e2e2-4bec-b846
 - **Training Speed**: 训练阶段每秒处理的样本数量。（批处理大小=4，截断长度=1024）
 - **Rouge Score**: [广告文案生成](https://aclanthology.org/D19-1321.pdf)任务验证集上的 Rouge-2 分数。（批处理大小=4，截断长度=1024）
 - **GPU Memory**: 4 比特量化训练的 GPU 显存峰值。（批处理大小=1，截断长度=1024）
- 我们在 ChatGLM 的 P-Tuning 中采用 `pre_seq_len=128`，在 LLaMA-Factory 的 LoRA 微调中采用 `lora_rank=32`。
+- 我们在 ChatGLM 的 P-Tuning 中采用 `pre_seq_len=128`，在 LLaMA Factory 的 LoRA 微调中采用 `lora_rank=32`。

 </details>

 ## 更新日志

-[24/02/28] 我们支持了 **[DoRA](https://arxiv.org/abs/2402.09353)** 微调。请使用 `--use_dora` 参数进行 DoRA 微调。
+[24/03/31] 我们支持了 **[ORPO](https://arxiv.org/abs/2403.07691)**。详细用法请参照 `examples/lora_single_gpu`。

-[24/02/15] 我们支持了 [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro) 提出的**块扩展**方法。详细用法请参照 `tests/llama_pro.py`。
+[24/03/21] 我们的论文 "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" 可在 arXiv 上查看！

-[24/02/05] Qwen1.5（Qwen2 测试版）系列模型已在 LLaMA-Factory 中实现微调支持。详情请查阅该[博客页面](https://qwenlm.github.io/zh/blog/qwen1.5/)。
+[24/03/20] 我们支持了能在 2x24GB GPU 上微调 70B 模型的 **FSDP+QLoRA**。详细用法请参照 `examples/extras/fsdp_qlora`。

 <details><summary>展开日志</summary>

+[24/03/13] 我们支持了 **[LoRA+](https://arxiv.org/abs/2402.12354)**。详细用法请参照 `examples/extras/loraplus`。
+
+[24/03/07] 我们支持了梯度低秩投影（**[GaLore](https://arxiv.org/abs/2403.03507)**）算法。详细用法请参照 `examples/extras/galore`。
+
+[24/03/07] 我们集成了 **[vLLM](https://github.com/vllm-project/vllm)** 以实现极速并发推理。请使用 `--infer_backend vllm` 来获得 **270%** 的推理速度。（尚不支持 LoRA，请先合并权重。）
+
+[24/02/28] 我们支持了 **[DoRA](https://arxiv.org/abs/2402.09353)** 微调。请使用 `--use_dora` 参数进行 DoRA 微调。
+
+[24/02/15] 我们支持了 [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro) 提出的**块扩展**方法。详细用法请参照 `examples/extras/llama_pro`。
+
+[24/02/05] Qwen1.5（Qwen2 测试版）系列模型已在 LLaMA-Factory 中实现微调支持。详情请查阅该[博客页面](https://qwenlm.github.io/zh/blog/qwen1.5/)。
+
 [24/01/18] 我们针对绝大多数模型实现了 **Agent 微调**，微调时指定 `--dataset glaive_toolcall` 即可使模型获得工具调用能力。

-[23/12/23] 我们针对 LLaMA, Mistral 和 Yi 模型支持了 **[unsloth](https://github.com/unslothai/unsloth)** 的 LoRA 训练加速。请使用 `--use_unsloth` 参数启用 unsloth 优化。该方法可提供 1.7 倍的训练速度，详情请查阅[此页面](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison)。
+[23/12/23] 我们针对 LLaMA, Mistral 和 Yi 模型支持了 **[unsloth](https://github.com/unslothai/unsloth)** 的 LoRA 训练加速。请使用 `--use_unsloth` 参数启用 unsloth 优化。该方法可提供 **170%** 的训练速度，详情请查阅[此页面](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison)。

 [23/12/12] 我们支持了微调最新的混合专家模型 **[Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)**。硬件需求请查阅[此处](#硬件依赖)。

@@ -122,13 +138,14 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/6ba60acc-e2e2-4bec-b846
 | [InternLM2](https://huggingface.co/internlm)             | 7B/20B                      | wqkv              | intern2   |
 | [LLaMA](https://github.com/facebookresearch/llama)       | 7B/13B/33B/65B              | q_proj,v_proj     | -         |
 | [LLaMA-2](https://huggingface.co/meta-llama)             | 7B/13B/70B                  | q_proj,v_proj     | llama2    |
-| [Mistral](https://huggingface.co/mistralai)              | 7B                          | q_proj,v_proj     | mistral   |
-| [Mixtral](https://huggingface.co/mistralai)              | 8x7B                        | q_proj,v_proj     | mistral   |
+| [Mistral/Mixtral](https://huggingface.co/mistralai)      | 7B/8x7B                     | q_proj,v_proj     | mistral   |
+| [OLMo](https://huggingface.co/allenai)                   | 1B/7B                       | att_proj          | olmo      |
 | [Phi-1.5/2](https://huggingface.co/microsoft)            | 1.3B/2.7B                   | q_proj,v_proj     | -         |
 | [Qwen](https://huggingface.co/Qwen)                      | 1.8B/7B/14B/72B             | c_attn            | qwen      |
-| [Qwen1.5](https://huggingface.co/Qwen)                   | 0.5B/1.8B/4B/7B/14B/72B     | q_proj,v_proj     | qwen      |
+| [Qwen1.5 (MoE)](https://huggingface.co/Qwen)             | 0.5B/1.8B/4B/7B/14B/32B/72B | q_proj,v_proj     | qwen      |
+| [StarCoder2](https://huggingface.co/bigcode)             | 3B/7B/15B                   | q_proj,v_proj     | -         |
 | [XVERSE](https://huggingface.co/xverse)                  | 7B/13B/65B                  | q_proj,v_proj     | xverse    |
-| [Yi](https://huggingface.co/01-ai)                       | 6B/34B                      | q_proj,v_proj     | yi        |
+| [Yi](https://huggingface.co/01-ai)                       | 6B/9B/34B                   | q_proj,v_proj     | yi        |
 | [Yuan](https://huggingface.co/IEITYuan)                  | 2B/51B/102B                 | q_proj,v_proj     | yuan      |

 > [!NOTE]
@@ -138,6 +155,8 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/6ba60acc-e2e2-4bec-b846

 项目所支持模型的完整列表请参阅 [constants.py](src/llmtuner/extras/constants.py)。

+您也可以在 [template.py](src/llmtuner/data/template.py) 中添加自己的对话模板。
+
 ## 训练方法

 | 方法                   |     全参数训练      |    部分参数训练     |       LoRA         |       QLoRA        |
@@ -147,9 +166,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/6ba60acc-e2e2-4bec-b846
 | 奖励模型训练            | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
 | PPO 训练               | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
 | DPO 训练               | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
-
-> [!NOTE]
-> 请使用 `--quantization_bit 4` 参数来启用 QLoRA 训练。
+| ORPO 训练              | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |

 ## 数据集

@@ -204,6 +221,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/6ba60acc-e2e2-4bec-b846
 - [LMSYS Chat 1M (en)](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)
 - [Evol Instruct V2 (en)](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)
 - [Glaive Function Calling V2 (en)](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2)
+- [Cosmopedia (en)](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)
 - [Open Assistant (de)](https://huggingface.co/datasets/mayflowergmbh/oasst_de)
 - [Dolly 15k (de)](https://huggingface.co/datasets/mayflowergmbh/dolly-15k_de)
 - [Alpaca GPT4 (de)](https://huggingface.co/datasets/mayflowergmbh/alpaca-gpt4_de)
@@ -221,13 +239,12 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/6ba60acc-e2e2-4bec-b846
 - [HH-RLHF (en)](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 - [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
 - [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
+- [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
 - [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)
 - [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)

 </details>

-使用方法请参考 [data/README_zh.md](data/README_zh.md) 文件。
-
 部分数据集的使用需要确认，我们推荐使用下述命令登录您的 Hugging Face 账户。

 ```bash
@@ -240,60 +257,126 @@ huggingface-cli login
 | 必需项       | 至少     | 推荐      |
 | ------------ | ------- | --------- |
 | python       | 3.8     | 3.10      |
-| torch        | 1.13.1  | 2.2.1     |
-| transformers | 4.37.2  | 4.38.1    |
-| datasets     | 2.14.3  | 2.17.1    |
-| accelerate   | 0.27.2  | 0.27.2    |
-| peft         | 0.9.0   | 0.9.0     |
-| trl          | 0.7.11  | 0.7.11    |
+| torch        | 1.13.1  | 2.2.0     |
+| transformers | 4.37.2  | 4.39.3    |
+| datasets     | 2.14.3  | 2.18.0    |
+| accelerate   | 0.27.2  | 0.28.0    |
+| peft         | 0.9.0   | 0.10.0    |
+| trl          | 0.8.1   | 0.8.1     |

 | 可选项       | 至少     | 推荐      |
 | ------------ | ------- | --------- |
 | CUDA         | 11.6    | 12.2      |
-| deepspeed    | 0.10.0  | 0.13.4    |
-| bitsandbytes | 0.39.0  | 0.41.3    |
-| flash-attn   | 2.3.0   | 2.5.5     |
+| deepspeed    | 0.10.0  | 0.14.0    |
+| bitsandbytes | 0.39.0  | 0.43.0    |
+| flash-attn   | 2.3.0   | 2.5.6     |

 ### 硬件依赖

 \* *估算值*

-| 训练方法 | 精度 |   7B  |  13B  |  30B  |   65B  |   8x7B |
+| 训练方法 | 精度 |   7B  |  13B  |  30B  |   70B  |   8x7B |
 | ------- | ---- | ----- | ----- | ----- | ------ | ------ |
-| 全参数   |  16  | 160GB | 320GB | 600GB | 1200GB |  900GB |
-| 部分参数 |  16  |  20GB |  40GB | 120GB |  240GB |  200GB |
-| LoRA    |  16  |  16GB |  32GB |  80GB |  160GB |  120GB |
-| QLoRA   |   8  |  10GB |  16GB |  40GB |   80GB |   80GB |
-| QLoRA   |   4  |   6GB |  12GB |  24GB |   48GB |   32GB |
+| 全参数   | AMP  | 120GB | 240GB | 600GB | 1200GB |  900GB |
+| 全参数   |  16  |  60GB | 120GB | 300GB |  600GB |  400GB |
+| GaLore  |  16  |  16GB |  32GB |  64GB |  160GB |  120GB |
+| 部分参数 |  16  |  20GB |  40GB |  80GB |  200GB |  160GB |
+| LoRA    |  16  |  16GB |  32GB |  64GB |  160GB |  120GB |
+| QLoRA   |   8  |  10GB |  20GB |  40GB |   80GB |   60GB |
+| QLoRA   |   4  |   6GB |  12GB |  24GB |   48GB |   30GB |
+| QLoRA   |   2  |   4GB |   8GB |  16GB |   24GB |   18GB |

 ## 如何使用

-### 数据准备（可跳过）
+### 数据准备

-关于数据集文件的格式，请参考 [data/README_zh.md](data/README_zh.md) 的内容。构建自定义数据集时，既可以使用单个 `.json` 文件，也可以使用一个[数据加载脚本](https://huggingface.co/docs/datasets/dataset_script)和多个文件。
+关于数据集文件的格式，请参考 [data/README_zh.md](data/README_zh.md) 的内容。你可以使用 HuggingFace / ModelScope 上的数据集或加载本地数据集。

 > [!NOTE]
-> 使用自定义数据集时，请更新 `data/dataset_info.json` 文件，该文件的格式请参考 `data/README_zh.md`。
+> 使用自定义数据集时，请更新 `data/dataset_info.json` 文件。

-### 环境搭建（可跳过）
+### 安装依赖

 ```bash
 git clone https://github.com/hiyouga/LLaMA-Factory.git
 conda create -n llama_factory python=3.10
 conda activate llama_factory
 cd LLaMA-Factory
-pip install -r requirements.txt
+pip install -e .[metrics]
 ```

-如果要在 Windows 平台上开启量化 LoRA（QLoRA），需要安装预编译的 `bitsandbytes` 库, 支持 CUDA 11.1 到 12.2。
+可选的额外依赖项：deepspeed、metrics、unsloth、galore、vllm、bitsandbytes、gptq、awq、aqlm、qwen、modelscope、quality
+
+<details><summary>Windows 用户指南</summary>
+
+如果要在 Windows 平台上开启量化 LoRA（QLoRA），需要安装预编译的 `bitsandbytes` 库, 支持 CUDA 11.1 到 12.2, 请根据您的 CUDA 版本情况选择适合的[发布版本](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels)。

 ```bash
-pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.0-py3-none-win_amd64.whl
+pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl
 ```

 如果要在 Windows 平台上开启 FlashAttention-2，需要安装预编译的 `flash-attn` 库，支持 CUDA 12.1 到 12.2，请根据需求到 [flash-attention](https://github.com/bdashore3/flash-attention/releases) 下载对应版本安装。

-### 使用魔搭社区（可跳过）
+</details>
+
+### LLaMA Board 可视化界面
+
+> [!IMPORTANT]
+> LLaMA Board 可视化界面目前仅支持单 GPU 训练，请使用[命令行接口](#命令行接口)来进行分布式训练。
+
+#### 使用本地环境
+
+```bash
+export CUDA_VISIBLE_DEVICES=0 # Windows 使用 `set CUDA_VISIBLE_DEVICES=0`
+python src/train_web.py # 或 python -m llmtuner.webui.interface
+```
+
+#### 使用 Docker
+
+```bash
+docker build -f ./Dockerfile -t llama-factory:latest .
+docker run --gpus=all \
+    -v ./hf_cache:/root/.cache/huggingface/ \
+    -v ./data:/app/data \
+    -v ./output:/app/output \
+    -e CUDA_VISIBLE_DEVICES=0 \
+    -p 7860:7860 \
+    --shm-size 16G \
+    --name llama_factory \
+    -d llama-factory:latest
+```
+
+#### 使用 Docker Compose
+
+```bash
+docker compose -f ./docker-compose.yml up -d
+```
+
+<details><summary>数据卷详情</summary>
+
+- hf_cache：使用宿主机的 Hugging Face 缓存文件夹，允许更改为新的目录。
+- data：宿主机中存放数据集的文件夹路径。
+- output：将导出目录设置为该路径后，即可在宿主机中访问导出后的模型。
+
+</details>
+
+### 命令行接口
+
+使用方法请参考 [examples/README_zh.md](examples/README_zh.md)。
+
+使用 `python src/train_bash.py -h` 查看参数文档。
+
+### 使用 OpenAI 风格 API 和 vLLM 部署
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \
+    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 \
+    --template mistral \
+    --infer_backend vllm \
+    --vllm_enforce_eager
+```
+
+### 使用魔搭社区

 如果您在 Hugging Face 模型和数据集的下载中遇到了问题，可以通过下述方法使用魔搭社区。

@@ -301,325 +384,17 @@ pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/downl
 export USE_MODELSCOPE_HUB=1 # Windows 使用 `set USE_MODELSCOPE_HUB=1`
 ```

-接着即可通过指定模型名称来训练对应的模型。（在[魔搭社区](https://modelscope.cn/models)查看所有可用的模型）
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --model_name_or_path modelscope/Llama-2-7b-ms \
-    ... # 参数同上
-```
-
-LLaMA Board 同样支持魔搭社区的模型和数据集下载。
-
-```bash
-CUDA_VISIBLE_DEVICES=0 USE_MODELSCOPE_HUB=1 python src/train_web.py
-```
-
-### 单 GPU 训练
-
-> [!IMPORTANT]
-> 如果您使用多张 GPU 训练模型，请移步[多 GPU 分布式训练](#多-gpu-分布式训练)部分。
-
-#### 预训练
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage pt \
-    --do_train \
-    --model_name_or_path path_to_llama_model \
-    --dataset wiki_demo \
-    --finetuning_type lora \
-    --lora_target q_proj,v_proj \
-    --output_dir path_to_pt_checkpoint \
-    --overwrite_cache \
-    --per_device_train_batch_size 4 \
-    --gradient_accumulation_steps 4 \
-    --lr_scheduler_type cosine \
-    --logging_steps 10 \
-    --save_steps 1000 \
-    --learning_rate 5e-5 \
-    --num_train_epochs 3.0 \
-    --plot_loss \
-    --fp16
-```
-
-#### 指令监督微调
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage sft \
-    --do_train \
-    --model_name_or_path path_to_llama_model \
-    --dataset alpaca_gpt4_zh \
-    --template default \
-    --finetuning_type lora \
-    --lora_target q_proj,v_proj \
-    --output_dir path_to_sft_checkpoint \
-    --overwrite_cache \
-    --per_device_train_batch_size 4 \
-    --gradient_accumulation_steps 4 \
-    --lr_scheduler_type cosine \
-    --logging_steps 10 \
-    --save_steps 1000 \
-    --learning_rate 5e-5 \
-    --num_train_epochs 3.0 \
-    --plot_loss \
-    --fp16
-```
-
-#### 奖励模型训练
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage rm \
-    --do_train \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_sft_checkpoint \
-    --create_new_adapter \
-    --dataset comparison_gpt4_zh \
-    --template default \
-    --finetuning_type lora \
-    --lora_target q_proj,v_proj \
-    --output_dir path_to_rm_checkpoint \
-    --per_device_train_batch_size 2 \
-    --gradient_accumulation_steps 4 \
-    --lr_scheduler_type cosine \
-    --logging_steps 10 \
-    --save_steps 1000 \
-    --learning_rate 1e-6 \
-    --num_train_epochs 1.0 \
-    --plot_loss \
-    --fp16
-```
-
-#### PPO 训练
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage ppo \
-    --do_train \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_sft_checkpoint \
-    --create_new_adapter \
-    --dataset alpaca_gpt4_zh \
-    --template default \
-    --finetuning_type lora \
-    --lora_target q_proj,v_proj \
-    --reward_model path_to_rm_checkpoint \
-    --output_dir path_to_ppo_checkpoint \
-    --per_device_train_batch_size 2 \
-    --gradient_accumulation_steps 4 \
-    --lr_scheduler_type cosine \
-    --top_k 0 \
-    --top_p 0.9 \
-    --logging_steps 10 \
-    --save_steps 1000 \
-    --learning_rate 1e-5 \
-    --num_train_epochs 1.0 \
-    --plot_loss \
-    --fp16
-```
-
-> [!TIP]
-> 使用 `--adapter_name_or_path path_to_sft_checkpoint,path_to_ppo_checkpoint` 来进行微调模型的推理。
-
-> [!WARNING]
-> 如果使用 fp16 精度进行 LLaMA-2 模型的 PPO 训练，请使用 `--per_device_train_batch_size=1`。
-
-#### DPO 训练
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage dpo \
-    --do_train \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_sft_checkpoint \
-    --create_new_adapter \
-    --dataset comparison_gpt4_zh \
-    --template default \
-    --finetuning_type lora \
-    --lora_target q_proj,v_proj \
-    --output_dir path_to_dpo_checkpoint \
-    --per_device_train_batch_size 2 \
-    --gradient_accumulation_steps 4 \
-    --lr_scheduler_type cosine \
-    --logging_steps 10 \
-    --save_steps 1000 \
-    --learning_rate 1e-5 \
-    --num_train_epochs 1.0 \
-    --plot_loss \
-    --fp16
-```
-
-> [!TIP]
-> 使用 `--adapter_name_or_path path_to_sft_checkpoint,path_to_dpo_checkpoint` 来进行微调模型的推理。
-
-### 多 GPU 分布式训练
-
-#### 使用 Huggingface Accelerate
-
-```bash
-accelerate config # 首先配置分布式环境
-accelerate launch src/train_bash.py # 参数同上
-```
-
-<details><summary>LoRA 训练的 Accelerate 配置示例</summary>
-
-```yaml
-compute_environment: LOCAL_MACHINE
-debug: false
-distributed_type: MULTI_GPU
-downcast_bf16: 'no'
-gpu_ids: all
-machine_rank: 0
-main_training_function: main
-mixed_precision: fp16
-num_machines: 1
-num_processes: 4
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
-```
-
-</details>
-
-#### 使用 DeepSpeed
-
-```bash
-deepspeed --num_gpus 8 --master_port=9901 src/train_bash.py \
-    --deepspeed ds_config.json \
-    ... # 参数同上
-```
-
-<details><summary>使用 DeepSpeed ZeRO-2 进行全参数训练的 DeepSpeed 配置示例</summary>
-
-```json
-{
-  "train_batch_size": "auto",
-  "train_micro_batch_size_per_gpu": "auto",
-  "gradient_accumulation_steps": "auto",
-  "gradient_clipping": "auto",
-  "zero_allow_untested_optimizer": true,
-  "fp16": {
-    "enabled": "auto",
-    "loss_scale": 0,
-    "initial_scale_power": 16,
-    "loss_scale_window": 1000,
-    "hysteresis": 2,
-    "min_loss_scale": 1
-  },
-  "zero_optimization": {
-    "stage": 2,
-    "allgather_partitions": true,
-    "allgather_bucket_size": 5e8,
-    "reduce_scatter": true,
-    "reduce_bucket_size": 5e8,
-    "overlap_comm": false,
-    "contiguous_gradients": true
-  }
-}
-```
-
-</details>
-
-### 合并 LoRA 权重并导出模型
-
-```bash
-python src/export_model.py \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --template default \
-    --finetuning_type lora \
-    --export_dir path_to_export \
-    --export_size 2 \
-    --export_legacy_format False
-```
-
-> [!WARNING]
-> 尚不支持量化模型的 LoRA 权重合并及导出。
-
-> [!TIP]
-> 合并 LoRA 权重之后可再次使用 `--export_quantization_bit 4` 和 `--export_quantization_dataset data/c4_demo.json` 量化模型。
-
-### 使用 OpenAI 风格 API 推理
-
-```bash
-python src/api_demo.py \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --template default \
-    --finetuning_type lora
-```
-
-> [!TIP]
-> 关于 API 文档请见 `http://localhost:8000/docs`。
-
-### 使用命令行推理
-
-```bash
-python src/cli_demo.py \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --template default \
-    --finetuning_type lora
-```
-
-### 使用浏览器推理
-
-```bash
-python src/web_demo.py \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --template default \
-    --finetuning_type lora
-```
-
-### 模型评估
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/evaluate.py \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --template vanilla \
-    --finetuning_type lora \
-    --task ceval \
-    --split validation \
-    --lang zh \
-    --n_shot 5 \
-    --batch_size 4
-```
-
-### 模型预测
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage sft \
-    --do_predict \
-    --model_name_or_path path_to_llama_model \
-    --adapter_name_or_path path_to_checkpoint \
-    --dataset alpaca_gpt4_zh \
-    --template default \
-    --finetuning_type lora \
-    --output_dir path_to_predict_result \
-    --per_device_eval_batch_size 1 \
-    --max_samples 100 \
-    --predict_with_generate \
-    --fp16
-```
-
-> [!WARNING]
-> 如果使用 fp16 精度进行 LLaMA-2 模型的预测，请使用 `--per_device_eval_batch_size=1`。
-
-> [!TIP]
-> 我们建议在量化模型的预测中使用 `--per_device_eval_batch_size=1` 和 `--max_target_length 128`。
+将 `--model_name_or_path` 设置为模型 ID 来加载对应的模型。在[魔搭社区](https://modelscope.cn/models)查看所有可用的模型，例如 `modelscope/Llama-2-7b-ms`。

 ## 使用了 LLaMA Factory 的项目

+如果您有项目希望添加至上述列表，请通过邮件联系或者创建一个 PR。
+
+<details><summary>点击显示</summary>
+
 1. Wang et al. ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation. 2023. [[arxiv]](https://arxiv.org/abs/2308.02223)
 1. Yu et al. Open, Closed, or Small Language Models for Text Classification? 2023. [[arxiv]](https://arxiv.org/abs/2308.10092)
+1. Wang et al. UbiPhysio: Support Daily Functioning, Fitness, and Rehabilitation with Action Understanding and Feedback in Natural Language. 2023. [[arxiv]](https://arxiv.org/abs/2308.10526)
 1. Luceri et al. Leveraging Large Language Models to Detect Influence Campaigns in Social Media. 2023. [[arxiv]](https://arxiv.org/abs/2311.07816)
 1. Zhang et al. Alleviating Hallucinations of Large Language Models through Induced Hallucinations. 2023. [[arxiv]](https://arxiv.org/abs/2312.15710)
 1. Wang et al. Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs. 2024. [[arxiv]](https://arxiv.org/abs/2401.04319)
@@ -634,37 +409,43 @@ CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
 1. Cao et al. Head-wise Shareable Attention for Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2402.11819)
 1. Zhang et al. Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages. 2024. [[arxiv]](https://arxiv.org/abs/2402.12204)
 1. Kim et al. Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2402.14714)
+1. Yu et al. KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2402.15043)
+1. Huang et al. Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning. 2024. [[arxiv]](https://arxiv.org/abs/2403.02333)
+1. Duan et al. Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization. 2024. [[arxiv]](https://arxiv.org/abs/2403.03419)
+1. Xie and Schwertfeger. Empowering Robotics with Large Language Models: osmAG Map Comprehension with LLMs. 2024. [[arxiv]](https://arxiv.org/abs/2403.08228)
+1. Weller et al. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. 2024. [[arxiv]](https://arxiv.org/abs/2403.15246)
+1. Hongbin Na. CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering. 2024. [[arxiv]](https://arxiv.org/abs/2403.16008)
 1. **[StarWhisper](https://github.com/Yu-Yang-Li/StarWhisper)**: 天文大模型 StarWhisper，基于 ChatGLM2-6B 和 Qwen-14B 在天文数据上微调而得。
 1. **[DISC-LawLLM](https://github.com/FudanDISC/DISC-LawLLM)**: 中文法律领域大模型 DISC-LawLLM，基于 Baichuan-13B 微调而得，具有法律推理和知识检索能力。
 1. **[Sunsimiao](https://github.com/thomas-yanxin/Sunsimiao)**: 孙思邈中文医疗大模型 Sumsimiao，基于 Baichuan-7B 和 ChatGLM-6B 在中文医疗数据上微调而得。
 1. **[CareGPT](https://github.com/WangRongsheng/CareGPT)**: 医疗大模型项目 CareGPT，基于 LLaMA2-7B 和 Baichuan-13B 在中文医疗数据上微调而得。
 1. **[MachineMindset](https://github.com/PKU-YuanGroup/Machine-Mindset/)**：MBTI性格大模型项目，根据数据集与训练方式让任意 LLM 拥有 16 个不同的性格类型。

-> [!TIP]
-> 如果您有项目希望添加至上述列表，请通过邮件联系或者创建一个 PR。
+</details>

 ## 协议

 本仓库的代码依照 [Apache-2.0](LICENSE) 协议开源。

-使用模型权重时，请遵循对应的模型协议：[Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2](https://ai.meta.com/llama/license/) / [Mistral](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan)
+使用模型权重时，请遵循对应的模型协议：[Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2](https://ai.meta.com/llama/license/) / [Mistral](LICENSE) / [OLMo](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [StarCoder2](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan)

 ## 引用

 如果您觉得此项目有帮助，请考虑以下列格式引用

 ```bibtex
-@Misc{llama-factory,
-  title = {LLaMA Factory},
-  author = {hiyouga},
-  howpublished = {\url{https://github.com/hiyouga/LLaMA-Factory}},
-  year = {2023}
+@article{zheng2024llamafactory,
+  title={LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models}, 
+  author={Yaowei Zheng and Richong Zhang and Junhao Zhang and Yanhan Ye and Zheyan Luo and Yongqiang Ma},
+  journal={arXiv preprint arXiv:2403.13372},
+  year={2024},
+  url={http://arxiv.org/abs/2403.13372}
 }
 ```

 ## 致谢

-本项目受益于 [PEFT](https://github.com/huggingface/peft)、[QLoRA](https://github.com/artidoro/qlora) 和 [FastChat](https://github.com/lm-sys/FastChat)，感谢以上诸位作者的付出。
+本项目受益于 [PEFT](https://github.com/huggingface/peft)、[TRL](https://github.com/huggingface/trl)、[QLoRA](https://github.com/artidoro/qlora) 和 [FastChat](https://github.com/lm-sys/FastChat)，感谢以上诸位作者的付出。

 ## Star History

--- a/data/README.md
+++ b/data/README.md
@@ -34,6 +34,8 @@ If you are using a custom dataset, please provide your dataset definition in the

 Given above, you can use the custom dataset via specifying `--dataset dataset_name`.

+----
+
 Currently we support dataset in **alpaca** or **sharegpt** format, the dataset in alpaca format should follow the below format:

 ```json
@@ -84,6 +86,10 @@ For the preference datasets, the `response` column should be a string list whose
 }
 ```

+Remember to set `"ranking": true` for the preference datasets.
+
+----
+
 The dataset in sharegpt format should follow the below format:

 ```json
--- a/data/README_zh.md
+++ b/data/README_zh.md
@@ -34,6 +34,8 @@

 添加后可通过指定 `--dataset 数据集名称` 参数使用自定义数据集。

+----
+
 该项目目前支持两种格式的数据集：**alpaca** 和 **sharegpt**，其中 alpaca 格式的数据集按照以下方式组织：

 ```json
@@ -84,6 +86,10 @@
 }
 ```

+添加偏好数据集需要额外指定 `"ranking": true`。
+
+----
+
 而 sharegpt 格式的数据集按照以下方式组织：

 ```json
--- a/data/belle_multiturn/belle_multiturn.py
+++ b/data/belle_multiturn/belle_multiturn.py
@@ -1,7 +1,10 @@
+import os
 import json
 import datasets


+_HF_ENDPOINT = os.getenv("HF_ENDPOINT", "https://huggingface.co")
+
 _DESCRIPTION = "BELLE multiturn chat dataset."

 _CITATION = """\
@@ -13,9 +16,9 @@ _CITATION = """\
 }
 """

-_HOMEPAGE = "https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M"
+_HOMEPAGE = "{}/datasets/BelleGroup/multiturn_chat_0.8M".format(_HF_ENDPOINT)
 _LICENSE = "gpl-3.0"
-_URL = "https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M/resolve/main/multiturn_chat_0.8M.json"
+_URL = "{}/datasets/BelleGroup/multiturn_chat_0.8M/resolve/main/multiturn_chat_0.8M.json".format(_HF_ENDPOINT)


 class BelleMultiturn(datasets.GeneratorBasedBuilder):
--- a/data/example_dataset/example_dataset.py
+++ b/data/example_dataset/example_dataset.py
@@ -1,6 +1,6 @@
 import json
 import datasets
-from typing import Any, Dict, List
+from typing import Any, Dict, Generator, List, Tuple


 _DESCRIPTION = "An example of dataset."
@@ -40,7 +40,7 @@ class ExampleDataset(datasets.GeneratorBasedBuilder):
            )
        ]

-    def _generate_examples(self, filepath: str) -> Dict[int, Dict[str, Any]]:
+    def _generate_examples(self, filepath: str) -> Generator[Tuple[int, Dict[str, Any]], None, None]:
        example_dataset = json.load(open(filepath, "r", encoding="utf-8"))
        for key, example in enumerate(example_dataset):
            yield key, example
--- a/data/hh_rlhf_en/hh_rlhf_en.py
+++ b/data/hh_rlhf_en/hh_rlhf_en.py
@@ -1,13 +1,14 @@
+import os
 import json
 import datasets
 from typing import List

-
+_HF_ENDPOINT = os.getenv("HF_ENDPOINT", "https://huggingface.co")
 _DESCRIPTION = "Human preference data about helpfulness and harmlessness."
 _CITATION = ""
-_HOMEPAGE = "https://huggingface.co/datasets/Anthropic/hh-rlhf"
+_HOMEPAGE = "{}/datasets/Anthropic/hh-rlhf".format(_HF_ENDPOINT)
 _LICENSE = "mit"
-_URL = "https://huggingface.co/datasets/Anthropic/hh-rlhf/resolve/main/"
+_URL = "{}/datasets/Anthropic/hh-rlhf/resolve/main/".format(_HF_ENDPOINT)
 _URLS = {
    "train": [
        _URL + "harmless-base/train.jsonl.gz",
--- a/data/orca_rlhf.json.REMOVED.git-id
+++ b/data/orca_rlhf.json.REMOVED.git-id
@@ -0,0 +1 @@
+736bcedea2b24a1414765c6d69cbdafaea839f3c
--- a/data/ultra_chat/ultra_chat.py
+++ b/data/ultra_chat/ultra_chat.py
@@ -1,7 +1,9 @@
+import os
 import json
 import datasets
 from typing import List

+_HF_ENDPOINT = os.getenv("HF_ENDPOINT", "https://huggingface.co")

 _DESCRIPTION = "UltraChat: Large-scale, Informative, and Diverse Multi-round Dialogue Data."

@@ -16,9 +18,9 @@ _CITATION = """\
 }
 """

-_HOMEPAGE = "https://huggingface.co/datasets/stingning/ultrachat"
+_HOMEPAGE = "{}/datasets/stingning/ultrachat".format(_HF_ENDPOINT)
 _LICENSE = "cc-by-nc-4.0"
-_BASE_DATA_URL = "https://huggingface.co/datasets/stingning/ultrachat/resolve/main/train_{idx}.jsonl"
+_BASE_DATA_URL = "{}/datasets/stingning/ultrachat/resolve/main/train_{{idx}}.jsonl".format(_HF_ENDPOINT)


 class UltraChat(datasets.GeneratorBasedBuilder):
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -0,0 +1,25 @@
+version: '3.8'
+
+services:
+  llama-factory:
+    build:
+      dockerfile: Dockerfile
+      context: .
+    container_name: llama_factory
+    volumes:
+      - ./hf_cache:/root/.cache/huggingface/
+      - ./data:/app/data
+      - ./output:/app/output
+    environment:
+      - CUDA_VISIBLE_DEVICES=0
+    ports:
+      - "7860:7860"
+    ipc: host
+    deploy:
+      resources:
+        reservations:
+          devices:
+          - driver: nvidia
+            count: "all"
+            capabilities: [gpu]
+    restart: unless-stopped
--- a/examples/README.md
+++ b/examples/README.md
@@ -0,0 +1,43 @@
+We provide diverse examples about fine-tuning LLMs.
+
+```
+examples/
+├── lora_single_gpu/
+│   ├── pretrain.sh: Do pre-training
+│   ├── sft.sh: Do supervised fine-tuning
+│   ├── reward.sh: Do reward modeling
+│   ├── ppo.sh: Do PPO training
+│   ├── dpo.sh: Do DPO training
+│   ├── orpo.sh: Do ORPO training
+│   ├── prepare.sh: Save tokenized dataset
+│   └── predict.sh: Do batch predict
+├── qlora_single_gpu/
+│   ├── bitsandbytes.sh: Fine-tune 4/8-bit BNB models
+│   ├── gptq.sh: Fine-tune 4/8-bit GPTQ models
+│   ├── awq.sh: Fine-tune 4-bit AWQ models
+│   └── aqlm.sh: Fine-tune 2-bit AQLM models
+├── lora_multi_gpu/
+│   ├── single_node.sh: Fine-tune model with Accelerate on single node
+│   └── multi_node.sh: Fine-tune model with Accelerate on multiple nodes
+├── full_multi_gpu/
+│   ├── single_node.sh: Fine-tune model with DeepSpeed on single node
+│   └── multi_node.sh: Fine-tune model with DeepSpeed on multiple nodes
+├── merge_lora/
+│   ├── merge.sh: Merge LoRA weights into the pre-trained models
+│   └── quantize.sh: Quantize fine-tuned model with AutoGPTQ
+├── inference/
+│   ├── cli_demo.sh: Launch a command line interface
+│   ├── api_demo.sh: Launch an OpenAI-style API
+│   ├── web_demo.sh: Launch a web interface
+│   └── evaluate.sh: Evaluate model on the MMLU benchmark
+└── extras/
+    ├── galore/
+    │   └── sft.sh: Fine-tune model with GaLore
+    ├── loraplus/
+    │   └── sft.sh: Fine-tune model with LoRA+
+    ├── llama_pro/
+    │   ├── expand.sh: Expand layers in the model
+    │   └── sft.sh: Fine-tune expanded model
+    └── fsdp_qlora/
+        └── sft.sh: Fine-tune quantized model with FSDP
+```
--- a/examples/README_zh.md
+++ b/examples/README_zh.md
@@ -0,0 +1,43 @@
+我们提供了多样化的示例脚本。
+
+```
+examples/
+├── lora_single_gpu/
+│   ├── pretrain.sh: 进行预训练
+│   ├── sft.sh: 进行指令监督微调
+│   ├── reward.sh: 进行奖励模型训练
+│   ├── ppo.sh: 进行 PPO 训练
+│   ├── dpo.sh: 进行 DPO 训练
+│   ├── orpo.sh: 进行 ORPO 训练
+│   ├── prepare.sh: 保存预处理后的数据集
+│   └── predict.sh: 进行批量预测
+├── qlora_single_gpu/
+│   ├── bitsandbytes.sh: 微调 4/8 比特 BNB 模型
+│   ├── gptq.sh: 微调 4/8 比特 GPTQ 模型
+│   ├── awq.sh: 微调 4 比特 AWQ 模型
+│   └── aqlm.sh: 微调 2 比特 AQLM 模型
+├── lora_multi_gpu/
+│   ├── single_node.sh: 使用 Accelerate 进行单节点训练
+│   └── multi_node.sh: 使用 Accelerate 进行多节点训练
+├── full_multi_gpu/
+│   ├── single_node.sh: 使用 DeepSpeed 进行单节点训练
+│   └── multi_node.sh: 使用 DeepSpeed 进行多节点训练
+├── merge_lora/
+│   ├── merge.sh: 将 LoRA 权重合并到预训练模型中
+│   └── quantize.sh: 使用 AutoGPTQ 量化模型
+├── inference/
+│   ├── cli_demo.sh: 启动命令行推理接口
+│   ├── api_demo.sh: 启动 OpenAI 风格 API
+│   ├── web_demo.sh: 启动浏览器推理接口
+│   └── evaluate.sh: 在 MMLU 数据集上评测模型
+└── extras/
+    ├── galore/
+    │   └── sft.sh: 使用 GaLore 训练模型
+    ├── loraplus/
+    │   └── sft.sh: 使用 LoRA+ 训练模型
+    ├── llama_pro/
+    │   ├── expand.sh: 扩展模型中的层
+    │   └── sft.sh: 训练扩展后的模型
+    └── fsdp_qlora/
+        └── sft.sh: 使用 FSDP 微调量化模型
+```
--- a/examples/accelerate/fsdp_config.yaml
+++ b/examples/accelerate/fsdp_config.yaml
@@ -0,0 +1,25 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+fsdp_config:
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_forward_prefetch: false
+  fsdp_offload_params: true
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_state_dict_type: FULL_STATE_DICT
+  fsdp_sync_module_states: true
+  fsdp_use_orig_params: false
+machine_rank: 0
+main_training_function: main
+mixed_precision: fp16
+num_machines: 1 # the number of nodes
+num_processes: 2 # the number of GPUs in all nodes
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/accelerate/master_config.yaml
+++ b/examples/accelerate/master_config.yaml
@@ -0,0 +1,18 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: all
+machine_rank: 0
+main_process_ip: 192.168.0.1
+main_process_port: 29555
+main_training_function: main
+mixed_precision: fp16
+num_machines: 2 # the number of nodes
+num_processes: 16 # the number of GPUs in all nodes
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/accelerate/single_config.yaml
+++ b/examples/accelerate/single_config.yaml
@@ -6,8 +6,8 @@ gpu_ids: all
 machine_rank: 0
 main_training_function: main
 mixed_precision: fp16
-num_machines: 1
-num_processes: 4
+num_machines: 1 # the number of nodes
+num_processes: 4 # the number of GPUs in all nodes
 rdzv_backend: static
 same_network: true
 tpu_env: []
--- a/examples/accelerate/slave_config.yaml
+++ b/examples/accelerate/slave_config.yaml
@@ -0,0 +1,18 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: all
+machine_rank: 1
+main_process_ip: 192.168.0.1
+main_process_port: 29555
+main_training_function: main
+mixed_precision: fp16
+num_machines: 2 # the number of nodes
+num_processes: 16 # the number of GPUs in all nodes
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/extras/fsdp_qlora/sft.sh
+++ b/examples/extras/fsdp_qlora/sft.sh
@@ -0,0 +1,40 @@
+#!/bin/bash
+
+pip install "transformers>=4.39.1"
+pip install "accelerate>=0.28.0"
+pip install "bitsandbytes>=0.43.0"
+
+CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
+    --config_file ../../accelerate/fsdp_config.yaml \
+    ../../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-70b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../../data \
+    --template default \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir ../../../saves/LLaMA2-70B/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 4 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --quantization_bit 4 \
+    --plot_loss \
+    --fp16
--- a/examples/extras/galore/sft.sh
+++ b/examples/extras/galore/sft.sh
@@ -0,0 +1,35 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../../data \
+    --template default \
+    --finetuning_type full \
+    --use_galore \
+    --galore_layerwise \
+    --galore_target mlp,self_attn \
+    --galore_rank 128 \
+    --output_dir ../../../saves/LLaMA2-7B/galore/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 1 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --plot_loss \
+    --pure_bf16
--- a/examples/extras/llama_pro/expand.sh
+++ b/examples/extras/llama_pro/expand.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+python ../../../scripts/llama_pro.py \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --output_dir ../../../models/llama2-7b-pro \
+    --num_expand 8
--- a/examples/extras/llama_pro/sft.sh
+++ b/examples/extras/llama_pro/sft.sh
@@ -0,0 +1,34 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path ../../../models/llama2-7b-pro \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../../data \
+    --template default \
+    --finetuning_type freeze \
+    --name_module_trainable all \
+    --num_layer_trainable 8 \
+    --use_llama_pro \
+    --output_dir ../../../saves/LLaMA2-7B-Pro/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --plot_loss \
+    --fp16
--- a/examples/extras/loraplus/sft.sh
+++ b/examples/extras/loraplus/sft.sh
@@ -0,0 +1,33 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../data \
+    --template default \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir ../../saves/LLaMA2-7B/loraplus/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --plot_loss \
+    --fp16 \
+    --loraplus_lr_ratio 16.0
--- a/examples/full_multi_gpu/multi_node.sh
+++ b/examples/full_multi_gpu/multi_node.sh
@@ -0,0 +1,38 @@
+#!/bin/bash
+
+python -m torch.distributed.run \
+    --nproc_per_node $NPROC_PER_NODE \
+    --nnodes $NNODES \
+    --node_rank $RANK \
+    --master_addr $MASTER_ADDR \
+    --master_port $MASTER_PORT \
+    ../../src/train_bash.py \
+    --deepspeed ../deepspeed/ds_z3_config.json \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../data \
+    --template default \
+    --finetuning_type full \
+    --output_dir ../../saves/LLaMA2-7B/full/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 2 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/examples/full_multi_gpu/single_node.sh
+++ b/examples/full_multi_gpu/single_node.sh
@@ -1,11 +1,11 @@
 #!/bin/bash

 deepspeed --num_gpus 4 ../../src/train_bash.py \
-    --deepspeed ds_z3_config.json \
+    --deepspeed ../deepspeed/ds_z3_config.json \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
-    --dataset alpaca_gpt4_en \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type full \
@@ -13,11 +13,13 @@ deepspeed --num_gpus 4 ../../src/train_bash.py \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
+    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
@@ -25,5 +27,6 @@ deepspeed --num_gpus 4 ../../src/train_bash.py \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
+    --ddp_timeout 180000000 \
    --plot_loss \
    --fp16
--- a/examples/inference/api_demo.sh
+++ b/examples/inference/api_demo.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 API_PORT=8000 python ../../src/api_demo.py \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
+    --template default \
+    --finetuning_type lora
--- a/examples/inference/cli_demo.sh
+++ b/examples/inference/cli_demo.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../src/cli_demo.py \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
+    --template default \
+    --finetuning_type lora
--- a/examples/inference/evaluate.sh
+++ b/examples/inference/evaluate.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../src/evaluate.py \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
+    --template vanilla \
+    --finetuning_type lora \
+    --task mmlu \
+    --split test \
+    --lang en \
+    --n_shot 5 \
+    --batch_size 4
--- a/examples/inference/web_demo.sh
+++ b/examples/inference/web_demo.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../src/web_demo.py \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
+    --template default \
+    --finetuning_type lora
--- a/examples/lora_multi_gpu/multi_node.sh
+++ b/examples/lora_multi_gpu/multi_node.sh
@@ -1,6 +1,8 @@
 #!/bin/bash

-CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --config_file config.yaml ../../src/train_bash.py \
+CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
+    --config_file ../accelerate/master_config.yaml \
+    ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
@@ -13,11 +15,13 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --config_file config.yaml ../../s
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
+    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
@@ -26,5 +30,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --config_file config.yaml ../../s
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
+    --ddp_timeout 180000000 \
    --plot_loss \
    --fp16
--- a/examples/lora_multi_gpu/single_node.sh
+++ b/examples/lora_multi_gpu/single_node.sh
@@ -0,0 +1,35 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
+    --config_file ../accelerate/single_config.yaml \
+    ../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../data \
+    --template default \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir ../../saves/LLaMA2-7B/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 2 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 5e-5 \
+    --num_train_epochs 3.0 \
+    --max_samples 3000 \
+    --val_size 0.1 \
+    --ddp_timeout 180000000 \
+    --plot_loss \
+    --fp16
--- a/examples/lora_single_gpu/dpo.sh
+++ b/examples/lora_single_gpu/dpo.sh
@@ -6,7 +6,7 @@ CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
    --create_new_adapter \
-    --dataset comparison_gpt4_en \
+    --dataset orca_rlhf \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
@@ -15,11 +15,13 @@ CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
+    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
--- a/examples/lora_single_gpu/orpo.sh
+++ b/examples/lora_single_gpu/orpo.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
+    --stage orpo \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset orca_rlhf \
+    --dataset_dir ../../data \
+    --template default \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir ../../saves/LLaMA2-7B/lora/orpo \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps 8 \
+    --lr_scheduler_type cosine \
+    --logging_steps 10 \
+    --warmup_steps 20 \
+    --save_steps 100 \
+    --eval_steps 100 \
+    --evaluation_strategy steps \
+    --load_best_model_at_end \
+    --learning_rate 1e-5 \
+    --num_train_epochs 1.0 \
+    --max_samples 1000 \
+    --val_size 0.1 \
+    --plot_loss \
+    --fp16
--- a/examples/lora_single_gpu/ppo.sh
+++ b/examples/lora_single_gpu/ppo.sh
@@ -16,6 +16,7 @@ CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 512 \
+    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
--- a/examples/lora_single_gpu/predict.sh
+++ b/examples/lora_single_gpu/predict.sh
@@ -13,6 +13,7 @@ CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
    --per_device_eval_batch_size 1 \
    --max_samples 20 \
    --predict_with_generate
--- a/examples/lora_single_gpu/prepare.sh
+++ b/examples/lora_single_gpu/prepare.sh
@@ -0,0 +1,18 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES= python ../../src/train_bash.py \
+    --stage sft \
+    --do_train \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset alpaca_gpt4_en,glaive_toolcall \
+    --dataset_dir ../../data \
+    --template default \
+    --finetuning_type lora \
+    --lora_target q_proj,v_proj \
+    --output_dir ../../saves/LLaMA2-7B/lora/sft \
+    --overwrite_cache \
+    --overwrite_output_dir \
+    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
+    --max_samples 3000 \
+    --tokenized_path ../../saves/datasets/sft
--- a/examples/lora_single_gpu/pretrain.sh
+++ b/examples/lora_single_gpu/pretrain.sh
@@ -12,11 +12,13 @@ CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
+    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
--- a/examples/lora_single_gpu/reward.sh
+++ b/examples/lora_single_gpu/reward.sh
@@ -6,7 +6,7 @@ CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
    --create_new_adapter \
-    --dataset comparison_gpt4_en \
+    --dataset orca_rlhf \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
@@ -15,11 +15,13 @@ CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
+    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
--- a/examples/lora_single_gpu/sft.sh
+++ b/examples/lora_single_gpu/sft.sh
@@ -13,11 +13,13 @@ CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
+    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
+    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
--- a/examples/merge_lora/merge.sh
+++ b/examples/merge_lora/merge.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+# DO NOT use quantized model or quantization_bit when merging lora weights
+
+CUDA_VISIBLE_DEVICES= python ../../src/export_model.py \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
+    --template default \
+    --finetuning_type lora \
+    --export_dir ../../models/llama2-7b-sft \
+    --export_size 2 \
+    --export_legacy_format False
--- a/examples/merge_lora/quantize.sh
+++ b/examples/merge_lora/quantize.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+CUDA_VISIBLE_DEVICES=0 python ../../src/export_model.py \
+    --model_name_or_path ../../models/llama2-7b-sft \
+    --template default \
+    --export_dir ../../models/llama2-7b-sft-int4 \
+    --export_quantization_bit 4 \
+    --export_quantization_dataset ../../data/c4_demo.json \
+    --export_size 2 \
+    --export_legacy_format False
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -28,5 +28,6 @@ known-third-party = [
 [tool.ruff.format]
 quote-style = "double"
 indent-style = "space"
+docstring-code-format = true
 skip-magic-trailing-comma = false
 line-ending = "auto"
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,18 +2,16 @@ torch>=1.13.1
 transformers>=4.37.2
 datasets>=2.14.3
 accelerate>=0.27.2
-peft>=0.9.0
-trl>=0.7.11
-gradio>=3.38.0,<4.0.0
+peft>=0.10.0
+trl>=0.8.1
+gradio>=4.0.0,<=4.21.0
 scipy
 einops
 sentencepiece
 protobuf
-jieba
-rouge-chinese
-nltk
 uvicorn
 pydantic
 fastapi
 sse-starlette
 matplotlib
+fire
--- a/scripts/cal_flops.py
+++ b/scripts/cal_flops.py
--- a/scripts/cal_lr.py
+++ b/scripts/cal_lr.py
@@ -15,7 +15,7 @@ from transformers import DataCollatorForLanguageModeling, DataCollatorForSeq2Seq
 from llmtuner.data import get_dataset
 from llmtuner.extras.constants import IGNORE_INDEX
 from llmtuner.hparams import get_train_args
-from llmtuner.model import load_model_and_tokenizer
+from llmtuner.model import load_tokenizer


 BASE_LR = 3e-4  # 1.5e-4 for 30B-70B models
@@ -32,7 +32,7 @@ def calculate_lr(
    cutoff_len: Optional[int] = 1024,  # i.e. maximum input length during training
    is_mistral: Optional[bool] = False,  # mistral model uses a smaller learning rate,
 ):
-    model_args, data_args, training_args, finetuning_args, _ = get_train_args(
+    model_args, data_args, training_args, _, _ = get_train_args(
        dict(
            stage=stage,
            model_name_or_path=model_name_or_path,
@@ -44,8 +44,8 @@ def calculate_lr(
            overwrite_cache=True,
        )
    )
-    _, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, is_trainable=False, add_valuehead=False)
-    trainset = get_dataset(tokenizer, model_args, data_args, training_args, stage=stage)
+    tokenizer = load_tokenizer(model_args)
+    trainset = get_dataset(tokenizer, model_args, data_args, training_args, stage)
    if stage == "pt":
        data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
    elif stage == "sft":
--- a/scripts/length_cdf.py
+++ b/scripts/length_cdf.py
@@ -10,7 +10,7 @@ from tqdm import tqdm

 from llmtuner.data import get_dataset
 from llmtuner.hparams import get_train_args
-from llmtuner.model import load_model_and_tokenizer
+from llmtuner.model import load_tokenizer


 def length_cdf(
@@ -20,7 +20,7 @@ def length_cdf(
    template: Optional[str] = "default",
    interval: Optional[int] = 1000,
 ):
-    model_args, data_args, training_args, finetuning_args, _ = get_train_args(
+    model_args, data_args, training_args, _, _ = get_train_args(
        dict(
            stage="sft",
            model_name_or_path=model_name_or_path,
@@ -32,7 +32,7 @@ def length_cdf(
            overwrite_cache=True,
        )
    )
-    _, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, is_trainable=False, add_valuehead=False)
+    tokenizer = load_tokenizer(model_args)
    trainset = get_dataset(tokenizer, model_args, data_args, training_args, stage="sft")
    total_num = len(trainset)
    length_dict = defaultdict(int)
--- a/scripts/llama_pro.py
+++ b/scripts/llama_pro.py
--- a/scripts/llamafy_baichuan2.py
+++ b/scripts/llamafy_baichuan2.py
--- a/scripts/llamafy_qwen.py
+++ b/scripts/llamafy_qwen.py
--- a/scripts/loftq_init.py
+++ b/scripts/loftq_init.py
--- a/setup.py
+++ b/setup.py
@@ -1,13 +1,14 @@
 import os
 import re
-from setuptools import setup, find_packages
+
+from setuptools import find_packages, setup


 def get_version():
    with open(os.path.join("src", "llmtuner", "__init__.py"), "r", encoding="utf-8") as f:
        file_content = f.read()
        pattern = r"{0}\W*=\W*\"([^\"]+)\"".format("__version__")
-        version, = re.findall(pattern, file_content)
+        (version,) = re.findall(pattern, file_content)
        return version


@@ -18,8 +19,23 @@ def get_requires():
        return lines


-def main():
+extra_require = {
+    "deepspeed": ["deepspeed>=0.10.0"],
+    "metrics": ["nltk", "jieba", "rouge-chinese"],
+    "unsloth": ["torch==2.2.0", "unsloth[cu121-ampere-torch220]"],
+    "galore": ["galore-torch"],
+    "vllm": ["vllm>=0.3.3"],
+    "bitsandbytes": ["bitsandbytes>=0.39.0"],
+    "gptq": ["optimum>=1.16.0", "auto-gptq>=0.5.0"],
+    "awq": ["autoawq"],
+    "aqlm": ["aqlm[gpu]>=1.1.0"],
+    "qwen": ["tiktoken", "transformers_stream_generator"],
+    "modelscope": ["modelscope"],
+    "quality": ["ruff"],
+}

+
+def main():
    setup(
        name="llmtuner",
        version=get_version(),
@@ -35,8 +51,9 @@ def main():
        packages=find_packages("src"),
        python_requires=">=3.8.0",
        install_requires=get_requires(),
+        extras_require=extra_require,
        classifiers=[
-            "Development Status :: 3 - Alpha",
+            "Development Status :: 4 - Beta",
            "Intended Audience :: Developers",
            "Intended Audience :: Education",
            "Intended Audience :: Science/Research",
@@ -46,8 +63,9 @@ def main():
            "Programming Language :: Python :: 3.8",
            "Programming Language :: Python :: 3.9",
            "Programming Language :: Python :: 3.10",
+            "Programming Language :: Python :: 3.11",
            "Topic :: Scientific/Engineering :: Artificial Intelligence",
-        ]
+        ],
    )


--- a/src/evaluate.py
+++ b/src/evaluate.py
@@ -2,8 +2,7 @@ from llmtuner import Evaluator


 def main():
-    evaluator = Evaluator()
-    evaluator.eval()
+    Evaluator().eval()


 if __name__ == "__main__":
--- a/src/llmtuner/init.py
+++ b/src/llmtuner/init.py
@@ -7,5 +7,5 @@ from .train import export_model, run_exp
 from .webui import create_ui, create_web_demo


-__version__ = "0.5.3"
+__version__ = "0.6.2"
 __all__ = ["create_app", "ChatModel", "Evaluator", "export_model", "run_exp", "create_ui", "create_web_demo"]
--- a/src/llmtuner/api/app.py
+++ b/src/llmtuner/api/app.py
@@ -1,4 +1,3 @@
-import asyncio
 import json
 import os
 from contextlib import asynccontextmanager
@@ -73,7 +72,6 @@ def create_app(chat_model: "ChatModel") -> "FastAPI":
        allow_headers=["*"],
    )

-    semaphore = asyncio.Semaphore(int(os.environ.get("MAX_CONCURRENT", 1)))
    role_mapping = {
        Role.USER: DataRole.USER.value,
        Role.ASSISTANT: DataRole.ASSISTANT.value,
@@ -89,7 +87,7 @@ def create_app(chat_model: "ChatModel") -> "FastAPI":

    @app.post("/v1/chat/completions", response_model=ChatCompletionResponse, status_code=status.HTTP_200_OK)
    async def create_chat_completion(request: ChatCompletionRequest):
-        if not chat_model.can_generate:
+        if not chat_model.engine.can_generate:
            raise HTTPException(status_code=status.HTTP_405_METHOD_NOT_ALLOWED, detail="Not allowed")

        if len(request.messages) == 0:
@@ -110,31 +108,32 @@ def create_app(chat_model: "ChatModel") -> "FastAPI":
            elif i % 2 == 1 and message.role not in [Role.ASSISTANT, Role.FUNCTION]:
                raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid role")

-            input_messages.append({"role": role_mapping[message.role], "content": message.content})
+            if message.role == Role.ASSISTANT and isinstance(message.tool_calls, list) and len(message.tool_calls):
+                name = message.tool_calls[0].function.name
+                arguments = message.tool_calls[0].function.arguments
+                content = json.dumps({"name": name, "argument": arguments}, ensure_ascii=False)
+                input_messages.append({"role": role_mapping[Role.FUNCTION], "content": content})
+            else:
+                input_messages.append({"role": role_mapping[message.role], "content": message.content})

        tool_list = request.tools
        if isinstance(tool_list, list) and len(tool_list):
            try:
-                tools = json.dumps([tool["function"] for tool in tool_list], ensure_ascii=False)
+                tools = json.dumps([dictify(tool.function) for tool in tool_list], ensure_ascii=False)
            except Exception:
                raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid tools")
        else:
            tools = ""

-        async with semaphore:
-            loop = asyncio.get_running_loop()
-            return await loop.run_in_executor(None, chat_completion, input_messages, system, tools, request)
-
-    def chat_completion(messages: Sequence[Dict[str, str]], system: str, tools: str, request: ChatCompletionRequest):
        if request.stream:
            if tools:
                raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Cannot stream function calls.")

-            generate = stream_chat_completion(messages, system, tools, request)
+            generate = stream_chat_completion(input_messages, system, tools, request)
            return EventSourceResponse(generate, media_type="text/event-stream")

-        responses = chat_model.chat(
-            messages,
+        responses = await chat_model.achat(
+            input_messages,
            system,
            tools,
            do_sample=request.do_sample,
@@ -148,7 +147,7 @@ def create_app(chat_model: "ChatModel") -> "FastAPI":
        choices = []
        for i, response in enumerate(responses):
            if tools:
-                result = chat_model.template.format_tools.extract(response.response_text)
+                result = chat_model.engine.template.format_tools.extract(response.response_text)
            else:
                result = response.response_text

@@ -177,7 +176,7 @@ def create_app(chat_model: "ChatModel") -> "FastAPI":

        return ChatCompletionResponse(model=request.model, choices=choices, usage=usage)

-    def stream_chat_completion(
+    async def stream_chat_completion(
        messages: Sequence[Dict[str, str]], system: str, tools: str, request: ChatCompletionRequest
    ):
        choice_data = ChatCompletionResponseStreamChoice(
@@ -186,7 +185,7 @@ def create_app(chat_model: "ChatModel") -> "FastAPI":
        chunk = ChatCompletionStreamResponse(model=request.model, choices=[choice_data])
        yield jsonify(chunk)

-        for new_text in chat_model.stream_chat(
+        async for new_token in chat_model.astream_chat(
            messages,
            system,
            tools,
@@ -195,11 +194,11 @@ def create_app(chat_model: "ChatModel") -> "FastAPI":
            top_p=request.top_p,
            max_new_tokens=request.max_tokens,
        ):
-            if len(new_text) == 0:
+            if len(new_token) == 0:
                continue

            choice_data = ChatCompletionResponseStreamChoice(
-                index=0, delta=ChatCompletionMessage(content=new_text), finish_reason=None
+                index=0, delta=ChatCompletionMessage(content=new_token), finish_reason=None
            )
            chunk = ChatCompletionStreamResponse(model=request.model, choices=[choice_data])
            yield jsonify(chunk)
@@ -213,18 +212,13 @@ def create_app(chat_model: "ChatModel") -> "FastAPI":

    @app.post("/v1/score/evaluation", response_model=ScoreEvaluationResponse, status_code=status.HTTP_200_OK)
    async def create_score_evaluation(request: ScoreEvaluationRequest):
-        if chat_model.can_generate:
+        if chat_model.engine.can_generate:
            raise HTTPException(status_code=status.HTTP_405_METHOD_NOT_ALLOWED, detail="Not allowed")

        if len(request.messages) == 0:
            raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid request")

-        async with semaphore:
-            loop = asyncio.get_running_loop()
-            return await loop.run_in_executor(None, get_score, request)
-
-    def get_score(request: ScoreEvaluationRequest):
-        scores = chat_model.get_scores(request.messages, max_length=request.max_length)
+        scores = await chat_model.aget_scores(request.messages, max_length=request.max_length)
        return ScoreEvaluationResponse(model=request.model, scores=scores)

    return app
--- a/src/llmtuner/api/protocol.py
+++ b/src/llmtuner/api/protocol.py
@@ -1,6 +1,6 @@
 import time
 from enum import Enum, unique
-from typing import List, Optional
+from typing import Any, Dict, List, Optional

 from pydantic import BaseModel, Field
 from typing_extensions import Literal
@@ -39,6 +39,17 @@ class Function(BaseModel):
    arguments: str


+class FunctionDefinition(BaseModel):
+    name: str
+    description: str
+    parameters: Dict[str, Any]
+
+
+class FunctionAvailable(BaseModel):
+    type: Literal["function", "code_interpreter"] = "function"
+    function: Optional[FunctionDefinition] = None
+
+
 class FunctionCall(BaseModel):
    id: Literal["call_default"] = "call_default"
    type: Literal["function"] = "function"
@@ -47,7 +58,8 @@ class FunctionCall(BaseModel):

 class ChatMessage(BaseModel):
    role: Role
-    content: str
+    content: Optional[str] = None
+    tool_calls: Optional[List[FunctionCall]] = None


 class ChatCompletionMessage(BaseModel):
@@ -59,7 +71,7 @@ class ChatCompletionMessage(BaseModel):
 class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[ChatMessage]
-    tools: Optional[list] = []
+    tools: Optional[List[FunctionAvailable]] = None
    do_sample: bool = True
    temperature: Optional[float] = None
    top_p: Optional[float] = None
--- a/src/llmtuner/chat/init.py
+++ b/src/llmtuner/chat/init.py
@@ -1,4 +1,5 @@
+from .base_engine import BaseEngine
 from .chat_model import ChatModel


-__all__ = ["ChatModel"]
+__all__ = ["BaseEngine", "ChatModel"]
--- a/src/llmtuner/chat/base_engine.py
+++ b/src/llmtuner/chat/base_engine.py
@@ -0,0 +1,69 @@
+from abc import ABC, abstractmethod
+from dataclasses import dataclass
+from typing import TYPE_CHECKING, Any, AsyncGenerator, Dict, List, Literal, Optional, Sequence, Union
+
+
+if TYPE_CHECKING:
+    from transformers import PreTrainedModel, PreTrainedTokenizer
+
+    from ..data import Template
+    from ..extras.packages import is_vllm_available
+    from ..hparams import DataArguments, FinetuningArguments, GeneratingArguments, ModelArguments
+
+    if is_vllm_available():
+        from vllm import AsyncLLMEngine
+
+
+@dataclass
+class Response:
+    response_text: str
+    response_length: int
+    prompt_length: int
+    finish_reason: Literal["stop", "length"]
+
+
+class BaseEngine(ABC):
+    model: Union["PreTrainedModel", "AsyncLLMEngine"]
+    tokenizer: "PreTrainedTokenizer"
+    can_generate: bool
+    template: "Template"
+    generating_args: Dict[str, Any]
+
+    @abstractmethod
+    def __init__(
+        self,
+        model_args: "ModelArguments",
+        data_args: "DataArguments",
+        finetuning_args: "FinetuningArguments",
+        generating_args: "GeneratingArguments",
+    ) -> None: ...
+
+    @abstractmethod
+    async def start(
+        self,
+    ) -> None: ...
+
+    @abstractmethod
+    async def chat(
+        self,
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        **input_kwargs,
+    ) -> List["Response"]: ...
+
+    @abstractmethod
+    async def stream_chat(
+        self,
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        **input_kwargs,
+    ) -> AsyncGenerator[str, None]: ...
+
+    @abstractmethod
+    async def get_scores(
+        self,
+        batch_input: List[str],
+        **input_kwargs,
+    ) -> List[float]: ...
--- a/src/llmtuner/chat/chat_model.py
+++ b/src/llmtuner/chat/chat_model.py
@@ -1,124 +1,55 @@
-from dataclasses import dataclass
+import asyncio
 from threading import Thread
-from typing import Any, Dict, Generator, List, Literal, Optional, Sequence, Tuple
+from typing import TYPE_CHECKING, Any, AsyncGenerator, Dict, Generator, List, Optional, Sequence

-import torch
-from transformers import GenerationConfig, TextIteratorStreamer
-
-from ..data import get_template_and_fix_tokenizer
-from ..extras.misc import get_logits_processor
 from ..hparams import get_infer_args
-from ..model import dispatch_model, load_model_and_tokenizer
+from .hf_engine import HuggingfaceEngine
+from .vllm_engine import VllmEngine


-@dataclass
-class Response:
-    response_text: str
-    response_length: int
-    prompt_length: int
-    finish_reason: Literal["stop", "length"]
+if TYPE_CHECKING:
+    from .base_engine import BaseEngine, Response
+
+
+def _start_background_loop(loop: asyncio.AbstractEventLoop) -> None:
+    asyncio.set_event_loop(loop)
+    loop.run_forever()


 class ChatModel:
    def __init__(self, args: Optional[Dict[str, Any]] = None) -> None:
-        model_args, data_args, finetuning_args, self.generating_args = get_infer_args(args)
-        self.can_generate = finetuning_args.stage == "sft"
-        self.model, self.tokenizer = load_model_and_tokenizer(
-            model_args, finetuning_args, is_trainable=False, add_valuehead=(not self.can_generate)
-        )
-        self.tokenizer.padding_side = "left" if self.can_generate else "right"
-        self.model = dispatch_model(self.model)
-        self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args.template)
+        model_args, data_args, finetuning_args, generating_args = get_infer_args(args)
+        if model_args.infer_backend == "huggingface":
+            self.engine: "BaseEngine" = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)
+        elif model_args.infer_backend == "vllm":
+            self.engine: "BaseEngine" = VllmEngine(model_args, data_args, finetuning_args, generating_args)
+        else:
+            raise NotImplementedError("Unknown backend: {}".format(model_args.infer_backend))

-    def _process_args(
-        self,
-        messages: Sequence[Dict[str, str]],
-        system: Optional[str] = None,
-        tools: Optional[str] = None,
-        **input_kwargs,
-    ) -> Tuple[Dict[str, Any], int]:
-        paired_messages = messages + [{"role": "assistant", "content": ""}]
-        prompt, _ = self.template.encode_oneturn(
-            tokenizer=self.tokenizer, messages=paired_messages, system=system, tools=tools
-        )
-        prompt_length = len(prompt)
-        input_ids = torch.tensor([prompt], device=self.model.device)
+        self._loop = asyncio.new_event_loop()
+        self._thread = Thread(target=_start_background_loop, args=(self._loop,), daemon=True)
+        self._thread.start()
+        asyncio.run_coroutine_threadsafe(self.engine.start(), self._loop)

-        do_sample = input_kwargs.pop("do_sample", None)
-        temperature = input_kwargs.pop("temperature", None)
-        top_p = input_kwargs.pop("top_p", None)
-        top_k = input_kwargs.pop("top_k", None)
-        num_return_sequences = input_kwargs.pop("num_return_sequences", None)
-        repetition_penalty = input_kwargs.pop("repetition_penalty", None)
-        max_length = input_kwargs.pop("max_length", None)
-        max_new_tokens = input_kwargs.pop("max_new_tokens", None)
-
-        generating_args = self.generating_args.to_dict()
-        generating_args.update(
-            dict(
-                do_sample=do_sample if do_sample is not None else generating_args["do_sample"],
-                temperature=temperature or generating_args["temperature"],
-                top_p=top_p or generating_args["top_p"],
-                top_k=top_k or generating_args["top_k"],
-                num_return_sequences=num_return_sequences or 1,
-                repetition_penalty=repetition_penalty or generating_args["repetition_penalty"],
-                eos_token_id=[self.tokenizer.eos_token_id] + self.tokenizer.additional_special_tokens_ids,
-                pad_token_id=self.tokenizer.pad_token_id,
-            )
-        )
-
-        if isinstance(num_return_sequences, int) and num_return_sequences > 1:
-            generating_args["do_sample"] = True
-
-        if max_length:
-            generating_args.pop("max_new_tokens", None)
-            generating_args["max_length"] = max_length
-
-        if max_new_tokens:
-            generating_args.pop("max_length", None)
-            generating_args["max_new_tokens"] = max_new_tokens
-
-        gen_kwargs = dict(
-            inputs=input_ids,
-            generation_config=GenerationConfig(**generating_args),
-            logits_processor=get_logits_processor(),
-        )
-
-        return gen_kwargs, prompt_length
-
-    @torch.inference_mode()
    def chat(
        self,
        messages: Sequence[Dict[str, str]],
        system: Optional[str] = None,
        tools: Optional[str] = None,
        **input_kwargs,
-    ) -> List[Response]:
-        if not self.can_generate:
-            raise ValueError("The current model does not support `chat`.")
+    ) -> List["Response"]:
+        task = asyncio.run_coroutine_threadsafe(self.achat(messages, system, tools, **input_kwargs), self._loop)
+        return task.result()

-        gen_kwargs, prompt_length = self._process_args(messages, system, tools, **input_kwargs)
-        generate_output = self.model.generate(**gen_kwargs)
-        response_ids = generate_output[:, prompt_length:]
-        response = self.tokenizer.batch_decode(
-            response_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
-        )
-        results = []
-        for i in range(len(response)):
-            eos_index = (response_ids[i] == self.tokenizer.eos_token_id).nonzero()
-            response_length = (eos_index[0].item() + 1) if len(eos_index) else len(response_ids[i])
-            results.append(
-                Response(
-                    response_text=response[i],
-                    response_length=response_length,
-                    prompt_length=prompt_length,
-                    finish_reason="stop" if len(eos_index) else "length",
-                )
-            )
+    async def achat(
+        self,
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        **input_kwargs,
+    ) -> List["Response"]:
+        return await self.engine.chat(messages, system, tools, **input_kwargs)

-        return results
-
-    @torch.inference_mode()
    def stream_chat(
        self,
        messages: Sequence[Dict[str, str]],
@@ -126,44 +57,35 @@ class ChatModel:
        tools: Optional[str] = None,
        **input_kwargs,
    ) -> Generator[str, None, None]:
-        if not self.can_generate:
-            raise ValueError("The current model does not support `stream_chat`.")
+        generator = self.astream_chat(messages, system, tools, **input_kwargs)
+        while True:
+            try:
+                task = asyncio.run_coroutine_threadsafe(generator.__anext__(), self._loop)
+                yield task.result()
+            except StopAsyncIteration:
+                break

-        gen_kwargs, _ = self._process_args(messages, system, tools, **input_kwargs)
-        streamer = TextIteratorStreamer(self.tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True)
-        gen_kwargs["streamer"] = streamer
+    async def astream_chat(
+        self,
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        **input_kwargs,
+    ) -> AsyncGenerator[str, None]:
+        async for new_token in self.engine.stream_chat(messages, system, tools, **input_kwargs):
+            yield new_token

-        thread = Thread(target=self.model.generate, kwargs=gen_kwargs)
-        thread.start()
+    def get_scores(
+        self,
+        batch_input: List[str],
+        **input_kwargs,
+    ) -> List[float]:
+        task = asyncio.run_coroutine_threadsafe(self.aget_scores(batch_input, **input_kwargs), self._loop)
+        return task.result()

-        yield from streamer
-
-    @torch.inference_mode()
-    def get_scores(self, batch_input: List[str], **input_kwargs) -> List[float]:
-        if self.can_generate:
-            raise ValueError("Cannot get scores using an auto-regressive model.")
-
-        max_length = input_kwargs.pop("max_length", None)
-        device = getattr(self.model.pretrained_model, "device", "cuda")
-        inputs = self.tokenizer(
-            batch_input,
-            padding=True,
-            truncation=True,
-            max_length=max_length or getattr(self.model.config, "max_position_embeddings", 1024),
-            return_tensors="pt",
-            add_special_tokens=True,
-        ).to(device)
-
-        input_ids: torch.Tensor = inputs["input_ids"]
-        _, _, values = self.model(**inputs, output_hidden_states=True, return_dict=True)
-
-        if getattr(self.model.config, "model_type", None) == "chatglm":
-            values = torch.transpose(values, 0, 1)
-
-        scores = []
-        for i in range(input_ids.size(0)):
-            end_indexes = (input_ids[i] != self.tokenizer.pad_token_id).nonzero()
-            end_index = end_indexes[-1].item() if len(end_indexes) else 0
-            scores.append(values[i, end_index].nan_to_num().item())
-
-        return scores
+    async def aget_scores(
+        self,
+        batch_input: List[str],
+        **input_kwargs,
+    ) -> List[float]:
+        return await self.engine.get_scores(batch_input, **input_kwargs)
--- a/src/llmtuner/chat/hf_engine.py
+++ b/src/llmtuner/chat/hf_engine.py
@@ -0,0 +1,264 @@
+import asyncio
+import concurrent.futures
+import os
+from threading import Thread
+from typing import TYPE_CHECKING, Any, AsyncGenerator, Callable, Dict, List, Optional, Sequence, Tuple
+
+import torch
+from transformers import GenerationConfig, TextIteratorStreamer
+
+from ..data import get_template_and_fix_tokenizer
+from ..extras.misc import get_logits_processor
+from ..model import load_model, load_tokenizer
+from .base_engine import BaseEngine, Response
+
+
+if TYPE_CHECKING:
+    from transformers import PreTrainedModel, PreTrainedTokenizer
+    from trl import PreTrainedModelWrapper
+
+    from ..data import Template
+    from ..hparams import DataArguments, FinetuningArguments, GeneratingArguments, ModelArguments
+
+
+class HuggingfaceEngine(BaseEngine):
+    def __init__(
+        self,
+        model_args: "ModelArguments",
+        data_args: "DataArguments",
+        finetuning_args: "FinetuningArguments",
+        generating_args: "GeneratingArguments",
+    ) -> None:
+        self.can_generate = finetuning_args.stage == "sft"
+        self.tokenizer = load_tokenizer(model_args)
+        self.tokenizer.padding_side = "left" if self.can_generate else "right"
+        self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args.template)
+        self.model = load_model(
+            self.tokenizer, model_args, finetuning_args, is_trainable=False, add_valuehead=(not self.can_generate)
+        )  # must after fixing tokenizer to resize vocab
+        self.generating_args = generating_args.to_dict()
+
+    @staticmethod
+    def _process_args(
+        model: "PreTrainedModel",
+        tokenizer: "PreTrainedTokenizer",
+        template: "Template",
+        generating_args: Dict[str, Any],
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        input_kwargs: Optional[Dict[str, Any]] = {},
+    ) -> Tuple[Dict[str, Any], int]:
+        paired_messages = messages + [{"role": "assistant", "content": ""}]
+        prompt_ids, _ = template.encode_oneturn(
+            tokenizer=tokenizer, messages=paired_messages, system=system, tools=tools
+        )
+        prompt_length = len(prompt_ids)
+        inputs = torch.tensor([prompt_ids], device=model.device)
+
+        do_sample = input_kwargs.pop("do_sample", None)
+        temperature = input_kwargs.pop("temperature", None)
+        top_p = input_kwargs.pop("top_p", None)
+        top_k = input_kwargs.pop("top_k", None)
+        num_return_sequences = input_kwargs.pop("num_return_sequences", None)
+        repetition_penalty = input_kwargs.pop("repetition_penalty", None)
+        max_length = input_kwargs.pop("max_length", None)
+        max_new_tokens = input_kwargs.pop("max_new_tokens", None)
+
+        generating_args.update(
+            dict(
+                do_sample=do_sample if do_sample is not None else generating_args["do_sample"],
+                temperature=temperature or generating_args["temperature"],
+                top_p=top_p or generating_args["top_p"],
+                top_k=top_k or generating_args["top_k"],
+                num_return_sequences=num_return_sequences or 1,
+                repetition_penalty=repetition_penalty or generating_args["repetition_penalty"],
+                eos_token_id=[tokenizer.eos_token_id] + tokenizer.additional_special_tokens_ids,
+                pad_token_id=tokenizer.pad_token_id,
+            )
+        )
+
+        if isinstance(num_return_sequences, int) and num_return_sequences > 1:
+            generating_args["do_sample"] = True
+
+        if max_length:
+            generating_args.pop("max_new_tokens", None)
+            generating_args["max_length"] = max_length
+
+        if max_new_tokens:
+            generating_args.pop("max_length", None)
+            generating_args["max_new_tokens"] = max_new_tokens
+
+        gen_kwargs = dict(
+            inputs=inputs,
+            generation_config=GenerationConfig(**generating_args),
+            logits_processor=get_logits_processor(),
+        )
+
+        return gen_kwargs, prompt_length
+
+    @staticmethod
+    @torch.inference_mode()
+    def _chat(
+        model: "PreTrainedModel",
+        tokenizer: "PreTrainedTokenizer",
+        template: "Template",
+        generating_args: Dict[str, Any],
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        input_kwargs: Optional[Dict[str, Any]] = {},
+    ) -> List["Response"]:
+        gen_kwargs, prompt_length = HuggingfaceEngine._process_args(
+            model, tokenizer, template, generating_args, messages, system, tools, input_kwargs
+        )
+        generate_output = model.generate(**gen_kwargs)
+        response_ids = generate_output[:, prompt_length:]
+        response = tokenizer.batch_decode(response_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+        results = []
+        for i in range(len(response)):
+            eos_index = (response_ids[i] == tokenizer.eos_token_id).nonzero()
+            response_length = (eos_index[0].item() + 1) if len(eos_index) else len(response_ids[i])
+            results.append(
+                Response(
+                    response_text=response[i],
+                    response_length=response_length,
+                    prompt_length=prompt_length,
+                    finish_reason="stop" if len(eos_index) else "length",
+                )
+            )
+
+        return results
+
+    @staticmethod
+    @torch.inference_mode()
+    def _stream_chat(
+        model: "PreTrainedModel",
+        tokenizer: "PreTrainedTokenizer",
+        template: "Template",
+        generating_args: Dict[str, Any],
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        input_kwargs: Optional[Dict[str, Any]] = {},
+    ) -> Callable[[], str]:
+        gen_kwargs, _ = HuggingfaceEngine._process_args(
+            model, tokenizer, template, generating_args, messages, system, tools, input_kwargs
+        )
+        streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
+        gen_kwargs["streamer"] = streamer
+        thread = Thread(target=model.generate, kwargs=gen_kwargs, daemon=True)
+        thread.start()
+
+        def stream():
+            try:
+                return streamer.__next__()
+            except StopIteration:
+                raise StopAsyncIteration()
+
+        return stream
+
+    @staticmethod
+    @torch.inference_mode()
+    def _get_scores(
+        model: "PreTrainedModelWrapper",
+        tokenizer: "PreTrainedTokenizer",
+        batch_input: List[str],
+        input_kwargs: Optional[Dict[str, Any]] = {},
+    ) -> List[float]:
+        max_length = input_kwargs.pop("max_length", None)
+        device = getattr(model.pretrained_model, "device", "cuda")
+        inputs = tokenizer(
+            batch_input,
+            padding=True,
+            truncation=True,
+            max_length=max_length or getattr(model.config, "max_position_embeddings", 1024),
+            return_tensors="pt",
+            add_special_tokens=True,
+        ).to(device)
+
+        input_ids: torch.Tensor = inputs["input_ids"]
+        _, _, values = model(**inputs, output_hidden_states=True, return_dict=True)
+
+        if getattr(model.config, "model_type", None) == "chatglm":
+            values = torch.transpose(values, 0, 1)
+
+        scores = []
+        for i in range(input_ids.size(0)):
+            end_indexes = (input_ids[i] != tokenizer.pad_token_id).nonzero()
+            end_index = end_indexes[-1].item() if len(end_indexes) else 0
+            scores.append(values[i, end_index].nan_to_num().item())
+
+        return scores
+
+    async def start(self) -> None:
+        self._semaphore = asyncio.Semaphore(int(os.environ.get("MAX_CONCURRENT", 1)))
+
+    async def chat(
+        self,
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        **input_kwargs,
+    ) -> List["Response"]:
+        if not self.can_generate:
+            raise ValueError("The current model does not support `chat`.")
+
+        loop = asyncio.get_running_loop()
+        input_args = (
+            self.model,
+            self.tokenizer,
+            self.template,
+            self.generating_args,
+            messages,
+            system,
+            tools,
+            input_kwargs,
+        )
+        async with self._semaphore:
+            with concurrent.futures.ThreadPoolExecutor() as pool:
+                return await loop.run_in_executor(pool, self._chat, *input_args)
+
+    async def stream_chat(
+        self,
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        **input_kwargs,
+    ) -> AsyncGenerator[str, None]:
+        if not self.can_generate:
+            raise ValueError("The current model does not support `stream_chat`.")
+
+        loop = asyncio.get_running_loop()
+        input_args = (
+            self.model,
+            self.tokenizer,
+            self.template,
+            self.generating_args,
+            messages,
+            system,
+            tools,
+            input_kwargs,
+        )
+        async with self._semaphore:
+            with concurrent.futures.ThreadPoolExecutor() as pool:
+                stream = self._stream_chat(*input_args)
+                while True:
+                    try:
+                        yield await loop.run_in_executor(pool, stream)
+                    except StopAsyncIteration:
+                        break
+
+    async def get_scores(
+        self,
+        batch_input: List[str],
+        **input_kwargs,
+    ) -> List[float]:
+        if self.can_generate:
+            raise ValueError("Cannot get scores using an auto-regressive model.")
+
+        loop = asyncio.get_running_loop()
+        input_args = (self.model, self.tokenizer, batch_input, input_kwargs)
+        async with self._semaphore:
+            with concurrent.futures.ThreadPoolExecutor() as pool:
+                return await loop.run_in_executor(pool, self._get_scores, *input_args)
--- a/src/llmtuner/chat/vllm_engine.py
+++ b/src/llmtuner/chat/vllm_engine.py
@@ -0,0 +1,149 @@
+import uuid
+from typing import TYPE_CHECKING, AsyncGenerator, AsyncIterator, Dict, List, Optional, Sequence
+
+from transformers.utils.versions import require_version
+
+from ..data import get_template_and_fix_tokenizer
+from ..extras.misc import get_device_count
+from ..extras.packages import is_vllm_available
+from ..model import load_tokenizer
+from .base_engine import BaseEngine, Response
+
+
+if is_vllm_available():
+    from vllm import AsyncEngineArgs, AsyncLLMEngine, RequestOutput, SamplingParams
+
+if TYPE_CHECKING:
+    from ..hparams import DataArguments, FinetuningArguments, GeneratingArguments, ModelArguments
+
+
+class VllmEngine(BaseEngine):
+    def __init__(
+        self,
+        model_args: "ModelArguments",
+        data_args: "DataArguments",
+        finetuning_args: "FinetuningArguments",
+        generating_args: "GeneratingArguments",
+    ) -> None:
+        require_version("vllm>=0.3.3", "To fix: pip install vllm>=0.3.3")
+        self.can_generate = finetuning_args.stage == "sft"
+        engine_args = AsyncEngineArgs(
+            model=model_args.model_name_or_path,
+            trust_remote_code=True,
+            max_model_len=model_args.vllm_maxlen,
+            tensor_parallel_size=get_device_count() or 1,
+            gpu_memory_utilization=model_args.vllm_gpu_util,
+            disable_log_stats=True,
+            disable_log_requests=True,
+            enforce_eager=model_args.vllm_enforce_eager,
+        )
+        self.model = AsyncLLMEngine.from_engine_args(engine_args)
+        self.tokenizer = load_tokenizer(model_args)
+        self.tokenizer.padding_side = "left"
+        self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args.template)
+        self.generating_args = generating_args.to_dict()
+
+    async def _generate(
+        self,
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        **input_kwargs,
+    ) -> AsyncIterator["RequestOutput"]:
+        request_id = "chatcmpl-{}".format(uuid.uuid4().hex)
+        paired_messages = messages + [{"role": "assistant", "content": ""}]
+        prompt_ids, _ = self.template.encode_oneturn(
+            tokenizer=self.tokenizer, messages=paired_messages, system=system, tools=tools
+        )
+        prompt_length = len(prompt_ids)
+
+        temperature = input_kwargs.pop("temperature", None)
+        top_p = input_kwargs.pop("top_p", None)
+        top_k = input_kwargs.pop("top_k", None)
+        num_return_sequences = input_kwargs.pop("num_return_sequences", None)
+        repetition_penalty = input_kwargs.pop("repetition_penalty", None)
+        max_length = input_kwargs.pop("max_length", None)
+        max_new_tokens = input_kwargs.pop("max_new_tokens", None)
+
+        generating_args = self.generating_args.copy()
+        generating_args.update(
+            dict(
+                temperature=temperature or generating_args["temperature"],
+                top_p=top_p or generating_args["top_p"],
+                top_k=top_k or generating_args["top_k"],
+                num_return_sequences=num_return_sequences or 1,
+                repetition_penalty=repetition_penalty or generating_args["repetition_penalty"],
+            )
+        )
+
+        if max_length:
+            generating_args["max_new_tokens"] = max_length - prompt_length
+
+        if max_new_tokens:
+            generating_args["max_new_tokens"] = max_new_tokens
+
+        sampling_params = SamplingParams(
+            n=generating_args["num_return_sequences"],
+            repetition_penalty=generating_args["repetition_penalty"],
+            temperature=generating_args["temperature"],
+            top_p=generating_args["top_p"],
+            top_k=generating_args["top_k"],
+            use_beam_search=generating_args["num_beams"] > 1,
+            length_penalty=generating_args["length_penalty"],
+            stop_token_ids=[self.tokenizer.eos_token_id] + self.tokenizer.additional_special_tokens_ids,
+            max_tokens=generating_args["max_new_tokens"],
+            skip_special_tokens=True,
+        )
+        result_generator = self.model.generate(
+            prompt=None, sampling_params=sampling_params, request_id=request_id, prompt_token_ids=prompt_ids
+        )
+        return result_generator
+
+    async def start(self) -> None:
+        pass
+
+    async def chat(
+        self,
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        **input_kwargs,
+    ) -> List["Response"]:
+        final_output = None
+        generator = await self._generate(messages, system, tools, **input_kwargs)
+        async for request_output in generator:
+            final_output = request_output
+
+        results = []
+        for output in final_output.outputs:
+            results.append(
+                Response(
+                    response_text=output.text,
+                    response_length=len(output.token_ids),
+                    prompt_length=len(final_output.prompt_token_ids),
+                    finish_reason=output.finish_reason,
+                )
+            )
+
+        return results
+
+    async def stream_chat(
+        self,
+        messages: Sequence[Dict[str, str]],
+        system: Optional[str] = None,
+        tools: Optional[str] = None,
+        **input_kwargs,
+    ) -> AsyncGenerator[str, None]:
+        generated_text = ""
+        generator = await self._generate(messages, system, tools, **input_kwargs)
+        async for result in generator:
+            delta_text = result.outputs[0].text[len(generated_text) :]
+            generated_text = result.outputs[0].text
+            yield delta_text
+
+    async def get_scores(
+        self,
+        batch_input: List[str],
+        **input_kwargs,
+    ) -> List[float]:
+        raise NotImplementedError("vLLM engine does not support get_scores.")
--- a/src/llmtuner/data/init.py
+++ b/src/llmtuner/data/init.py
@@ -1,6 +1,15 @@
+from .collator import PairwiseDataCollatorWithPadding
 from .loader import get_dataset
-from .template import get_template_and_fix_tokenizer, templates
+from .template import Template, get_template_and_fix_tokenizer, templates
 from .utils import Role, split_dataset


-__all__ = ["get_dataset", "get_template_and_fix_tokenizer", "templates", "Role", "split_dataset"]
+__all__ = [
+    "PairwiseDataCollatorWithPadding",
+    "get_dataset",
+    "Template",
+    "get_template_and_fix_tokenizer",
+    "templates",
+    "Role",
+    "split_dataset",
+]
--- a/src/llmtuner/train/dpo/collator.py
+++ b/src/llmtuner/train/dpo/collator.py
@@ -6,12 +6,15 @@ from transformers import DataCollatorForSeq2Seq


@dataclass
-class DPODataCollatorWithPadding(DataCollatorForSeq2Seq):
+class PairwiseDataCollatorWithPadding(DataCollatorForSeq2Seq):
    r"""
    Data collator for pairwise data.
    """

    def _pad_labels(self, batch: torch.Tensor, positions: List[Tuple[int, int]]) -> torch.Tensor:
+        r"""
+        Masks out the input ids except for the responses.
+        """
        padded_labels = []
        for feature, (prompt_len, answer_len) in zip(batch, positions):
            if self.tokenizer.padding_side == "left":
@@ -43,12 +46,6 @@ class DPODataCollatorWithPadding(DataCollatorForSeq2Seq):
                )
                label_positions.append((prompt_len, answer_len))

-        batch = self.tokenizer.pad(
-            concatenated_features,
-            padding=self.padding,
-            max_length=self.max_length,
-            pad_to_multiple_of=self.pad_to_multiple_of,
-            return_tensors=self.return_tensors,
-        )
+        batch = super().__call__(concatenated_features)
        batch["labels"] = self._pad_labels(batch["input_ids"], label_positions)
        return batch
--- a/src/llmtuner/data/formatter.py
+++ b/src/llmtuner/data/formatter.py
@@ -2,7 +2,7 @@ import json
 import re
 from abc import ABC, abstractmethod
 from dataclasses import dataclass, field
-from typing import Any, Dict, List, Literal, Sequence, Set, Tuple, Union
+from typing import Any, Dict, List, Literal, Optional, Sequence, Set, Tuple, Union


 SLOTS = Sequence[Union[str, Set[str], Dict[str, str]]]
@@ -72,11 +72,10 @@ def default_tool_extractor(content: str) -> Union[str, Tuple[str, str]]:
@dataclass
 class Formatter(ABC):
    slots: SLOTS = field(default_factory=list)
-    tool_format: Literal["default"] = "default"
+    tool_format: Optional[Literal["default"]] = None

    @abstractmethod
-    def apply(self, **kwargs) -> SLOTS:
-        ...
+    def apply(self, **kwargs) -> SLOTS: ...

    def extract(self, content: str) -> Union[str, Tuple[str, str]]:
        raise NotImplementedError
@@ -84,12 +83,30 @@ class Formatter(ABC):

@dataclass
 class EmptyFormatter(Formatter):
+    def __post_init__(self):
+        has_placeholder = False
+        for slot in filter(lambda s: isinstance(s, str), self.slots):
+            if re.search(r"\{\{[a-zA-Z_][a-zA-Z0-9_]*\}\}", slot):
+                has_placeholder = True
+
+        if has_placeholder:
+            raise ValueError("Empty formatter should not contain any placeholder.")
+
    def apply(self, **kwargs) -> SLOTS:
        return self.slots


@dataclass
 class StringFormatter(Formatter):
+    def __post_init__(self):
+        has_placeholder = False
+        for slot in filter(lambda s: isinstance(s, str), self.slots):
+            if re.search(r"\{\{[a-zA-Z_][a-zA-Z0-9_]*\}\}", slot):
+                has_placeholder = True
+
+        if not has_placeholder:
+            raise ValueError("A placeholder is required in the string formatter.")
+
    def apply(self, **kwargs) -> SLOTS:
        elements = []
        for slot in self.slots:
@@ -110,6 +127,17 @@ class StringFormatter(Formatter):

@dataclass
 class FunctionFormatter(Formatter):
+    def __post_init__(self):
+        has_name, has_args = False, False
+        for slot in filter(lambda s: isinstance(s, str), self.slots):
+            if "{{name}}" in slot:
+                has_name = True
+            if "{{arguments}}" in slot:
+                has_args = True
+
+        if not has_name or not has_args:
+            raise ValueError("Name and arguments placeholders are required in the function formatter.")
+
    def apply(self, **kwargs) -> SLOTS:
        content = kwargs.pop("content")
        try:
@@ -134,6 +162,10 @@ class FunctionFormatter(Formatter):

@dataclass
 class ToolFormatter(Formatter):
+    def __post_init__(self):
+        if self.tool_format is None:
+            raise ValueError("Tool format was not found.")
+
    def apply(self, **kwargs) -> SLOTS:
        content = kwargs.pop("content")
        try:
--- a/src/llmtuner/data/loader.py
+++ b/src/llmtuner/data/loader.py
@@ -1,16 +1,17 @@
 import inspect
 import os
-from typing import TYPE_CHECKING, List, Literal, Union
+from typing import TYPE_CHECKING, Literal, Union

-from datasets import concatenate_datasets, interleave_datasets, load_dataset, load_from_disk
+from datasets import load_dataset, load_from_disk

 from ..extras.constants import FILEEXT2TYPE
 from ..extras.logging import get_logger
+from ..extras.misc import has_tokenized_data
 from .aligner import align_dataset
 from .parser import get_dataset_list
 from .preprocess import get_preprocess_and_print_func
 from .template import get_template_and_fix_tokenizer
-from .utils import checksum
+from .utils import checksum, merge_dataset


 if TYPE_CHECKING:
@@ -29,7 +30,7 @@ def load_single_dataset(
    dataset_attr: "DatasetAttr",
    model_args: "ModelArguments",
    data_args: "DataArguments",
-):
+) -> Union["Dataset", "IterableDataset"]:
    logger.info("Loading dataset {}...".format(dataset_attr))
    data_path, data_name, data_dir, data_files = None, None, None, None
    if dataset_attr.load_from in ["hf_hub", "ms_hub"]:
@@ -44,7 +45,7 @@ def load_single_dataset(

    elif dataset_attr.load_from == "file":
        data_files = []
-        local_path: str = os.path.join(data_args.dataset_dir, dataset_attr.dataset_name)
+        local_path = os.path.join(data_args.dataset_dir, dataset_attr.dataset_name)
        if os.path.isdir(local_path):  # is directory
            for file_name in os.listdir(local_path):
                data_files.append(os.path.join(local_path, file_name))
@@ -80,7 +81,9 @@ def load_single_dataset(
                cache_dir=cache_dir,
                token=model_args.ms_hub_token,
                use_streaming=(data_args.streaming and (dataset_attr.load_from != "file")),
-            ).to_hf_dataset()
+            )
+            if isinstance(dataset, MsDataset):
+                dataset = dataset.to_hf_dataset()
        except ImportError:
            raise ImportError("Please install modelscope via `pip install modelscope -U`")
    else:
@@ -111,54 +114,36 @@ def load_single_dataset(
    return align_dataset(dataset, dataset_attr, data_args)


-def merge_dataset(
-    all_datasets: List[Union["Dataset", "IterableDataset"]],
-    data_args: "DataArguments",
-    training_args: "Seq2SeqTrainingArguments",
-) -> Union["Dataset", "IterableDataset"]:
-    if len(all_datasets) == 1:
-        return all_datasets[0]
-    elif data_args.mix_strategy == "concat":
-        if data_args.streaming:
-            logger.warning("The samples between different datasets will not be mixed in streaming mode.")
-        return concatenate_datasets(all_datasets)
-    elif data_args.mix_strategy.startswith("interleave"):
-        if not data_args.streaming:
-            logger.warning("We recommend using `mix_strategy=concat` in non-streaming mode.")
-        return interleave_datasets(
-            datasets=all_datasets,
-            probabilities=data_args.interleave_probs,
-            seed=training_args.seed,
-            stopping_strategy="first_exhausted" if data_args.mix_strategy.endswith("under") else "all_exhausted",
-        )
-    else:
-        raise ValueError("Unknown mixing strategy.")
-
-
 def get_dataset(
    tokenizer: "PreTrainedTokenizer",
    model_args: "ModelArguments",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
    stage: Literal["pt", "sft", "rm", "ppo"],
-    # split: Optional[str] = "train", # TODO: add split
 ) -> Union["Dataset", "IterableDataset"]:
    template = get_template_and_fix_tokenizer(tokenizer, data_args.template)
    if data_args.train_on_prompt and template.efficient_eos:
        raise ValueError("Current template does not support `train_on_prompt`.")

-    # Load from cache
-    if data_args.cache_path is not None:
-        if os.path.exists(data_args.cache_path):
+    # Load tokenized dataset
+    if data_args.tokenized_path is not None:
+        if has_tokenized_data(data_args.tokenized_path):
            logger.warning("Loading dataset from disk will ignore other data arguments.")
-            dataset = load_from_disk(data_args.cache_path)
+            dataset = load_from_disk(data_args.tokenized_path)
+            logger.info("Loaded tokenized dataset from {}.".format(data_args.tokenized_path))
            if data_args.streaming:
                dataset = dataset.to_iterable_dataset()
            return dataset

+        if data_args.streaming:
+            raise ValueError("Turn off `streaming` when saving dataset to disk.")
+
    with training_args.main_process_first(desc="load dataset"):
        all_datasets = []
        for dataset_attr in get_dataset_list(data_args):
+            if (stage == "rm" and dataset_attr.ranking is False) or (stage != "rm" and dataset_attr.ranking is True):
+                raise ValueError("The dataset is not applicable in the current training stage.")
+
            all_datasets.append(load_single_dataset(dataset_attr, model_args, data_args))
        dataset = merge_dataset(all_datasets, data_args, training_args)

@@ -177,10 +162,13 @@ def get_dataset(

        dataset = dataset.map(preprocess_func, batched=True, remove_columns=column_names, **kwargs)

-        if data_args.cache_path is not None and not os.path.exists(data_args.cache_path):
+        if data_args.tokenized_path is not None:
            if training_args.should_save:
-                dataset.save_to_disk(data_args.cache_path)
-                logger.info("Dataset cache saved at {}.".format(data_args.cache_path))
+                dataset.save_to_disk(data_args.tokenized_path)
+                logger.info("Tokenized dataset saved at {}.".format(data_args.tokenized_path))
+                logger.info("Please restart the training with `--tokenized_path {}`.".format(data_args.tokenized_path))
+
+            exit(0)

        if training_args.should_log:
            try:
--- a/src/llmtuner/data/parser.py
+++ b/src/llmtuner/data/parser.py
@@ -19,13 +19,13 @@ class DatasetAttr:

    """ basic configs """
    load_from: Literal["hf_hub", "ms_hub", "script", "file"]
-    dataset_name: Optional[str] = None
+    dataset_name: str
    """ extra configs """
    file_sha1: Optional[str] = None
    subset: Optional[str] = None
    folder: Optional[str] = None
-    ranking: Optional[bool] = False
-    formatting: Optional[Literal["alpaca", "sharegpt"]] = "alpaca"
+    ranking: bool = False
+    formatting: Literal["alpaca", "sharegpt"] = "alpaca"
    """ columns """
    system: Optional[str] = None
    """ columns for the alpaca format """
@@ -53,22 +53,35 @@ class DatasetAttr:


 def get_dataset_list(data_args: "DataArguments") -> List["DatasetAttr"]:
-    dataset_names = [ds.strip() for ds in data_args.dataset.split(",")] if data_args.dataset is not None else []
-    try:
-        with open(os.path.join(data_args.dataset_dir, DATA_CONFIG), "r") as f:
-            dataset_info = json.load(f)
-    except Exception as err:
-        if data_args.dataset is not None:
-            raise ValueError(
-                "Cannot open {} due to {}.".format(os.path.join(data_args.dataset_dir, DATA_CONFIG), str(err))
-            )
+    if data_args.dataset is not None:
+        dataset_names = [ds.strip() for ds in data_args.dataset.split(",")]
+    else:
+        dataset_names = []
+
+    if data_args.dataset_dir == "ONLINE":
        dataset_info = None
+    else:
+        try:
+            with open(os.path.join(data_args.dataset_dir, DATA_CONFIG), "r") as f:
+                dataset_info = json.load(f)
+        except Exception as err:
+            if len(dataset_names) != 0:
+                raise ValueError(
+                    "Cannot open {} due to {}.".format(os.path.join(data_args.dataset_dir, DATA_CONFIG), str(err))
+                )
+            dataset_info = None

    if data_args.interleave_probs is not None:
        data_args.interleave_probs = [float(prob.strip()) for prob in data_args.interleave_probs.split(",")]

    dataset_list: List[DatasetAttr] = []
    for name in dataset_names:
+        if dataset_info is None:
+            load_from = "ms_hub" if use_modelscope() else "hf_hub"
+            dataset_attr = DatasetAttr(load_from, dataset_name=name)
+            dataset_list.append(dataset_attr)
+            continue
+
        if name not in dataset_info:
            raise ValueError("Undefined dataset {} in {}.".format(name, DATA_CONFIG))

--- a/src/llmtuner/data/preprocess.py
+++ b/src/llmtuner/data/preprocess.py
@@ -21,19 +21,28 @@ logger = get_logger(__name__)
 def preprocess_pretrain_dataset(
    examples: Dict[str, List[Any]], tokenizer: "PreTrainedTokenizer", data_args: "DataArguments"
 ) -> Dict[str, List[List[int]]]:
-    # build grouped texts with format `X1 X2 X3 ...`
+    # build grouped texts with format `X1 X2 X3 ...` if packing is enabled
    text_examples = [messages[0]["content"] + tokenizer.eos_token for messages in examples["prompt"]]
-    tokenized_examples = tokenizer(text_examples, add_special_tokens=False)
-    concatenated_examples = {k: list(chain(*tokenized_examples[k])) for k in tokenized_examples.keys()}
-    total_length = len(concatenated_examples[list(concatenated_examples.keys())[0]])
-    block_size = data_args.cutoff_len
-    # we drop the small remainder, and if the total_length < block_size, we exclude this batch
-    total_length = (total_length // block_size) * block_size
-    # split by chunks of cutoff_len
-    result = {
-        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
-        for k, t in concatenated_examples.items()
-    }
+
+    if not data_args.packing:
+        if data_args.template == "gemma":
+            text_examples = [tokenizer.bos_token + example for example in text_examples]
+
+        result = tokenizer(text_examples, add_special_tokens=False, max_length=data_args.cutoff_len)
+    else:
+        tokenized_examples = tokenizer(text_examples, add_special_tokens=False)
+        concatenated_examples = {k: list(chain(*tokenized_examples[k])) for k in tokenized_examples.keys()}
+        total_length = len(concatenated_examples[list(concatenated_examples.keys())[0]])
+        block_size = data_args.cutoff_len
+        total_length = (total_length // block_size) * block_size
+        result = {
+            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
+            for k, t in concatenated_examples.items()
+        }
+        if data_args.template == "gemma":
+            for i in range(len(result["input_ids"])):
+                result["input_ids"][i][0] = tokenizer.bos_token_id
+
    return result


@@ -245,7 +254,7 @@ def get_preprocess_and_print_func(
        preprocess_func = partial(preprocess_pretrain_dataset, tokenizer=tokenizer, data_args=data_args)
        print_function = partial(print_unsupervised_dataset_example, tokenizer=tokenizer)
    elif stage == "sft" and not training_args.predict_with_generate:
-        if data_args.sft_packing:
+        if data_args.packing:
            preprocess_func = partial(
                preprocess_packed_supervised_dataset, tokenizer=tokenizer, template=template, data_args=data_args
            )
--- a/src/llmtuner/data/template.py
+++ b/src/llmtuner/data/template.py
@@ -9,7 +9,7 @@ from .utils import Role, infer_max_len
 if TYPE_CHECKING:
    from transformers import PreTrainedTokenizer

-    from .formatter import Formatter
+    from .formatter import SLOTS, Formatter


 logger = get_logger(__name__)
@@ -36,8 +36,8 @@ class Template:
        messages: List[Dict[str, str]],
        system: Optional[str] = None,
        tools: Optional[str] = None,
-        cutoff_len: Optional[int] = 1_000_000,
-        reserved_label_len: Optional[int] = 1,
+        cutoff_len: int = 1_000_000,
+        reserved_label_len: int = 1,
    ) -> Tuple[List[int], List[int]]:
        r"""
        Returns a single pair of token ids representing prompt and response respectively.
@@ -56,8 +56,8 @@ class Template:
        messages: List[Dict[str, str]],
        system: Optional[str] = None,
        tools: Optional[str] = None,
-        cutoff_len: Optional[int] = 1_000_000,
-        reserved_label_len: Optional[int] = 1,
+        cutoff_len: int = 1_000_000,
+        reserved_label_len: int = 1,
    ) -> Sequence[Tuple[List[int], List[int]]]:
        r"""
        Returns multiple pairs of token ids representing prompts and responses respectively.
@@ -207,12 +207,38 @@ def _register_template(
    format_observation: Optional["Formatter"] = None,
    format_tools: Optional["Formatter"] = None,
    format_separator: Optional["Formatter"] = None,
-    default_system: Optional[str] = "",
-    stop_words: Optional[List[str]] = [],
-    efficient_eos: Optional[bool] = False,
-    replace_eos: Optional[bool] = False,
-    force_system: Optional[bool] = False,
+    default_system: str = "",
+    stop_words: List[str] = [],
+    efficient_eos: bool = False,
+    replace_eos: bool = False,
+    force_system: bool = False,
 ) -> None:
+    r"""
+    Registers a chat template.
+
+    To add the following chat template:
+    ```
+    [HUMAN]:
+    user prompt here
+    [AI]:
+    model response here
+
+    [HUMAN]:
+    user prompt here
+    [AI]:
+    model response here
+    ```
+
+    The corresponding code should be:
+    ```
+    _register_template(
+        name="custom",
+        format_user=StringFormatter(slots=["[HUMAN]:\n{{content}}\n[AI]:\n"]),
+        format_separator=EmptyFormatter(slots=["\n\n"]),
+        efficient_eos=True,
+    )
+    ```
+    """
    eos_slots = [] if efficient_eos else [{"eos_token"}]
    template_class = Llama2Template if name.startswith("llama2") else Template
    default_user_formatter = StringFormatter(slots=["{{content}}"])
@@ -238,18 +264,80 @@ def _register_template(

 def _add_or_replace_eos_token(tokenizer: "PreTrainedTokenizer", eos_token: str) -> None:
    is_added = tokenizer.eos_token_id is None
-    is_oov = eos_token not in tokenizer.get_vocab()
-    tokenizer.add_special_tokens({"eos_token": eos_token})
+    num_added_tokens = tokenizer.add_special_tokens({"eos_token": eos_token})

    if is_added:
        logger.info("Add eos token: {}".format(tokenizer.eos_token))
    else:
        logger.info("Replace eos token: {}".format(tokenizer.eos_token))

-    if is_oov:
+    if num_added_tokens > 0:
        logger.warning("New tokens have been added, make sure `resize_vocab` is True.")


+def _jinja_escape(content: str) -> str:
+    return content.replace("\n", r"\n").replace("'", r"\'")
+
+
+def _convert_slots_to_jinja(slots: "SLOTS", tokenizer: "PreTrainedTokenizer", placeholder: str = "content") -> str:
+    slot_items = []
+    for slot in slots:
+        if isinstance(slot, str):
+            slot_pieces = slot.split("{{content}}")
+            if slot_pieces[0]:
+                slot_items.append("'" + _jinja_escape(slot_pieces[0]) + "'")
+            if len(slot_pieces) > 1:
+                slot_items.append(placeholder)
+                if slot_pieces[1]:
+                    slot_items.append("'" + _jinja_escape(slot_pieces[1]) + "'")
+        elif isinstance(slot, set):
+            if "bos_token" in slot:
+                slot_items.append("'" + tokenizer.bos_token + "'")
+            elif "eos_token" in slot:  # do not use {{ eos_token }} since it may be replaced
+                slot_items.append("'" + tokenizer.eos_token + "'")
+        elif isinstance(slot, dict):
+            raise ValueError("Dict is not supported.")
+
+    return " + ".join(slot_items)
+
+
+def _get_jinja_template(template: "Template", tokenizer: "PreTrainedTokenizer") -> str:
+    jinja_template = ""
+
+    if template.default_system:
+        jinja_template += "{% set system_message = '" + _jinja_escape(template.default_system) + "' %}"
+
+    jinja_template += (
+        "{% if messages[0]['role'] == 'system' %}" "{% set system_message = messages[0]['content'] %}" "{% endif %}"
+    )
+
+    system_message = _convert_slots_to_jinja(template.format_system.apply(), tokenizer, placeholder="system_message")
+    if isinstance(template, Llama2Template):
+        pass
+    elif template.force_system:
+        jinja_template += "{{ " + system_message + " }}"
+    else:
+        jinja_template += "{% if system_message is defined %}{{ " + system_message + " }}{% endif %}"
+
+    jinja_template += "{% for message in messages %}"
+    jinja_template += "{% set content = message['content'] %}"
+    if isinstance(template, Llama2Template):
+        jinja_template += "{% if loop.index0 == 0 and system_message is defined %}"
+        jinja_template += "{% set content = " + system_message + " + message['content'] %}"
+        jinja_template += "{% endif %}"
+    jinja_template += "{% if message['role'] == 'user' %}"
+    user_message = _convert_slots_to_jinja(template.format_user.apply(), tokenizer)
+    jinja_template += "{{ " + user_message + " }}"
+    jinja_template += "{% elif message['role'] == 'assistant' %}"
+    assistant_message = _convert_slots_to_jinja(
+        template.format_assistant.apply() + template.format_separator.apply(), tokenizer
+    )
+    jinja_template += "{{ " + assistant_message + " }}"
+    jinja_template += "{% endif %}"
+    jinja_template += "{% endfor %}"
+    return jinja_template
+
+
 def get_template_and_fix_tokenizer(
    tokenizer: "PreTrainedTokenizer",
    name: Optional[str] = None,
@@ -277,10 +365,17 @@ def get_template_and_fix_tokenizer(
        logger.info("Add pad token: {}".format(tokenizer.pad_token))

    if stop_words:
-        tokenizer.add_special_tokens(
+        num_added_tokens = tokenizer.add_special_tokens(
            dict(additional_special_tokens=stop_words), replace_additional_special_tokens=False
        )
        logger.info("Add {} to stop words.".format(",".join(stop_words)))
+        if num_added_tokens > 0:
+            logger.warning("New tokens have been added, make sure `resize_vocab` is True.")
+
+    try:
+        tokenizer.chat_template = _get_jinja_template(template, tokenizer)
+    except ValueError:
+        logger.info("Cannot add this chat template to tokenizer.")

    return template

@@ -326,7 +421,7 @@ _register_template(

 _register_template(
    name="baichuan2",
-    format_user=StringFormatter(slots=[{"token": "<reserved_106>"}, "{{content}}", {"token": "<reserved_107>"}]),
+    format_user=StringFormatter(slots=["<reserved_106>{{content}}<reserved_107>"]),
    efficient_eos=True,
 )

@@ -346,6 +441,18 @@ _register_template(
 )


+_register_template(
+    name="breeze",
+    format_user=StringFormatter(slots=["[INST] {{content}} [/INST] "]),
+    format_system=StringFormatter(slots=[{"bos_token"}, "{{content}}"]),
+    default_system=(
+        "You are a helpful AI assistant built by MediaTek Research. "
+        "The user you are helping speaks Traditional Chinese and comes from Taiwan."
+    ),
+    efficient_eos=True,
+)
+
+
 _register_template(
    name="chatglm2",
    format_user=StringFormatter(slots=["[Round {{idx}}]\n\n问：{{content}}\n\n答："]),
@@ -360,7 +467,7 @@ _register_template(
    name="chatglm3",
    format_user=StringFormatter(slots=[{"token": "<|user|>"}, "\n", "{{content}}", {"token": "<|assistant|>"}]),
    format_assistant=StringFormatter(slots=["\n", "{{content}}"]),
-    format_system=StringFormatter(slots=[{"token": "[gMASK]"}, {"token": "sop"}]),
+    format_system=StringFormatter(slots=[{"token": "[gMASK]"}, {"token": "sop"}, "{{content}}"]),
    format_function=FunctionFormatter(slots=["{{name}}\n{{arguments}}"]),
    format_observation=StringFormatter(
        slots=[{"token": "<|observation|>"}, "\n", "{{content}}", {"token": "<|assistant|>"}]
@@ -439,7 +546,7 @@ _register_template(
    name="deepseekcoder",
    format_user=StringFormatter(slots=["### Instruction:\n{{content}}\n### Response:"]),
    format_assistant=StringFormatter(slots=["\n", "{{content}}"]),
-    format_separator=EmptyFormatter(slots=["\n", {"token": "<|EOT|>"}, "\n"]),
+    format_separator=EmptyFormatter(slots=["\n<|EOT|>\n"]),
    default_system=(
        "You are an AI programming assistant, utilizing the Deepseek Coder model, "
        "developed by Deepseek Company, and you only answer questions related to computer science. "
@@ -536,6 +643,15 @@ _register_template(
 )


+_register_template(
+    name="olmo",
+    format_user=StringFormatter(slots=["<|user|>\n{{content}}<|assistant|>"]),
+    format_assistant=StringFormatter(slots=["{{content}}", {"eos_token"}]),
+    format_system=StringFormatter(slots=[{"eos_token"}, "{{content}}"]),
+    force_system=True,
+)
+
+
 _register_template(
    name="openchat",
    format_user=StringFormatter(slots=["GPT4 Correct User: {{content}}", {"eos_token"}, "GPT4 Correct Assistant:"]),
@@ -574,10 +690,8 @@ _register_template(

 _register_template(
    name="starchat",
-    format_user=StringFormatter(
-        slots=[{"token": "<|user|>"}, "\n{{content}}", {"token": "<|end|>"}, "\n", {"token": "<|assistant|>"}]
-    ),
-    format_system=StringFormatter(slots=[{"token": "<|system|>"}, "\n{{content}}", {"token": "<|end|>"}, "\n"]),
+    format_user=StringFormatter(slots=["<|user|>\n{{content}}<|end|>\n<|assistant|>"]),
+    format_system=StringFormatter(slots=["<|system|>\n{{content}}<|end|>\n"]),
    format_separator=EmptyFormatter(slots=["\n"]),
    stop_words=["<|end|>"],
    replace_eos=True,
@@ -587,6 +701,8 @@ _register_template(

 _register_template(
    name="vanilla",
+    format_separator=EmptyFormatter(slots=["\n"]),
+    efficient_eos=True,
 )


@@ -658,6 +774,7 @@ _register_template(
 _register_template(
    name="zephyr",
    format_user=StringFormatter(slots=["<|user|>\n{{content}}", {"eos_token"}, "<|assistant|>"]),
+    format_assistant=StringFormatter(slots=["\n{{content}}", {"eos_token"}]),
    format_system=StringFormatter(slots=["<|system|>\n{{content}}", {"eos_token"}]),
    default_system="You are a friendly chatbot who always responds in the style of a pirate",
 )
@@ -665,6 +782,6 @@ _register_template(

 _register_template(
    name="ziya",
-    format_user=StringFormatter(slots=[{"token": "<human>"}, ":{{content}}\n", {"token": "<bot>"}, ":"]),
+    format_user=StringFormatter(slots=["<human>:{{content}}\n<bot>:"]),
    format_separator=EmptyFormatter(slots=["\n"]),
 )
--- a/src/llmtuner/data/utils.py
+++ b/src/llmtuner/data/utils.py
@@ -2,12 +2,14 @@ import hashlib
 from enum import Enum, unique
 from typing import TYPE_CHECKING, Dict, List, Optional, Tuple, Union

+from datasets import concatenate_datasets, interleave_datasets
+
 from ..extras.logging import get_logger


 if TYPE_CHECKING:
    from datasets import Dataset, IterableDataset
-    from transformers import TrainingArguments
+    from transformers import Seq2SeqTrainingArguments

    from llmtuner.hparams import DataArguments

@@ -42,12 +44,36 @@ def checksum(data_files: List[str], file_sha1: Optional[str] = None) -> None:
 def infer_max_len(source_len: int, target_len: int, max_len: int, reserved_label_len: int) -> Tuple[int, int]:
    max_target_len = int(max_len * (target_len / (source_len + target_len)))
    max_target_len = max(max_target_len, reserved_label_len)
-    max_source_len = max_len - max_target_len
+    max_source_len = max_len - min(max_target_len, target_len)
    return max_source_len, max_target_len


+def merge_dataset(
+    all_datasets: List[Union["Dataset", "IterableDataset"]],
+    data_args: "DataArguments",
+    training_args: "Seq2SeqTrainingArguments",
+) -> Union["Dataset", "IterableDataset"]:
+    if len(all_datasets) == 1:
+        return all_datasets[0]
+    elif data_args.mix_strategy == "concat":
+        if data_args.streaming:
+            logger.warning("The samples between different datasets will not be mixed in streaming mode.")
+        return concatenate_datasets(all_datasets)
+    elif data_args.mix_strategy.startswith("interleave"):
+        if not data_args.streaming:
+            logger.warning("We recommend using `mix_strategy=concat` in non-streaming mode.")
+        return interleave_datasets(
+            datasets=all_datasets,
+            probabilities=data_args.interleave_probs,
+            seed=training_args.seed,
+            stopping_strategy="first_exhausted" if data_args.mix_strategy.endswith("under") else "all_exhausted",
+        )
+    else:
+        raise ValueError("Unknown mixing strategy.")
+
+
 def split_dataset(
-    dataset: Union["Dataset", "IterableDataset"], data_args: "DataArguments", training_args: "TrainingArguments"
+    dataset: Union["Dataset", "IterableDataset"], data_args: "DataArguments", training_args: "Seq2SeqTrainingArguments"
 ) -> Dict[str, "Dataset"]:
    if training_args.do_train:
        if data_args.val_size > 1e-6:  # Split the dataset
--- a/src/llmtuner/eval/evaluator.py
+++ b/src/llmtuner/eval/evaluator.py
@@ -14,17 +14,17 @@ from transformers.utils import cached_file
 from ..data import get_template_and_fix_tokenizer
 from ..extras.constants import CHOICES, SUBJECTS
 from ..hparams import get_eval_args
-from ..model import dispatch_model, load_model_and_tokenizer
+from ..model import load_model, load_tokenizer
 from .template import get_eval_template


 class Evaluator:
    def __init__(self, args: Optional[Dict[str, Any]] = None) -> None:
        self.model_args, self.data_args, self.eval_args, finetuning_args = get_eval_args(args)
-        self.model, self.tokenizer = load_model_and_tokenizer(self.model_args, finetuning_args)
+        self.tokenizer = load_tokenizer(self.model_args)
        self.tokenizer.padding_side = "right"  # avoid overflow issue in batched inference for llama2
-        self.model = dispatch_model(self.model)
        self.template = get_template_and_fix_tokenizer(self.tokenizer, self.data_args.template)
+        self.model = load_model(self.tokenizer, self.model_args, finetuning_args)
        self.eval_template = get_eval_template(self.eval_args.lang)
        self.choice_inputs = [
            self.tokenizer.encode(self.eval_template.prefix + ch, add_special_tokens=False)[-1] for ch in CHOICES
--- a/src/llmtuner/eval/template.py
+++ b/src/llmtuner/eval/template.py
@@ -1,14 +1,10 @@
 from dataclasses import dataclass
-from typing import TYPE_CHECKING, Dict, List, Tuple
+from typing import Dict, List, Sequence, Tuple

 from ..data import Role
 from ..extras.constants import CHOICES


-if TYPE_CHECKING:
-    from datasets import Dataset
-
-
@dataclass
 class EvalTemplate:
    system: str
@@ -16,22 +12,29 @@ class EvalTemplate:
    answer: str
    prefix: str

-    def parse_example(self, example: Dict[str, str]) -> Tuple[str, str]:
+    def _parse_example(self, example: Dict[str, str]) -> Tuple[str, str]:
+        r"""
+        input: a dict with keys {"question", "A", "B", "C", "D", "answer"}
+        output: a tuple of (prompt, response)
+        """
        candidates = [self.choice.format(choice=ch, content=example[ch]) for ch in CHOICES if ch in example]
        return "".join([example["question"]] + candidates + [self.answer]), example["answer"]

    def format_example(
-        self, target_data: Dict[str, str], support_set: "Dataset", subject_name: str
+        self, target_data: Dict[str, str], support_set: Sequence[Dict[str, str]], subject_name: str
    ) -> List[Dict[str, str]]:
+        r"""
+        Converts dataset examples to messages.
+        """
        messages = []
        for k in range(len(support_set)):
-            prompt, response = self.parse_example(support_set[k])
-            messages.append({"role": Role.USER, "content": prompt})
-            messages.append({"role": Role.ASSISTANT, "content": response})
+            prompt, response = self._parse_example(support_set[k])
+            messages.append({"role": Role.USER.value, "content": prompt})
+            messages.append({"role": Role.ASSISTANT.value, "content": response})

-        prompt, response = self.parse_example(target_data)
-        messages.append({"role": Role.USER, "content": prompt})
-        messages.append({"role": Role.ASSISTANT, "content": response})
+        prompt, response = self._parse_example(target_data)
+        messages.append({"role": Role.USER.value, "content": prompt})
+        messages.append({"role": Role.ASSISTANT.value, "content": response})
        messages[0]["content"] = self.system.format(subject=subject_name) + messages[0]["content"]
        return messages

@@ -39,7 +42,7 @@ class EvalTemplate:
 eval_templates: Dict[str, "EvalTemplate"] = {}


-def register_eval_template(name: str, system: str, choice: str, answer: str, prefix: str) -> None:
+def _register_eval_template(name: str, system: str, choice: str, answer: str, prefix: str) -> None:
    eval_templates[name] = EvalTemplate(system=system, choice=choice, answer=answer, prefix=prefix)


@@ -49,7 +52,7 @@ def get_eval_template(name: str) -> "EvalTemplate":
    return eval_template


-register_eval_template(
+_register_eval_template(
    name="en",
    system="The following are multiple choice questions (with answers) about {subject}.\n\n",
    choice="\n{choice}. {content}",
@@ -58,10 +61,10 @@ register_eval_template(
 )


-register_eval_template(
+_register_eval_template(
    name="zh",
    system="以下是中国关于{subject}考试的单项选择题，请选出其中的正确答案。\n\n",
    choice="\n{choice}. {content}",
    answer="\n答案：",
-    prefix="\n",
+    prefix=" ",
 )
--- a/src/llmtuner/extras/callbacks.py
+++ b/src/llmtuner/extras/callbacks.py
@@ -58,9 +58,17 @@ class LogCallback(TrainerCallback):
            self.in_training = True
            self.start_time = time.time()
            self.max_steps = state.max_steps
-            if os.path.exists(os.path.join(args.output_dir, LOG_FILE_NAME)) and args.overwrite_output_dir:
-                logger.warning("Previous log file in this folder will be deleted.")
-                os.remove(os.path.join(args.output_dir, LOG_FILE_NAME))
+
+        if args.save_on_each_node:
+            if not state.is_local_process_zero:
+                return
+        else:
+            if not state.is_world_process_zero:
+                return
+
+        if os.path.exists(os.path.join(args.output_dir, LOG_FILE_NAME)) and args.overwrite_output_dir:
+            logger.warning("Previous log file in this folder will be deleted.")
+            os.remove(os.path.join(args.output_dir, LOG_FILE_NAME))

    def on_train_end(self, args: "TrainingArguments", state: "TrainerState", control: "TrainerControl", **kwargs):
        r"""
@@ -112,8 +120,12 @@ class LogCallback(TrainerCallback):
        r"""
        Event called after logging the last logs.
        """
-        if not state.is_local_process_zero:
-            return
+        if args.save_on_each_node:
+            if not state.is_local_process_zero:
+                return
+        else:
+            if not state.is_world_process_zero:
+                return

        logs = dict(
            current_steps=self.cur_steps,
@@ -122,6 +134,7 @@ class LogCallback(TrainerCallback):
            eval_loss=state.log_history[-1].get("eval_loss", None),
            predict_loss=state.log_history[-1].get("predict_loss", None),
            reward=state.log_history[-1].get("reward", None),
+            accuracy=state.log_history[-1].get("rewards/accuracies", None),
            learning_rate=state.log_history[-1].get("learning_rate", None),
            epoch=state.log_history[-1].get("epoch", None),
            percentage=round(self.cur_steps / self.max_steps * 100, 2) if self.max_steps != 0 else 100,
--- a/src/llmtuner/extras/constants.py
+++ b/src/llmtuner/extras/constants.py
@@ -39,9 +39,12 @@ TRAINING_STAGES = {
    "Reward Modeling": "rm",
    "PPO": "ppo",
    "DPO": "dpo",
+    "ORPO": "orpo",
    "Pre-Training": "pt",
 }

+STAGES_USE_PAIR_DATA = ["rm", "dpo", "orpo"]
+
 V_HEAD_WEIGHTS_NAME = "value_head.bin"

 V_HEAD_SAFE_WEIGHTS_NAME = "value_head.safetensors"
@@ -167,6 +170,19 @@ register_model_group(
 )


+register_model_group(
+    models={
+        "Breeze-7B": {
+            DownloadSource.DEFAULT: "MediaTek-Research/Breeze-7B-Base-v1_0",
+        },
+        "Breeze-7B-Chat": {
+            DownloadSource.DEFAULT: "MediaTek-Research/Breeze-7B-Instruct-v1_0",
+        },
+    },
+    template="breeze",
+)
+
+
 register_model_group(
    models={
        "ChatGLM2-6B-Chat": {
@@ -460,14 +476,18 @@ register_model_group(

 register_model_group(
    models={
-        "Mistral-7B": {
+        "Mistral-7B-v0.1": {
            DownloadSource.DEFAULT: "mistralai/Mistral-7B-v0.1",
            DownloadSource.MODELSCOPE: "AI-ModelScope/Mistral-7B-v0.1",
        },
-        "Mistral-7B-Chat": {
+        "Mistral-7B-v0.1-Chat": {
            DownloadSource.DEFAULT: "mistralai/Mistral-7B-Instruct-v0.1",
            DownloadSource.MODELSCOPE: "AI-ModelScope/Mistral-7B-Instruct-v0.1",
        },
+        "Mistral-7B-v0.2": {
+            DownloadSource.DEFAULT: "alpindale/Mistral-7B-v0.2-hf",
+            DownloadSource.MODELSCOPE: "AI-ModelScope/Mistral-7B-v0.2-hf",
+        },
        "Mistral-7B-v0.2-Chat": {
            DownloadSource.DEFAULT: "mistralai/Mistral-7B-Instruct-v0.2",
            DownloadSource.MODELSCOPE: "AI-ModelScope/Mistral-7B-Instruct-v0.2",
@@ -492,6 +512,24 @@ register_model_group(
 )


+register_model_group(
+    models={
+        "OLMo-1B": {
+            DownloadSource.DEFAULT: "allenai/OLMo-1B",
+        },
+        "OLMo-7B": {
+            DownloadSource.DEFAULT: "allenai/OLMo-7B",
+            DownloadSource.MODELSCOPE: "AI-ModelScope/OLMo-7B",
+        },
+        "OLMo-7B-Chat": {
+            DownloadSource.DEFAULT: "allenai/OLMo-7B-Instruct",
+        },
+    },
+    module="att_proj",
+    template="olmo",
+)
+
+
 register_model_group(
    models={
        "OpenChat3.5-7B-Chat": {
@@ -638,10 +676,18 @@ register_model_group(
            DownloadSource.DEFAULT: "Qwen/Qwen1.5-14B",
            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-14B",
        },
+        "Qwen1.5-32B": {
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-32B",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-32B",
+        },
        "Qwen1.5-72B": {
            DownloadSource.DEFAULT: "Qwen/Qwen1.5-72B",
            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-72B",
        },
+        "Qwen1.5-MoE-A2.7B": {
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-MoE-A2.7B",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-MoE-A2.7B",
+        },
        "Qwen1.5-0.5B-Chat": {
            DownloadSource.DEFAULT: "Qwen/Qwen1.5-0.5B-Chat",
            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-0.5B-Chat",
@@ -662,57 +708,73 @@ register_model_group(
            DownloadSource.DEFAULT: "Qwen/Qwen1.5-14B-Chat",
            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-14B-Chat",
        },
+        "Qwen1.5-32B-Chat": {
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-32B-Chat",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-32B-Chat",
+        },
        "Qwen1.5-72B-Chat": {
            DownloadSource.DEFAULT: "Qwen/Qwen1.5-72B-Chat",
            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-72B-Chat",
        },
+        "Qwen1.5-MoE-A2.7B-Chat": {
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-MoE-A2.7B-Chat",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-MoE-A2.7B-Chat",
+        },
        "Qwen1.5-0.5B-int8-Chat": {
            DownloadSource.DEFAULT: "Qwen/Qwen1.5-0.5B-Chat-GPTQ-Int8",
            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-0.5B-Chat-GPTQ-Int8",
        },
        "Qwen1.5-0.5B-int4-Chat": {
-            DownloadSource.DEFAULT: "Qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4",
-            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4",
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-0.5B-Chat-AWQ",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-0.5B-Chat-AWQ",
        },
        "Qwen1.5-1.8B-int8-Chat": {
            DownloadSource.DEFAULT: "Qwen/Qwen1.5-1.8B-Chat-GPTQ-Int8",
            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-1.8B-Chat-GPTQ-Int8",
        },
        "Qwen1.5-1.8B-int4-Chat": {
-            DownloadSource.DEFAULT: "Qwen/Qwen1.5-1.8B-Chat-GPTQ-Int4",
-            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-1.8B-Chat-GPTQ-Int4",
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-1.8B-Chat-AWQ",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-1.8B-Chat-AWQ",
        },
        "Qwen1.5-4B-int8-Chat": {
            DownloadSource.DEFAULT: "Qwen/Qwen1.5-4B-Chat-GPTQ-Int8",
            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-4B-Chat-GPTQ-Int8",
        },
        "Qwen1.5-4B-int4-Chat": {
-            DownloadSource.DEFAULT: "Qwen/Qwen1.5-4B-Chat-GPTQ-Int4",
-            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-4B-Chat-GPTQ-Int4",
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-4B-Chat-AWQ",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-4B-Chat-AWQ",
        },
        "Qwen1.5-7B-int8-Chat": {
            DownloadSource.DEFAULT: "Qwen/Qwen1.5-7B-Chat-GPTQ-Int8",
            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-7B-Chat-GPTQ-Int8",
        },
        "Qwen1.5-7B-int4-Chat": {
-            DownloadSource.DEFAULT: "Qwen/Qwen1.5-7B-Chat-GPTQ-Int4",
-            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-7B-Chat-GPTQ-Int4",
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-7B-Chat-AWQ",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-7B-Chat-AWQ",
        },
        "Qwen1.5-14B-int8-Chat": {
            DownloadSource.DEFAULT: "Qwen/Qwen1.5-14B-Chat-GPTQ-Int8",
            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-14B-Chat-GPTQ-Int8",
        },
        "Qwen1.5-14B-int4-Chat": {
-            DownloadSource.DEFAULT: "Qwen/Qwen1.5-14B-Chat-GPTQ-Int4",
-            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-14B-Chat-GPTQ-Int4",
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-14B-Chat-AWQ",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-14B-Chat-AWQ",
+        },
+        "Qwen1.5-32B-int4-Chat": {
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-32B-Chat-AWQ",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-32B-Chat-AWQ",
        },
        "Qwen1.5-72B-int8-Chat": {
            DownloadSource.DEFAULT: "Qwen/Qwen1.5-72B-Chat-GPTQ-Int8",
            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-72B-Chat-GPTQ-Int8",
        },
        "Qwen1.5-72B-int4-Chat": {
-            DownloadSource.DEFAULT: "Qwen/Qwen1.5-72B-Chat-GPTQ-Int4",
-            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-72B-Chat-GPTQ-Int4",
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-72B-Chat-AWQ",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-72B-Chat-AWQ",
+        },
+        "Qwen1.5-MoE-A2.7B-int4-Chat": {
+            DownloadSource.DEFAULT: "Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
+            DownloadSource.MODELSCOPE: "qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
        },
    },
    template="qwen",
@@ -743,6 +805,21 @@ register_model_group(
 )


+register_model_group(
+    models={
+        "StarCoder2-3B": {
+            DownloadSource.DEFAULT: "bigcode/starcoder2-3b",
+        },
+        "StarCoder2-7B": {
+            DownloadSource.DEFAULT: "bigcode/starcoder2-7b",
+        },
+        "StarCoder2-15B": {
+            DownloadSource.DEFAULT: "bigcode/starcoder2-15b",
+        },
+    }
+)
+
+
 register_model_group(
    models={
        "Vicuna1.5-7B-Chat": {
@@ -833,6 +910,10 @@ register_model_group(
            DownloadSource.DEFAULT: "01-ai/Yi-6B",
            DownloadSource.MODELSCOPE: "01ai/Yi-6B",
        },
+        "Yi-9B": {
+            DownloadSource.DEFAULT: "01-ai/Yi-9B",
+            DownloadSource.MODELSCOPE: "01ai/Yi-9B",
+        },
        "Yi-34B": {
            DownloadSource.DEFAULT: "01-ai/Yi-34B",
            DownloadSource.MODELSCOPE: "01ai/Yi-34B",
--- a/src/llmtuner/extras/misc.py
+++ b/src/llmtuner/extras/misc.py
@@ -14,6 +14,7 @@ from transformers.utils import (
    is_torch_npu_available,
    is_torch_xpu_available,
 )
+from transformers.utils.versions import require_version

 from .constants import V_HEAD_SAFE_WEIGHTS_NAME, V_HEAD_WEIGHTS_NAME
 from .logging import get_logger
@@ -56,6 +57,18 @@ class AverageMeter:
        self.avg = self.sum / self.count


+def check_dependencies() -> None:
+    if int(os.environ.get("DISABLE_VERSION_CHECK", "0")):
+        logger.warning("Version checking has been disabled, may lead to unexpected behaviors.")
+    else:
+        require_version("transformers>=4.37.2", "To fix: pip install transformers>=4.37.2")
+        require_version("datasets>=2.14.3", "To fix: pip install datasets>=2.14.3")
+        require_version("accelerate>=0.27.2", "To fix: pip install accelerate>=0.27.2")
+        require_version("peft>=0.10.0", "To fix: pip install peft>=0.10.0")
+        require_version("trl>=0.8.1", "To fix: pip install trl>=0.8.1")
+        require_version("gradio>=4.0.0,<=4.21.0", "To fix: pip install gradio==4.21.0")
+
+
 def count_parameters(model: torch.nn.Module) -> Tuple[int, int]:
    r"""
    Returns the number of trainable parameters and number of all parameters in the model.
@@ -69,7 +82,12 @@ def count_parameters(model: torch.nn.Module) -> Tuple[int, int]:

        # Due to the design of 4bit linear layers from bitsandbytes, multiply the number of parameters by 2
        if param.__class__.__name__ == "Params4bit":
-            num_params = num_params * 2
+            if hasattr(param, "quant_storage") and hasattr(param.quant_storage, "itemsize"):
+                num_bytes = param.quant_storage.itemsize
+            else:
+                num_bytes = 1
+
+            num_params = num_params * 2 * num_bytes

        all_param += num_params
        if param.requires_grad:
@@ -145,6 +163,12 @@ def get_current_device() -> torch.device:


 def get_device_count() -> int:
+    r"""
+    Gets the number of available GPU devices.
+    """
+    if not torch.cuda.is_available():
+        return 0
+
    return torch.cuda.device_count()


@@ -169,6 +193,13 @@ def infer_optim_dtype(model_dtype: torch.dtype) -> torch.dtype:
        return torch.float32


+def has_tokenized_data(path: os.PathLike) -> bool:
+    r"""
+    Checks if the path has a tokenized dataset.
+    """
+    return os.path.isdir(path) and len(os.listdir(path)) > 0
+
+
 def torch_gc() -> None:
    r"""
    Collects GPU memory.
@@ -179,17 +210,15 @@ def torch_gc() -> None:
        torch.cuda.ipc_collect()


-def try_download_model_from_ms(model_args: "ModelArguments") -> None:
+def try_download_model_from_ms(model_args: "ModelArguments") -> str:
    if not use_modelscope() or os.path.exists(model_args.model_name_or_path):
-        return
+        return model_args.model_name_or_path

    try:
        from modelscope import snapshot_download

        revision = "master" if model_args.model_revision == "main" else model_args.model_revision
-        model_args.model_name_or_path = snapshot_download(
-            model_args.model_name_or_path, revision=revision, cache_dir=model_args.cache_dir
-        )
+        return snapshot_download(model_args.model_name_or_path, revision=revision, cache_dir=model_args.cache_dir)
    except ImportError:
        raise ImportError("Please install modelscope via `pip install modelscope -U`")

--- a/src/llmtuner/extras/packages.py
+++ b/src/llmtuner/extras/packages.py
@@ -21,6 +21,10 @@ def is_flash_attn2_available():
    return _is_package_available("flash_attn") and _get_package_version("flash_attn").startswith("2")


+def is_galore_available():
+    return _is_package_available("galore_torch")
+
+
 def is_jieba_available():
    return _is_package_available("jieba")

@@ -51,3 +55,7 @@ def is_unsloth_available():

 def is_uvicorn_available():
    return _is_package_available("uvicorn")
+
+
+def is_vllm_available():
+    return _is_package_available("vllm")
--- a/src/llmtuner/extras/patches/llama_patch.py
+++ b/src/llmtuner/extras/patches/llama_patch.py
@@ -11,12 +11,14 @@ from transformers.models.llama.modeling_llama import (
    repeat_kv,
 )
 from transformers.utils import logging
+from transformers.utils.versions import require_version


 logger = logging.get_logger(__name__)


-# Modified from: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py
+# Modified from:
+# https://github.com/huggingface/transformers/blob/v4.39.1/src/transformers/models/llama/modeling_llama.py
 def llama_torch_attn_forward(
    self: "LlamaAttention",
    hidden_states: torch.Tensor,
@@ -24,6 +26,7 @@ def llama_torch_attn_forward(
    position_ids: Optional[torch.LongTensor] = None,
    past_key_value: Optional["Cache"] = None,
    output_attentions: bool = False,
+    cache_position: Optional[torch.LongTensor] = None,
    **kwargs,
 ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
    bsz, q_len, _ = hidden_states.size()
@@ -36,15 +39,12 @@ def llama_torch_attn_forward(
    key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
    value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)

-    kv_seq_len = key_states.shape[-2]
-    if past_key_value is not None:
-        kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
-
-    cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
-    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+    past_key_value = getattr(self, "past_key_value", past_key_value)
+    cos, sin = self.rotary_emb(value_states, position_ids)
+    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)

    if past_key_value is not None:
-        cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
+        cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
        key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)

    key_states = repeat_kv(key_states, self.num_key_value_groups)
@@ -96,14 +96,16 @@ def llama_torch_attn_forward(
    return attn_output, attn_weights, past_key_value


-# Modified from: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py
+# Modified from:
+# https://github.com/huggingface/transformers/blob/v4.39.1/src/transformers/models/llama/modeling_llama.py
 def llama_flash_attn_forward(
    self: "LlamaFlashAttention2",
    hidden_states: torch.Tensor,
    attention_mask: Optional[torch.Tensor] = None,
    position_ids: Optional[torch.LongTensor] = None,
-    past_key_value: Optional[Tuple[torch.Tensor]] = None,
+    past_key_value: Optional["Cache"] = None,
    output_attentions: bool = False,
+    cache_position: Optional[torch.LongTensor] = None,
    **kwargs,
 ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
    # LlamaFlashAttention2 attention does not support output_attentions
@@ -120,15 +122,13 @@ def llama_flash_attn_forward(
    key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
    value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)

-    kv_seq_len = key_states.shape[-2]
-    if past_key_value is not None:
-        kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
+    cos, sin = self.rotary_emb(value_states, position_ids)
+    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)

-    cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
-    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+    past_key_value = getattr(self, "past_key_value", past_key_value)

    if past_key_value is not None:
-        cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
+        cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
        key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)

    key_states = repeat_kv(key_states, self.num_key_value_groups)
@@ -193,5 +193,6 @@ def llama_flash_attn_forward(


 def apply_llama_patch() -> None:
+    require_version("transformers==4.39.3", "To fix: pip install transformers==4.39.3")
    LlamaAttention.forward = llama_torch_attn_forward
    LlamaFlashAttention2.forward = llama_flash_attn_forward
--- a/src/llmtuner/extras/patches/mixtral_patch.py
+++ b/src/llmtuner/extras/patches/mixtral_patch.py
@@ -1,38 +0,0 @@
-import torch
-import torch.nn.functional as F
-from transformers.models.mixtral.modeling_mixtral import MixtralBLockSparseTop2MLP, MixtralSparseMoeBlock
-
-
-def mlp_forward(self: "MixtralBLockSparseTop2MLP", hidden_states: torch.Tensor) -> torch.Tensor:
-    current_hidden_states = self.act_fn(self.w1(hidden_states)) * self.w3(hidden_states)
-    current_hidden_states = self.w2(current_hidden_states)
-    return current_hidden_states
-
-
-# Modified from: https://huggingface.co/deepseek-ai/deepseek-moe-16b-base/blob/main/modeling_deepseek.py
-def moe_forward(self: "MixtralSparseMoeBlock", hidden_states: torch.Tensor) -> torch.Tensor:
-    batch_size, sequence_length, hidden_dim = hidden_states.shape
-    hidden_states = hidden_states.view(-1, hidden_dim)
-    # router_logits: (batch * sequence_length, n_experts)
-    router_logits = self.gate(hidden_states)
-
-    routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
-    topk_weight, topk_idx = torch.topk(routing_weights, self.top_k, dim=-1, sorted=False)
-    topk_weight /= topk_weight.sum(dim=-1, keepdim=True)
-    # we cast back to the input dtype
-    topk_weight = topk_weight.to(hidden_states.dtype)
-
-    hidden_states = hidden_states.repeat_interleave(self.top_k, dim=0)
-    y = torch.empty_like(hidden_states)
-    flat_topk_idx = topk_idx.view(-1)
-    for i in range(self.num_experts):
-        expert = self.experts[i]
-        y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])
-    y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
-    final_hidden_states = y.reshape(batch_size, sequence_length, hidden_dim)
-    return final_hidden_states, router_logits
-
-
-def patch_mixtral_replace_moe_impl() -> None:
-    MixtralBLockSparseTop2MLP.forward = mlp_forward
-    MixtralSparseMoeBlock.forward = moe_forward
--- a/src/llmtuner/extras/ploting.py
+++ b/src/llmtuner/extras/ploting.py
@@ -1,7 +1,7 @@
 import json
 import math
 import os
-from typing import List, Optional
+from typing import List

 from transformers.trainer import TRAINER_STATE_NAME

@@ -30,7 +30,7 @@ def smooth(scalars: List[float]) -> List[float]:
    return smoothed


-def plot_loss(save_dictionary: os.PathLike, keys: Optional[List[str]] = ["loss"]) -> None:
+def plot_loss(save_dictionary: os.PathLike, keys: List[str] = ["loss"]) -> None:
    with open(os.path.join(save_dictionary, TRAINER_STATE_NAME), "r", encoding="utf-8") as f:
        data = json.load(f)

@@ -46,11 +46,12 @@ def plot_loss(save_dictionary: os.PathLike, keys: Optional[List[str]] = ["loss"]
            continue

        plt.figure()
-        plt.plot(steps, metrics, alpha=0.4, label="original")
-        plt.plot(steps, smooth(metrics), label="smoothed")
+        plt.plot(steps, metrics, color="#1f77b4", alpha=0.4, label="original")
+        plt.plot(steps, smooth(metrics), color="#1f77b4", label="smoothed")
        plt.title("training {} of {}".format(key, save_dictionary))
        plt.xlabel("step")
        plt.ylabel(key)
        plt.legend()
-        plt.savefig(os.path.join(save_dictionary, "training_{}.png".format(key)), format="png", dpi=100)
-        print("Figure saved:", os.path.join(save_dictionary, "training_{}.png".format(key)))
+        figure_path = os.path.join(save_dictionary, "training_{}.png".format(key.replace("/", "_")))
+        plt.savefig(figure_path, format="png", dpi=100)
+        print("Figure saved at:", figure_path)
--- a/src/llmtuner/hparams/data_args.py
+++ b/src/llmtuner/hparams/data_args.py
@@ -16,35 +16,35 @@ class DataArguments:
        default=None,
        metadata={"help": "The name of provided dataset(s) to use. Use commas to separate multiple datasets."},
    )
-    dataset_dir: Optional[str] = field(
+    dataset_dir: str = field(
        default="data",
        metadata={"help": "Path to the folder containing the datasets."},
    )
-    split: Optional[str] = field(
+    split: str = field(
        default="train",
        metadata={"help": "Which dataset split to use for training and evaluation."},
    )
-    cutoff_len: Optional[int] = field(
+    cutoff_len: int = field(
        default=1024,
        metadata={"help": "The cutoff length of the model inputs after tokenization."},
    )
-    reserved_label_len: Optional[int] = field(
+    reserved_label_len: int = field(
        default=1,
        metadata={"help": "The minimum cutoff length reserved for label after tokenization."},
    )
-    train_on_prompt: Optional[bool] = field(
+    train_on_prompt: bool = field(
        default=False,
        metadata={"help": "Whether to disable the mask on the prompt or not."},
    )
-    streaming: Optional[bool] = field(
+    streaming: bool = field(
        default=False,
        metadata={"help": "Enable dataset streaming."},
    )
-    buffer_size: Optional[int] = field(
+    buffer_size: int = field(
        default=16384,
        metadata={"help": "Size of the buffer to randomly sample examples from in dataset streaming."},
    )
-    mix_strategy: Optional[Literal["concat", "interleave_under", "interleave_over"]] = field(
+    mix_strategy: Literal["concat", "interleave_under", "interleave_over"] = field(
        default="concat",
        metadata={"help": "Strategy to use in dataset mixing (concat/interleave) (undersampling/oversampling)."},
    )
@@ -52,13 +52,13 @@ class DataArguments:
        default=None,
        metadata={"help": "Probabilities to sample data from datasets. Use commas to separate multiple datasets."},
    )
-    overwrite_cache: Optional[bool] = field(
+    overwrite_cache: bool = field(
        default=False,
        metadata={"help": "Overwrite the cached training and evaluation sets."},
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
-        metadata={"help": "The number of processes to use for the preprocessing."},
+        metadata={"help": "The number of processes to use for the pre-processing."},
    )
    max_samples: Optional[int] = field(
        default=None,
@@ -68,23 +68,25 @@ class DataArguments:
        default=None,
        metadata={"help": "Number of beams to use for evaluation. This argument will be passed to `model.generate`"},
    )
-    ignore_pad_token_for_loss: Optional[bool] = field(
+    ignore_pad_token_for_loss: bool = field(
        default=True,
        metadata={
            "help": "Whether or not to ignore the tokens corresponding to padded labels in the loss computation."
        },
    )
-    val_size: Optional[float] = field(
-        default=0,
+    val_size: float = field(
+        default=0.0,
        metadata={"help": "Size of the development set, should be an integer or a float in range `[0,1)`."},
    )
-    sft_packing: Optional[bool] = field(
-        default=False,
-        metadata={"help": "Packing the questions and answers in the supervised fine-tuning stage."},
-    )
-    cache_path: Optional[str] = field(
+    packing: Optional[bool] = field(
        default=None,
-        metadata={"help": "Path to save or load the preprocessed datasets."},
+        metadata={
+            "help": "Whether or not to pack the sequences in training. Will automatically enable in pre-training."
+        },
+    )
+    tokenized_path: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to save or load the tokenized datasets."},
    )

    def __post_init__(self):
--- a/src/llmtuner/hparams/evaluation_args.py
+++ b/src/llmtuner/hparams/evaluation_args.py
@@ -14,23 +14,23 @@ class EvaluationArguments:
    task: str = field(
        metadata={"help": "Name of the evaluation task."},
    )
-    task_dir: Optional[str] = field(
+    task_dir: str = field(
        default="evaluation",
        metadata={"help": "Path to the folder containing the evaluation datasets."},
    )
-    batch_size: Optional[int] = field(
+    batch_size: int = field(
        default=4,
        metadata={"help": "The batch size per GPU for evaluation."},
    )
-    seed: Optional[int] = field(
+    seed: int = field(
        default=42,
        metadata={"help": "Random seed to be used with data loaders."},
    )
-    lang: Optional[Literal["en", "zh"]] = field(
+    lang: Literal["en", "zh"] = field(
        default="en",
        metadata={"help": "Language used at evaluation."},
    )
-    n_shot: Optional[int] = field(
+    n_shot: int = field(
        default=5,
        metadata={"help": "Number of examplars for few-shot learning."},
    )
@@ -38,7 +38,7 @@ class EvaluationArguments:
        default=None,
        metadata={"help": "Path to save the evaluation results."},
    )
-    download_mode: Optional[DownloadMode] = field(
+    download_mode: DownloadMode = field(
        default=DownloadMode.REUSE_DATASET_IF_EXISTS,
        metadata={"help": "Download mode used for the evaluation datasets."},
    )
--- a/src/llmtuner/hparams/finetuning_args.py
+++ b/src/llmtuner/hparams/finetuning_args.py
@@ -9,8 +9,8 @@ class FreezeArguments:
    Arguments pertaining to the freeze (partial-parameter) training.
    """

-    name_module_trainable: Optional[str] = field(
-        default=None,
+    name_module_trainable: str = field(
+        default="all",
        metadata={
            "help": """Name of trainable modules for partial-parameter (freeze) fine-tuning. \
                    Use commas to separate multiple modules. \
@@ -22,8 +22,8 @@ class FreezeArguments:
                    Others choices: the same as LLaMA."""
        },
    )
-    num_layer_trainable: Optional[int] = field(
-        default=3,
+    num_layer_trainable: int = field(
+        default=2,
        metadata={"help": "The number of trainable layers for partial-parameter (freeze) fine-tuning."},
    )

@@ -44,20 +44,20 @@ class LoraArguments:
        default=None,
        metadata={"help": "The scale factor for LoRA fine-tuning (default: lora_rank * 2)."},
    )
-    lora_dropout: Optional[float] = field(
+    lora_dropout: float = field(
        default=0.0,
        metadata={"help": "Dropout rate for the LoRA fine-tuning."},
    )
-    lora_rank: Optional[int] = field(
+    lora_rank: int = field(
        default=8,
        metadata={"help": "The intrinsic dimension for LoRA fine-tuning."},
    )
-    lora_target: Optional[str] = field(
-        default=None,
+    lora_target: str = field(
+        default="all",
        metadata={
            "help": """Name(s) of target modules to apply LoRA. \
                    Use commas to separate multiple modules. \
-                    Use "all" to specify all the available modules. \
+                    Use "all" to specify all the linear modules. \
                    LLaMA choices: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], \
                    BLOOM & Falcon & ChatGLM choices: ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"], \
                    Baichuan choices: ["W_pack", "o_proj", "gate_proj", "up_proj", "down_proj"], \
@@ -66,18 +66,23 @@ class LoraArguments:
                    Others choices: the same as LLaMA."""
        },
    )
-    lora_bf16_mode: Optional[bool] = field(
-        default=False,
-        metadata={"help": "Whether or not to train lora adapters in bf16 precision."},
+    loraplus_lr_ratio: Optional[float] = field(
+        default=None,
+        metadata={"help": "LoRA plus learning rate ratio (lr_B / lr_A)."},
    )
-    use_rslora: Optional[bool] = field(
+    loraplus_lr_embedding: float = field(
+        default=1e-6,
+        metadata={"help": "LoRA plus learning rate for lora embedding layers."},
+    )
+    use_rslora: bool = field(
        default=False,
        metadata={"help": "Whether or not to use the rank stabilization scaling factor for LoRA layer."},
    )
-    use_dora: Optional[bool] = field(
-        default=False, metadata={"help": "Whether or not to use the weight-decomposed lora method (DoRA)."}
+    use_dora: bool = field(
+        default=False,
+        metadata={"help": "Whether or not to use the weight-decomposed lora method (DoRA)."},
    )
-    create_new_adapter: Optional[bool] = field(
+    create_new_adapter: bool = field(
        default=False,
        metadata={"help": "Whether or not to create a new adapter with randomly initialized weight."},
    )
@@ -89,39 +94,43 @@ class RLHFArguments:
    Arguments pertaining to the PPO and DPO training.
    """

-    dpo_beta: Optional[float] = field(
+    dpo_beta: float = field(
        default=0.1,
        metadata={"help": "The beta parameter for the DPO loss."},
    )
-    dpo_loss: Optional[Literal["sigmoid", "hinge", "ipo", "kto_pair"]] = field(
+    dpo_loss: Literal["sigmoid", "hinge", "ipo", "kto_pair"] = field(
        default="sigmoid",
        metadata={"help": "The type of DPO loss to use."},
    )
-    dpo_ftx: Optional[float] = field(
-        default=0,
+    dpo_label_smoothing: float = field(
+        default=0.0,
+        metadata={"help": "The robust DPO label smoothing parameter in cDPO that should be between 0 and 0.5."},
+    )
+    dpo_ftx: float = field(
+        default=0.0,
        metadata={"help": "The supervised fine-tuning loss coefficient in DPO training."},
    )
-    ppo_buffer_size: Optional[int] = field(
+    orpo_beta: float = field(
+        default=0.1,
+        metadata={"help": "The beta (lambda) parameter in ORPO loss representing the weight of the SFT loss."},
+    )
+    ppo_buffer_size: int = field(
        default=1,
        metadata={"help": "The number of mini-batches to make experience buffer in a PPO optimization step."},
    )
-    ppo_epochs: Optional[int] = field(
+    ppo_epochs: int = field(
        default=4,
        metadata={"help": "The number of epochs to perform in a PPO optimization step."},
    )
-    ppo_logger: Optional[str] = field(
-        default=None,
-        metadata={"help": 'Log with either "wandb" or "tensorboard" in PPO training.'},
-    )
-    ppo_score_norm: Optional[bool] = field(
+    ppo_score_norm: bool = field(
        default=False,
        metadata={"help": "Use score normalization in PPO training."},
    )
-    ppo_target: Optional[float] = field(
+    ppo_target: float = field(
        default=6.0,
        metadata={"help": "Target KL value for adaptive KL control in PPO training."},
    )
-    ppo_whiten_rewards: Optional[bool] = field(
+    ppo_whiten_rewards: bool = field(
        default=False,
        metadata={"help": "Whiten the rewards before compute advantages in PPO training."},
    )
@@ -149,35 +158,74 @@ class RLHFArguments:
        default=None,
        metadata={"help": "The number of bits to quantize the reward model."},
    )
-    reward_model_type: Optional[Literal["lora", "full", "api"]] = field(
+    reward_model_type: Literal["lora", "full", "api"] = field(
        default="lora",
        metadata={"help": "The type of the reward model in PPO training. Lora model only supports lora training."},
    )


@dataclass
-class FinetuningArguments(FreezeArguments, LoraArguments, RLHFArguments):
+class GaloreArguments:
+    r"""
+    Arguments pertaining to the GaLore algorithm.
+    """
+
+    use_galore: bool = field(
+        default=False,
+        metadata={"help": "Whether or not to use gradient low-Rank projection."},
+    )
+    galore_target: str = field(
+        default="all",
+        metadata={
+            "help": """Name(s) of modules to apply GaLore. Use commas to separate multiple modules. \
+                    Use "all" to specify all the linear modules."""
+        },
+    )
+    galore_rank: int = field(
+        default=16,
+        metadata={"help": "The rank of GaLore gradients."},
+    )
+    galore_update_interval: int = field(
+        default=200,
+        metadata={"help": "Number of steps to update the GaLore projection."},
+    )
+    galore_scale: float = field(
+        default=0.25,
+        metadata={"help": "GaLore scaling coefficient."},
+    )
+    galore_proj_type: Literal["std", "reverse_std", "right", "left", "full"] = field(
+        default="std",
+        metadata={"help": "Type of GaLore projection."},
+    )
+    galore_layerwise: bool = field(
+        default=False,
+        metadata={"help": "Whether or not to enable layer-wise update to further save memory."},
+    )
+
+
+@dataclass
+class FinetuningArguments(FreezeArguments, LoraArguments, RLHFArguments, GaloreArguments):
    r"""
    Arguments pertaining to which techniques we are going to fine-tuning with.
    """

-    stage: Optional[Literal["pt", "sft", "rm", "ppo", "dpo"]] = field(
+    pure_bf16: bool = field(
+        default=False,
+        metadata={"help": "Whether or not to train model in purely bf16 precision (without AMP)."},
+    )
+    stage: Literal["pt", "sft", "rm", "ppo", "dpo", "orpo"] = field(
        default="sft",
        metadata={"help": "Which stage will be performed in training."},
    )
-    finetuning_type: Optional[Literal["lora", "freeze", "full"]] = field(
+    finetuning_type: Literal["lora", "freeze", "full"] = field(
        default="lora",
        metadata={"help": "Which fine-tuning method to use."},
    )
-    use_llama_pro: Optional[bool] = field(
+    use_llama_pro: bool = field(
        default=False,
        metadata={"help": "Whether or not to make only the parameters in the expanded blocks trainable."},
    )
-    disable_version_checking: Optional[bool] = field(
-        default=False,
-        metadata={"help": "Whether or not to disable version checking."},
-    )
-    plot_loss: Optional[bool] = field(
+    plot_loss: bool = field(
        default=False,
        metadata={"help": "Whether or not to save the training loss curves."},
    )
@@ -192,6 +240,7 @@ class FinetuningArguments(FreezeArguments, LoraArguments, RLHFArguments):
        self.lora_alpha = self.lora_alpha or self.lora_rank * 2
        self.lora_target = split_arg(self.lora_target)
        self.additional_target = split_arg(self.additional_target)
+        self.galore_target = split_arg(self.galore_target)

        assert self.finetuning_type in ["lora", "freeze", "full"], "Invalid fine-tuning method."
        assert self.ref_model_quantization_bit in [None, 8, 4], "We only accept 4-bit or 8-bit quantization."
@@ -203,9 +252,15 @@ class FinetuningArguments(FreezeArguments, LoraArguments, RLHFArguments):
        if self.stage == "ppo" and self.reward_model_type == "lora" and self.finetuning_type != "lora":
            raise ValueError("`reward_model_type` cannot be lora for Freeze/Full PPO training.")

+        if self.stage == "dpo" and self.dpo_loss != "sigmoid" and self.dpo_label_smoothing > 1e-6:
+            raise ValueError("`dpo_label_smoothing` is only valid for sigmoid loss function.")
+
        if self.use_llama_pro and self.finetuning_type == "full":
            raise ValueError("`use_llama_pro` is only valid for the Freeze or LoRA method.")

+        if self.use_galore and self.finetuning_type == "lora":
+            raise ValueError("Cannot use LoRA with GaLore together.")
+
    def save_to_json(self, json_path: str):
        r"""Saves the content of this instance in JSON format inside `json_path`."""
        json_string = json.dumps(asdict(self), indent=2, sort_keys=True) + "\n"
--- a/src/llmtuner/hparams/generating_args.py
+++ b/src/llmtuner/hparams/generating_args.py
@@ -1,5 +1,5 @@
 from dataclasses import asdict, dataclass, field
-from typing import Any, Dict, Optional
+from typing import Any, Dict


@dataclass
@@ -8,41 +8,41 @@ class GeneratingArguments:
    Arguments pertaining to specify the decoding parameters.
    """

-    do_sample: Optional[bool] = field(
+    do_sample: bool = field(
        default=True,
        metadata={"help": "Whether or not to use sampling, use greedy decoding otherwise."},
    )
-    temperature: Optional[float] = field(
+    temperature: float = field(
        default=0.95,
        metadata={"help": "The value used to modulate the next token probabilities."},
    )
-    top_p: Optional[float] = field(
+    top_p: float = field(
        default=0.7,
        metadata={
            "help": "The smallest set of most probable tokens with probabilities that add up to top_p or higher are kept."
        },
    )
-    top_k: Optional[int] = field(
+    top_k: int = field(
        default=50,
        metadata={"help": "The number of highest probability vocabulary tokens to keep for top-k filtering."},
    )
-    num_beams: Optional[int] = field(
+    num_beams: int = field(
        default=1,
        metadata={"help": "Number of beams for beam search. 1 means no beam search."},
    )
-    max_length: Optional[int] = field(
+    max_length: int = field(
        default=512,
        metadata={"help": "The maximum length the generated tokens can have. It can be overridden by max_new_tokens."},
    )
-    max_new_tokens: Optional[int] = field(
+    max_new_tokens: int = field(
        default=512,
        metadata={"help": "The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt."},
    )
-    repetition_penalty: Optional[float] = field(
+    repetition_penalty: float = field(
        default=1.0,
        metadata={"help": "The parameter for repetition penalty. 1.0 means no penalty."},
    )
-    length_penalty: Optional[float] = field(
+    length_penalty: float = field(
        default=1.0,
        metadata={"help": "Exponential penalty to the length that is used with beam-based generation."},
    )
--- a/src/llmtuner/hparams/model_args.py
+++ b/src/llmtuner/hparams/model_args.py
@@ -5,7 +5,7 @@ from typing import Any, Dict, Literal, Optional
@dataclass
 class ModelArguments:
    r"""
-    Arguments pertaining to which model/config/tokenizer we are going to fine-tune.
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune or infer.
    """

    model_name_or_path: str = field(
@@ -21,62 +21,98 @@ class ModelArguments:
        default=None,
        metadata={"help": "Where to store the pre-trained models downloaded from huggingface.co or modelscope.cn."},
    )
-    use_fast_tokenizer: Optional[bool] = field(
+    use_fast_tokenizer: bool = field(
        default=False,
        metadata={"help": "Whether or not to use one of the fast tokenizer (backed by the tokenizers library)."},
    )
-    resize_vocab: Optional[bool] = field(
+    resize_vocab: bool = field(
        default=False,
        metadata={"help": "Whether or not to resize the tokenizer vocab and the embedding layers."},
    )
-    split_special_tokens: Optional[bool] = field(
+    split_special_tokens: bool = field(
        default=False,
        metadata={"help": "Whether or not the special tokens should be split during the tokenization process."},
    )
-    model_revision: Optional[str] = field(
+    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
+    low_cpu_mem_usage: bool = field(
+        default=True,
+        metadata={"help": "Whether or not to use memory-efficient model loading."},
+    )
    quantization_bit: Optional[int] = field(
        default=None,
-        metadata={"help": "The number of bits to quantize the model."},
+        metadata={"help": "The number of bits to quantize the model using bitsandbytes."},
    )
-    quantization_type: Optional[Literal["fp4", "nf4"]] = field(
+    quantization_type: Literal["fp4", "nf4"] = field(
        default="nf4",
        metadata={"help": "Quantization data type to use in int4 training."},
    )
-    double_quantization: Optional[bool] = field(
+    double_quantization: bool = field(
        default=True,
        metadata={"help": "Whether or not to use double quantization in int4 training."},
    )
+    quantization_device_map: Optional[Literal["auto"]] = field(
+        default=None,
+        metadata={"help": "Device map used for loading the 4-bit quantized model, needs bitsandbytes>=0.43.0."},
+    )
    rope_scaling: Optional[Literal["linear", "dynamic"]] = field(
        default=None,
        metadata={"help": "Which scaling strategy should be adopted for the RoPE embeddings."},
    )
-    flash_attn: Optional[bool] = field(
+    flash_attn: bool = field(
        default=False,
        metadata={"help": "Enable FlashAttention-2 for faster training."},
    )
-    shift_attn: Optional[bool] = field(
+    shift_attn: bool = field(
        default=False,
        metadata={"help": "Enable shift short attention (S^2-Attn) proposed by LongLoRA."},
    )
-    use_unsloth: Optional[bool] = field(
+    use_unsloth: bool = field(
        default=False,
        metadata={"help": "Whether or not to use unsloth's optimization for the LoRA training."},
    )
-    disable_gradient_checkpointing: Optional[bool] = field(
+    moe_aux_loss_coef: Optional[float] = field(
+        default=None,
+        metadata={"help": "Coefficient of the auxiliary router loss in mixture-of-experts model."},
+    )
+    disable_gradient_checkpointing: bool = field(
        default=False,
        metadata={"help": "Whether or not to disable gradient checkpointing."},
    )
-    upcast_layernorm: Optional[bool] = field(
+    upcast_layernorm: bool = field(
        default=False,
        metadata={"help": "Whether or not to upcast the layernorm weights in fp32."},
    )
-    upcast_lmhead_output: Optional[bool] = field(
+    upcast_lmhead_output: bool = field(
        default=False,
        metadata={"help": "Whether or not to upcast the output of lm_head in fp32."},
    )
+    infer_backend: Literal["huggingface", "vllm"] = field(
+        default="huggingface",
+        metadata={"help": "Backend engine used at inference."},
+    )
+    vllm_maxlen: int = field(
+        default=2048,
+        metadata={"help": "Maximum input length of the vLLM engine."},
+    )
+    vllm_gpu_util: float = field(
+        default=0.9,
+        metadata={"help": "The fraction of GPU memory in (0,1) to be used for the vLLM engine."},
+    )
+    vllm_enforce_eager: bool = field(
+        default=False,
+        metadata={"help": "Whether or not to disable CUDA graph in the vLLM engine."},
+    )
+    offload_folder: str = field(
+        default="offload",
+        metadata={"help": "Path to offload model weights."},
+    )
+    use_cache: bool = field(
+        default=True,
+        metadata={"help": "Whether or not to use KV cache in generation."},
+    )
    hf_hub_token: Optional[str] = field(
        default=None,
        metadata={"help": "Auth token to log in with Hugging Face Hub."},
@@ -89,7 +125,7 @@ class ModelArguments:
        default=None,
        metadata={"help": "Path to the directory to save the exported model."},
    )
-    export_size: Optional[int] = field(
+    export_size: int = field(
        default=1,
        metadata={"help": "The file shard size (in GB) of the exported model."},
    )
@@ -101,15 +137,15 @@ class ModelArguments:
        default=None,
        metadata={"help": "Path to the dataset or dataset name to use in quantizing the exported model."},
    )
-    export_quantization_nsamples: Optional[int] = field(
+    export_quantization_nsamples: int = field(
        default=128,
        metadata={"help": "The number of samples used for quantization."},
    )
-    export_quantization_maxlen: Optional[int] = field(
+    export_quantization_maxlen: int = field(
        default=1024,
        metadata={"help": "The maximum length of the model inputs used for quantization."},
    )
-    export_legacy_format: Optional[bool] = field(
+    export_legacy_format: bool = field(
        default=False,
        metadata={"help": "Whether or not to save the `.bin` files instead of `.safetensors`."},
    )
@@ -117,13 +153,14 @@ class ModelArguments:
        default=None,
        metadata={"help": "The name of the repository if push the model to the Hugging Face hub."},
    )
-    print_param_status: Optional[bool] = field(
+    print_param_status: bool = field(
        default=False,
        metadata={"help": "For debugging purposes, print the status of the parameters in the model."},
    )

    def __post_init__(self):
        self.compute_dtype = None
+        self.device_map = None
        self.model_max_length = None

        if self.split_special_tokens and self.use_fast_tokenizer:
--- a/src/llmtuner/hparams/parser.py
+++ b/src/llmtuner/hparams/parser.py
@@ -7,9 +7,10 @@ import torch
 import transformers
 from transformers import HfArgumentParser, Seq2SeqTrainingArguments
 from transformers.trainer_utils import get_last_checkpoint
-from transformers.utils.versions import require_version
+from transformers.utils import is_torch_bf16_gpu_available

 from ..extras.logging import get_logger
+from ..extras.misc import check_dependencies
 from ..extras.packages import is_unsloth_available
 from .data_args import DataArguments
 from .evaluation_args import EvaluationArguments
@@ -21,6 +22,9 @@ from .model_args import ModelArguments
 logger = get_logger(__name__)


+check_dependencies()
+
+
 _TRAIN_ARGS = [ModelArguments, DataArguments, Seq2SeqTrainingArguments, FinetuningArguments, GeneratingArguments]
 _TRAIN_CLS = Tuple[ModelArguments, DataArguments, Seq2SeqTrainingArguments, FinetuningArguments, GeneratingArguments]
 _INFER_ARGS = [ModelArguments, DataArguments, FinetuningArguments, GeneratingArguments]
@@ -29,17 +33,6 @@ _EVAL_ARGS = [ModelArguments, DataArguments, EvaluationArguments, FinetuningArgu
 _EVAL_CLS = Tuple[ModelArguments, DataArguments, EvaluationArguments, FinetuningArguments]


-def _check_dependencies(disabled: bool) -> None:
-    if disabled:
-        logger.warning("Version checking has been disabled, may lead to unexpected behaviors.")
-    else:
-        require_version("transformers>=4.37.2", "To fix: pip install transformers>=4.37.2")
-        require_version("datasets>=2.14.3", "To fix: pip install datasets>=2.14.3")
-        require_version("accelerate>=0.27.2", "To fix: pip install accelerate>=0.27.2")
-        require_version("peft>=0.9.0", "To fix: pip install peft>=0.9.0")
-        require_version("trl>=0.7.11", "To fix: pip install trl>=0.7.11")
-
-
 def _parse_args(parser: "HfArgumentParser", args: Optional[Dict[str, Any]] = None) -> Tuple[Any]:
    if args is not None:
        return parser.parse_dict(args)
@@ -67,6 +60,9 @@ def _set_transformers_logging(log_level: Optional[int] = logging.INFO) -> None:


 def _verify_model_args(model_args: "ModelArguments", finetuning_args: "FinetuningArguments") -> None:
+    if model_args.adapter_name_or_path is not None and finetuning_args.finetuning_type != "lora":
+        raise ValueError("Adapter is only valid for the LoRA method.")
+
    if model_args.quantization_bit is not None:
        if finetuning_args.finetuning_type != "lora":
            raise ValueError("Quantization is only compatible with the LoRA method.")
@@ -77,9 +73,6 @@ def _verify_model_args(model_args: "ModelArguments", finetuning_args: "Finetunin
        if model_args.adapter_name_or_path is not None and len(model_args.adapter_name_or_path) != 1:
            raise ValueError("Quantized model only accepts a single adapter. Merge them first.")

-    if model_args.adapter_name_or_path is not None and finetuning_args.finetuning_type != "lora":
-        raise ValueError("Adapter is only valid for the LoRA method.")
-

 def _parse_train_args(args: Optional[Dict[str, Any]] = None) -> _TRAIN_CLS:
    parser = HfArgumentParser(_TRAIN_ARGS)
@@ -125,34 +118,46 @@ def get_train_args(args: Optional[Dict[str, Any]] = None) -> _TRAIN_CLS:
    if finetuning_args.stage == "ppo" and finetuning_args.reward_model_type == "lora" and model_args.use_unsloth:
        raise ValueError("Unsloth does not support lora reward model.")

+    if (
+        finetuning_args.stage == "ppo"
+        and training_args.report_to
+        and training_args.report_to[0] not in ["wandb", "tensorboard"]
+    ):
+        raise ValueError("PPO only accepts wandb or tensorboard logger.")
+
    if training_args.max_steps == -1 and data_args.streaming:
        raise ValueError("Please specify `max_steps` in streaming mode.")

    if training_args.do_train and training_args.predict_with_generate:
        raise ValueError("`predict_with_generate` cannot be set as True while training.")

-    if (
-        training_args.do_train
-        and finetuning_args.finetuning_type == "freeze"
-        and finetuning_args.name_module_trainable is None
-    ):
-        raise ValueError("Please specify `name_module_trainable` in Freeze training.")
-
-    if training_args.do_train and finetuning_args.finetuning_type == "lora" and finetuning_args.lora_target is None:
-        raise ValueError("Please specify `lora_target` in LoRA training.")
-
-    if training_args.do_train and model_args.use_unsloth and not is_unsloth_available:
+    if training_args.do_train and model_args.use_unsloth and not is_unsloth_available():
        raise ValueError("Unsloth was not installed: https://github.com/unslothai/unsloth")

-    if finetuning_args.use_dora:
-        if model_args.quantization_bit is not None:
-            raise ValueError("DoRA does not support quantization.")
+    if finetuning_args.use_dora and model_args.use_unsloth:
+        raise ValueError("Unsloth does not support DoRA.")

-        if model_args.use_unsloth:
-            raise ValueError("Unsloth does not support DoRA.")
+    if finetuning_args.pure_bf16:
+        if not is_torch_bf16_gpu_available():
+            raise ValueError("This device does not support `pure_bf16`.")
+
+        if training_args.fp16 or training_args.bf16:
+            raise ValueError("Turn off mixed precision training when using `pure_bf16`.")
+
+    if (
+        finetuning_args.use_galore
+        and finetuning_args.galore_layerwise
+        and training_args.parallel_mode.value == "distributed"
+    ):
+        raise ValueError("Distributed training does not support layer-wise GaLore.")
+
+    if finetuning_args.use_galore and training_args.deepspeed is not None:
+        raise ValueError("GaLore is incompatible with DeepSpeed.")
+
+    if model_args.infer_backend == "vllm":
+        raise ValueError("vLLM backend is only available for API, CLI and Web.")

    _verify_model_args(model_args, finetuning_args)
-    _check_dependencies(disabled=finetuning_args.disable_version_checking)

    if (
        training_args.do_train
@@ -168,6 +173,9 @@ def get_train_args(args: Optional[Dict[str, Any]] = None) -> _TRAIN_CLS:
    if training_args.do_train and (not training_args.fp16) and (not training_args.bf16):
        logger.warning("We recommend enable mixed precision training.")

+    if training_args.do_train and finetuning_args.use_galore and not finetuning_args.pure_bf16:
+        logger.warning("Using GaLore with mixed precision training may significantly increases GPU memory usage.")
+
    if (not training_args.do_train) and model_args.quantization_bit is not None:
        logger.warning("Evaluating model in 4/8-bit mode may cause lower scores.")

@@ -176,14 +184,12 @@ def get_train_args(args: Optional[Dict[str, Any]] = None) -> _TRAIN_CLS:

    # Post-process training arguments
    if (
-        training_args.local_rank != -1
+        training_args.parallel_mode.value == "distributed"
        and training_args.ddp_find_unused_parameters is None
        and finetuning_args.finetuning_type == "lora"
    ):
        logger.warning("`ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.")
-        training_args_dict = training_args.to_dict()
-        training_args_dict.update(dict(ddp_find_unused_parameters=False))
-        training_args = Seq2SeqTrainingArguments(**training_args_dict)
+        training_args.ddp_find_unused_parameters = False

    if finetuning_args.stage in ["rm", "ppo"] and finetuning_args.finetuning_type in ["full", "freeze"]:
        can_resume_from_checkpoint = False
@@ -205,9 +211,7 @@ def get_train_args(args: Optional[Dict[str, Any]] = None) -> _TRAIN_CLS:
            raise ValueError("Output directory already exists and is not empty. Please set `overwrite_output_dir`.")

        if last_checkpoint is not None:
-            training_args_dict = training_args.to_dict()
-            training_args_dict.update(dict(resume_from_checkpoint=last_checkpoint))
-            training_args = Seq2SeqTrainingArguments(**training_args_dict)
+            training_args.resume_from_checkpoint = last_checkpoint
            logger.info(
                "Resuming training from {}. Change `output_dir` or use `overwrite_output_dir` to avoid.".format(
                    training_args.resume_from_checkpoint
@@ -226,18 +230,21 @@ def get_train_args(args: Optional[Dict[str, Any]] = None) -> _TRAIN_CLS:
        )

    # Post-process model arguments
-    model_args.compute_dtype = (
-        torch.bfloat16 if training_args.bf16 else (torch.float16 if training_args.fp16 else None)
-    )
+    if training_args.bf16 or finetuning_args.pure_bf16:
+        model_args.compute_dtype = torch.bfloat16
+    elif training_args.fp16:
+        model_args.compute_dtype = torch.float16
+
    model_args.model_max_length = data_args.cutoff_len
+    data_args.packing = data_args.packing if data_args.packing is not None else finetuning_args.stage == "pt"

    # Log on each process the small summary:
    logger.info(
-        "Process rank: {}, device: {}, n_gpu: {}\n  distributed training: {}, compute dtype: {}".format(
+        "Process rank: {}, device: {}, n_gpu: {}, distributed training: {}, compute dtype: {}".format(
            training_args.local_rank,
            training_args.device,
            training_args.n_gpu,
-            bool(training_args.local_rank != -1),
+            training_args.parallel_mode.value == "distributed",
            str(model_args.compute_dtype),
        )
    )
@@ -251,12 +258,27 @@ def get_infer_args(args: Optional[Dict[str, Any]] = None) -> _INFER_CLS:
    model_args, data_args, finetuning_args, generating_args = _parse_infer_args(args)

    _set_transformers_logging()
-    _verify_model_args(model_args, finetuning_args)
-    _check_dependencies(disabled=finetuning_args.disable_version_checking)

    if data_args.template is None:
        raise ValueError("Please specify which `template` to use.")

+    if model_args.infer_backend == "vllm":
+        if finetuning_args.stage != "sft":
+            raise ValueError("vLLM engine only supports auto-regressive models.")
+
+        if model_args.adapter_name_or_path is not None:
+            raise ValueError("vLLM engine does not support LoRA adapters. Merge them first.")
+
+        if model_args.quantization_bit is not None:
+            raise ValueError("vLLM engine does not support quantization.")
+
+        if model_args.rope_scaling is not None:
+            raise ValueError("vLLM engine does not support RoPE scaling.")
+
+    _verify_model_args(model_args, finetuning_args)
+
+    model_args.device_map = "auto"
+
    return model_args, data_args, finetuning_args, generating_args


@@ -264,12 +286,17 @@ def get_eval_args(args: Optional[Dict[str, Any]] = None) -> _EVAL_CLS:
    model_args, data_args, eval_args, finetuning_args = _parse_eval_args(args)

    _set_transformers_logging()
-    _verify_model_args(model_args, finetuning_args)
-    _check_dependencies(disabled=finetuning_args.disable_version_checking)

    if data_args.template is None:
        raise ValueError("Please specify which `template` to use.")

+    if model_args.infer_backend == "vllm":
+        raise ValueError("vLLM backend is only available for API, CLI and Web.")
+
+    _verify_model_args(model_args, finetuning_args)
+
+    model_args.device_map = "auto"
+
    transformers.set_seed(eval_args.seed)

    return model_args, data_args, eval_args, finetuning_args
--- a/src/llmtuner/model/init.py
+++ b/src/llmtuner/model/init.py
@@ -1,5 +1,10 @@
-from .loader import load_model_and_tokenizer
-from .utils import dispatch_model, load_valuehead_params
+from .loader import load_model, load_tokenizer
+from .utils import find_all_linear_modules, load_valuehead_params


-__all__ = ["load_model_and_tokenizer", "dispatch_model", "load_valuehead_params"]
+__all__ = [
+    "load_model",
+    "load_tokenizer",
+    "load_valuehead_params",
+    "find_all_linear_modules",
+]
--- a/src/llmtuner/model/adapter.py
+++ b/src/llmtuner/model/adapter.py
@@ -5,7 +5,7 @@ from peft import LoraConfig, LoraModel, PeftModel, TaskType, get_peft_model
 from transformers.integrations import is_deepspeed_zero3_enabled

 from ..extras.logging import get_logger
-from .utils import find_all_linear_modules, find_expanded_modules
+from .utils import QuantizationMethod, find_all_linear_modules, find_expanded_modules


 if TYPE_CHECKING:
@@ -34,7 +34,8 @@ def init_adapter(

    if finetuning_args.finetuning_type == "full" and is_trainable:
        logger.info("Fine-tuning method: Full")
-        model = model.float()
+        if not finetuning_args.pure_bf16:
+            model = model.float()

    if finetuning_args.finetuning_type == "freeze" and is_trainable:
        logger.info("Fine-tuning method: Freeze")
@@ -78,7 +79,8 @@ def init_adapter(

        for name, param in model.named_parameters():
            if any(trainable_layer in name for trainable_layer in trainable_layers):
-                param.data = param.data.to(torch.float32)
+                if not finetuning_args.pure_bf16:
+                    param.data = param.data.to(torch.float32)
            else:
                param.requires_grad_(False)

@@ -105,14 +107,18 @@ def init_adapter(
                adapter_to_merge = model_args.adapter_name_or_path

            for adapter in adapter_to_merge:
-                model: "LoraModel" = PeftModel.from_pretrained(model, adapter)
+                model: "LoraModel" = PeftModel.from_pretrained(
+                    model, adapter, offload_folder=model_args.offload_folder
+                )
                model = model.merge_and_unload()

            if len(adapter_to_merge) > 0:
                logger.info("Merged {} adapter(s).".format(len(adapter_to_merge)))

            if adapter_to_resume is not None:  # resume lora training
-                model = PeftModel.from_pretrained(model, adapter_to_resume, is_trainable=is_trainable)
+                model = PeftModel.from_pretrained(
+                    model, adapter_to_resume, is_trainable=is_trainable, offload_folder=model_args.offload_folder
+                )

        if is_trainable and adapter_to_resume is None:  # create new lora weights while training
            if len(finetuning_args.lora_target) == 1 and finetuning_args.lora_target[0] == "all":
@@ -123,9 +129,9 @@ def init_adapter(
            if finetuning_args.use_llama_pro:
                target_modules = find_expanded_modules(model, target_modules, finetuning_args.num_layer_trainable)

-            if finetuning_args.use_dora:
-                if getattr(model, "quantization_method", None):
-                    raise ValueError("DoRA is currently not compatible with quantized models.")
+            if finetuning_args.use_dora and getattr(model, "quantization_method", None) is not None:
+                if getattr(model, "quantization_method", None) != QuantizationMethod.BITS_AND_BYTES:
+                    raise ValueError("DoRA is not compatible with PTQ-quantized models.")

            peft_kwargs = {
                "r": finetuning_args.lora_rank,
@@ -150,8 +156,9 @@ def init_adapter(
                )
                model = get_peft_model(model, lora_config)

-        for param in filter(lambda p: p.requires_grad, model.parameters()):
-            param.data = param.data.to(torch.bfloat16 if finetuning_args.lora_bf16_mode else torch.float32)
+        if not finetuning_args.pure_bf16:
+            for param in filter(lambda p: p.requires_grad, model.parameters()):
+                param.data = param.data.to(torch.float32)

        if model_args.adapter_name_or_path is not None:
            logger.info("Loaded adapter(s): {}".format(",".join(model_args.adapter_name_or_path)))
--- a/src/llmtuner/model/loader.py
+++ b/src/llmtuner/model/loader.py
@@ -1,7 +1,6 @@
-from typing import TYPE_CHECKING, Optional, Tuple
+from typing import TYPE_CHECKING, Any, Dict

 from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
-from transformers.integrations import is_deepspeed_zero3_enabled
 from trl import AutoModelForCausalLMWithValueHead

 from ..extras.logging import get_logger
@@ -20,38 +19,47 @@ if TYPE_CHECKING:
 logger = get_logger(__name__)


-def load_model_and_tokenizer(
-    model_args: "ModelArguments",
-    finetuning_args: "FinetuningArguments",
-    is_trainable: Optional[bool] = False,
-    add_valuehead: Optional[bool] = False,
-) -> Tuple["PreTrainedModel", "PreTrainedTokenizer"]:
-    r"""
-    Loads pretrained model and tokenizer.
-
-    Support both training and inference.
-    """
-
-    try_download_model_from_ms(model_args)
-
-    config_kwargs = {
+def _get_init_kwargs(model_args: "ModelArguments") -> Dict[str, Any]:
+    model_args.model_name_or_path = try_download_model_from_ms(model_args)
+    return {
        "trust_remote_code": True,
        "cache_dir": model_args.cache_dir,
        "revision": model_args.model_revision,
        "token": model_args.hf_hub_token,
    }

+
+def load_tokenizer(model_args: "ModelArguments") -> "PreTrainedTokenizer":
+    r"""
+    Loads pretrained tokenizer. Must before load_model.
+
+    Note: including inplace operation of model_args.
+    """
+    init_kwargs = _get_init_kwargs(model_args)
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        use_fast=model_args.use_fast_tokenizer,
        split_special_tokens=model_args.split_special_tokens,
        padding_side="right",
-        **config_kwargs,
+        **init_kwargs,
    )
    patch_tokenizer(tokenizer)
+    return tokenizer

-    config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
-    patch_config(config, tokenizer, model_args, config_kwargs, is_trainable)
+
+def load_model(
+    tokenizer: "PreTrainedTokenizer",
+    model_args: "ModelArguments",
+    finetuning_args: "FinetuningArguments",
+    is_trainable: bool = False,
+    add_valuehead: bool = False,
+) -> "PreTrainedModel":
+    r"""
+    Loads pretrained model. Must after load_tokenizer.
+    """
+    init_kwargs = _get_init_kwargs(model_args)
+    config = AutoConfig.from_pretrained(model_args.model_name_or_path, **init_kwargs)
+    patch_config(config, tokenizer, model_args, init_kwargs, is_trainable)

    model = None
    if is_trainable and model_args.use_unsloth:
@@ -77,13 +85,7 @@ def load_model_and_tokenizer(
            logger.warning("Unsloth does not support loading adapters.")

    if model is None:
-        model = AutoModelForCausalLM.from_pretrained(
-            model_args.model_name_or_path,
-            config=config,
-            torch_dtype=model_args.compute_dtype,
-            low_cpu_mem_usage=(not is_deepspeed_zero3_enabled()),
-            **config_kwargs,
-        )
+        model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, config=config, **init_kwargs)

    patch_model(model, tokenizer, model_args, is_trainable)
    register_autoclass(config, model, tokenizer)
@@ -106,20 +108,18 @@ def load_model_and_tokenizer(

    if not is_trainable:
        model.requires_grad_(False)
-        model = model.to(model_args.compute_dtype) if not getattr(model, "quantization_method", None) else model
        model.eval()
    else:
        model.train()

    trainable_params, all_param = count_parameters(model)
-    logger.info(
-        "trainable params: {:d} || all params: {:d} || trainable%: {:.4f}".format(
+    if is_trainable:
+        param_stats = "trainable params: {:d} || all params: {:d} || trainable%: {:.4f}".format(
            trainable_params, all_param, 100 * trainable_params / all_param
        )
-    )
-
-    if not is_trainable:
-        logger.info("This IS expected that the trainable params is 0 if you are using model for inference only.")
+    else:
+        param_stats = "all params: {:d}".format(all_param)
+    logger.info(param_stats)

    if model_args.print_param_status:
        for name, param in model.named_parameters():
@@ -129,4 +129,4 @@ def load_model_and_tokenizer(
                )
            )

-    return model, tokenizer
+    return model
--- a/src/llmtuner/model/patcher.py
+++ b/src/llmtuner/model/patcher.py
@@ -3,7 +3,7 @@ import os
 import random
 from contextlib import nullcontext
 from types import MethodType
-from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple
+from typing import TYPE_CHECKING, Any, Dict, List, Tuple

 import torch
 from datasets import load_dataset
@@ -17,7 +17,7 @@ from ..extras.logging import get_logger
 from ..extras.misc import get_current_device, infer_optim_dtype
 from ..extras.packages import is_flash_attn2_available
 from ..extras.patches.llama_patch import apply_llama_patch
-from ..extras.patches.mixtral_patch import patch_mixtral_replace_moe_impl
+from .utils import QuantizationMethod, add_z3_leaf_module


 if TYPE_CHECKING:
@@ -31,6 +31,172 @@ logger = get_logger(__name__)
 SUPPORTED_CLASS_FOR_S2ATTN = ["llama"]


+def _get_quantization_dataset(tokenizer: "PreTrainedTokenizer", model_args: "ModelArguments") -> List[str]:
+    r"""
+    Inspired by: https://github.com/huggingface/optimum/blob/v1.16.0/optimum/gptq/data.py#L133
+    TODO: remove tokenizer.decode() https://github.com/huggingface/optimum/pull/1600
+    """
+    if os.path.isfile(model_args.export_quantization_dataset):
+        data_path = FILEEXT2TYPE.get(model_args.export_quantization_dataset.split(".")[-1], None)
+        data_files = model_args.export_quantization_dataset
+    else:
+        data_path = model_args.export_quantization_dataset
+        data_files = None
+
+    dataset = load_dataset(path=data_path, data_files=data_files, split="train", cache_dir=model_args.cache_dir)
+    maxlen = model_args.export_quantization_maxlen
+
+    samples = []
+    for _ in range(model_args.export_quantization_nsamples):
+        while True:
+            sample_idx = random.randint(0, len(dataset) - 1)
+            sample: Dict[str, torch.Tensor] = tokenizer(dataset[sample_idx]["text"], return_tensors="pt")
+            if sample["input_ids"].size(1) >= maxlen:
+                break  # TODO: fix large maxlen
+
+        word_idx = random.randint(0, sample["input_ids"].size(1) - maxlen - 1)
+        input_ids = sample["input_ids"][:, word_idx : word_idx + maxlen]
+        samples.append(tokenizer.decode(input_ids[0].tolist(), skip_special_tokens=True))
+
+    return samples
+
+
+def _configure_attn_implementation(
+    config: "PretrainedConfig", model_args: "ModelArguments", init_kwargs: Dict[str, Any]
+) -> None:
+    if model_args.flash_attn:
+        if not is_flash_attn2_available():
+            logger.warning("FlashAttention2 is not installed.")
+            return
+
+        logger.info("Using FlashAttention-2 for faster training and inference.")
+        if getattr(config, "model_type", None) == "internlm2":  # special case for custom models
+            setattr(config, "attn_implementation", "flash_attention_2")
+        else:
+            init_kwargs["attn_implementation"] = "flash_attention_2"
+    else:
+        init_kwargs["attn_implementation"] = "eager"
+
+
+def _configure_rope(config: "PretrainedConfig", model_args: "ModelArguments", is_trainable: bool) -> None:
+    if model_args.rope_scaling is None:
+        return
+
+    if not hasattr(config, "rope_scaling"):
+        logger.warning("Current model does not support RoPE scaling.")
+        return
+
+    if is_trainable:
+        if model_args.rope_scaling == "dynamic":
+            logger.warning(
+                "Dynamic NTK scaling may not work well with fine-tuning. "
+                "See: https://github.com/huggingface/transformers/pull/24653"
+            )
+
+        current_max_length = getattr(config, "max_position_embeddings", None)
+        if current_max_length and model_args.model_max_length > current_max_length:
+            scaling_factor = float(math.ceil(model_args.model_max_length / current_max_length))
+        else:
+            logger.warning("Input length is smaller than max length. Consider increase input length.")
+            scaling_factor = 1.0
+    else:
+        scaling_factor = 2.0
+
+    setattr(config, "rope_scaling", {"type": model_args.rope_scaling, "factor": scaling_factor})
+    logger.info(
+        "Using {} scaling strategy and setting scaling factor to {}".format(model_args.rope_scaling, scaling_factor)
+    )
+
+
+def _configure_longlora(config: "PretrainedConfig", model_args: "ModelArguments", is_trainable: bool) -> None:
+    if not is_trainable or not model_args.shift_attn:
+        return
+
+    if getattr(config, "model_type", None) in SUPPORTED_CLASS_FOR_S2ATTN:
+        setattr(config, "group_size_ratio", 0.25)
+        apply_llama_patch()
+        logger.info("Using shift short attention with group_size_ratio=1/4.")
+    else:
+        logger.warning("Current model does not support shift short attention.")
+
+
+def _configure_quantization(
+    config: "PretrainedConfig",
+    tokenizer: "PreTrainedTokenizer",
+    model_args: "ModelArguments",
+    init_kwargs: Dict[str, Any],
+) -> None:
+    r"""
+    Priority: PTQ-quantized (training) > AutoGPTQ (export) > Bitsandbytes (training)
+    """
+    if getattr(config, "quantization_config", None):  # ptq
+        if is_deepspeed_zero3_enabled():
+            raise ValueError("DeepSpeed ZeRO-3 is incompatible with quantized models.")
+
+        init_kwargs["device_map"] = {"": get_current_device()}
+        quantization_config: Dict[str, Any] = getattr(config, "quantization_config", None)
+        quant_method = quantization_config.get("quant_method", "")
+
+        if quant_method == QuantizationMethod.GPTQ:
+            require_version("auto_gptq>=0.5.0", "To fix: pip install auto_gptq>=0.5.0")
+            quantization_config["use_exllama"] = False  # disable exllama
+
+        if quant_method == QuantizationMethod.AWQ:
+            require_version("autoawq", "To fix: pip install autoawq")
+
+        if quant_method == QuantizationMethod.AQLM:
+            require_version("transformers>=4.39.0", "To fix: pip install transformers>=4.39.0")
+            require_version("aqlm>=1.1.0", "To fix: pip install aqlm[gpu]>=1.1.0")
+            quantization_config["bits"] = 2
+
+        quant_bits = quantization_config.get("bits", "?")
+        logger.info("Loading {}-bit {}-quantized model.".format(quant_bits, quant_method.upper()))
+
+    elif model_args.export_quantization_bit is not None:  # auto-gptq
+        require_version("optimum>=1.16.0", "To fix: pip install optimum>=1.16.0")
+        require_version("auto_gptq>=0.5.0", "To fix: pip install auto_gptq>=0.5.0")
+        from accelerate.utils import get_max_memory
+
+        if getattr(config, "model_type", None) == "chatglm":
+            raise ValueError("ChatGLM model is not supported.")
+
+        init_kwargs["quantization_config"] = GPTQConfig(
+            bits=model_args.export_quantization_bit,
+            tokenizer=tokenizer,
+            dataset=_get_quantization_dataset(tokenizer, model_args),
+        )
+        init_kwargs["device_map"] = "auto"
+        init_kwargs["max_memory"] = get_max_memory()
+        logger.info("Quantizing model to {} bit.".format(model_args.export_quantization_bit))
+
+    elif model_args.quantization_bit is not None:  # bnb
+        if model_args.quantization_bit == 8:
+            require_version("bitsandbytes>=0.37.0", "To fix: pip install bitsandbytes>=0.37.0")
+            init_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
+
+        elif model_args.quantization_bit == 4:
+            require_version("bitsandbytes>=0.39.0", "To fix: pip install bitsandbytes>=0.39.0")
+            init_kwargs["quantization_config"] = BitsAndBytesConfig(
+                load_in_4bit=True,
+                bnb_4bit_compute_dtype=model_args.compute_dtype,
+                bnb_4bit_use_double_quant=model_args.double_quantization,
+                bnb_4bit_quant_type=model_args.quantization_type,
+                bnb_4bit_quant_storage=model_args.compute_dtype,  # crucial for fsdp qlora
+            )
+
+        if is_deepspeed_zero3_enabled() or model_args.quantization_device_map == "auto":
+            if model_args.quantization_bit != 4:
+                raise ValueError("Only 4-bit quantized model can use auto device map.")
+
+            require_version("transformers>=4.39.0", "To fix: pip install transformers>=4.39.0")
+            require_version("accelerate>=0.28.0", "To fix: pip install accelerate>=0.28.0")
+            require_version("bitsandbytes>=0.43.0", "To fix: pip install bitsandbytes>=0.43.0")
+        else:
+            init_kwargs["device_map"] = {"": get_current_device()}
+
+        logger.info("Quantizing model to {} bit.".format(model_args.quantization_bit))
+
+
 def _noisy_mean_initialization(embed_weight: torch.Tensor, num_new_tokens: int):
    embedding_dim = embed_weight.size(1)
    avg_weight = embed_weight[:-num_new_tokens].mean(dim=0, keepdim=True)
@@ -72,151 +238,14 @@ def _resize_embedding_layer(model: "PreTrainedModel", tokenizer: "PreTrainedToke
        logger.info("Resized token embeddings from {} to {}.".format(current_embedding_size, new_embedding_size))


-def _get_quantization_dataset(tokenizer: "PreTrainedTokenizer", model_args: "ModelArguments") -> List[str]:
-    r"""
-    Inspired by: https://github.com/huggingface/optimum/blob/v1.16.0/optimum/gptq/data.py#L133
-    TODO: remove tokenizer.decode() https://github.com/huggingface/optimum/pull/1600
-    """
-    if os.path.isfile(model_args.export_quantization_dataset):
-        data_path = FILEEXT2TYPE.get(model_args.export_quantization_dataset.split(".")[-1], None)
-        data_files = model_args.export_quantization_dataset
-    else:
-        data_path = model_args.export_quantization_dataset
-        data_files = None
-
-    dataset = load_dataset(path=data_path, data_files=data_files, split="train", cache_dir=model_args.cache_dir)
-    maxlen = model_args.export_quantization_maxlen
-
-    samples = []
-    for _ in range(model_args.export_quantization_nsamples):
-        while True:
-            sample_idx = random.randint(0, len(dataset) - 1)
-            sample: Dict[str, torch.Tensor] = tokenizer(dataset[sample_idx]["text"], return_tensors="pt")
-            if sample["input_ids"].size(1) >= maxlen:
-                break  # TODO: fix large maxlen
-
-        word_idx = random.randint(0, sample["input_ids"].size(1) - maxlen - 1)
-        input_ids = sample["input_ids"][:, word_idx : word_idx + maxlen]
-        samples.append(tokenizer.decode(input_ids[0].tolist(), skip_special_tokens=True))
-
-    return samples
-
-
-def _configure_attn_implementation(model_args: "ModelArguments", config_kwargs: Dict[str, Any]) -> None:
-    if model_args.flash_attn:
-        if is_flash_attn2_available():
-            config_kwargs["attn_implementation"] = "flash_attention_2"
-            logger.info("Using FlashAttention-2 for faster training and inference.")
-        else:
-            logger.warning("FlashAttention2 is not installed.")
-            config_kwargs["attn_implementation"] = None
-    else:
-        config_kwargs["attn_implementation"] = "eager"
-
-
-def _configure_rope(config: "PretrainedConfig", model_args: "ModelArguments", is_trainable: bool) -> None:
-    if not hasattr(config, "rope_scaling"):
-        logger.warning("Current model does not support RoPE scaling.")
-        return
-
-    if is_trainable:
-        if model_args.rope_scaling == "dynamic":
-            logger.warning(
-                "Dynamic NTK scaling may not work well with fine-tuning. "
-                "See: https://github.com/huggingface/transformers/pull/24653"
-            )
-
-        current_max_length = getattr(config, "max_position_embeddings", None)
-        if current_max_length and model_args.model_max_length > current_max_length:
-            scaling_factor = float(math.ceil(model_args.model_max_length / current_max_length))
-        else:
-            logger.warning("Input length is smaller than max length. Consider increase input length.")
-            scaling_factor = 1.0
-    else:
-        scaling_factor = 2.0
-
-    setattr(config, "rope_scaling", {"type": model_args.rope_scaling, "factor": scaling_factor})
-    logger.info(
-        "Using {} scaling strategy and setting scaling factor to {}".format(model_args.rope_scaling, scaling_factor)
-    )
-
-
-def _configure_longlora(config: "PretrainedConfig") -> None:
-    if getattr(config, "model_type", None) in SUPPORTED_CLASS_FOR_S2ATTN:
-        setattr(config, "group_size_ratio", 0.25)
-        apply_llama_patch()
-        logger.info("Using shift short attention with group_size_ratio=1/4.")
-    else:
-        logger.warning("Current model does not support shift short attention.")
-
-
-def _configure_quantization(
-    config: "PretrainedConfig",
-    tokenizer: "PreTrainedTokenizer",
-    model_args: "ModelArguments",
-    config_kwargs: Dict[str, Any],
-) -> None:
-    r"""
-    Priority: PTQ-quantized (training) > AutoGPTQ (export) > Bitsandbytes (training)
-    """
-    if getattr(config, "quantization_config", None):  # gptq
-        if is_deepspeed_zero3_enabled():
-            raise ValueError("DeepSpeed ZeRO-3 is incompatible with quantization.")
-
-        config_kwargs["device_map"] = {"": get_current_device()}
-        quantization_config: Dict[str, Any] = getattr(config, "quantization_config", None)
-        if quantization_config.get("quant_method", None) == "gptq" and quantization_config.get("bits", -1) == 4:
-            quantization_config["use_exllama"] = False  # disable exllama
-
-        if quantization_config.get("quant_method", None) == "aqlm":
-            quantization_config["bits"] = 2
-
-        logger.info(
-            "Loading {}-bit {}-quantized model.".format(
-                quantization_config.get("bits", "?"), quantization_config.get("quant_method", None)
-            )
-        )
-
-    elif model_args.export_quantization_bit is not None:  # auto-gptq
-        require_version("optimum>=1.16.0", "To fix: pip install optimum>=1.16.0")
-        require_version("auto_gptq>=0.5.0", "To fix: pip install auto_gptq>=0.5.0")
-        from accelerate.utils import get_max_memory
-
-        if getattr(config, "model_type", None) == "chatglm":
-            raise ValueError("ChatGLM model is not supported.")
-
-        config_kwargs["quantization_config"] = GPTQConfig(
-            bits=model_args.export_quantization_bit,
-            tokenizer=tokenizer,
-            dataset=_get_quantization_dataset(tokenizer, model_args),
-        )
-        config_kwargs["device_map"] = "auto"
-        config_kwargs["max_memory"] = get_max_memory()
-        logger.info("Quantizing model to {} bit.".format(model_args.export_quantization_bit))
-
-    elif model_args.quantization_bit is not None:  # bnb
-        if is_deepspeed_zero3_enabled():
-            raise ValueError("DeepSpeed ZeRO-3 is incompatible with quantization.")
-
-        if model_args.quantization_bit == 8:
-            require_version("bitsandbytes>=0.37.0", "To fix: pip install bitsandbytes>=0.37.0")
-            config_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
-
-        elif model_args.quantization_bit == 4:
-            require_version("bitsandbytes>=0.39.0", "To fix: pip install bitsandbytes>=0.39.0")
-            config_kwargs["quantization_config"] = BitsAndBytesConfig(
-                load_in_4bit=True,
-                bnb_4bit_compute_dtype=model_args.compute_dtype,
-                bnb_4bit_use_double_quant=model_args.double_quantization,
-                bnb_4bit_quant_type=model_args.quantization_type,
-            )
-
-        config_kwargs["device_map"] = {"": get_current_device()}
-        logger.info("Quantizing model to {} bit.".format(model_args.quantization_bit))
+def _fp32_forward_post_hook(
+    module: "torch.nn.Module", args: Tuple["torch.Tensor"], output: "torch.Tensor"
+) -> "torch.Tensor":
+    return output.to(torch.float32)


 def _prepare_model_for_training(
-    model: "PreTrainedModel", model_args: "ModelArguments", output_layer_name: Optional[str] = "lm_head"
+    model: "PreTrainedModel", model_args: "ModelArguments", output_layer_name: str = "lm_head"
 ) -> None:
    r"""
    Includes:
@@ -226,10 +255,10 @@ def _prepare_model_for_training(
    Inspired by: https://github.com/huggingface/peft/blob/v0.7.1/src/peft/utils/other.py#L72
    """
    if model_args.upcast_layernorm:
+        logger.info("Upcasting layernorm weights in float32.")
        for name, param in model.named_parameters():
            if param.ndim == 1 and any(ln_name in name for ln_name in LAYERNORM_NAMES):
                param.data = param.data.to(torch.float32)
-        logger.info("Upcasting layernorm weights in float32.")

    if not model_args.disable_gradient_checkpointing:
        if not getattr(model, "supports_gradient_checkpointing", False):
@@ -239,17 +268,14 @@ def _prepare_model_for_training(
            # According to: https://github.com/huggingface/transformers/issues/28339
            model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": True})
            model.enable_input_require_grads()
-            model.config.use_cache = False  # turn off when gradient checkpointing is enabled
+            setattr(model.config, "use_cache", False)  # turn off when gradient checkpointing is enabled
            logger.info("Gradient checkpointing enabled.")

    if hasattr(model, output_layer_name) and model_args.upcast_lmhead_output:
-
-        def fp32_forward_post_hook(module: torch.nn.Module, args: Tuple[torch.Tensor], output: torch.Tensor):
-            return output.to(torch.float32)
-
+        logger.info("Upcasting lm_head outputs in float32.")
        output_layer = getattr(model, output_layer_name)
        if isinstance(output_layer, torch.nn.Linear) and output_layer.weight.dtype != torch.float32:
-            output_layer.register_forward_hook(fp32_forward_post_hook)
+            output_layer.register_forward_hook(_fp32_forward_post_hook)


 def patch_tokenizer(tokenizer: "PreTrainedTokenizer") -> None:
@@ -261,34 +287,64 @@ def patch_config(
    config: "PretrainedConfig",
    tokenizer: "PreTrainedTokenizer",
    model_args: "ModelArguments",
-    config_kwargs: Dict[str, Any],
+    init_kwargs: Dict[str, Any],
    is_trainable: bool,
 ) -> None:
    if model_args.compute_dtype is None:  # priority: bf16 > fp16 > fp32
        model_args.compute_dtype = infer_optim_dtype(model_dtype=getattr(config, "torch_dtype", None))

+    _configure_attn_implementation(config, model_args, init_kwargs)
+    _configure_rope(config, model_args, is_trainable)
+    _configure_longlora(config, model_args, is_trainable)
+    _configure_quantization(config, tokenizer, model_args, init_kwargs)
+
+    if model_args.use_cache and not is_trainable:
+        setattr(config, "use_cache", True)
+        logger.info("Using KV cache for faster generation.")
+
+    if model_args.moe_aux_loss_coef is not None:
+        if getattr(config, "model_type", None) in ["mixtral", "qwen2_moe"]:
+            setattr(config, "router_aux_loss_coef", model_args.moe_aux_loss_coef)
+        elif getattr(config, "model_type", None) == "deepseek":
+            setattr(config, "aux_loss_alpha", model_args.moe_aux_loss_coef)
+
    if getattr(config, "model_type", None) == "qwen":
+        setattr(config, "use_flash_attn", model_args.flash_attn)
        for dtype_name, dtype in [("fp16", torch.float16), ("bf16", torch.bfloat16), ("fp32", torch.float32)]:
            setattr(config, dtype_name, model_args.compute_dtype == dtype)

-    _configure_attn_implementation(model_args, config_kwargs)
+    if getattr(config, "model_type", None) == "qwen2" and is_trainable and model_args.flash_attn:
+        setattr(config, "use_cache", False)  # qwen2 does not support use_cache when using flashattn

-    if model_args.rope_scaling is not None:
-        _configure_rope(config, model_args, is_trainable)
+    if getattr(config, "model_type", None) == "qwen2_moe" and is_trainable:
+        setattr(config, "output_router_logits", True)

-    if is_trainable and model_args.shift_attn:
-        _configure_longlora(config)
+    init_kwargs["torch_dtype"] = model_args.compute_dtype
+    if not is_deepspeed_zero3_enabled():
+        init_kwargs["low_cpu_mem_usage"] = model_args.low_cpu_mem_usage
+        if init_kwargs["low_cpu_mem_usage"]:
+            if "device_map" not in init_kwargs:
+                init_kwargs["device_map"] = model_args.device_map or {"": get_current_device()}

-    _configure_quantization(config, tokenizer, model_args, config_kwargs)
+            if init_kwargs["device_map"] == "auto":
+                init_kwargs["offload_folder"] = model_args.offload_folder


 def patch_model(
    model: "PreTrainedModel", tokenizer: "PreTrainedTokenizer", model_args: "ModelArguments", is_trainable: bool
 ) -> None:
+    gen_config = model.generation_config  # check and fix generation config
+    if not gen_config.do_sample and (
+        (gen_config.temperature is not None and gen_config.temperature != 1.0)
+        or (gen_config.top_p is not None and gen_config.top_p != 1.0)
+        or (gen_config.typical_p is not None and gen_config.typical_p != 1.0)
+    ):
+        gen_config.do_sample = True
+
    if "GenerationMixin" not in str(model.generate.__func__):
        model.generate = MethodType(PreTrainedModel.generate, model)

-    if getattr(model.config, "model_type", None) == "chatglm":
+    if is_trainable and getattr(model.config, "model_type", None) == "chatglm":
        setattr(model, "lm_head", model.transformer.output_layer)
        setattr(model, "_keys_to_ignore_on_save", ["lm_head.weight"])

@@ -298,15 +354,15 @@ def patch_model(
    if is_trainable:
        _prepare_model_for_training(model, model_args)

-    if getattr(model.config, "model_type", None) == "mixtral" and is_deepspeed_zero3_enabled():
-        require_version("deepspeed>=0.13.0", "To fix: pip install deepspeed>=0.13.0")
-        from deepspeed.utils import set_z3_leaf_modules  # type: ignore
+    if getattr(model.config, "model_type", None) == "mixtral":
        from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock

-        set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
+        add_z3_leaf_module(model, MixtralSparseMoeBlock)

-        if is_trainable:
-            patch_mixtral_replace_moe_impl()
+    if getattr(model.config, "model_type", None) == "qwen2moe":
+        from transformers.models.qwen2_moe.modeling_qwen2_moe import Qwen2MoeSparseMoeBlock
+
+        add_z3_leaf_module(model, Qwen2MoeSparseMoeBlock)

    try:
        model.add_model_tags(["llama-factory"])
--- a/src/llmtuner/model/utils.py
+++ b/src/llmtuner/model/utils.py
@@ -1,13 +1,14 @@
-import inspect
+from enum import Enum, unique
 from typing import TYPE_CHECKING, Dict, List

 import torch
 from transformers import PreTrainedModel
+from transformers.integrations import is_deepspeed_zero3_enabled
 from transformers.utils import cached_file
+from transformers.utils.versions import require_version

 from ..extras.constants import V_HEAD_SAFE_WEIGHTS_NAME, V_HEAD_WEIGHTS_NAME
 from ..extras.logging import get_logger
-from ..extras.misc import get_current_device


 if TYPE_CHECKING:
@@ -19,44 +20,38 @@ if TYPE_CHECKING:
 logger = get_logger(__name__)


-def dispatch_model(model: "PreTrainedModel") -> "PreTrainedModel":
+@unique
+class QuantizationMethod(str, Enum):
    r"""
-    Dispatches a pre-trained model to GPUs with balanced memory when the GPU is available.
-    Borrowed from: https://github.com/huggingface/transformers/blob/v4.36.2/src/transformers/modeling_utils.py#L3570
+    Borrowed from `transformers.utils.quantization_config.QuantizationMethod`.
    """
-    if getattr(model, "quantization_method", None):  # already set on current device
-        return model

-    if (
-        torch.cuda.device_count() > 1
-        and isinstance(model, PreTrainedModel)
-        and model._no_split_modules is not None
-        and model.config.model_type != "chatglm"
-    ):
-        from accelerate import dispatch_model
-        from accelerate.utils import get_balanced_memory, infer_auto_device_map
+    BITS_AND_BYTES = "bitsandbytes"
+    GPTQ = "gptq"
+    AWQ = "awq"
+    AQLM = "aqlm"
+    QUANTO = "quanto"

-        kwargs = {"dtype": model.dtype, "no_split_module_classes": model._get_no_split_modules("auto")}
-        max_memory = get_balanced_memory(model, **kwargs)
-        # Make sure tied weights are tied before creating the device map.
-        model.tie_weights()
-        device_map = infer_auto_device_map(model, max_memory=max_memory, **kwargs)
-        device_map_kwargs = {"device_map": device_map, "offload_dir": "offload"}
-        if "skip_keys" in inspect.signature(dispatch_model).parameters:
-            device_map_kwargs["skip_keys"] = model._skip_keys_device_placement
-        return dispatch_model(model, **device_map_kwargs)
-    else:
-        return model.to(device=get_current_device())
+
+def add_z3_leaf_module(model: "PreTrainedModel", module: "torch.nn.Module") -> None:
+    r"""
+    Sets module as a leaf module to skip partitioning in deepspeed zero3.
+    """
+    if is_deepspeed_zero3_enabled():
+        require_version("deepspeed>=0.13.0", "To fix: pip install deepspeed>=0.13.0")
+        from deepspeed.utils import set_z3_leaf_modules  # type: ignore
+
+        set_z3_leaf_modules(model, [module])


 def find_all_linear_modules(model: "PreTrainedModel") -> List[str]:
    r"""
-    Finds all available modules to apply lora.
+    Finds all available modules to apply lora or galore.
    """
    quantization_method = getattr(model, "quantization_method", None)
    if quantization_method is None:
        linear_cls = torch.nn.Linear
-    elif quantization_method == "bitsandbytes":
+    elif quantization_method == QuantizationMethod.BITS_AND_BYTES:
        import bitsandbytes as bnb

        linear_cls = bnb.nn.Linear4bit if getattr(model, "is_loaded_in_4bit", False) else bnb.nn.Linear8bitLt
@@ -66,6 +61,8 @@ def find_all_linear_modules(model: "PreTrainedModel") -> List[str]:
    output_layer_names = ["lm_head"]
    if model.config.model_type == "chatglm":
        output_layer_names.append("output_layer")
+    elif model.config.model_type == "internlm2":
+        output_layer_names.append("output")

    module_names = set()
    for name, module in model.named_modules():
--- a/src/llmtuner/train/dpo/trainer.py
+++ b/src/llmtuner/train/dpo/trainer.py
@@ -8,21 +8,22 @@ from trl import DPOTrainer
 from trl.trainer.utils import disable_dropout_in_model

 from ...extras.constants import IGNORE_INDEX
+from ..utils import create_custom_optimzer, create_custom_scheduler


 if TYPE_CHECKING:
    from transformers import PreTrainedModel

+    from ...hparams import FinetuningArguments
+

 class CustomDPOTrainer(DPOTrainer):
    def __init__(
        self,
-        beta: float,
-        loss_type: Literal["sigmoid", "hinge", "ipo", "kto_pair"],
-        ftx_gamma: float,
        model: Union["PreTrainedModel", torch.nn.Module],
-        ref_model: Optional[Union["PreTrainedModel", torch.nn.Module]] = None,
-        disable_dropout: Optional[bool] = True,
+        ref_model: Optional[Union["PreTrainedModel", torch.nn.Module]],
+        finetuning_args: "FinetuningArguments",
+        disable_dropout: bool = True,
        **kwargs,
    ):
        if disable_dropout:
@@ -30,6 +31,7 @@ class CustomDPOTrainer(DPOTrainer):
            if ref_model is not None:
                disable_dropout_in_model(ref_model)

+        self.finetuning_args = finetuning_args
        self.reference_free = False
        self.use_dpo_data_collator = True  # hack to avoid warning
        self.generate_during_eval = False  # disable at evaluation
@@ -42,10 +44,10 @@ class CustomDPOTrainer(DPOTrainer):
        self._peft_has_been_casted_to_bf16 = False

        self.ref_model = ref_model
-        self.beta = beta
-        self.label_smoothing = 0
-        self.loss_type = loss_type
-        self.ftx_gamma = ftx_gamma
+        self.beta = finetuning_args.dpo_beta
+        self.label_smoothing = finetuning_args.dpo_label_smoothing
+        self.loss_type = finetuning_args.dpo_loss
+        self.ftx_gamma = finetuning_args.dpo_ftx
        self._stored_metrics = defaultdict(lambda: defaultdict(list))

        Trainer.__init__(self, model=model, **kwargs)
@@ -61,7 +63,18 @@ class CustomDPOTrainer(DPOTrainer):
            else:
                self.ref_model = self.accelerator.prepare_model(self.ref_model, evaluation_mode=True)

-    def sft_loss(self, chosen_logits: torch.FloatTensor, chosen_labels: torch.LongTensor) -> torch.Tensor:
+    def create_optimizer(self) -> "torch.optim.Optimizer":
+        if self.optimizer is None:
+            self.optimizer = create_custom_optimzer(self.model, self.args, self.finetuning_args)
+        return super().create_optimizer()
+
+    def create_scheduler(
+        self, num_training_steps: int, optimizer: Optional["torch.optim.Optimizer"] = None
+    ) -> "torch.optim.lr_scheduler.LRScheduler":
+        create_custom_scheduler(self.args, num_training_steps, optimizer)
+        return super().create_scheduler(num_training_steps, optimizer)
+
+    def sft_loss(self, chosen_logits: "torch.FloatTensor", chosen_labels: "torch.LongTensor") -> "torch.Tensor":
        r"""
        Computes supervised cross-entropy loss of given labels under the given logits.

@@ -72,18 +85,27 @@ class CustomDPOTrainer(DPOTrainer):
        return -all_logps

    def concatenated_forward(
-        self, model: "PreTrainedModel", batch: Dict[str, torch.Tensor]
-    ) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
+        self, model: "PreTrainedModel", batch: Dict[str, "torch.Tensor"]
+    ) -> Tuple["torch.Tensor", "torch.Tensor", "torch.Tensor", "torch.Tensor"]:
+        r"""
+        Computes the sum log probabilities of the labels under the given logits if loss_type != IPO.
+
+        Otherwise the average log probabilities.
+        """
        batch_copied = BatchEncoding({k: v.detach().clone() for k, v in batch.items()})  # avoid error

-        all_logits = model(
-            input_ids=batch_copied["input_ids"], attention_mask=batch_copied["attention_mask"], return_dict=True
+        all_logits: "torch.Tensor" = model(
+            input_ids=batch_copied["input_ids"],
+            attention_mask=batch_copied["attention_mask"],
+            return_dict=True,
+            use_cache=False,
        ).logits.to(torch.float32)

        all_logps = self.get_batch_logps(
-            all_logits,
-            batch["labels"],
-            average_log_prob=False,
+            logits=all_logits,
+            labels=batch_copied["labels"],
+            average_log_prob=(self.loss_type == "ipo"),
+            is_encoder_decoder=self.is_encoder_decoder,
            label_pad_token_id=self.label_pad_token_id,
        )
        batch_size = batch["input_ids"].size(0) // 2
@@ -94,9 +116,9 @@ class CustomDPOTrainer(DPOTrainer):
    def get_batch_loss_metrics(
        self,
        model: "PreTrainedModel",
-        batch: Dict[str, torch.Tensor],
-        train_eval: Optional[Literal["train", "eval"]] = "train",
-    ) -> Tuple[torch.Tensor, Dict[str, torch.Tensor]]:
+        batch: Dict[str, "torch.Tensor"],
+        train_eval: Literal["train", "eval"] = "train",
+    ) -> Tuple["torch.Tensor", Dict[str, "torch.Tensor"]]:
        r"""
        Computes the DPO loss and other metrics for the given batch of inputs for train or test.
        """
@@ -137,13 +159,13 @@ class CustomDPOTrainer(DPOTrainer):
        reward_accuracies = (chosen_rewards > rejected_rewards).float()

        prefix = "eval_" if train_eval == "eval" else ""
-        metrics[f"{prefix}rewards/chosen"] = chosen_rewards.cpu().mean()
-        metrics[f"{prefix}rewards/rejected"] = rejected_rewards.cpu().mean()
-        metrics[f"{prefix}rewards/accuracies"] = reward_accuracies.cpu().mean()
-        metrics[f"{prefix}rewards/margins"] = (chosen_rewards - rejected_rewards).cpu().mean()
-        metrics[f"{prefix}logps/rejected"] = policy_rejected_logps.detach().cpu().mean()
-        metrics[f"{prefix}logps/chosen"] = policy_chosen_logps.detach().cpu().mean()
-        metrics[f"{prefix}logits/rejected"] = policy_rejected_logits.detach().cpu().mean()
-        metrics[f"{prefix}logits/chosen"] = policy_chosen_logits.detach().cpu().mean()
+        metrics["{}rewards/chosen".format(prefix)] = chosen_rewards.cpu().mean()
+        metrics["{}rewards/rejected".format(prefix)] = rejected_rewards.cpu().mean()
+        metrics["{}rewards/accuracies".format(prefix)] = reward_accuracies.cpu().mean()
+        metrics["{}rewards/margins".format(prefix)] = (chosen_rewards - rejected_rewards).cpu().mean()
+        metrics["{}logps/rejected".format(prefix)] = policy_rejected_logps.detach().cpu().mean()
+        metrics["{}logps/chosen".format(prefix)] = policy_chosen_logps.detach().cpu().mean()
+        metrics["{}logits/rejected".format(prefix)] = policy_rejected_logits.detach().cpu().mean()
+        metrics["{}logits/chosen".format(prefix)] = policy_chosen_logits.detach().cpu().mean()

        return losses.mean(), metrics
--- a/src/llmtuner/train/dpo/workflow.py
+++ b/src/llmtuner/train/dpo/workflow.py
@@ -2,20 +2,17 @@

 from typing import TYPE_CHECKING, List, Optional

-from transformers import Seq2SeqTrainingArguments
-
-from ...data import get_dataset, split_dataset
+from ...data import PairwiseDataCollatorWithPadding, get_dataset, split_dataset
 from ...extras.constants import IGNORE_INDEX
 from ...extras.ploting import plot_loss
 from ...hparams import ModelArguments
-from ...model import load_model_and_tokenizer
-from ...train.dpo.collator import DPODataCollatorWithPadding
-from ...train.dpo.trainer import CustomDPOTrainer
-from ...train.utils import create_modelcard_and_push, create_ref_model
+from ...model import load_model, load_tokenizer
+from ..utils import create_modelcard_and_push, create_ref_model
+from .trainer import CustomDPOTrainer


 if TYPE_CHECKING:
-    from transformers import TrainerCallback
+    from transformers import Seq2SeqTrainingArguments, TrainerCallback

    from ...hparams import DataArguments, FinetuningArguments

@@ -27,9 +24,11 @@ def run_dpo(
    finetuning_args: "FinetuningArguments",
    callbacks: Optional[List["TrainerCallback"]] = None,
 ):
-    model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train)
+    tokenizer = load_tokenizer(model_args)
    dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="rm")
-    data_collator = DPODataCollatorWithPadding(
+    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
+
+    data_collator = PairwiseDataCollatorWithPadding(
        tokenizer=tokenizer,
        pad_to_multiple_of=8,
        label_pad_token_id=IGNORE_INDEX if data_args.ignore_pad_token_for_loss else tokenizer.pad_token_id,
@@ -42,18 +41,14 @@ def run_dpo(
        ref_model = create_ref_model(model_args, finetuning_args)

    # Update arguments
-    training_args_dict = training_args.to_dict()
-    training_args_dict.update(dict(remove_unused_columns=False))  # important for pairwise dataset
-    training_args = Seq2SeqTrainingArguments(**training_args_dict)
+    training_args.remove_unused_columns = False  # important for pairwise dataset

    # Initialize our Trainer
    trainer = CustomDPOTrainer(
-        beta=finetuning_args.dpo_beta,
-        loss_type=finetuning_args.dpo_loss,
-        ftx_gamma=finetuning_args.dpo_ftx,
        model=model,
        ref_model=ref_model,
        args=training_args,
+        finetuning_args=finetuning_args,
        tokenizer=tokenizer,
        data_collator=data_collator,
        callbacks=callbacks,
@@ -68,7 +63,7 @@ def run_dpo(
        trainer.save_metrics("train", train_result.metrics)
        trainer.save_state()
        if trainer.is_world_process_zero() and finetuning_args.plot_loss:
-            plot_loss(training_args.output_dir, keys=["loss", "eval_loss"])
+            plot_loss(training_args.output_dir, keys=["loss", "eval_loss", "rewards/accuracies"])

    # Evaluation
    if training_args.do_eval:
--- a/src/llmtuner/train/orpo/init.py
+++ b/src/llmtuner/train/orpo/init.py
@@ -0,0 +1,4 @@
+from .workflow import run_orpo
+
+
+__all__ = ["run_orpo"]
--- a/src/llmtuner/train/orpo/trainer.py
+++ b/src/llmtuner/train/orpo/trainer.py
@@ -0,0 +1,122 @@
+from collections import defaultdict
+from typing import TYPE_CHECKING, Dict, Literal, Optional, Tuple, Union
+
+import torch
+import torch.nn.functional as F
+from transformers import Trainer
+from trl import DPOTrainer
+from trl.trainer.utils import disable_dropout_in_model
+
+from ...extras.constants import IGNORE_INDEX
+from ..utils import create_custom_optimzer, create_custom_scheduler
+
+
+if TYPE_CHECKING:
+    from transformers import PreTrainedModel
+
+    from ...hparams import FinetuningArguments
+
+
+class CustomORPOTrainer(DPOTrainer):
+    def __init__(
+        self,
+        model: Union["PreTrainedModel", "torch.nn.Module"],
+        finetuning_args: "FinetuningArguments",
+        disable_dropout: bool = True,
+        **kwargs,
+    ):
+        if disable_dropout:
+            disable_dropout_in_model(model)
+
+        self.finetuning_args = finetuning_args
+        self.reference_free = False
+        self.use_dpo_data_collator = True  # hack to avoid warning
+        self.generate_during_eval = False  # disable at evaluation
+        self.label_pad_token_id = IGNORE_INDEX
+        self.padding_value = 0
+        self.is_encoder_decoder = model.config.is_encoder_decoder
+        self.precompute_ref_log_probs = False
+        self._precomputed_train_ref_log_probs = False
+        self._precomputed_eval_ref_log_probs = False
+        self._peft_has_been_casted_to_bf16 = False
+
+        self.beta = finetuning_args.orpo_beta
+        self._stored_metrics = defaultdict(lambda: defaultdict(list))
+
+        Trainer.__init__(self, model=model, **kwargs)
+
+    def create_optimizer(self) -> "torch.optim.Optimizer":
+        if self.optimizer is None:
+            self.optimizer = create_custom_optimzer(self.model, self.args, self.finetuning_args)
+        return super().create_optimizer()
+
+    def create_scheduler(
+        self, num_training_steps: int, optimizer: Optional["torch.optim.Optimizer"] = None
+    ) -> "torch.optim.lr_scheduler.LRScheduler":
+        create_custom_scheduler(self.args, num_training_steps, optimizer)
+        return super().create_scheduler(num_training_steps, optimizer)
+
+    def odds_ratio_loss(self, chosen_logps: "torch.Tensor", rejected_logps: "torch.Tensor") -> "torch.Tensor":
+        r"""
+        Computes ORPO's odds ratio (OR) loss.
+        """
+        log_odds = (chosen_logps - rejected_logps) - (
+            torch.log1p(-torch.exp(chosen_logps)) - torch.log1p(-torch.exp(rejected_logps))
+        )
+        odds_ratio_loss = -F.logsigmoid(log_odds)
+        return odds_ratio_loss
+
+    def concatenated_forward(
+        self, model: "PreTrainedModel", batch: Dict[str, "torch.Tensor"]
+    ) -> Tuple["torch.Tensor", "torch.Tensor", "torch.Tensor", "torch.Tensor"]:
+        r"""
+        Computes the average log probabilities of the labels under the given logits.
+        """
+        all_logits: "torch.Tensor" = model(
+            input_ids=batch["input_ids"], attention_mask=batch["attention_mask"], return_dict=True, use_cache=False
+        ).logits.to(torch.float32)
+
+        all_logps = self.get_batch_logps(
+            logits=all_logits,
+            labels=batch["labels"],
+            average_log_prob=True,
+            is_encoder_decoder=self.is_encoder_decoder,
+            label_pad_token_id=self.label_pad_token_id,
+        )
+        batch_size = batch["input_ids"].size(0) // 2
+        chosen_logps, rejected_logps = all_logps.split(batch_size, dim=0)
+        chosen_logits, rejected_logits = all_logits.split(batch_size, dim=0)
+        return chosen_logps, rejected_logps, chosen_logits, rejected_logits
+
+    def get_batch_loss_metrics(
+        self,
+        model: "PreTrainedModel",
+        batch: Dict[str, "torch.Tensor"],
+        train_eval: Literal["train", "eval"] = "train",
+    ) -> Tuple["torch.Tensor", Dict[str, "torch.Tensor"]]:
+        r"""
+        Computes the ORPO loss and other metrics for the given batch of inputs for train or test.
+        """
+        metrics = {}
+        chosen_logps, rejected_logps, chosen_logits, rejected_logits = self.concatenated_forward(model, batch)
+        sft_loss = -chosen_logps
+        odds_ratio_loss = self.odds_ratio_loss(chosen_logps, rejected_logps)
+        batch_loss = (sft_loss + self.beta * odds_ratio_loss).mean()
+
+        chosen_rewards = self.beta * chosen_logps.detach()
+        rejected_rewards = self.beta * rejected_logps.detach()
+        reward_accuracies = (chosen_rewards > rejected_rewards).float()
+
+        prefix = "eval_" if train_eval == "eval" else ""
+        metrics["{}rewards/chosen".format(prefix)] = chosen_rewards.cpu().mean()
+        metrics["{}rewards/rejected".format(prefix)] = rejected_rewards.cpu().mean()
+        metrics["{}rewards/accuracies".format(prefix)] = reward_accuracies.cpu().mean()
+        metrics["{}rewards/margins".format(prefix)] = (chosen_rewards - rejected_rewards).cpu().mean()
+        metrics["{}logps/rejected".format(prefix)] = rejected_logps.detach().cpu().mean()
+        metrics["{}logps/chosen".format(prefix)] = chosen_logps.detach().cpu().mean()
+        metrics["{}logits/rejected".format(prefix)] = rejected_logits.detach().cpu().mean()
+        metrics["{}logits/chosen".format(prefix)] = chosen_logits.detach().cpu().mean()
+        metrics["{}sft_loss".format(prefix)] = sft_loss.detach().cpu().mean()
+        metrics["{}odds_ratio_loss".format(prefix)] = odds_ratio_loss.detach().cpu().mean()
+
+        return batch_loss, metrics
--- a/src/llmtuner/train/orpo/workflow.py
+++ b/src/llmtuner/train/orpo/workflow.py
@@ -0,0 +1,68 @@
+# Inspired by: https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py
+
+from typing import TYPE_CHECKING, List, Optional
+
+from ...data import PairwiseDataCollatorWithPadding, get_dataset, split_dataset
+from ...extras.constants import IGNORE_INDEX
+from ...extras.ploting import plot_loss
+from ...hparams import ModelArguments
+from ...model import load_model, load_tokenizer
+from ..utils import create_modelcard_and_push
+from .trainer import CustomORPOTrainer
+
+
+if TYPE_CHECKING:
+    from transformers import Seq2SeqTrainingArguments, TrainerCallback
+
+    from ...hparams import DataArguments, FinetuningArguments
+
+
+def run_orpo(
+    model_args: "ModelArguments",
+    data_args: "DataArguments",
+    training_args: "Seq2SeqTrainingArguments",
+    finetuning_args: "FinetuningArguments",
+    callbacks: Optional[List["TrainerCallback"]] = None,
+):
+    tokenizer = load_tokenizer(model_args)
+    dataset = get_dataset(tokenizer, model_args, data_args, training_args, stage="rm")
+    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
+
+    data_collator = PairwiseDataCollatorWithPadding(
+        tokenizer=tokenizer,
+        pad_to_multiple_of=8,
+        label_pad_token_id=IGNORE_INDEX if data_args.ignore_pad_token_for_loss else tokenizer.pad_token_id,
+    )
+
+    # Update arguments
+    training_args.remove_unused_columns = False  # important for pairwise dataset
+
+    # Initialize our Trainer
+    trainer = CustomORPOTrainer(
+        model=model,
+        args=training_args,
+        finetuning_args=finetuning_args,
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        callbacks=callbacks,
+        **split_dataset(dataset, data_args, training_args),
+    )
+
+    # Training
+    if training_args.do_train:
+        train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
+        trainer.save_model()
+        trainer.log_metrics("train", train_result.metrics)
+        trainer.save_metrics("train", train_result.metrics)
+        trainer.save_state()
+        if trainer.is_world_process_zero() and finetuning_args.plot_loss:
+            plot_loss(training_args.output_dir, keys=["loss", "eval_loss", "rewards/accuracies", "sft_loss"])
+
+    # Evaluation
+    if training_args.do_eval:
+        metrics = trainer.evaluate(metric_key_prefix="eval")
+        trainer.log_metrics("eval", metrics)
+        trainer.save_metrics("eval", metrics)
+
+    # Create model card
+    create_modelcard_and_push(trainer, model_args, data_args, training_args, finetuning_args)
--- a/src/llmtuner/train/ppo/trainer.py
+++ b/src/llmtuner/train/ppo/trainer.py
@@ -6,20 +6,23 @@ from typing import TYPE_CHECKING, Dict, List, Optional, Tuple
 import torch
 from tqdm import tqdm
 from transformers import GenerationConfig, Trainer, TrainerControl, TrainerState
+from transformers.optimization import get_scheduler
 from transformers.trainer_pt_utils import remove_dummy_checkpoint
 from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
 from transformers.utils import SAFE_WEIGHTS_NAME, WEIGHTS_NAME
-from trl import PPOTrainer
+from trl import PPOConfig, PPOTrainer
 from trl.core import PPODecorators, logprobs_from_logits

 from ...extras.callbacks import FixValueHeadModelCallback, LogCallback
 from ...extras.logging import get_logger
-from ...extras.misc import AverageMeter, count_parameters, get_logits_processor
+from ...extras.misc import AverageMeter, count_parameters, get_current_device, get_logits_processor
+from ..utils import create_custom_optimzer, create_custom_scheduler
 from .utils import dump_layernorm, get_rewards_from_server, replace_model, restore_layernorm


 if TYPE_CHECKING:
-    from transformers import Seq2SeqTrainingArguments, TrainerCallback
+    from datasets import Dataset
+    from transformers import DataCollatorWithPadding, PreTrainedTokenizer, Seq2SeqTrainingArguments, TrainerCallback
    from trl import AutoModelForCausalLMWithValueHead

    from ...hparams import FinetuningArguments, GeneratingArguments, ModelArguments
@@ -40,15 +43,59 @@ class CustomPPOTrainer(PPOTrainer, Trainer):
        finetuning_args: "FinetuningArguments",
        generating_args: "GeneratingArguments",
        callbacks: List["TrainerCallback"],
-        reward_model: "AutoModelForCausalLMWithValueHead",
-        **kwargs,
+        model: "AutoModelForCausalLMWithValueHead",
+        reward_model: Optional["AutoModelForCausalLMWithValueHead"],
+        ref_model: Optional["AutoModelForCausalLMWithValueHead"],
+        tokenizer: "PreTrainedTokenizer",
+        dataset: "Dataset",
+        data_collator: "DataCollatorWithPadding",
    ):
-        PPOTrainer.__init__(self, **kwargs)
+        backward_batch_size = training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps
+        ppo_config = PPOConfig(
+            model_name=model_args.model_name_or_path,
+            learning_rate=training_args.learning_rate,
+            mini_batch_size=training_args.per_device_train_batch_size,
+            batch_size=backward_batch_size * finetuning_args.ppo_buffer_size,
+            gradient_accumulation_steps=training_args.gradient_accumulation_steps,
+            ppo_epochs=finetuning_args.ppo_epochs,
+            max_grad_norm=training_args.max_grad_norm,
+            seed=training_args.seed,
+            optimize_device_cache=True,
+            target=finetuning_args.ppo_target,
+            use_score_scaling=finetuning_args.ppo_score_norm,
+            use_score_norm=finetuning_args.ppo_score_norm,
+            whiten_rewards=finetuning_args.ppo_whiten_rewards,
+            accelerator_kwargs={"step_scheduler_with_optimizer": False},
+            log_with=training_args.report_to[0] if training_args.report_to else None,
+            project_kwargs={"logging_dir": training_args.logging_dir},
+        )
+
+        # Create optimizer and scheduler
+        if training_args.max_steps > 0:
+            num_training_steps = training_args.max_steps
+        else:
+            total_train_batch_size = backward_batch_size * finetuning_args.ppo_buffer_size * training_args.world_size
+            num_training_steps = training_args.num_train_epochs * math.ceil(len(dataset) / total_train_batch_size)
+
+        optimizer = self.create_optimizer(model, training_args, finetuning_args)
+        scheduler = self.create_scheduler(training_args, num_training_steps, optimizer)
+
+        PPOTrainer.__init__(
+            self,
+            config=ppo_config,
+            model=model,
+            ref_model=ref_model,
+            tokenizer=tokenizer,
+            dataset=dataset,
+            data_collator=data_collator,
+            lr_scheduler=scheduler,
+        )

        self.args = training_args
        self.model_args = model_args
        self.finetuning_args = finetuning_args
        self.reward_model = reward_model
+        self.current_device = get_current_device()  # patch for deepspeed training

        self.generation_config = GenerationConfig(
            pad_token_id=self.tokenizer.pad_token_id,
@@ -204,6 +251,44 @@ class CustomPPOTrainer(PPOTrainer, Trainer):
            self.args, self.state, self.control, model=self.accelerator.unwrap_model(self.model)
        )

+    def create_optimizer(
+        self,
+        model: "AutoModelForCausalLMWithValueHead",
+        training_args: "Seq2SeqTrainingArguments",
+        finetuning_args: "FinetuningArguments",
+    ) -> "torch.optim.Optimizer":
+        optimizer = create_custom_optimzer(model, training_args, finetuning_args)
+        if optimizer is None:
+            decay_params, nodecay_params = [], []
+            decay_param_names = self.get_decay_parameter_names(model)
+            for name, param in model.named_parameters():
+                if param.requires_grad:
+                    if name in decay_param_names:
+                        decay_params.append(param)
+                    else:
+                        nodecay_params.append(param)
+
+            optim_class, optim_kwargs = Trainer.get_optimizer_cls_and_kwargs(training_args)
+            param_groups = [
+                dict(params=nodecay_params),
+                dict(params=decay_params, weight_decay=training_args.weight_decay),
+            ]
+            optimizer = optim_class(param_groups, **optim_kwargs)
+
+        return optimizer
+
+    def create_scheduler(
+        self, training_args: "Seq2SeqTrainingArguments", num_training_steps: int, optimizer: "torch.optim.Optimizer"
+    ) -> "torch.optim.lr_scheduler.LRScheduler":
+        create_custom_scheduler(training_args, num_training_steps, optimizer)
+        lr_scheduler = get_scheduler(
+            training_args.lr_scheduler_type,
+            optimizer=optimizer,
+            num_warmup_steps=training_args.get_warmup_steps(num_training_steps),
+            num_training_steps=num_training_steps,
+        )
+        return lr_scheduler
+
    @torch.no_grad()
    def get_inputs(self, batch: Dict[str, torch.Tensor]) -> Tuple[List[torch.Tensor], List[torch.Tensor]]:
        r"""
@@ -268,7 +353,7 @@ class CustomPPOTrainer(PPOTrainer, Trainer):
        batch = self.prepare_model_inputs(queries, responses)

        with torch.cuda.amp.autocast(dtype=self.model_args.compute_dtype):  # support bf16
-            _, _, values = reward_model(**batch, output_hidden_states=True, return_dict=True)
+            _, _, values = reward_model(**batch, output_hidden_states=True, return_dict=True, use_cache=False)

        if getattr(unwrapped_model.config, "model_type", None) == "chatglm":  # assume same architecture
            values = torch.transpose(values, 0, 1)
@@ -291,7 +376,7 @@ class CustomPPOTrainer(PPOTrainer, Trainer):
        queries: torch.Tensor,
        responses: torch.Tensor,
        model_inputs: dict,
-        return_logits: Optional[bool] = False,
+        return_logits: bool = False,
        response_masks: Optional[torch.Tensor] = None,
    ):
        r"""
--- a/Show More
+++ b/Show More