release v0.7.1

Former-commit-id: a4f8adb021b6218d624303b51cd5e93ffa3111a1
fix #3694
2024-05-16 00:57:16 +08:00 · 2024-05-16 00:35:28 +08:00 · 2024-05-15 23:05:02 +08:00 · 2024-05-15 22:58:19 +08:00 · 2024-05-15 20:02:41 +08:00 · 2024-05-15 19:25:48 +08:00
150 changed files with 3497 additions and 1979 deletions
--- a/2
+++ b/2
@@ -11,4 +11,4 @@ RUN pip install -e .[deepspeed,metrics,bitsandbytes,qwen]
 VOLUME [ "/root/.cache/huggingface/", "/app/data", "/app/output" ]
 EXPOSE 7860
-CMD [ "python", "src/train_web.py" ]
+CMD [ "llamafactory-cli", "webui" ]
--- a/README.md
+++ b/README.md
@@ -5,7 +5,7 @@
 [![GitHub last commit](https://img.shields.io/github/last-commit/hiyouga/LLaMA-Factory)](https://github.com/hiyouga/LLaMA-Factory/commits/main)
 [![PyPI](https://img.shields.io/pypi/v/llmtuner)](https://pypi.org/project/llmtuner/)
 [![Downloads](https://static.pepy.tech/badge/llmtuner)](https://pypi.org/project/llmtuner/)
-[![Citation](https://img.shields.io/badge/citation-34-green)](#projects-using-llama-factory)
+[![Citation](https://img.shields.io/badge/citation-44-green)](#projects-using-llama-factory)
 [![GitHub pull request](https://img.shields.io/badge/PRs-welcome-blue)](https://github.com/hiyouga/LLaMA-Factory/pulls)
 [![Discord](https://dcbadge.vercel.app/api/server/rKfvV9r9FK?compact=true&style=flat)](https://discord.gg/rKfvV9r9FK)
 [![Twitter](https://img.shields.io/twitter/follow/llamafactory_ai)](https://twitter.com/llamafactory_ai)
@@ -13,6 +13,8 @@
 [![Studios](https://img.shields.io/badge/ModelScope-Open%20in%20Studios-blue)](https://modelscope.cn/studios/hiyouga/LLaMA-Board)
 [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing)
 [![GitHub Tread](https://trendshift.io/api/badge/repositories/4535)](https://trendshift.io/repositories/4535)
 👋 Join our [WeChat](assets/wechat.jpg).
 \[ English | [中文](README_zh.md) \]
@@ -68,57 +70,61 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
 ## Changelog
-[24/04/26] We supported fine-tuning the **LLaVA-1.5** multimodal LLMs. See `examples/lora_single_gpu/sft_mllm.sh` for usage.
+[24/05/14] We supported training and inference on the Ascend NPU devices. Check [installation](#installation) section for details.
-[24/04/22] We provided a **[Colab notebook](https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing)** for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check [Llama3-8B-Chinese-Chat](https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat) and [Llama3-Chinese](https://huggingface.co/zhichen/Llama3-Chinese) for details.
+[24/05/13] We supported fine-tuning the **Yi-1.5** series models.
-[24/04/21] We supported **[Mixture-of-Depths](https://arxiv.org/abs/2404.02258)** according to [AstraMindAI's implementation](https://github.com/astramind-ai/Mixture-of-depths). See `examples/extras/mod` for usage.
+[24/04/26] We supported fine-tuning the **LLaVA-1.5** multimodal LLMs. See [examples](examples/README.md) for usage.
 [24/04/16] We supported **[BAdam](https://arxiv.org/abs/2404.02827)**. See `examples/extras/badam` for usage.
 [24/04/16] We supported **[unsloth](https://github.com/unslothai/unsloth)**'s long-sequence training (Llama-2-7B-56k within 24GB). It achieves **117%** speed and **50%** memory compared with FlashAttention-2, more benchmarks can be found in [this page](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison).
 <details><summary>Full Changelog</summary>
-[24/03/31] We supported **[ORPO](https://arxiv.org/abs/2403.07691)**. See `examples/lora_single_gpu` for usage.
+[24/04/22] We provided a **[Colab notebook](https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing)** for fine-tuning the Llama-3 model on a free T4 GPU. Two Llama-3-derived models fine-tuned using LLaMA Factory are available at Hugging Face, check [Llama3-8B-Chinese-Chat](https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat) and [Llama3-Chinese](https://huggingface.co/zhichen/Llama3-Chinese) for details.
 [24/04/21] We supported **[Mixture-of-Depths](https://arxiv.org/abs/2404.02258)** according to [AstraMindAI's implementation](https://github.com/astramind-ai/Mixture-of-depths). See [examples](examples/README.md) for usage.
 [24/04/16] We supported **[BAdam](https://arxiv.org/abs/2404.02827)**. See [examples](examples/README.md) for usage.
 [24/04/16] We supported **[unsloth](https://github.com/unslothai/unsloth)**'s long-sequence training (Llama-2-7B-56k within 24GB). It achieves **117%** speed and **50%** memory compared with FlashAttention-2, more benchmarks can be found in [this page](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison).
 [24/03/31] We supported **[ORPO](https://arxiv.org/abs/2403.07691)**. See [examples](examples/README.md) for usage.
 [24/03/21] Our paper "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" is available at arXiv!
-[24/03/20] We supported **FSDP+QLoRA** that fine-tunes a 70B model on 2x24GB GPUs. See `examples/extras/fsdp_qlora` for usage.
+[24/03/20] We supported **FSDP+QLoRA** that fine-tunes a 70B model on 2x24GB GPUs. See [examples](examples/README.md) for usage.
-[24/03/13] We supported **[LoRA+](https://arxiv.org/abs/2402.12354)**. See `examples/extras/loraplus` for usage.
+[24/03/13] We supported **[LoRA+](https://arxiv.org/abs/2402.12354)**. See [examples](examples/README.md) for usage.
-[24/03/07] We supported gradient low-rank projection (**[GaLore](https://arxiv.org/abs/2403.03507)**) algorithm. See `examples/extras/galore` for usage.
+[24/03/07] We supported gradient low-rank projection (**[GaLore](https://arxiv.org/abs/2403.03507)**) algorithm. See [examples](examples/README.md) for usage.
-[24/03/07] We integrated **[vLLM](https://github.com/vllm-project/vllm)** for faster and concurrent inference. Try `--infer_backend vllm` to enjoy **270%** inference speed. (LoRA is not yet supported, merge it first.)
+[24/03/07] We integrated **[vLLM](https://github.com/vllm-project/vllm)** for faster and concurrent inference. Try `infer_backend: vllm` to enjoy **270%** inference speed.
-[24/02/28] We supported weight-decomposed LoRA (**[DoRA](https://arxiv.org/abs/2402.09353)**). Try `--use_dora` to activate DoRA training.
+[24/02/28] We supported weight-decomposed LoRA (**[DoRA](https://arxiv.org/abs/2402.09353)**). Try `use_dora: true` to activate DoRA training.
-[24/02/15] We supported **block expansion** proposed by [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro). See `examples/extras/llama_pro` for usage.
+[24/02/15] We supported **block expansion** proposed by [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro). See [examples](examples/README.md) for usage.
 [24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this [blog post](https://qwenlm.github.io/blog/qwen1.5/) for details.
-[24/01/18] We supported **agent tuning** for most models, equipping model with tool using abilities by fine-tuning with `--dataset glaive_toolcall`.
+[24/01/18] We supported **agent tuning** for most models, equipping model with tool using abilities by fine-tuning with `dataset: glaive_toolcall`.
-[23/12/23] We supported **[unsloth](https://github.com/unslothai/unsloth)**'s implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try `--use_unsloth` argument to activate unsloth patch. It achieves **170%** speed in our benchmark, check [this page](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison) for details.
+[23/12/23] We supported **[unsloth](https://github.com/unslothai/unsloth)**'s implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try `use_unsloth: true` argument to activate unsloth patch. It achieves **170%** speed in our benchmark, check [this page](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison) for details.
 [23/12/12] We supported fine-tuning the latest MoE model **[Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)** in our framework. See hardware requirement [here](#hardware-requirement).
-[23/12/01] We supported downloading pre-trained models and datasets from the **[ModelScope Hub](https://modelscope.cn/models)** for Chinese mainland users. See [this tutorial](#use-modelscope-hub-optional) for usage.
+[23/12/01] We supported downloading pre-trained models and datasets from the **[ModelScope Hub](https://modelscope.cn/models)** for Chinese mainland users. See [this tutorial](#download-from-modelscope-hub) for usage.
-[23/10/21] We supported **[NEFTune](https://arxiv.org/abs/2310.05914)** trick for fine-tuning. Try `--neftune_noise_alpha` argument to activate NEFTune, e.g., `--neftune_noise_alpha 5`.
+[23/10/21] We supported **[NEFTune](https://arxiv.org/abs/2310.05914)** trick for fine-tuning. Try `neftune_noise_alpha: 5` argument to activate NEFTune.
-[23/09/27] We supported **$S^2$-Attn** proposed by [LongLoRA](https://github.com/dvlab-research/LongLoRA) for the LLaMA models. Try `--shift_attn` argument to enable shift short attention.
+[23/09/27] We supported **$S^2$-Attn** proposed by [LongLoRA](https://github.com/dvlab-research/LongLoRA) for the LLaMA models. Try `shift_attn: true` argument to enable shift short attention.
-[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See [this example](#evaluation) to evaluate your models.
+[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See [examples](examples/README.md) for usage.
-[23/09/10] We supported **[FlashAttention-2](https://github.com/Dao-AILab/flash-attention)**. Try `--flash_attn fa2` argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.
+[23/09/10] We supported **[FlashAttention-2](https://github.com/Dao-AILab/flash-attention)**. Try `flash_attn: fa2` argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.
-[23/08/12] We supported **RoPE scaling** to extend the context length of the LLaMA models. Try `--rope_scaling linear` argument in training and `--rope_scaling dynamic` argument at inference to extrapolate the position embeddings.
+[23/08/12] We supported **RoPE scaling** to extend the context length of the LLaMA models. Try `rope_scaling: linear` argument in training and `rope_scaling: dynamic` argument at inference to extrapolate the position embeddings.
-[23/08/11] We supported **[DPO training](https://arxiv.org/abs/2305.18290)** for instruction-tuned models. See [this example](#dpo-training) to train your models.
+[23/08/11] We supported **[DPO training](https://arxiv.org/abs/2305.18290)** for instruction-tuned models. See [examples](examples/README.md) for usage.
-[23/07/31] We supported **dataset streaming**. Try `--streaming` and `--max_steps 10000` arguments to load your dataset in streaming mode.
+[23/07/31] We supported **dataset streaming**. Try `streaming: true` and `max_steps: 10000` arguments to load your dataset in streaming mode.
 [23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos ([LLaMA-2](https://huggingface.co/hiyouga/Llama-2-Chinese-13b-chat) / [Baichuan](https://huggingface.co/hiyouga/Baichuan-13B-sft)) for details.
@@ -130,7 +136,7 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
 [23/06/22] We aligned the [demo API](src/api_demo.py) with the [OpenAI's](https://platform.openai.com/docs/api-reference/chat) format where you can insert the fine-tuned model in **arbitrary ChatGPT-based applications**.
-[23/06/03] We supported quantized training and inference (aka **[QLoRA](https://github.com/artidoro/qlora)**). Try `--quantization_bit 4/8` argument to work with quantized models.
+[23/06/03] We supported quantized training and inference (aka **[QLoRA](https://github.com/artidoro/qlora)**). See [examples](examples/README.md) for usage.
 </details>
@@ -143,7 +149,7 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
 | [BLOOMZ](https://huggingface.co/bigscience)              | 560M/1.1B/1.7B/3B/7.1B/176B      | query_key_value   | -         |
 | [ChatGLM3](https://huggingface.co/THUDM)                 | 6B                               | query_key_value   | chatglm3  |
 | [Command-R](https://huggingface.co/CohereForAI)          | 35B/104B                         | q_proj,v_proj     | cohere    |
-| [DeepSeek (MoE)](https://huggingface.co/deepseek-ai)     | 7B/16B/67B                       | q_proj,v_proj     | deepseek  |
+| [DeepSeek (MoE)](https://huggingface.co/deepseek-ai)     | 7B/16B/67B/236B                  | q_proj,v_proj     | deepseek  |
 | [Falcon](https://huggingface.co/tiiuae)                  | 7B/40B/180B                      | query_key_value   | falcon    |
 | [Gemma/CodeGemma](https://huggingface.co/google)         | 2B/7B                            | q_proj,v_proj     | gemma     |
 | [InternLM2](https://huggingface.co/internlm)             | 7B/20B                           | wqkv              | intern2   |
@@ -159,7 +165,8 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
 | [Qwen1.5 (Code/MoE)](https://huggingface.co/Qwen)        | 0.5B/1.8B/4B/7B/14B/32B/72B/110B | q_proj,v_proj     | qwen      |
 | [StarCoder2](https://huggingface.co/bigcode)             | 3B/7B/15B                        | q_proj,v_proj     | -         |
 | [XVERSE](https://huggingface.co/xverse)                  | 7B/13B/65B                       | q_proj,v_proj     | xverse    |
-| [Yi](https://huggingface.co/01-ai)                       | 6B/9B/34B                        | q_proj,v_proj     | yi        |
+| [Yi (1/1.5)](https://huggingface.co/01-ai)               | 6B/9B/34B                        | q_proj,v_proj     | yi        |
 | [Yi-VL](https://huggingface.co/01-ai)                    | 6B/34B                           | q_proj,v_proj     | yi_vl     |
 | [Yuan](https://huggingface.co/IEITYuan)                  | 2B/51B/102B                      | q_proj,v_proj     | yuan      |
 > [!NOTE]
@@ -205,8 +212,8 @@ You also can add a custom chat template to [template.py](src/llmtuner/data/templ
 - [Stanford Alpaca (en)](https://github.com/tatsu-lab/stanford_alpaca)
 - [Stanford Alpaca (zh)](https://github.com/ymcui/Chinese-LLaMA-Alpaca)
 - [Alpaca GPT4 (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
- [Self Cognition (zh)](data/self_cognition.json)
+- [Identity (en&zh)](data/identity.json)
- [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
+- [Open Assistant (zh)](https://huggingface.co/datasets/OpenAssistant/oasst1)
 - [ShareGPT (zh)](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Chinese-instruction-collection)
 - [Guanaco Dataset (multilingual)](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset)
 - [BELLE 2M (zh)](https://huggingface.co/datasets/BelleGroup/train_2M_CN)
@@ -254,11 +261,11 @@ You also can add a custom chat template to [template.py](src/llmtuner/data/templ
 <details><summary>Preference datasets</summary>
 - [HH-RLHF (en)](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 - [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
 - [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
 - [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
 - [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)
 - [DPO mixed (en&zh)](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k)
 - [Open Assistant (zh)](https://huggingface.co/datasets/OpenAssistant/oasst1)
 - [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)
 </details>
@@ -276,18 +283,19 @@ huggingface-cli login
 | ------------ | ------- | --------- |
 | python       | 3.8     | 3.10      |
 | torch        | 1.13.1  | 2.2.0     |
-| transformers | 4.37.2  | 4.39.3    |
+| transformers | 4.37.2  | 4.40.1    |
-| datasets     | 2.14.3  | 2.18.0    |
+| datasets     | 2.14.3  | 2.19.1    |
-| accelerate   | 0.27.2  | 0.28.0    |
+| accelerate   | 0.27.2  | 0.30.0    |
 | peft         | 0.9.0   | 0.10.0    |
-| trl          | 0.8.1   | 0.8.1     |
+| trl          | 0.8.1   | 0.8.6     |
 | Optional     | Minimum | Recommend |
 | ------------ | ------- | --------- |
 | CUDA         | 11.6    | 12.2      |
 | deepspeed    | 0.10.0  | 0.14.0    |
-| bitsandbytes | 0.39.0  | 0.43.0    |
+| bitsandbytes | 0.39.0  | 0.43.1    |
-| flash-attn   | 2.3.0   | 2.5.6     |
+| vllm         | 0.4.0   | 0.4.2     |
 | flash-attn   | 2.3.0   | 2.5.8     |
 ### Hardware Requirement
@@ -305,28 +313,25 @@ huggingface-cli login
 ## Getting Started
-### Data Preparation
+### Installation
-Please refer to [data/README.md](data/README.md) for checking the details about the format of dataset files. You can either use datasets on HuggingFace / ModelScope hub or load the dataset in local disk.
+> [!IMPORTANT]
-
+> Installation is mandatory.
 > [!NOTE]
 > Please update `data/dataset_info.json` to use your custom dataset.
 ### Dependence Installation
 ```bash
 git clone https://github.com/hiyouga/LLaMA-Factory.git
 conda create -n llama_factory python=3.10
 conda activate llama_factory
 cd LLaMA-Factory
-pip install -e .[metrics]
+pip install -e .[torch,metrics]
 ```
-Extra dependencies available: deepspeed, metrics, galore, badam, vllm, bitsandbytes, gptq, awq, aqlm, qwen, modelscope, quality
+Extra dependencies available: torch, metrics, deepspeed, bitsandbytes, vllm, galore, badam, gptq, awq, aqlm, qwen, modelscope, quality
 > [!TIP]
 > Use `pip install --no-deps -e .` to resolve package conflicts.
 <details><summary>For Windows users</summary>
-If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you will be required to install a pre-built version of `bitsandbytes` library, which supports CUDA 11.1 to 12.2, please select the appropriate [release version](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels) based on your CUDA version.
+If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you need to install a pre-built version of `bitsandbytes` library, which supports CUDA 11.1 to 12.2, please select the appropriate [release version](https://github.com/jllllll/bitsandbytes-windows-webui/releases/tag/wheels) based on your CUDA version.
 ```bash
 pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl
@@ -336,25 +341,73 @@ To enable FlashAttention-2 on the Windows platform, you need to install the prec
 </details>
-### Train with LLaMA Board GUI (powered by [Gradio](https://github.com/gradio-app/gradio))
+<details><summary>For Ascend NPU users</summary>
 To utilize Ascend NPU devices for (distributed) training and inference, you need to install the **[torch-npu](https://gitee.com/ascend/pytorch)** library and the **[Ascend CANN Kernels](https://www.hiascend.com/developer/download/community/result?module=cann)**.
 | Requirement  | Minimum | Recommend |
 | ------------ | ------- | --------- |
 | CANN         | 8.0.RC1 | 8.0.RC1   |
 | torch        | 2.2.0   | 2.2.0     |
 | torch-npu    | 2.2.0   | 2.2.0     |
 | deepspeed    | 0.13.2  | 0.13.2    |
 Docker image:
 - 32GB: [Download page](http://mirrors.cn-central-221.ovaijisuan.com/detail/130.html)
 - 64GB: Coming soon
 Remember to use `ASCEND_RT_VISIBLE_DEVICES` instead of `CUDA_VISIBLE_DEVICES` to specify the device to use.
 If you cannot infer model on NPU devices, try setting `do_sample: false` in the configurations.
 </details>
 ### Data Preparation
 Please refer to [data/README.md](data/README.md) for checking the details about the format of dataset files. You can either use datasets on HuggingFace / ModelScope hub or load the dataset in local disk.
 > [!NOTE]
 > Please update `data/dataset_info.json` to use your custom dataset.
 ### Quickstart
 Use the following 3 commands to run LoRA **fine-tuning**, **inference** and **merging** of the Llama3-8B-Instruct model, respectively.
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
 ```
 See [examples/README.md](examples/README.md) for advanced usage (including distributed training).
 > [!TIP]
 > Use `llamafactory-cli help` to show help information.
 ### Fine-Tuning with LLaMA Board GUI (powered by [Gradio](https://github.com/gradio-app/gradio))
 > [!IMPORTANT]
-> LLaMA Board GUI only supports training on a single GPU, please use [CLI](#command-line-interface) for distributed training.
+> LLaMA Board GUI only supports training on a single GPU.
 #### Use local environment
 ```bash
-export CUDA_VISIBLE_DEVICES=0 # `set CUDA_VISIBLE_DEVICES=0` for Windows
+CUDA_VISIBLE_DEVICES=0 GRADIO_SHARE=1 llamafactory-cli webui
 export GRADIO_SERVER_PORT=7860 # `set GRADIO_SERVER_PORT=7860` for Windows
 python src/train_web.py # or python -m llmtuner.webui.interface
 ```
-<details><summary>For Alibaba Cloud users</summary>
+<details><summary>For Alibaba Cloud PAI or AutoDL users</summary>
-If you encountered display problems in LLaMA Board on Alibaba Cloud, try using the following command to set environment variables before starting LLaMA Board:
+If you encountered display problems in LLaMA Board on Alibaba Cloud PAI, try using the following command to set environment variables before starting LLaMA Board:
 ```bash
-export GRADIO_ROOT_PATH=/${JUPYTER_NAME}/proxy/7860/
+export GRADIO_SERVER_PORT=7860 GRADIO_ROOT_PATH=/${JUPYTER_NAME}/proxy/7860/
 ```
 If you are using AutoDL, please install a specific version of Gradio:
 ```bash
 pip install gradio==4.10.0
 ```
 </details>
@@ -388,20 +441,10 @@ docker compose -f ./docker-compose.yml up -d
 </details>
 ### Train with Command Line Interface
 See [examples/README.md](examples/README.md) for usage.
 Use `python src/train_bash.py -h` to display arguments description.
 ### Deploy with OpenAI-style API and vLLM
 ```bash
-CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \
+CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 llamafactory-cli api examples/inference/llama3_vllm.yaml
    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
    --template llama3 \
    --infer_backend vllm \
    --vllm_enforce_eager
 ```
 ### Download from ModelScope Hub
@@ -441,6 +484,7 @@ If you have a project that should be incorporated, please contact via email or c
 1. Huang et al. Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning. 2024. [[arxiv]](https://arxiv.org/abs/2403.02333)
 1. Duan et al. Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization. 2024. [[arxiv]](https://arxiv.org/abs/2403.03419)
 1. Xie and Schwertfeger. Empowering Robotics with Large Language Models: osmAG Map Comprehension with LLMs. 2024. [[arxiv]](https://arxiv.org/abs/2403.08228)
 1. Wu et al. Large Language Models are Parallel Multilingual Learners. 2024. [[arxiv]](https://arxiv.org/abs/2403.09073)
 1. Zhang et al. EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling. 2024. [[arxiv]](https://arxiv.org/abs/2403.14541)
 1. Weller et al. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. 2024. [[arxiv]](https://arxiv.org/abs/2403.15246)
 1. Hongbin Na. CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering. 2024. [[arxiv]](https://arxiv.org/abs/2403.16008)
@@ -448,12 +492,21 @@ If you have a project that should be incorporated, please contact via email or c
 1. Liu et al. Extensive Self-Contrast Enables Feedback-Free Language Model Alignment. 2024. [[arxiv]](https://arxiv.org/abs/2404.00604)
 1. Luo et al. BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2404.02827)
 1. Du et al. Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model. 2024. [[arxiv]](https://arxiv.org/abs/2404.04167)
 1. Ma et al. Parameter Efficient Quasi-Orthogonal Fine-Tuning via Givens Rotation. 2024. [[arxiv]](https://arxiv.org/abs/2404.04316)
 1. Liu et al. Dynamic Generation of Personalities with Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2404.07084)
 1. Shang et al. How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2404.09836)
 1. Huang et al. LLMTune: Accelerate Database Knob Tuning with Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2404.11581)
 1. Deng et al. Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction. 2024. [[arxiv]](https://arxiv.org/abs/2404.14215)
 1. Acikgoz et al. Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare. 2024. [[arxiv]](https://arxiv.org/abs/2404.16621)
 1. Zhang et al. Small Language Models Need Strong Verifiers to Self-Correct Reasoning. 2024. [[arxiv]](https://arxiv.org/abs/2404.17140)
 1. Zhou et al. FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering. 2024. [[arxiv]](https://arxiv.org/abs/2404.18585)
 1. **[StarWhisper](https://github.com/Yu-Yang-Li/StarWhisper)**: A large language model for Astronomy, based on ChatGLM2-6B and Qwen-14B.
 1. **[DISC-LawLLM](https://github.com/FudanDISC/DISC-LawLLM)**: A large language model specialized in Chinese legal domain, based on Baichuan-13B, is capable of retrieving and reasoning on legal knowledge.
 1. **[Sunsimiao](https://github.com/thomas-yanxin/Sunsimiao)**: A large language model specialized in Chinese medical domain, based on Baichuan-7B and ChatGLM-6B.
 1. **[CareGPT](https://github.com/WangRongsheng/CareGPT)**: A series of large language models for Chinese medical domain, based on LLaMA2-7B and Baichuan-13B.
 1. **[MachineMindset](https://github.com/PKU-YuanGroup/Machine-Mindset/)**: A series of MBTI Personality large language models, capable of giving any LLM 16 different personality types based on different datasets and training methods.
 1. **[Luminia-13B-v3](https://huggingface.co/Nekochu/Luminia-13B-v3)**: A large language model specialized in generate metadata for stable diffusion. [[🤗Demo]](https://huggingface.co/spaces/Nekochu/Luminia-13B_SD_Prompt)
 1. **[Chinese-LLaVA-Med](https://github.com/BUAADreamer/Chinese-LLaVA-Med)**: A multimodal large language model specialized in Chinese medical domain, based on LLaVA-1.5-7B.
 </details>
@@ -461,7 +514,7 @@ If you have a project that should be incorporated, please contact via email or c
 This repository is licensed under the [Apache-2.0 License](LICENSE).
-Please follow the model licenses to use the corresponding model weights: [Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [Command-R](https://cohere.com/c4ai-cc-by-nc-license) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2/LLaVA-1.5](https://ai.meta.com/llama/license/) / [LLaMA-3](https://llama.meta.com/llama3/license/) / [Mistral](LICENSE) / [OLMo](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [StarCoder2](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan)
+Please follow the model licenses to use the corresponding model weights: [Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [Command-R](https://cohere.com/c4ai-cc-by-nc-license) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2 (LLaVA-1.5)](https://ai.meta.com/llama/license/) / [LLaMA-3](https://llama.meta.com/llama3/license/) / [Mistral](LICENSE) / [OLMo](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [StarCoder2](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yi-1.5](LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan)
 ## Citation
--- a/README_zh.md
+++ b/README_zh.md
@@ -5,7 +5,7 @@
 [![GitHub last commit](https://img.shields.io/github/last-commit/hiyouga/LLaMA-Factory)](https://github.com/hiyouga/LLaMA-Factory/commits/main)
 [![PyPI](https://img.shields.io/pypi/v/llmtuner)](https://pypi.org/project/llmtuner/)
 [![Downloads](https://static.pepy.tech/badge/llmtuner)](https://pypi.org/project/llmtuner/)
-[![Citation](https://img.shields.io/badge/citation-34-green)](#使用了-llama-factory-的项目)
+[![Citation](https://img.shields.io/badge/citation-44-green)](#使用了-llama-factory-的项目)
 [![GitHub pull request](https://img.shields.io/badge/PRs-welcome-blue)](https://github.com/hiyouga/LLaMA-Factory/pulls)
 [![Discord](https://dcbadge.vercel.app/api/server/rKfvV9r9FK?compact=true&style=flat)](https://discord.gg/rKfvV9r9FK)
 [![Twitter](https://img.shields.io/twitter/follow/llamafactory_ai)](https://twitter.com/llamafactory_ai)
@@ -13,6 +13,8 @@
 [![Studios](https://img.shields.io/badge/ModelScope-Open%20in%20Studios-blue)](https://modelscope.cn/studios/hiyouga/LLaMA-Board)
 [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1d5KQtbemerlSDSxZIfAaWXhKr30QypiK?usp=sharing)
 [![GitHub Tread](https://trendshift.io/api/badge/repositories/4535)](https://trendshift.io/repositories/4535)
 👋 加入我们的[微信群](assets/wechat.jpg)。
 \[ [English](README.md) | 中文 \]
@@ -68,57 +70,61 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
 ## 更新日志
-[24/04/26] 我们支持了多模态模型 **LLaVA-1.5** 的微调。详细用法请参照 `examples/lora_single_gpu/sft_mllm.sh`。
+[24/05/14] 我们支持了昇腾 NPU 设备的训练和推理。详情请查阅[安装](#安装-llama-factory)部分。
-[24/04/22] 我们提供了在免费 T4 GPU 上微调 Llama-3 模型的 **[Colab 笔记本](https://colab.research.google.com/drive/1d5KQtbemerlSDSxZIfAaWXhKr30QypiK?usp=sharing)**。Hugging Face 社区公开了两个利用 LLaMA Factory 微调的 Llama-3 模型，详情请见 [Llama3-8B-Chinese-Chat](https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat) 和 [Llama3-Chinese](https://huggingface.co/zhichen/Llama3-Chinese)。
+[24/05/13] 我们支持了 Yi-1.5 系列模型的微调。
-[24/04/21] 我们基于 [AstraMindAI 的仓库](https://github.com/astramind-ai/Mixture-of-depths)支持了 **[混合深度训练](https://arxiv.org/abs/2404.02258)**。详细用法请参照 `examples/extras/mod`。
+[24/04/26] 我们支持了多模态模型 **LLaVA-1.5** 的微调。详细用法请参照 [examples](examples/README_zh.md)。
 [24/04/16] 我们支持了 **[BAdam](https://arxiv.org/abs/2404.02827)**。详细用法请参照 `examples/extras/badam`。
 [24/04/16] 我们支持了 **[unsloth](https://github.com/unslothai/unsloth)** 的长序列训练（24GB 可训练 Llama-2-7B-56k）。该方法相比 FlashAttention-2 提供了 **117%** 的训练速度和 **50%** 的显存节约。更多数据请见[此页面](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison)。
 <details><summary>展开日志</summary>
-[24/03/31] 我们支持了 **[ORPO](https://arxiv.org/abs/2403.07691)**。详细用法请参照 `examples/lora_single_gpu`。
+[24/04/22] 我们提供了在免费 T4 GPU 上微调 Llama-3 模型的 **[Colab 笔记本](https://colab.research.google.com/drive/1d5KQtbemerlSDSxZIfAaWXhKr30QypiK?usp=sharing)**。Hugging Face 社区公开了两个利用 LLaMA Factory 微调的 Llama-3 模型，详情请见 [Llama3-8B-Chinese-Chat](https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat) 和 [Llama3-Chinese](https://huggingface.co/zhichen/Llama3-Chinese)。
 [24/04/21] 我们基于 [AstraMindAI 的仓库](https://github.com/astramind-ai/Mixture-of-depths)支持了 **[混合深度训练](https://arxiv.org/abs/2404.02258)**。详细用法请参照 [examples](examples/README_zh.md)。
 [24/04/16] 我们支持了 **[BAdam](https://arxiv.org/abs/2404.02827)**。详细用法请参照 [examples](examples/README_zh.md)。
 [24/04/16] 我们支持了 **[unsloth](https://github.com/unslothai/unsloth)** 的长序列训练（24GB 可训练 Llama-2-7B-56k）。该方法相比 FlashAttention-2 提供了 **117%** 的训练速度和 **50%** 的显存节约。更多数据请见[此页面](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison)。
 [24/03/31] 我们支持了 **[ORPO](https://arxiv.org/abs/2403.07691)**。详细用法请参照 [examples](examples/README_zh.md)。
 [24/03/21] 我们的论文 "[LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models](https://arxiv.org/abs/2403.13372)" 可在 arXiv 上查看！
-[24/03/20] 我们支持了能在 2x24GB GPU 上微调 70B 模型的 **FSDP+QLoRA**。详细用法请参照 `examples/extras/fsdp_qlora`。
+[24/03/20] 我们支持了能在 2x24GB GPU 上微调 70B 模型的 **FSDP+QLoRA**。详细用法请参照 [examples](examples/README_zh.md)。
-[24/03/13] 我们支持了 **[LoRA+](https://arxiv.org/abs/2402.12354)**。详细用法请参照 `examples/extras/loraplus`。
+[24/03/13] 我们支持了 **[LoRA+](https://arxiv.org/abs/2402.12354)**。详细用法请参照 [examples](examples/README_zh.md)。
-[24/03/07] 我们支持了梯度低秩投影（**[GaLore](https://arxiv.org/abs/2403.03507)**）算法。详细用法请参照 `examples/extras/galore`。
+[24/03/07] 我们支持了梯度低秩投影（**[GaLore](https://arxiv.org/abs/2403.03507)**）算法。详细用法请参照 [examples](examples/README_zh.md)。
-[24/03/07] 我们集成了 **[vLLM](https://github.com/vllm-project/vllm)** 以实现极速并发推理。请使用 `--infer_backend vllm` 来获得 **270%** 的推理速度。（尚不支持 LoRA，请先合并权重。）
+[24/03/07] 我们集成了 **[vLLM](https://github.com/vllm-project/vllm)** 以实现极速并发推理。请使用 `infer_backend: vllm` 来获得 **270%** 的推理速度。
-[24/02/28] 我们支持了 **[DoRA](https://arxiv.org/abs/2402.09353)** 微调。请使用 `--use_dora` 参数进行 DoRA 微调。
+[24/02/28] 我们支持了 **[DoRA](https://arxiv.org/abs/2402.09353)** 微调。请使用 `use_dora: true` 参数进行 DoRA 微调。
-[24/02/15] 我们支持了 [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro) 提出的**块扩展**方法。详细用法请参照 `examples/extras/llama_pro`。
+[24/02/15] 我们支持了 [LLaMA Pro](https://github.com/TencentARC/LLaMA-Pro) 提出的**块扩展**方法。详细用法请参照 [examples](examples/README_zh.md)。
 [24/02/05] Qwen1.5（Qwen2 测试版）系列模型已在 LLaMA-Factory 中实现微调支持。详情请查阅该[博客页面](https://qwenlm.github.io/zh/blog/qwen1.5/)。
-[24/01/18] 我们针对绝大多数模型实现了 **Agent 微调**，微调时指定 `--dataset glaive_toolcall` 即可使模型获得工具调用能力。
+[24/01/18] 我们针对绝大多数模型实现了 **Agent 微调**，微调时指定 `dataset: glaive_toolcall` 即可使模型获得工具调用能力。
-[23/12/23] 我们针对 LLaMA, Mistral 和 Yi 模型支持了 **[unsloth](https://github.com/unslothai/unsloth)** 的 LoRA 训练加速。请使用 `--use_unsloth` 参数启用 unsloth 优化。该方法可提供 **170%** 的训练速度，详情请查阅[此页面](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison)。
+[23/12/23] 我们针对 LLaMA, Mistral 和 Yi 模型支持了 **[unsloth](https://github.com/unslothai/unsloth)** 的 LoRA 训练加速。请使用 `use_unsloth: true` 参数启用 unsloth 优化。该方法可提供 **170%** 的训练速度，详情请查阅[此页面](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison)。
 [23/12/12] 我们支持了微调最新的混合专家模型 **[Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)**。硬件需求请查阅[此处](#硬件依赖)。
-[23/12/01] 我们支持了从 **[魔搭社区](https://modelscope.cn/models)** 下载预训练模型和数据集。详细用法请参照 [此教程](#使用魔搭社区可跳过)。
+[23/12/01] 我们支持了从 **[魔搭社区](https://modelscope.cn/models)** 下载预训练模型和数据集。详细用法请参照 [此教程](#从魔搭社区下载)。
-[23/10/21] 我们支持了 **[NEFTune](https://arxiv.org/abs/2310.05914)** 训练技巧。请使用 `--neftune_noise_alpha` 参数启用 NEFTune，例如 `--neftune_noise_alpha 5`。
+[23/10/21] 我们支持了 **[NEFTune](https://arxiv.org/abs/2310.05914)** 训练技巧。请使用 `neftune_noise_alpha: 5` 参数启用 NEFTune。
-[23/09/27] 我们针对 LLaMA 模型支持了 [LongLoRA](https://github.com/dvlab-research/LongLoRA) 提出的 **$S^2$-Attn**。请使用 `--shift_attn` 参数以启用该功能。
+[23/09/27] 我们针对 LLaMA 模型支持了 [LongLoRA](https://github.com/dvlab-research/LongLoRA) 提出的 **$S^2$-Attn**。请使用 `shift_attn: true` 参数以启用该功能。
-[23/09/23] 我们在项目中集成了 MMLU、C-Eval 和 CMMLU 评估集。使用方法请参阅[此示例](#模型评估)。
+[23/09/23] 我们在项目中集成了 MMLU、C-Eval 和 CMMLU 评估集。详细用法请参照 [examples](examples/README_zh.md)。
-[23/09/10] 我们支持了 **[FlashAttention-2](https://github.com/Dao-AILab/flash-attention)**。如果您使用的是 RTX4090、A100 或 H100 GPU，请使用 `--flash_attn fa2` 参数以启用 FlashAttention-2。
+[23/09/10] 我们支持了 **[FlashAttention-2](https://github.com/Dao-AILab/flash-attention)**。如果您使用的是 RTX4090、A100 或 H100 GPU，请使用 `flash_attn: fa2` 参数以启用 FlashAttention-2。
-[23/08/12] 我们支持了 **RoPE 插值**来扩展 LLaMA 模型的上下文长度。请使用 `--rope_scaling linear` 参数训练模型或使用 `--rope_scaling dynamic` 参数评估模型。
+[23/08/12] 我们支持了 **RoPE 插值**来扩展 LLaMA 模型的上下文长度。请使用 `rope_scaling: linear` 参数训练模型或使用 `rope_scaling: dynamic` 参数评估模型。
-[23/08/11] 我们支持了指令模型的 **[DPO 训练](https://arxiv.org/abs/2305.18290)**。使用方法请参阅[此示例](#dpo-训练)。
+[23/08/11] 我们支持了指令模型的 **[DPO 训练](https://arxiv.org/abs/2305.18290)**。详细用法请参照 [examples](examples/README_zh.md)。
-[23/07/31] 我们支持了**数据流式加载**。请使用 `--streaming` 和 `--max_steps 10000` 参数来流式加载数据集。
+[23/07/31] 我们支持了**数据流式加载**。请使用 `streaming: true` 和 `max_steps: 10000` 参数来流式加载数据集。
 [23/07/29] 我们在 Hugging Face 发布了两个 13B 指令微调模型。详细内容请查阅我们的 Hugging Face 项目（[LLaMA-2](https://huggingface.co/hiyouga/Llama-2-Chinese-13b-chat) / [Baichuan](https://huggingface.co/hiyouga/Baichuan-13B-sft)）。
@@ -130,7 +136,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
 [23/06/22] 我们对齐了[示例 API](src/api_demo.py) 与 [OpenAI API](https://platform.openai.com/docs/api-reference/chat) 的格式，您可以将微调模型接入**任意基于 ChatGPT 的应用**中。
-[23/06/03] 我们实现了 4 比特的 LoRA 训练（也称 **[QLoRA](https://github.com/artidoro/qlora)**）。请使用 `--quantization_bit 4` 参数进行 4 比特量化微调。
+[23/06/03] 我们实现了 4 比特的 LoRA 训练（也称 **[QLoRA](https://github.com/artidoro/qlora)**）。详细用法请参照 [examples](examples/README_zh.md)。
 </details>
@@ -143,7 +149,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
 | [BLOOMZ](https://huggingface.co/bigscience)              | 560M/1.1B/1.7B/3B/7.1B/176B      | query_key_value   | -         |
 | [ChatGLM3](https://huggingface.co/THUDM)                 | 6B                               | query_key_value   | chatglm3  |
 | [Command-R](https://huggingface.co/CohereForAI)          | 35B/104B                         | q_proj,v_proj     | cohere    |
-| [DeepSeek (MoE)](https://huggingface.co/deepseek-ai)     | 7B/16B/67B                       | q_proj,v_proj     | deepseek  |
+| [DeepSeek (MoE)](https://huggingface.co/deepseek-ai)     | 7B/16B/67B/236B                  | q_proj,v_proj     | deepseek  |
 | [Falcon](https://huggingface.co/tiiuae)                  | 7B/40B/180B                      | query_key_value   | falcon    |
 | [Gemma/CodeGemma](https://huggingface.co/google)         | 2B/7B                            | q_proj,v_proj     | gemma     |
 | [InternLM2](https://huggingface.co/internlm)             | 7B/20B                           | wqkv              | intern2   |
@@ -159,11 +165,12 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
 | [Qwen1.5 (Code/MoE)](https://huggingface.co/Qwen)        | 0.5B/1.8B/4B/7B/14B/32B/72B/110B | q_proj,v_proj     | qwen      |
 | [StarCoder2](https://huggingface.co/bigcode)             | 3B/7B/15B                        | q_proj,v_proj     | -         |
 | [XVERSE](https://huggingface.co/xverse)                  | 7B/13B/65B                       | q_proj,v_proj     | xverse    |
-| [Yi](https://huggingface.co/01-ai)                       | 6B/9B/34B                        | q_proj,v_proj     | yi        |
+| [Yi (1/1.5)](https://huggingface.co/01-ai)               | 6B/9B/34B                        | q_proj,v_proj     | yi        |
 | [Yi-VL](https://huggingface.co/01-ai)                    | 6B/34B                           | q_proj,v_proj     | yi_vl     |
 | [Yuan](https://huggingface.co/IEITYuan)                  | 2B/51B/102B                      | q_proj,v_proj     | yuan      |
 > [!NOTE]
-> **默认模块**应作为 `--lora_target` 参数的默认值，可使用 `--lora_target all` 参数指定全部模块以得到更好的效果。
+> **默认模块**应作为 `--lora_target` 参数的默认值，可使用 `--lora_target all` 参数指定全部模块以取得更好的效果。
 >
 > 对于所有“基座”（Base）模型，`--template` 参数可以是 `default`, `alpaca`, `vicuna` 等任意值。但“对话”（Instruct/Chat）模型请务必使用**对应的模板**。
 >
@@ -205,8 +212,8 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
 - [Stanford Alpaca (en)](https://github.com/tatsu-lab/stanford_alpaca)
 - [Stanford Alpaca (zh)](https://github.com/ymcui/Chinese-LLaMA-Alpaca)
 - [Alpaca GPT4 (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
- [Self Cognition (zh)](data/self_cognition.json)
+- [Identity (en&zh)](data/identity.json)
- [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
+- [Open Assistant (zh)](https://huggingface.co/datasets/OpenAssistant/oasst1)
 - [ShareGPT (zh)](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Chinese-instruction-collection)
 - [Guanaco Dataset (multilingual)](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset)
 - [BELLE 2M (zh)](https://huggingface.co/datasets/BelleGroup/train_2M_CN)
@@ -254,11 +261,11 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
 <details><summary>偏好数据集</summary>
 - [HH-RLHF (en)](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 - [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
 - [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
 - [Orca DPO (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
 - [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)
 - [DPO mixed (en&zh)](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k)
 - [Open Assistant (zh)](https://huggingface.co/datasets/OpenAssistant/oasst1)
 - [Orca DPO (de)](https://huggingface.co/datasets/mayflowergmbh/intel_orca_dpo_pairs_de)
 </details>
@@ -276,18 +283,19 @@ huggingface-cli login
 | ------------ | ------- | --------- |
 | python       | 3.8     | 3.10      |
 | torch        | 1.13.1  | 2.2.0     |
-| transformers | 4.37.2  | 4.39.3    |
+| transformers | 4.37.2  | 4.40.1    |
-| datasets     | 2.14.3  | 2.18.0    |
+| datasets     | 2.14.3  | 2.19.1    |
-| accelerate   | 0.27.2  | 0.28.0    |
+| accelerate   | 0.27.2  | 0.30.0    |
 | peft         | 0.9.0   | 0.10.0    |
-| trl          | 0.8.1   | 0.8.1     |
+| trl          | 0.8.1   | 0.8.6     |
 | 可选项       | 至少     | 推荐      |
 | ------------ | ------- | --------- |
 | CUDA         | 11.6    | 12.2      |
 | deepspeed    | 0.10.0  | 0.14.0    |
-| bitsandbytes | 0.39.0  | 0.43.0    |
+| bitsandbytes | 0.39.0  | 0.43.1    |
-| flash-attn   | 2.3.0   | 2.5.6     |
+| vllm         | 0.4.0   | 0.4.2     |
 | flash-attn   | 2.3.0   | 2.5.8     |
 ### 硬件依赖
@@ -305,24 +313,21 @@ huggingface-cli login
 ## 如何使用
-### 数据准备
+### 安装 LLaMA Factory
-关于数据集文件的格式，请参考 [data/README_zh.md](data/README_zh.md) 的内容。你可以使用 HuggingFace / ModelScope 上的数据集或加载本地数据集。
+> [!IMPORTANT]
-
+> 此步骤为必需。
 > [!NOTE]
 > 使用自定义数据集时，请更新 `data/dataset_info.json` 文件。
 ### 安装依赖
 ```bash
 git clone https://github.com/hiyouga/LLaMA-Factory.git
 conda create -n llama_factory python=3.10
 conda activate llama_factory
 cd LLaMA-Factory
-pip install -e .[metrics]
+pip install -e .[torch,metrics]
 ```
-可选的额外依赖项：deepspeed、metrics、galore、badam、vllm、bitsandbytes、gptq、awq、aqlm、qwen、modelscope、quality
+可选的额外依赖项：torch、metrics、deepspeed、bitsandbytes、vllm、galore、badam、gptq、awq、aqlm、qwen、modelscope、quality
 > [!TIP]
 > 遇到包冲突时，可使用 `pip install --no-deps -e .` 解决。
 <details><summary>Windows 用户指南</summary>
@@ -336,25 +341,73 @@ pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/downl
 </details>
-### 利用 LLaMA Board 可视化界面训练（由 [Gradio](https://github.com/gradio-app/gradio) 驱动）
+<details><summary>昇腾 NPU 用户指南</summary>
 如果使用昇腾 NPU 设备进行（分布式）训练或推理，需要安装 **[torch-npu](https://gitee.com/ascend/pytorch)** 库和 **[Ascend CANN Kernels](https://www.hiascend.com/developer/download/community/result?module=cann)**。
 | 依赖项       | 至少     | 推荐      |
 | ------------ | ------- | --------- |
 | CANN         | 8.0.RC1 | 8.0.RC1   |
 | torch        | 2.2.0   | 2.2.0     |
 | torch-npu    | 2.2.0   | 2.2.0     |
 | deepspeed    | 0.13.2  | 0.13.2    |
 Docker 镜像：
 - 32GB：[下载地址](http://mirrors.cn-central-221.ovaijisuan.com/detail/130.html)
 - 64GB：敬请期待
 请记得使用 `ASCEND_RT_VISIBLE_DEVICES` 而非 `CUDA_VISIBLE_DEVICES` 来指定您使用的设备。
 如果遇到无法正常推理的情况，请尝试设置 `do_sample: false`。
 </details>
 ### 数据准备
 关于数据集文件的格式，请参考 [data/README_zh.md](data/README_zh.md) 的内容。你可以使用 HuggingFace / ModelScope 上的数据集或加载本地数据集。
 > [!NOTE]
 > 使用自定义数据集时，请更新 `data/dataset_info.json` 文件。
 ### 快速开始
 下面三行命令分别对 Llama3-8B-Instruct 模型进行 LoRA **微调**、**推理**和**合并**。
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
 ```
 高级用法请参考 [examples/README_zh.md](examples/README_zh.md)（包括多 GPU 微调）。
 > [!TIP]
 > 使用 `llamafactory-cli help` 显示帮助信息。
 ### LLaMA Board 可视化微调（由 [Gradio](https://github.com/gradio-app/gradio) 驱动）
 > [!IMPORTANT]
-> LLaMA Board 可视化界面目前仅支持单 GPU 训练，请使用[命令行接口](#命令行接口)来进行多 GPU 分布式训练。
+> LLaMA Board 可视化界面目前仅支持单 GPU 训练。
 #### 使用本地环境
 ```bash
-export CUDA_VISIBLE_DEVICES=0 # Windows 使用 `set CUDA_VISIBLE_DEVICES=0`
+CUDA_VISIBLE_DEVICES=0 GRADIO_SHARE=1 llamafactory-cli webui
 export GRADIO_SERVER_PORT=7860 # Windows 使用 `set GRADIO_SERVER_PORT=7860`
 python src/train_web.py # 或 python -m llmtuner.webui.interface
 ```
-<details><summary>阿里云用户指南</summary>
+<details><summary>阿里云 PAI 和 AutoDL 用户指南</summary>
-如果您在阿里云上使用 LLaMA Board 时遇到显示问题，请尝试在启动前使用以下命令设置环境变量：
+如果您在阿里云 PAI 上使用 LLaMA Board 时遇到显示问题，请尝试在启动前使用以下命令设置环境变量：
 ```bash
-export GRADIO_ROOT_PATH=/${JUPYTER_NAME}/proxy/7860/
+export GRADIO_SERVER_PORT=7860 GRADIO_ROOT_PATH=/${JUPYTER_NAME}/proxy/7860/
 ```
 如果您正在使用 AutoDL，请安装下述 Gradio 版本：
 ```bash
 pip install gradio==4.10.0
 ```
 </details>
@@ -388,20 +441,10 @@ docker compose -f ./docker-compose.yml up -d
 </details>
 ### 利用命令行接口训练
 使用方法请参考 [examples/README_zh.md](examples/README_zh.md)。
 您可以执行 `python src/train_bash.py -h` 来查看参数文档。
 ### 利用 vLLM 部署 OpenAI API
 ```bash
-CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \
+CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 llamafactory-cli api examples/inference/llama3_vllm.yaml
    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
    --template llama3 \
    --infer_backend vllm \
    --vllm_enforce_eager
 ```
 ### 从魔搭社区下载
@@ -441,6 +484,7 @@ export USE_MODELSCOPE_HUB=1 # Windows 使用 `set USE_MODELSCOPE_HUB=1`
 1. Huang et al. Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning. 2024. [[arxiv]](https://arxiv.org/abs/2403.02333)
 1. Duan et al. Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization. 2024. [[arxiv]](https://arxiv.org/abs/2403.03419)
 1. Xie and Schwertfeger. Empowering Robotics with Large Language Models: osmAG Map Comprehension with LLMs. 2024. [[arxiv]](https://arxiv.org/abs/2403.08228)
 1. Wu et al. Large Language Models are Parallel Multilingual Learners. 2024. [[arxiv]](https://arxiv.org/abs/2403.09073)
 1. Zhang et al. EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling. 2024. [[arxiv]](https://arxiv.org/abs/2403.14541)
 1. Weller et al. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. 2024. [[arxiv]](https://arxiv.org/abs/2403.15246)
 1. Hongbin Na. CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering. 2024. [[arxiv]](https://arxiv.org/abs/2403.16008)
@@ -448,12 +492,21 @@ export USE_MODELSCOPE_HUB=1 # Windows 使用 `set USE_MODELSCOPE_HUB=1`
 1. Liu et al. Extensive Self-Contrast Enables Feedback-Free Language Model Alignment. 2024. [[arxiv]](https://arxiv.org/abs/2404.00604)
 1. Luo et al. BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2404.02827)
 1. Du et al. Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model. 2024. [[arxiv]](https://arxiv.org/abs/2404.04167)
 1. Ma et al. Parameter Efficient Quasi-Orthogonal Fine-Tuning via Givens Rotation. 2024. [[arxiv]](https://arxiv.org/abs/2404.04316)
 1. Liu et al. Dynamic Generation of Personalities with Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2404.07084)
 1. Shang et al. How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2404.09836)
 1. Huang et al. LLMTune: Accelerate Database Knob Tuning with Large Language Models. 2024. [[arxiv]](https://arxiv.org/abs/2404.11581)
 1. Deng et al. Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction. 2024. [[arxiv]](https://arxiv.org/abs/2404.14215)
 1. Acikgoz et al. Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare. 2024. [[arxiv]](https://arxiv.org/abs/2404.16621)
 1. Zhang et al. Small Language Models Need Strong Verifiers to Self-Correct Reasoning. 2024. [[arxiv]](https://arxiv.org/abs/2404.17140)
 1. Zhou et al. FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering. 2024. [[arxiv]](https://arxiv.org/abs/2404.18585)
 1. **[StarWhisper](https://github.com/Yu-Yang-Li/StarWhisper)**: 天文大模型 StarWhisper，基于 ChatGLM2-6B 和 Qwen-14B 在天文数据上微调而得。
 1. **[DISC-LawLLM](https://github.com/FudanDISC/DISC-LawLLM)**: 中文法律领域大模型 DISC-LawLLM，基于 Baichuan-13B 微调而得，具有法律推理和知识检索能力。
 1. **[Sunsimiao](https://github.com/thomas-yanxin/Sunsimiao)**: 孙思邈中文医疗大模型 Sumsimiao，基于 Baichuan-7B 和 ChatGLM-6B 在中文医疗数据上微调而得。
 1. **[CareGPT](https://github.com/WangRongsheng/CareGPT)**: 医疗大模型项目 CareGPT，基于 LLaMA2-7B 和 Baichuan-13B 在中文医疗数据上微调而得。
 1. **[MachineMindset](https://github.com/PKU-YuanGroup/Machine-Mindset/)**：MBTI性格大模型项目，根据数据集与训练方式让任意 LLM 拥有 16 个不同的性格类型。
 1. **[Luminia-13B-v3](https://huggingface.co/Nekochu/Luminia-13B-v3)**：一个用于生成 Stable Diffusion 提示词的大型语言模型。[[🤗Demo]](https://huggingface.co/spaces/Nekochu/Luminia-13B_SD_Prompt)
 1. **[Chinese-LLaVA-Med](https://github.com/BUAADreamer/Chinese-LLaVA-Med)**：中文多模态医学大模型，基于 LLaVA-1.5-7B 在中文多模态医疗数据上微调而得。
 </details>
@@ -461,7 +514,7 @@ export USE_MODELSCOPE_HUB=1 # Windows 使用 `set USE_MODELSCOPE_HUB=1`
 本仓库的代码依照 [Apache-2.0](LICENSE) 协议开源。
-使用模型权重时，请遵循对应的模型协议：[Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [Command-R](https://cohere.com/c4ai-cc-by-nc-license) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2/LLaVA-1.5](https://ai.meta.com/llama/license/) / [LLaMA-3](https://llama.meta.com/llama3/license/) / [Mistral](LICENSE) / [OLMo](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [StarCoder2](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan)
+使用模型权重时，请遵循对应的模型协议：[Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base/blob/main/Community%20License%20for%20Baichuan%202%20Model.pdf) / [BLOOM](https://huggingface.co/spaces/bigscience/license) / [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/MODEL_LICENSE) / [Command-R](https://cohere.com/c4ai-cc-by-nc-license) / [DeepSeek](https://github.com/deepseek-ai/DeepSeek-LLM/blob/main/LICENSE-MODEL) / [Falcon](https://huggingface.co/tiiuae/falcon-180B/blob/main/LICENSE.txt) / [Gemma](https://ai.google.dev/gemma/terms) / [InternLM2](https://github.com/InternLM/InternLM#license) / [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) / [LLaMA-2 (LLaVA-1.5)](https://ai.meta.com/llama/license/) / [LLaMA-3](https://llama.meta.com/llama3/license/) / [Mistral](LICENSE) / [OLMo](LICENSE) / [Phi-1.5/2](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx) / [Phi-3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/LICENSE) / [Qwen](https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT) / [StarCoder2](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement) / [XVERSE](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf) / [Yi](https://huggingface.co/01-ai/Yi-6B/blob/main/LICENSE) / [Yi-1.5](LICENSE) / [Yuan](https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/LICENSE-Yuan)
 ## 引用
--- a/data/README.md
+++ b/data/README.md
@@ -1,4 +1,4 @@
-If you are using a custom dataset, please provide your dataset definition in the following format in `dataset_info.json`.
+If you are using a custom dataset, please add your **dataset description** to `dataset_info.json` according to the following format. We also provide several examples in the next section.
 ```json
 "dataset_name": {
@@ -33,7 +33,7 @@ If you are using a custom dataset, please provide your dataset definition in the
 }
 ```
-Given above, you can use the custom dataset via specifying `--dataset dataset_name`.
+After that, you can load the custom dataset by specifying `--dataset dataset_name`.
 ----
@@ -54,10 +54,11 @@ Currently we support dataset in **alpaca** or **sharegpt** format, the dataset i
 ]
 ```
-Regarding the above dataset, the `columns` in `dataset_info.json` should be:
+Regarding the above dataset, the description in `dataset_info.json` should be:
 ```json
 "dataset_name": {
  "file_name": "data.json",
  "columns": {
    "prompt": "instruction",
    "query": "input",
@@ -70,28 +71,60 @@ Regarding the above dataset, the `columns` in `dataset_info.json` should be:
 The `query` column will be concatenated with the `prompt` column and used as the user prompt, then the user prompt would be `prompt\nquery`. The `response` column represents the model response.
-The `system` column will be used as the system prompt. The `history` column is a list consisting string tuples representing prompt-response pairs in the history. Note that the responses in the history **will also be used for training**.
+The `system` column will be used as the system prompt. The `history` column is a list consisting string tuples representing prompt-response pairs in the history. Note that the responses in the history **will also be used for training** in supervised fine-tuning.
-For the pre-training datasets, only the `prompt` column will be used for training.
+For the **pre-training datasets**, only the `prompt` column will be used for training, for example:
 For the preference datasets, the `response` column should be a string list whose length is 2, with the preferred answers appearing first, for example:
 ```json
-{
+[
-  "instruction": "user instruction",
+  {"text": "document"},
-  "input": "user input",
+  {"text": "document"}
-  "output": [
+]
-    "chosen answer",
+```
-    "rejected answer"
+
-  ]
+Regarding the above dataset, the description in `dataset_info.json` should be:
 ```json
 "dataset_name": {
  "file_name": "data.json",
  "columns": {
    "prompt": "text"
  }
 }
 ```
-Remember to set `"ranking": true` for the preference datasets.
+For the **preference datasets**, the `response` column should be a string list whose length is 2, with the preferred answers appearing first, for example:
 ```json
 [
  {
    "instruction": "user instruction",
    "input": "user input",
    "output": [
      "chosen answer",
      "rejected answer"
    ]
  }
 ]
 ```
 Regarding the above dataset, the description in `dataset_info.json` should be:
 ```json
 "dataset_name": {
  "file_name": "data.json",
  "ranking": true,
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
  }
 }
 ```
 ----
-The dataset in sharegpt format should follow the below format:
+The dataset in **sharegpt** format should follow the below format:
 ```json
 [
@@ -112,10 +145,12 @@ The dataset in sharegpt format should follow the below format:
 ]
 ```
-Regarding the above dataset, the `columns` in `dataset_info.json` should be:
+Regarding the above dataset, the description in `dataset_info.json` should be:
 ```json
 "dataset_name": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "conversations",
    "system": "system",
@@ -132,4 +167,46 @@ Regarding the above dataset, the `columns` in `dataset_info.json` should be:
 where the `messages` column should be a list following the `u/a/u/a/u/a` order.
-Pre-training datasets and preference datasets are incompatible with the sharegpt format yet.
+We also supports the dataset in the **openai** format:
 ```json
 [
  {
    "messages": [
      {
        "role": "system",
        "content": "system prompt (optional)"
      },
      {
        "role": "user",
        "content": "user instruction"
      },
      {
        "role": "assistant",
        "content": "model response"
      }
    ]
  }
 ]
 ```
 Regarding the above dataset, the description in `dataset_info.json` should be:
 ```json
 "dataset_name": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "messages"
  },
  "tags": {
    "role_tag": "role",
    "content_tag": "content",
    "user_tag": "user",
    "assistant_tag": "assistant",
    "system_tag": "system"
  }
 }
 ```
 Pre-training datasets and preference datasets are **incompatible** with the sharegpt format yet.
--- a/data/README_zh.md
+++ b/data/README_zh.md
@@ -1,4 +1,4 @@
-如果您使用自定义数据集，请务必在 `dataset_info.json` 文件中按照以下格式提供数据集定义。
+如果您使用自定义数据集，请务必按照以下格式在 `dataset_info.json` 文件中添加**数据集描述**。我们在下面也提供了一些例子。
 ```json
 "数据集名称": {
@@ -33,7 +33,7 @@
 }
 ```
-添加后可通过指定 `--dataset 数据集名称` 参数使用自定义数据集。
+然后，可通过使用 `--dataset 数据集名称` 参数加载自定义数据集。
 ----
@@ -54,10 +54,11 @@
 ]
 ```
-对于上述格式的数据，`dataset_info.json` 中的 `columns` 应为：
+对于上述格式的数据，`dataset_info.json` 中的描述应为：
 ```json
 "数据集名称": {
  "file_name": "data.json",
  "columns": {
    "prompt": "instruction",
    "query": "input",
@@ -70,28 +71,60 @@
 其中 `query` 列对应的内容会与 `prompt` 列对应的内容拼接后作为用户指令，即用户指令为 `prompt\nquery`。`response` 列对应的内容为模型回答。
-`system` 列对应的内容将被作为系统提示词。`history` 列是由多个字符串二元组构成的列表，分别代表历史消息中每轮的指令和回答。注意历史消息中的回答**也会被用于训练**。
+`system` 列对应的内容将被作为系统提示词。`history` 列是由多个字符串二元组构成的列表，分别代表历史消息中每轮的指令和回答。注意在指令监督学习时，历史消息中的回答**也会被用于训练**。
-对于预训练数据集，仅 `prompt` 列中的内容会用于模型训练。
+对于**预训练数据集**，仅 `prompt` 列中的内容会用于模型训练，例如：
 对于偏好数据集，`response` 列应当是一个长度为 2 的字符串列表，排在前面的代表更优的回答，例如：
 ```json
-{
+[
-  "instruction": "用户指令",
+  {"text": "document"},
-  "input": "用户输入",
+  {"text": "document"}
-  "output": [
+]
-    "优质回答",
+```
-    "劣质回答"
+
-  ]
+对于上述格式的数据，`dataset_info.json` 中的描述应为：
 ```json
 "数据集名称": {
  "file_name": "data.json",
  "columns": {
    "prompt": "text"
  }
 }
 ```
-添加偏好数据集需要额外指定 `"ranking": true`。
+对于**偏好数据集**，`response` 列应当是一个长度为 2 的字符串列表，排在前面的代表更优的回答，例如：
 ```json
 [
  {
    "instruction": "用户指令",
    "input": "用户输入",
    "output": [
      "优质回答",
      "劣质回答"
    ]
  }
 ]
 ```
 对于上述格式的数据，`dataset_info.json` 中的描述应为：
 ```json
 "数据集名称": {
  "file_name": "data.json",
  "ranking": true,
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
  }
 }
 ```
 ----
-而 sharegpt 格式的数据集按照以下方式组织：
+而 **sharegpt** 格式的数据集按照以下方式组织：
 ```json
 [
@@ -112,10 +145,12 @@
 ]
 ```
-对于上述格式的数据，`dataset_info.json` 中的 `columns` 应为：
+对于上述格式的数据，`dataset_info.json` 中的描述应为：
 ```json
 "数据集名称": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "conversations",
    "system": "system",
@@ -132,4 +167,46 @@
 其中 `messages` 列应当是一个列表，且符合 `用户/模型/用户/模型/用户/模型` 的顺序。
-预训练数据集和偏好数据集尚不支持 sharegpt 格式。
+我们同样支持 **openai** 格式的数据集：
 ```json
 [
  {
    "messages": [
      {
        "role": "system",
        "content": "系统提示词（选填）"
      },
      {
        "role": "user",
        "content": "用户指令"
      },
      {
        "role": "assistant",
        "content": "模型回答"
      }
    ]
  }
 ]
 ```
 对于上述格式的数据，`dataset_info.json` 中的描述应为：
 ```json
 "数据集名称": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "messages"
  },
  "tags": {
    "role_tag": "role",
    "content_tag": "content",
    "user_tag": "user",
    "assistant_tag": "assistant",
    "system_tag": "system"
  }
 }
 ```
 预训练数据集和偏好数据集**尚不支持** sharegpt 格式。
--- a/data/oaast_rm.json.REMOVED.git-id
+++ b/data/oaast_rm.json.REMOVED.git-id
@@ -1 +0,0 @@
 274079ea921762be356de85b18f13fa60b7ba8cb
--- a/data/oaast_sft.json.REMOVED.git-id
+++ b/data/oaast_sft.json.REMOVED.git-id
@@ -1 +0,0 @@
 57fd080be5bffe4153fe3ee26a175e3d56da30f3
--- a/evaluation/ceval/ceval.py
+++ b/evaluation/ceval/ceval.py
@@ -19,7 +19,7 @@ import pandas as pd
 _CITATION = """\
@article{huang2023ceval,
-  title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models}, 
+  title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models},
  author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian},
  journal={arXiv preprint arXiv:2305.08322},
  year={2023}
@@ -133,25 +133,19 @@ class Ceval(datasets.GeneratorBasedBuilder):
            datasets.SplitGenerator(
                name=datasets.Split.TEST,
                gen_kwargs={
-                    "filepath": os.path.join(
+                    "filepath": os.path.join(data_dir, "test", f"{task_name}_test.csv"),
                        data_dir, "test", f"{task_name}_test.csv"
                    ),
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION,
                gen_kwargs={
-                    "filepath": os.path.join(
+                    "filepath": os.path.join(data_dir, "val", f"{task_name}_val.csv"),
                        data_dir, "val", f"{task_name}_val.csv"
                    ),
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
-                    "filepath": os.path.join(
+                    "filepath": os.path.join(data_dir, "dev", f"{task_name}_dev.csv"),
                        data_dir, "dev", f"{task_name}_dev.csv"
                    ),
                },
            ),
        ]
--- a/evaluation/cmmlu/cmmlu.py
+++ b/evaluation/cmmlu/cmmlu.py
@@ -37,73 +37,73 @@ _LICENSE = "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Internatio
 _URL = "cmmlu.zip"
 task_list = [
-     'agronomy',
+    "agronomy",
-     'anatomy',
+    "anatomy",
-     'ancient_chinese',
+    "ancient_chinese",
-     'arts',
+    "arts",
-     'astronomy',
+    "astronomy",
-     'business_ethics',
+    "business_ethics",
-     'chinese_civil_service_exam',
+    "chinese_civil_service_exam",
-     'chinese_driving_rule',
+    "chinese_driving_rule",
-     'chinese_food_culture',
+    "chinese_food_culture",
-     'chinese_foreign_policy',
+    "chinese_foreign_policy",
-     'chinese_history',
+    "chinese_history",
-     'chinese_literature',
+    "chinese_literature",
-     'chinese_teacher_qualification',
+    "chinese_teacher_qualification",
-     'clinical_knowledge',
+    "clinical_knowledge",
-     'college_actuarial_science',
+    "college_actuarial_science",
-     'college_education',
+    "college_education",
-     'college_engineering_hydrology',
+    "college_engineering_hydrology",
-     'college_law',
+    "college_law",
-     'college_mathematics',
+    "college_mathematics",
-     'college_medical_statistics',
+    "college_medical_statistics",
-     'college_medicine',
+    "college_medicine",
-     'computer_science',
+    "computer_science",
-     'computer_security',
+    "computer_security",
-     'conceptual_physics',
+    "conceptual_physics",
-     'construction_project_management',
+    "construction_project_management",
-     'economics',
+    "economics",
-     'education',
+    "education",
-     'electrical_engineering',
+    "electrical_engineering",
-     'elementary_chinese',
+    "elementary_chinese",
-     'elementary_commonsense',
+    "elementary_commonsense",
-     'elementary_information_and_technology',
+    "elementary_information_and_technology",
-     'elementary_mathematics',
+    "elementary_mathematics",
-     'ethnology',
+    "ethnology",
-     'food_science',
+    "food_science",
-     'genetics',
+    "genetics",
-     'global_facts',
+    "global_facts",
-     'high_school_biology',
+    "high_school_biology",
-     'high_school_chemistry',
+    "high_school_chemistry",
-     'high_school_geography',
+    "high_school_geography",
-     'high_school_mathematics',
+    "high_school_mathematics",
-     'high_school_physics',
+    "high_school_physics",
-     'high_school_politics',
+    "high_school_politics",
-     'human_sexuality',
+    "human_sexuality",
-     'international_law',
+    "international_law",
-     'journalism',
+    "journalism",
-     'jurisprudence',
+    "jurisprudence",
-     'legal_and_moral_basis',
+    "legal_and_moral_basis",
-     'logical',
+    "logical",
-     'machine_learning',
+    "machine_learning",
-     'management',
+    "management",
-     'marketing',
+    "marketing",
-     'marxist_theory',
+    "marxist_theory",
-     'modern_chinese',
+    "modern_chinese",
-     'nutrition',
+    "nutrition",
-     'philosophy',
+    "philosophy",
-     'professional_accounting',
+    "professional_accounting",
-     'professional_law',
+    "professional_law",
-     'professional_medicine',
+    "professional_medicine",
-     'professional_psychology',
+    "professional_psychology",
-     'public_relations',
+    "public_relations",
-     'security_study',
+    "security_study",
-     'sociology',
+    "sociology",
-     'sports_science',
+    "sports_science",
-     'traditional_chinese_medicine',
+    "traditional_chinese_medicine",
-     'virology',
+    "virology",
-     'world_history',
+    "world_history",
-     'world_religions',
+    "world_religions",
 ]
--- a/evaluation/mmlu/mmlu.py
+++ b/evaluation/mmlu/mmlu.py
@@ -136,25 +136,19 @@ class MMLU(datasets.GeneratorBasedBuilder):
            datasets.SplitGenerator(
                name=datasets.Split.TEST,
                gen_kwargs={
-                    "filepath": os.path.join(
+                    "filepath": os.path.join(data_dir, "data", "test", f"{task_name}_test.csv"),
                        data_dir, "data", "test", f"{task_name}_test.csv"
                    ),
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION,
                gen_kwargs={
-                    "filepath": os.path.join(
+                    "filepath": os.path.join(data_dir, "data", "val", f"{task_name}_val.csv"),
                        data_dir, "data", "val", f"{task_name}_val.csv"
                    ),
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
-                    "filepath": os.path.join(
+                    "filepath": os.path.join(data_dir, "data", "dev", f"{task_name}_dev.csv"),
                        data_dir, "data", "dev", f"{task_name}_dev.csv"
                    ),
                },
            ),
        ]
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,50 +1,229 @@
 We provide diverse examples about fine-tuning LLMs.
 Make sure to execute these commands in the `LLaMA-Factory` directory.
 ## Table of Contents
 - [LoRA Fine-Tuning on A Single GPU](#lora-fine-tuning-on-a-single-gpu)
 - [QLoRA Fine-Tuning on a Single GPU](#qlora-fine-tuning-on-a-single-gpu)
 - [LoRA Fine-Tuning on Multiple GPUs](#lora-fine-tuning-on-multiple-gpus)
 - [LoRA Fine-Tuning on Multiple NPUs](#lora-fine-tuning-on-multiple-npus)
 - [Full-Parameter Fine-Tuning on Multiple GPUs](#full-parameter-fine-tuning-on-multiple-gpus)
 - [Merging LoRA Adapters and Quantization](#merging-lora-adapters-and-quantization)
 - [Inferring LoRA Fine-Tuned Models](#inferring-lora-fine-tuned-models)
 - [Extras](#extras)
 ## Examples
 ### LoRA Fine-Tuning on A Single GPU
 #### (Continuous) Pre-Training
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_pretrain.yaml
 ```
-examples/
+
-├── lora_single_gpu/
+#### Supervised Fine-Tuning
-│   ├── pretrain.sh: Do continuous pre-training using LoRA
+
-│   ├── sft.sh: Do supervised fine-tuning using LoRA
+```bash
-│   ├── reward.sh: Do reward modeling using LoRA
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml
-│   ├── ppo.sh: Do PPO training using LoRA
+```
-│   ├── dpo.sh: Do DPO training using LoRA
+
-│   ├── orpo.sh: Do ORPO training using LoRA
+#### Multimodal Supervised Fine-Tuning
-│   ├── sft_mllm.sh: Do supervised fine-tuning on multimodal data using LoRA
+
-│   ├── prepare.sh: Save tokenized dataset
+```bash
-│   └── predict.sh: Do batch predict and compute BLEU and ROUGE scores after LoRA tuning
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llava1_5_lora_sft.yaml
-├── qlora_single_gpu/
+```
-│   ├── bitsandbytes.sh: Fine-tune 4/8-bit BNB models using QLoRA
+
-│   ├── gptq.sh: Fine-tune 4/8-bit GPTQ models using QLoRA
+#### Reward Modeling
-│   ├── awq.sh: Fine-tune 4-bit AWQ models using QLoRA
+
-│   └── aqlm.sh: Fine-tune 2-bit AQLM models using QLoRA
+```bash
-├── lora_multi_gpu/
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_reward.yaml
-│   ├── single_node.sh: Fine-tune model with Accelerate on single node using LoRA
+```
-│   ├── multi_node.sh: Fine-tune model with Accelerate on multiple nodes using LoRA
+
-│   └── ds_zero3.sh: Fine-tune model with DeepSpeed ZeRO-3 using LoRA (weight sharding)
+#### PPO Training
-├── full_multi_gpu/
+
-│   ├── single_node.sh: Full fine-tune model with DeepSpeed on single node
+```bash
-│   ├── multi_node.sh: Full fine-tune model with DeepSpeed on multiple nodes
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_ppo.yaml
-│   └── predict.sh: Do parallel batch predict and compute BLEU and ROUGE scores after full tuning
+```
-├── merge_lora/
+
-│   ├── merge.sh: Merge LoRA weights into the pre-trained models
+#### DPO Training
-│   └── quantize.sh: Quantize the fine-tuned model with AutoGPTQ
+
-├── inference/
+```bash
-│   ├── cli_demo.sh: Chat with fine-tuned model in the CLI with LoRA adapters
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_dpo.yaml
-│   ├── api_demo.sh: Chat with fine-tuned model in an OpenAI-style API with LoRA adapters
+```
-│   ├── web_demo.sh: Chat with fine-tuned model in the Web browser with LoRA adapters
+
-│   └── evaluate.sh: Evaluate model on the MMLU/CMMLU/C-Eval benchmarks with LoRA adapters
+#### ORPO Training
-└── extras/
+
-    ├── galore/
+```bash
-    │   └── sft.sh: Fine-tune model with GaLore
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_orpo.yaml
-    ├── badam/
+```
-    │   └── sft.sh: Fine-tune model with BAdam
+
-    ├── loraplus/
+#### Preprocess Dataset
-    │   └── sft.sh: Fine-tune model using LoRA+
+
-    ├── mod/
+It is useful for large dataset, use `tokenized_path` in config to load the preprocessed dataset.
-    │   └── sft.sh: Fine-tune model using Mixture-of-Depths
+
-    ├── llama_pro/
+```bash
-    │   ├── expand.sh: Expand layers in the model
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_preprocess.yaml
-    │   └── sft.sh: Fine-tune the expanded model
+```
-    └── fsdp_qlora/
+
-        └── sft.sh: Fine-tune quantized model with FSDP+QLoRA
+#### Evaluating on MMLU/CMMLU/C-Eval Benchmarks
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli eval examples/lora_single_gpu/llama3_lora_eval.yaml
 ```
 #### Batch Predicting and Computing BLEU and ROUGE Scores
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_predict.yaml
 ```
 ### QLoRA Fine-Tuning on a Single GPU
 #### Supervised Fine-Tuning with 4/8-bit Bitsandbytes Quantization (Recommended)
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml
 ```
 #### Supervised Fine-Tuning with 4/8-bit GPTQ Quantization
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml
 ```
 #### Supervised Fine-Tuning with 4-bit AWQ Quantization
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_awq.yaml
 ```
 #### Supervised Fine-Tuning with 2-bit AQLM Quantization
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml
 ```
 ### LoRA Fine-Tuning on Multiple GPUs
 #### Supervised Fine-Tuning with Accelerate on Single Node
 ```bash
 bash examples/lora_multi_gpu/single_node.sh
 ```
 #### Supervised Fine-Tuning with Accelerate on Multiple Nodes
 ```bash
 bash examples/lora_multi_gpu/multi_node.sh
 ```
 #### Supervised Fine-Tuning with DeepSpeed ZeRO-3 (Weight Sharding)
 ```bash
 bash examples/lora_multi_gpu/ds_zero3.sh
 ```
 ### LoRA Fine-Tuning on Multiple NPUs
 #### Supervised Fine-Tuning with DeepSpeed ZeRO-0
 ```bash
 bash examples/lora_multi_npu/ds_zero0.sh
 ```
 ### Full-Parameter Fine-Tuning on Multiple GPUs
 #### Supervised Fine-Tuning with Accelerate on Single Node
 ```bash
 bash examples/full_multi_gpu/single_node.sh
 ```
 #### Supervised Fine-Tuning with Accelerate on Multiple Nodes
 ```bash
 bash examples/full_multi_gpu/multi_node.sh
 ```
 #### Batch Predicting and Computing BLEU and ROUGE Scores
 ```bash
 bash examples/full_multi_gpu/predict.sh
 ```
 ### Merging LoRA Adapters and Quantization
 #### Merge LoRA Adapters
 Note: DO NOT use quantized model or `quantization_bit` when merging LoRA adapters.
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
 ```
 #### Quantizing Model using AutoGPTQ
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_gptq.yaml
 ```
 ### Inferring LoRA Fine-Tuned Models
 #### Use CLI
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/merge_lora/llama3_lora_sft.yaml
 ```
 #### Use Web UI
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli webchat examples/merge_lora/llama3_lora_sft.yaml
 ```
 #### Launch OpenAI-style API
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli api examples/merge_lora/llama3_lora_sft.yaml
 ```
 ### Extras
 #### Full-Parameter Fine-Tuning using GaLore
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/galore/llama3_full_sft.yaml
 ```
 #### Full-Parameter Fine-Tuning using BAdam
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/badam/llama3_full_sft.yaml
 ```
 #### LoRA+ Fine-Tuning
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/loraplus/llama3_lora_sft.yaml
 ```
 #### Mixture-of-Depths Fine-Tuning
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/mod/llama3_full_sft.yaml
 ```
 #### LLaMA-Pro Fine-Tuning
 ```bash
 bash examples/extras/llama_pro/expand.sh
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/llama_pro/llama3_freeze_sft.yaml
 ```
 #### FSDP+QLoRA Fine-Tuning
 ```bash
 bash examples/extras/fsdp_qlora/single_node.sh
 ```
--- a/examples/README_zh.md
+++ b/examples/README_zh.md
@@ -1,50 +1,229 @@
 我们提供了多样化的大模型微调示例脚本。
 请确保在 `LLaMA-Factory` 目录下执行下述命令。
 ## 目录
 - [单 GPU LoRA 微调](#单-gpu-lora-微调)
 - [单 GPU QLoRA 微调](#单-gpu-qlora-微调)
 - [多 GPU LoRA 微调](#多-gpu-lora-微调)
 - [多 NPU LoRA 微调](#多-npu-lora-微调)
 - [多 GPU 全参数微调](#多-gpu-全参数微调)
 - [合并 LoRA 适配器与模型量化](#合并-lora-适配器与模型量化)
 - [推理 LoRA 模型](#推理-lora-模型)
 - [杂项](#杂项)
 ## 示例
 ### 单 GPU LoRA 微调
 #### （增量）预训练
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_pretrain.yaml
 ```
-examples/
+
-├── lora_single_gpu/
+#### 指令监督微调
-│   ├── pretrain.sh: 基于 LoRA 进行增量预训练
+
-│   ├── sft.sh: 基于 LoRA 进行指令监督微调
+```bash
-│   ├── reward.sh: 基于 LoRA 进行奖励模型训练
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml
-│   ├── ppo.sh: 基于 LoRA 进行 PPO 训练
+```
-│   ├── dpo.sh: 基于 LoRA 进行 DPO 训练
+
-│   ├── orpo.sh: 基于 LoRA 进行 ORPO 训练
+#### 多模态指令监督微调
-│   ├── sft_mllm.sh: 基于 LoRA 进行多模态指令监督微调
+
-│   ├── prepare.sh: 保存预处理后的数据集
+```bash
-│   └── predict.sh: 基于 LoRA 进行批量预测并计算 BLEU 和 ROUGE 分数
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llava1_5_lora_sft.yaml
-├── qlora_single_gpu/
+```
-│   ├── bitsandbytes.sh: 基于 QLoRA 微调 4/8 比特 BNB 模型
+
-│   ├── gptq.sh: 基于 QLoRA 微调 4/8 比特 GPTQ 模型
+#### 奖励模型训练
-│   ├── awq.sh: 基于 QLoRA 微调 4 比特 AWQ 模型
+
-│   └── aqlm.sh: 基于 QLoRA 微调 2 比特 AQLM 模型
+```bash
-├── lora_multi_gpu/
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_reward.yaml
-│   ├── single_node.sh: 使用 Accelerate 进行单节点 LoRA 训练
+```
-│   ├── multi_node.sh: 使用 Accelerate 进行多节点 LoRA 训练
+
-│   └── ds_zero3.sh: 使用 DeepSpeed ZeRO-3 进行 LoRA 训练（拆分权重）
+#### PPO 训练
-├── full_multi_gpu/
+
-│   ├── single_node.sh: 使用 DeepSpeed 进行单节点全量训练
+```bash
-│   ├── multi_node.sh: 使用 DeepSpeed 进行多节点全量训练
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_ppo.yaml
-│   └── predict.sh: 基于全量训练进行多卡批量预测并计算 BLEU 和 ROUGE 分数
+```
-├── merge_lora/
+
-│   ├── merge.sh: 将 LoRA 权重合并到预训练模型中
+#### DPO 训练
-│   └── quantize.sh: 使用 AutoGPTQ 量化微调后的模型
+
-├── inference/
+```bash
-│   ├── cli_demo.sh: 启动 LoRA 模型的命令行推理接口
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_dpo.yaml
-│   ├── api_demo.sh: 启动 LoRA 模型的 OpenAI 风格 API
+```
-│   ├── web_demo.sh: 启动 LoRA 模型的浏览器推理接口
+
-│   └── evaluate.sh: 在 MMLU/CMMLU/C-Eval 数据集上评测 LoRA 模型
+#### ORPO 训练
-└── extras/
+
-    ├── galore/
+```bash
-    │   └── sft.sh: 使用 GaLore 训练模型
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_orpo.yaml
-    ├── badam/
+```
-    │   └── sft.sh: 使用 BAdam 训练模型
+
-    ├── loraplus/
+#### 预处理数据集
-    │   └── sft.sh: 使用 LoRA+ 训练模型
+
-    ├── mod/
+对于大数据集有帮助，在配置中使用 `tokenized_path` 以加载预处理后的数据集。
-    │   └── sft.sh: 使用深度混合训练模型
+
-    ├── llama_pro/
+```bash
-    │   ├── expand.sh: 扩展模型中的层
+CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_preprocess.yaml
-    │   └── sft.sh: 训练扩展后的模型
+```
-    └── fsdp_qlora/
+
-        └── sft.sh: 使用 FSDP+QLoRA 微调量化模型
+#### 在 MMLU/CMMLU/C-Eval 上评估
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli eval examples/lora_single_gpu/llama3_lora_eval.yaml
 ```
 #### 批量预测并计算 BLEU 和 ROUGE 分数
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_predict.yaml
 ```
 ### 单 GPU QLoRA 微调
 #### 基于 4/8 比特 Bitsandbytes 量化进行指令监督微调（推荐）
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml
 ```
 #### 基于 4/8 比特 GPTQ 量化进行指令监督微调
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml
 ```
 #### 基于 4 比特 AWQ 量化进行指令监督微调
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_awq.yaml
 ```
 #### 基于 2 比特 AQLM 量化进行指令监督微调
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml
 ```
 ### 多 GPU LoRA 微调
 #### 使用 Accelerate 进行单节点训练
 ```bash
 bash examples/lora_multi_gpu/single_node.sh
 ```
 #### 使用 Accelerate 进行多节点训练
 ```bash
 bash examples/lora_multi_gpu/multi_node.sh
 ```
 #### 使用 DeepSpeed ZeRO-3 平均分配显存
 ```bash
 bash examples/lora_multi_gpu/ds_zero3.sh
 ```
 ### 多 NPU LoRA 微调
 #### 使用 DeepSpeed ZeRO-0 训练
 ```bash
 bash examples/lora_multi_npu/ds_zero0.sh
 ```
 ### 多 GPU 全参数微调
 #### 使用 DeepSpeed 进行单节点训练
 ```bash
 bash examples/full_multi_gpu/single_node.sh
 ```
 #### 使用 DeepSpeed 进行多节点训练
 ```bash
 bash examples/full_multi_gpu/multi_node.sh
 ```
 #### 批量预测并计算 BLEU 和 ROUGE 分数
 ```bash
 bash examples/full_multi_gpu/predict.sh
 ```
 ### 合并 LoRA 适配器与模型量化
 #### 合并 LoRA 适配器
 注：请勿使用量化后的模型或 `quantization_bit` 参数来合并 LoRA 适配器。
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
 ```
 #### 使用 AutoGPTQ 量化模型
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_gptq.yaml
 ```
 ### 推理 LoRA 模型
 #### 使用命令行接口
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/merge_lora/llama3_lora_sft.yaml
 ```
 #### 使用浏览器界面
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli webchat examples/merge_lora/llama3_lora_sft.yaml
 ```
 #### 启动 OpenAI 风格 API
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli api examples/merge_lora/llama3_lora_sft.yaml
 ```
 ### 杂项
 #### 使用 GaLore 进行全参数训练
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/galore/llama3_full_sft.yaml
 ```
 #### 使用 BAdam 进行全参数训练
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/badam/llama3_full_sft.yaml
 ```
 #### LoRA+ 微调
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/loraplus/llama3_lora_sft.yaml
 ```
 #### 深度混合微调
 ```bash
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/mod/llama3_full_sft.yaml
 ```
 #### LLaMA-Pro 微调
 ```bash
 bash examples/extras/llama_pro/expand.sh
 CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/llama_pro/llama3_freeze_sft.yaml
 ```
 #### FSDP+QLoRA 微调
 ```bash
 bash examples/extras/fsdp_qlora/single_node.sh
 ```
--- a/examples/extras/badam/llama3_lora_sft.yaml
+++ b/examples/extras/badam/llama3_lora_sft.yaml
@@ -0,0 +1,41 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: sft
 do_train: true
 finetuning_type: full
 use_badam: true
 badam_switch_mode: descending
 badam_switch_interval: 50
 badam_verbose: 2
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/full/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 pure_bf16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/extras/badam/sft.sh
+++ b/examples/extras/badam/sft.sh
@@ -1,35 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../../data \
    --template default \
    --finetuning_type full \
    --use_badam \
    --badam_switch_mode descending \
    --badam_switch_block_every 50 \
    --badam_verbose 2 \
    --output_dir ../../../saves/LLaMA2-7B/badam/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --pure_bf16
--- a/examples/extras/fsdp_qlora/llama3_lora_sft.yaml
+++ b/examples/extras/fsdp_qlora/llama3_lora_sft.yaml
@@ -0,0 +1,42 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 quantization_bit: 4
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # ddp
 ddp_timeout: 180000000
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/extras/fsdp_qlora/sft.sh
+++ b/examples/extras/fsdp_qlora/sft.sh
@@ -1,41 +0,0 @@
 #!/bin/bash
 # DO NOT use GPTQ/AWQ model in FSDP+QLoRA
 pip install "transformers>=4.39.1"
 pip install "accelerate>=0.28.0"
 pip install "bitsandbytes>=0.43.0"
 CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
    --config_file ../../accelerate/fsdp_config.yaml \
    ../../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-70b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../../saves/LLaMA2-70B/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --ddp_timeout 180000000 \
    --quantization_bit 4 \
    --plot_loss \
    --fp16
--- a/examples/extras/fsdp_qlora/single_node.sh
+++ b/examples/extras/fsdp_qlora/single_node.sh
@@ -0,0 +1,10 @@
 #!/bin/bash
 # DO NOT use GPTQ/AWQ model in FSDP+QLoRA
 pip install "transformers>=4.39.1"
 pip install "accelerate>=0.28.0"
 pip install "bitsandbytes>=0.43.0"
 CUDA_VISIBLE_DEVICES=0,1 accelerate launch \
    --config_file examples/accelerate/fsdp_config.yaml \
    src/train.py examples/extras/fsdp_qlora/llama3_lora_sft.yaml
--- a/examples/extras/galore/llama3_full_sft.yaml
+++ b/examples/extras/galore/llama3_full_sft.yaml
@@ -0,0 +1,42 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: sft
 do_train: true
 finetuning_type: full
 use_galore: true
 galore_layerwise: true
 galore_target: mlp,self_attn
 galore_rank: 128
 galore_scale: 2.0
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/full/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 1
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 pure_bf16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/extras/galore/sft.sh
+++ b/examples/extras/galore/sft.sh
@@ -1,36 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../../data \
    --template default \
    --finetuning_type full \
    --use_galore \
    --galore_layerwise \
    --galore_target mlp,self_attn \
    --galore_rank 128 \
    --galore_scale 2.0 \
    --output_dir ../../../saves/LLaMA2-7B/galore/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --pure_bf16
--- a/examples/extras/llama_pro/expand.sh
+++ b/examples/extras/llama_pro/expand.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
-python ../../../scripts/llama_pro.py \
+python scripts/llama_pro.py \
-    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
-    --output_dir ../../../models/llama2-7b-pro \
+    --output_dir models/llama3-8b-instruct-pro \
    --num_expand 8
--- a/examples/extras/llama_pro/llama3_freeze_sft.yaml
+++ b/examples/extras/llama_pro/llama3_freeze_sft.yaml
@@ -0,0 +1,40 @@
 # model
 model_name_or_path: models/llama3-8b-instruct-pro
 # method
 stage: sft
 do_train: true
 finetuning_type: freeze
 freeze_trainable_layers: 8
 freeze_trainable_modules: all
 use_llama_pro: true
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b-instruct-pro/freeze/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/extras/llama_pro/sft.sh
+++ b/examples/extras/llama_pro/sft.sh
@@ -1,34 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path ../../../models/llama2-7b-pro \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../../data \
    --template default \
    --finetuning_type freeze \
    --name_module_trainable all \
    --num_layer_trainable 8 \
    --use_llama_pro \
    --output_dir ../../../saves/LLaMA2-7B-Pro/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --fp16
--- a/examples/extras/loraplus/llama3_lora_sft.yaml
+++ b/examples/extras/loraplus/llama3_lora_sft.yaml
@@ -0,0 +1,39 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 loraplus_lr_ratio: 16.0
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/extras/loraplus/sft.sh
+++ b/examples/extras/loraplus/sft.sh
@@ -1,33 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --loraplus_lr_ratio 16.0 \
    --output_dir ../../saves/LLaMA2-7B/loraplus/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --fp16
--- a/examples/extras/mod/llama3_full_sft.yaml
+++ b/examples/extras/mod/llama3_full_sft.yaml
@@ -0,0 +1,39 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: sft
 do_train: true
 finetuning_type: full
 mixture_of_depths: convert
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b-mod/full/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 optim: paged_adamw_8bit
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 pure_bf16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/extras/mod/sft.sh
+++ b/examples/extras/mod/sft.sh
@@ -1,33 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../../data \
    --template default \
    --finetuning_type full \
    --mixture_of_depths convert \
    --output_dir ../../../saves/LLaMA2-7B/mod/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --optim paged_adamw_8bit \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --pure_bf16
--- a/examples/full_multi_gpu/llama3_full_predict.yaml
+++ b/examples/full_multi_gpu/llama3_full_predict.yaml
@@ -0,0 +1,23 @@
 # model
 model_name_or_path: saves/llama3-8b/full/sft
 # method
 stage: sft
 do_predict: true
 finetuning_type: full
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 50
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/full/predict
 overwrite_output_dir: true
 # eval
 per_device_eval_batch_size: 1
 predict_with_generate: true
--- a/examples/full_multi_gpu/llama3_full_sft.yaml
+++ b/examples/full_multi_gpu/llama3_full_sft.yaml
@@ -0,0 +1,41 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: sft
 do_train: true
 finetuning_type: full
 # ddp
 ddp_timeout: 180000000
 deepspeed: examples/deepspeed/ds_z3_config.json
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/full/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 2
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/full_multi_gpu/multi_node.sh
+++ b/examples/full_multi_gpu/multi_node.sh
@@ -1,38 +1,15 @@
 #!/bin/bash
-python -m torch.distributed.run \
+NPROC_PER_NODE=4
 NNODES=2
 RANK=0
 MASTER_ADDR=192.168.0.1
 MASTER_PORT=29500
 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun \
    --nproc_per_node $NPROC_PER_NODE \
    --nnodes $NNODES \
    --node_rank $RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
-    ../../src/train_bash.py \
+    src/train.py examples/full_multi_gpu/llama3_full_sft.yaml
    --deepspeed ../deepspeed/ds_z3_config.json \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type full \
    --output_dir ../../saves/LLaMA2-7B/full/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --ddp_timeout 180000000 \
    --plot_loss \
    --fp16
--- a/examples/full_multi_gpu/predict.sh
+++ b/examples/full_multi_gpu/predict.sh
@@ -1,20 +1,5 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
-    --config_file ../accelerate/single_config.yaml \
+    --config_file examples/accelerate/single_config.yaml \
-    ../../src/train_bash.py \
+    src/train.py examples/full_multi_gpu/llama3_full_predict.yaml
    --stage sft \
    --do_predict \
    --model_name_or_path ../../saves/LLaMA2-7B/full/sft \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type full \
    --output_dir ../../saves/LLaMA2-7B/full/predict \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_eval_batch_size 1 \
    --max_samples 20 \
    --predict_with_generate
--- a/examples/full_multi_gpu/single_node.sh
+++ b/examples/full_multi_gpu/single_node.sh
@@ -1,32 +1,15 @@
 #!/bin/bash
-deepspeed --num_gpus 4 ../../src/train_bash.py \
+NPROC_PER_NODE=4
-    --deepspeed ../deepspeed/ds_z3_config.json \
+NNODES=1
-    --stage sft \
+RANK=0
-    --do_train \
+MASTER_ADDR=127.0.0.1
-    --model_name_or_path meta-llama/Llama-2-7b-hf \
+MASTER_PORT=29500
-    --dataset alpaca_gpt4_en,glaive_toolcall \
+
-    --dataset_dir ../../data \
+CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun \
-    --template default \
+    --nproc_per_node $NPROC_PER_NODE \
-    --finetuning_type full \
+    --nnodes $NNODES \
-    --output_dir ../../saves/LLaMA2-7B/full/sft \
+    --node_rank $RANK \
-    --overwrite_cache \
+    --master_addr $MASTER_ADDR \
-    --overwrite_output_dir \
+    --master_port $MASTER_PORT \
-    --cutoff_len 1024 \
+    src/train.py examples/full_multi_gpu/llama3_full_sft.yaml
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --ddp_timeout 180000000 \
    --plot_loss \
    --fp16
--- a/examples/inference/api_demo.sh
+++ b/examples/inference/api_demo.sh
@@ -1,7 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 API_PORT=8000 python ../../src/api_demo.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
    --template default \
    --finetuning_type lora
--- a/examples/inference/cli_demo.sh
+++ b/examples/inference/cli_demo.sh
@@ -1,7 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/cli_demo.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
    --template default \
    --finetuning_type lora
--- a/examples/inference/evaluate.sh
+++ b/examples/inference/evaluate.sh
@@ -1,12 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/evaluate.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
    --template fewshot \
    --finetuning_type lora \
    --task mmlu \
    --split test \
    --lang en \
    --n_shot 5 \
    --batch_size 4
--- a/examples/inference/llama3.yaml
+++ b/examples/inference/llama3.yaml
@@ -0,0 +1,2 @@
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 template: llama3
--- a/examples/inference/llama3_lora_sft.yaml
+++ b/examples/inference/llama3_lora_sft.yaml
@@ -0,0 +1,4 @@
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 adapter_name_or_path: saves/llama3-8b/lora/sft
 template: llama3
 finetuning_type: lora
--- a/examples/inference/llama3_vllm.yaml
+++ b/examples/inference/llama3_vllm.yaml
@@ -0,0 +1,4 @@
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 template: llama3
 infer_backend: vllm
 vllm_enforce_eager: true
--- a/examples/inference/web_demo.sh
+++ b/examples/inference/web_demo.sh
@@ -1,8 +0,0 @@
 #!/bin/bash
 # add `--visual_inputs True` to load MLLM
 CUDA_VISIBLE_DEVICES=0 python ../../src/web_demo.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
    --template default \
    --finetuning_type lora
--- a/examples/lora_multi_gpu/ds_zero3.sh
+++ b/examples/lora_multi_gpu/ds_zero3.sh
@@ -1,33 +1,15 @@
 #!/bin/bash
-deepspeed --num_gpus 4 ../../src/train_bash.py \
+NPROC_PER_NODE=4
-    --deepspeed ../deepspeed/ds_z3_config.json \
+NNODES=1
-    --stage sft \
+RANK=0
-    --do_train \
+MASTER_ADDR=127.0.0.1
-    --model_name_or_path meta-llama/Llama-2-7b-hf \
+MASTER_PORT=29500
-    --dataset alpaca_gpt4_en,glaive_toolcall \
+
-    --dataset_dir ../../data \
+CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun \
-    --template default \
+    --nproc_per_node $NPROC_PER_NODE \
-    --finetuning_type lora \
+    --nnodes $NNODES \
-    --lora_target q_proj,v_proj \
+    --node_rank $RANK \
-    --output_dir ../../saves/LLaMA2-7B/lora/sft \
+    --master_addr $MASTER_ADDR \
-    --overwrite_cache \
+    --master_port $MASTER_PORT \
-    --overwrite_output_dir \
+    src/train.py examples/lora_multi_gpu/llama3_lora_sft_ds.yaml
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --ddp_timeout 180000000 \
    --plot_loss \
    --fp16
--- a/examples/lora_multi_gpu/llama3_lora_sft.yaml
+++ b/examples/lora_multi_gpu/llama3_lora_sft.yaml
@@ -0,0 +1,41 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # ddp
 ddp_timeout: 180000000
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 2
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/lora_multi_gpu/llama3_lora_sft_ds.yaml
+++ b/examples/lora_multi_gpu/llama3_lora_sft_ds.yaml
@@ -0,0 +1,42 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # ddp
 ddp_timeout: 180000000
 deepspeed: examples/deepspeed/ds_z3_config.json
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 2
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/lora_multi_gpu/multi_node.sh
+++ b/examples/lora_multi_gpu/multi_node.sh
@@ -2,35 +2,5 @@
 # also launch it on slave machine using slave_config.yaml
 CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
-    --config_file ../accelerate/master_config.yaml \
+    --config_file examples/accelerate/master_config.yaml \
-    ../../src/train_bash.py \
+    src/train.py examples/lora_multi_gpu/llama3_lora_sft.yaml
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --ddp_timeout 180000000 \
    --plot_loss \
    --fp16
--- a/examples/lora_multi_gpu/single_node.sh
+++ b/examples/lora_multi_gpu/single_node.sh
@@ -1,35 +1,5 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
-    --config_file ../accelerate/single_config.yaml \
+    --config_file examples/accelerate/single_config.yaml \
-    ../../src/train_bash.py \
+    src/train.py examples/lora_multi_gpu/llama3_lora_sft.yaml
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --ddp_timeout 180000000 \
    --plot_loss \
    --fp16
--- a/examples/lora_multi_npu/ds_zero0.sh
+++ b/examples/lora_multi_npu/ds_zero0.sh
@@ -0,0 +1,15 @@
 #!/bin/bash
 NPROC_PER_NODE=4
 NNODES=1
 RANK=0
 MASTER_ADDR=127.0.0.1
 MASTER_PORT=29500
 ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 torchrun \
    --nproc_per_node $NPROC_PER_NODE \
    --nnodes $NNODES \
    --node_rank $RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    src/train.py examples/lora_multi_npu/llama3_lora_sft_ds.yaml
--- a/examples/lora_multi_npu/llama3_lora_sft_ds.yaml
+++ b/examples/lora_multi_npu/llama3_lora_sft_ds.yaml
@@ -0,0 +1,42 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # ddp
 ddp_timeout: 180000000
 deepspeed: examples/deepspeed/ds_z0_config.json
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 2
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/lora_single_gpu/dpo.sh
+++ b/examples/lora_single_gpu/dpo.sh
@@ -1,35 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage dpo \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
    --create_new_adapter \
    --dataset orca_rlhf \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/dpo \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --max_samples 1000 \
    --val_size 0.1 \
    --dpo_ftx 1.0 \
    --plot_loss \
    --fp16
--- a/examples/lora_single_gpu/llama3_lora_dpo.yaml
+++ b/examples/lora_single_gpu/llama3_lora_dpo.yaml
@@ -0,0 +1,39 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: dpo
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 dpo_ftx: 1.0
 # dataset
 dataset: orca_rlhf
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/dpo
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.00001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/lora_single_gpu/llama3_lora_eval.yaml
+++ b/examples/lora_single_gpu/llama3_lora_eval.yaml
@@ -0,0 +1,19 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 adapter_name_or_path: saves/llama3-8b/lora/sft
 # method
 finetuning_type: lora
 # dataset
 task: mmlu
 split: test
 template: fewshot
 lang: en
 n_shot: 5
 # output
 save_dir: saves/llama3-8b/lora/eval
 # eval
 batch_size: 4
--- a/examples/lora_single_gpu/llama3_lora_orpo.yaml
+++ b/examples/lora_single_gpu/llama3_lora_orpo.yaml
@@ -0,0 +1,38 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: orpo
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # dataset
 dataset: orca_rlhf
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/orpo
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.00001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/lora_single_gpu/llama3_lora_ppo.yaml
+++ b/examples/lora_single_gpu/llama3_lora_ppo.yaml
@@ -0,0 +1,38 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 reward_model: saves/llama3-8b/lora/reward
 # method
 stage: ppo
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/ppo
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.00001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # generate
 max_new_tokens: 512
 top_k: 0
 top_p: 0.9
--- a/examples/lora_single_gpu/llama3_lora_predict.yaml
+++ b/examples/lora_single_gpu/llama3_lora_predict.yaml
@@ -0,0 +1,24 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 adapter_name_or_path: saves/llama3-8b/lora/sft
 # method
 stage: sft
 do_predict: true
 finetuning_type: lora
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 50
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/predict
 overwrite_output_dir: true
 # eval
 per_device_eval_batch_size: 1
 predict_with_generate: true
--- a/examples/lora_single_gpu/llama3_lora_pretrain.yaml
+++ b/examples/lora_single_gpu/llama3_lora_pretrain.yaml
@@ -0,0 +1,37 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: pt
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # dataset
 dataset: c4_demo
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/lora_single_gpu/llama3_lora_reward.yaml
+++ b/examples/lora_single_gpu/llama3_lora_reward.yaml
@@ -0,0 +1,38 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: rm
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # dataset
 dataset: orca_rlhf
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/reward
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.00001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/lora_single_gpu/llama3_lora_sft.yaml
+++ b/examples/lora_single_gpu/llama3_lora_sft.yaml
@@ -0,0 +1,38 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/lora_single_gpu/llama3_preprocess.yaml
+++ b/examples/lora_single_gpu/llama3_preprocess.yaml
@@ -0,0 +1,21 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 tokenized_path: saves/llama3-8b/dataset/sft
 # output
 output_dir: saves/llama3-8b/lora/sft
 overwrite_output_dir: true
--- a/examples/lora_single_gpu/llava1_5_lora_sft.yaml
+++ b/examples/lora_single_gpu/llava1_5_lora_sft.yaml
@@ -0,0 +1,39 @@
 # model
 model_name_or_path: llava-hf/llava-1.5-7b-hf
 visual_inputs: true
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # dataset
 dataset: mllm_demo
 template: vicuna
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llava1_5-7b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/lora_single_gpu/orpo.sh
+++ b/examples/lora_single_gpu/orpo.sh
@@ -1,32 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage orpo \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset orca_rlhf \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/orpo \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --max_samples 1000 \
    --val_size 0.1 \
    --plot_loss \
    --fp16
--- a/examples/lora_single_gpu/ppo.sh
+++ b/examples/lora_single_gpu/ppo.sh
@@ -1,32 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage ppo \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
    --create_new_adapter \
    --dataset alpaca_gpt4_en \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --reward_model ../../saves/LLaMA2-7B/lora/reward \
    --output_dir ../../saves/LLaMA2-7B/lora/ppo \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 512 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --max_samples 1000 \
    --top_k 0 \
    --top_p 0.9 \
    --max_new_tokens 256 \
    --plot_loss \
    --fp16
--- a/examples/lora_single_gpu/predict.sh
+++ b/examples/lora_single_gpu/predict.sh
@@ -1,19 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_predict \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft,../../saves/LLaMA2-7B/lora/dpo \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --output_dir ../../saves/LLaMA2-7B/lora/predict \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_eval_batch_size 1 \
    --max_samples 20 \
    --predict_with_generate
--- a/examples/lora_single_gpu/prepare.sh
+++ b/examples/lora_single_gpu/prepare.sh
@@ -1,18 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES= python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --max_samples 3000 \
    --tokenized_path ../../saves/datasets/sft
--- a/examples/lora_single_gpu/pretrain.sh
+++ b/examples/lora_single_gpu/pretrain.sh
@@ -1,31 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage pt \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset c4_demo \
    --dataset_dir ../../data \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/pretrain \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 10000 \
    --val_size 0.1 \
    --plot_loss \
    --fp16
--- a/examples/lora_single_gpu/reward.sh
+++ b/examples/lora_single_gpu/reward.sh
@@ -1,33 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage rm \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
    --create_new_adapter \
    --dataset orca_rlhf \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/reward \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --max_samples 5000 \
    --val_size 0.1 \
    --plot_loss \
    --fp16
--- a/examples/lora_single_gpu/sft.sh
+++ b/examples/lora_single_gpu/sft.sh
@@ -1,32 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --fp16
--- a/examples/lora_single_gpu/sft_mllm.sh
+++ b/examples/lora_single_gpu/sft_mllm.sh
@@ -1,33 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path llava-hf/llava-1.5-7b-hf \
    --visual_inputs \
    --dataset mllm_demo \
    --dataset_dir ../../data \
    --template vicuna \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/sft_mllm \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 100.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --fp16
--- a/examples/merge_lora/llama3_gptq.yaml
+++ b/examples/merge_lora/llama3_gptq.yaml
@@ -0,0 +1,11 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 template: llama3
 # export
 export_dir: models/llama3_gptq
 export_quantization_bit: 4
 export_quantization_dataset: data/c4_demo.json
 export_size: 2
 export_device: cpu
 export_legacy_format: false
--- a/examples/merge_lora/llama3_lora_sft.yaml
+++ b/examples/merge_lora/llama3_lora_sft.yaml
@@ -0,0 +1,13 @@
 # Note: DO NOT use quantized model or quantization_bit when merging lora adapters
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 adapter_name_or_path: saves/llama3-8b/lora/sft
 template: llama3
 finetuning_type: lora
 # export
 export_dir: models/llama3_lora_sft
 export_size: 2
 export_device: cpu
 export_legacy_format: false
--- a/examples/merge_lora/merge.sh
+++ b/examples/merge_lora/merge.sh
@@ -1,12 +0,0 @@
 #!/bin/bash
 # DO NOT use quantized model or quantization_bit when merging lora weights
 CUDA_VISIBLE_DEVICES=0 python ../../src/export_model.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --adapter_name_or_path ../../saves/LLaMA2-7B/lora/sft \
    --template default \
    --finetuning_type lora \
    --export_dir ../../models/llama2-7b-sft \
    --export_size 2 \
    --export_device cpu \
    --export_legacy_format False
--- a/examples/merge_lora/quantize.sh
+++ b/examples/merge_lora/quantize.sh
@@ -1,11 +0,0 @@
 #!/bin/bash
 # NEED TO run `merge.sh` before using this script
 CUDA_VISIBLE_DEVICES=0 python ../../src/export_model.py \
    --model_name_or_path ../../models/llama2-7b-sft \
    --template default \
    --export_dir ../../models/llama2-7b-sft-int4 \
    --export_quantization_bit 4 \
    --export_quantization_dataset ../../data/c4_demo.json \
    --export_size 2 \
    --export_legacy_format False
--- a/examples/qlora_single_gpu/aqlm.sh
+++ b/examples/qlora_single_gpu/aqlm.sh
@@ -1,30 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --fp16
--- a/examples/qlora_single_gpu/awq.sh
+++ b/examples/qlora_single_gpu/awq.sh
@@ -1,30 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path TheBloke/Llama-2-7B-AWQ \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --fp16
--- a/examples/qlora_single_gpu/bitsandbytes.sh
+++ b/examples/qlora_single_gpu/bitsandbytes.sh
@@ -1,31 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --quantization_bit 4 \
    --plot_loss \
    --fp16
--- a/examples/qlora_single_gpu/gptq.sh
+++ b/examples/qlora_single_gpu/gptq.sh
@@ -1,30 +0,0 @@
 #!/bin/bash
 CUDA_VISIBLE_DEVICES=0 python ../../src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path TheBloke/Llama-2-7B-GPTQ \
    --dataset alpaca_gpt4_en,glaive_toolcall \
    --dataset_dir ../../data \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --output_dir ../../saves/LLaMA2-7B/lora/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --fp16
--- a/examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml
+++ b/examples/qlora_single_gpu/llama3_lora_sft_aqlm.yaml
@@ -0,0 +1,38 @@
 # model
 model_name_or_path: ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/qlora_single_gpu/llama3_lora_sft_awq.yaml
+++ b/examples/qlora_single_gpu/llama3_lora_sft_awq.yaml
@@ -0,0 +1,38 @@
 # model
 model_name_or_path: TechxGenus/Meta-Llama-3-8B-Instruct-AWQ
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml
+++ b/examples/qlora_single_gpu/llama3_lora_sft_bitsandbytes.yaml
@@ -0,0 +1,39 @@
 # model
 model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
 quantization_bit: 4
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml
+++ b/examples/qlora_single_gpu/llama3_lora_sft_gptq.yaml
@@ -0,0 +1,38 @@
 # model
 model_name_or_path: TechxGenus/Meta-Llama-3-8B-Instruct-GPTQ
 # method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: q_proj,v_proj
 # dataset
 dataset: identity,alpaca_gpt4_en
 template: llama3
 cutoff_len: 1024
 max_samples: 1000
 overwrite_cache: true
 preprocessing_num_workers: 16
 # output
 output_dir: saves/llama3-8b/lora/sft
 logging_steps: 10
 save_steps: 500
 plot_loss: true
 overwrite_output_dir: true
 # train
 per_device_train_batch_size: 1
 gradient_accumulation_steps: 8
 learning_rate: 0.0001
 num_train_epochs: 3.0
 lr_scheduler_type: cosine
 warmup_steps: 0.1
 fp16: true
 # eval
 val_size: 0.1
 per_device_eval_batch_size: 1
 evaluation_strategy: steps
 eval_steps: 500
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,4 +1,3 @@
 torch>=1.13.1
 transformers>=4.37.2
 datasets>=2.14.3
 accelerate>=0.27.2
@@ -13,6 +12,7 @@ uvicorn
 pydantic
 fastapi
 sse-starlette
-matplotlib
+matplotlib>=3.7.0
 fire
 packaging
 pyyaml
--- a/scripts/cal_flops.py
+++ b/scripts/cal_flops.py
@@ -3,24 +3,22 @@
 # Usage: python cal_flops.py --model_name_or_path path_to_model --batch_size 1 --seq_length 512
 # Inspired by: https://www.deepspeed.ai/tutorials/flops-profiler/
 from typing import Optional
 import fire
 import torch
 from deepspeed.accelerator import get_accelerator  # type: ignore
 from deepspeed.profiling.flops_profiler import get_model_profile  # type: ignore
-from llmtuner import ChatModel
+from llmtuner.chat import ChatModel
 def calculate_flops(
    model_name_or_path: str,
-    batch_size: Optional[int] = 1,
+    batch_size: int = 1,
-    seq_length: Optional[int] = 256,
+    seq_length: int = 256,
-    flash_attn: Optional[bool] = False,
+    flash_attn: str = "auto",
 ):
    with get_accelerator().device(0):
-        chat_model = ChatModel(dict(model_name_or_path=model_name_or_path, template="vanilla", flash_attn=flash_attn))
+        chat_model = ChatModel(dict(model_name_or_path=model_name_or_path, template="empty", flash_attn=flash_attn))
        fake_input = torch.ones((batch_size, seq_length), dtype=torch.long, device=chat_model.model.device)
        input_dict = {"input_ids": fake_input, "labels": fake_input.clone()}
        flops, macs, params = get_model_profile(chat_model.model, kwargs=input_dict, print_profile=True, detailed=True)
--- a/scripts/cal_lr.py
+++ b/scripts/cal_lr.py
@@ -4,7 +4,7 @@
 # Inspired by: https://github.com/imoneoi/openchat/blob/master/ochat/training_deepspeed/train.py
 import math
-from typing import Optional
+from typing import Literal
 import fire
 import torch
@@ -25,12 +25,12 @@ BASE_BS = 4_000_000  # from llama paper
 def calculate_lr(
    model_name_or_path: str,
    batch_size: int,  # total batch size, namely (batch size * gradient accumulation * world size)
-    stage: Optional[str] = "sft",
+    stage: Literal["pt", "sft"] = "sft",
-    dataset: Optional[str] = "alpaca_en",
+    dataset: str = "alpaca_en",
-    dataset_dir: Optional[str] = "data",
+    dataset_dir: str = "data",
-    template: Optional[str] = "default",
+    template: str = "default",
-    cutoff_len: Optional[int] = 1024,  # i.e. maximum input length during training
+    cutoff_len: int = 1024,  # i.e. maximum input length during training
-    is_mistral: Optional[bool] = False,  # mistral model uses a smaller learning rate,
+    is_mistral: bool = False,  # mistral model uses a smaller learning rate,
 ):
    model_args, data_args, training_args, _, _ = get_train_args(
        dict(
@@ -54,9 +54,7 @@ def calculate_lr(
    else:
        raise NotImplementedError
-    dataloader = DataLoader(
+    dataloader = DataLoader(trainset, batch_size, shuffle=False, collate_fn=data_collator, pin_memory=True)
        dataset=trainset, batch_size=batch_size, shuffle=True, collate_fn=data_collator, pin_memory=True
    )
    valid_tokens, total_tokens = 0, 0
    for batch in tqdm(dataloader):
        valid_tokens += torch.sum(batch["labels"] != IGNORE_INDEX).item()
--- a/scripts/cal_ppl.py
+++ b/scripts/cal_ppl.py
@@ -0,0 +1,116 @@
 # coding=utf-8
 # Calculates the ppl on the dataset of the pre-trained models.
 # Usage: python cal_ppl.py --model_name_or_path path_to_model --save_name ppl.json
 import json
 from dataclasses import dataclass
 from typing import Any, Dict, Literal, Optional, Sequence
 import fire
 import torch
 from torch.utils.data import DataLoader
 from tqdm import tqdm
 from transformers import DataCollatorForLanguageModeling, DataCollatorForSeq2Seq
 from llmtuner.data import get_dataset
 from llmtuner.extras.constants import IGNORE_INDEX
 from llmtuner.hparams import get_train_args
 from llmtuner.model import load_model, load_tokenizer
@dataclass
 class PairwiseDataCollatorWithPadding(DataCollatorForSeq2Seq):
    r"""
    Data collator for pairwise data.
    """
    train_on_prompt: bool = False
    def __call__(self, features: Sequence[Dict[str, Any]]) -> Dict[str, torch.Tensor]:
        r"""
        Pads batched data to the longest sequence in the batch.
        We generate 2 * n examples where the first n examples represent chosen examples and
        the last n examples represent rejected examples.
        """
        chosen_features = []
        for feature in features:
            prompt_len, answer_len = len(feature["prompt_ids"]), len(feature["chosen_ids"])
            input_ids = feature["prompt_ids"] + feature["chosen_ids"]
            attention_mask = [1] * (prompt_len + answer_len)
            labels = input_ids if self.train_on_prompt else [IGNORE_INDEX] * prompt_len + feature["chosen_ids"]
            chosen_features.append({"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels})
        return super().__call__(chosen_features)
 def cal_ppl(
    model_name_or_path: str,
    save_name: str,
    batch_size: int = 4,
    stage: Literal["pt", "sft", "rm"] = "sft",
    dataset: str = "alpaca_en",
    dataset_dir: str = "data",
    template: str = "default",
    cutoff_len: int = 1024,
    max_samples: Optional[int] = None,
    train_on_prompt: bool = False,
 ):
    model_args, data_args, training_args, finetuning_args, _ = get_train_args(
        dict(
            stage=stage,
            model_name_or_path=model_name_or_path,
            dataset=dataset,
            dataset_dir=dataset_dir,
            template=template,
            cutoff_len=cutoff_len,
            max_samples=max_samples,
            train_on_prompt=train_on_prompt,
            output_dir="dummy_dir",
            overwrite_cache=True,
        )
    )
    tokenizer_module = load_tokenizer(model_args)
    tokenizer = tokenizer_module["tokenizer"]
    trainset = get_dataset(model_args, data_args, training_args, stage, **tokenizer_module)
    model = load_model(tokenizer, model_args, finetuning_args, is_trainable=False)
    if stage == "pt":
        data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
    elif stage == "sft":
        data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, label_pad_token_id=IGNORE_INDEX)
    elif stage == "rm":
        data_collator = PairwiseDataCollatorWithPadding(
            tokenizer=tokenizer, label_pad_token_id=IGNORE_INDEX, train_on_prompt=train_on_prompt
        )
    else:
        raise NotImplementedError
    dataloader = DataLoader(trainset, batch_size, shuffle=False, collate_fn=data_collator, pin_memory=True)
    criterion = torch.nn.CrossEntropyLoss(reduction="none")
    total_ppl = 0
    perplexities = []
    batch: Dict[str, "torch.Tensor"]
    with torch.no_grad():
        for batch in tqdm(dataloader):
            batch = batch.to(model.device)
            outputs = model(**batch)
            shift_logits: "torch.Tensor" = outputs["logits"][..., :-1, :]
            shift_labels: "torch.Tensor" = batch["labels"][..., 1:]
            loss_mask = shift_labels != IGNORE_INDEX
            flatten_logits = shift_logits.contiguous().view(shift_labels.size(0) * shift_labels.size(1), -1)
            flatten_labels = shift_labels.contiguous().view(-1)
            token_logps: "torch.Tensor" = criterion(flatten_logits, flatten_labels)
            token_logps = token_logps.contiguous().view(shift_logits.size(0), -1)
            sentence_logps = (token_logps * loss_mask).sum(-1) / loss_mask.sum(-1)
            total_ppl += sentence_logps.exp().sum().item()
            perplexities.extend(sentence_logps.exp().tolist())
    with open(save_name, "w", encoding="utf-8") as f:
        json.dump(perplexities, f, indent=2)
    print("Average perplexity is {:.2f}".format(total_ppl / len(perplexities)))
    print("Perplexities have been saved at {}.".format(save_name))
 if __name__ == "__main__":
    fire.Fire(cal_ppl)
--- a/scripts/length_cdf.py
+++ b/scripts/length_cdf.py
@@ -3,7 +3,6 @@
 # Usage: python length_cdf.py --model_name_or_path path_to_model --dataset alpaca_en --template default
 from collections import defaultdict
 from typing import Optional
 import fire
 from tqdm import tqdm
@@ -15,10 +14,10 @@ from llmtuner.model import load_tokenizer
 def length_cdf(
    model_name_or_path: str,
-    dataset: Optional[str] = "alpaca_en",
+    dataset: str = "alpaca_en",
-    dataset_dir: Optional[str] = "data",
+    dataset_dir: str = "data",
-    template: Optional[str] = "default",
+    template: str = "default",
-    interval: Optional[int] = 1000,
+    interval: int = 1000,
 ):
    model_args, data_args, training_args, _, _ = get_train_args(
        dict(
--- a/scripts/llama_pro.py
+++ b/scripts/llama_pro.py
@@ -1,5 +1,5 @@
 # coding=utf-8
-# Performs block expansion for LLaMA, Mistral or Qwen1.5 models.
+# Performs block expansion for LLaMA, Mistral, Qwen1.5 or Yi models.
 # Usage: python llama_pro.py --model_name_or_path meta-llama/Llama-2-7b-hf --output_dir llama2_pro --num_expand 8
 # Inspired by: https://github.com/TencentARC/LLaMA-Pro/blob/main/scripts/block_expansion.py
@@ -106,8 +106,7 @@ def block_expansion(
    print("Fine-tune this model with:")
    print("  --model_name_or_path {} \\".format(output_dir))
    print("  --finetuning_type freeze \\")
-    print("  --name_module_trainable all \\")
+    print("  --freeze_trainable_layers {} \\".format(num_expand))
    print("  --num_layer_trainable {} \\".format(num_expand))
    print("  --use_llama_pro")
--- a/setup.py
+++ b/setup.py
@@ -5,9 +5,9 @@ from setuptools import find_packages, setup
 def get_version():
-    with open(os.path.join("src", "llmtuner", "__init__.py"), "r", encoding="utf-8") as f:
+    with open(os.path.join("src", "llmtuner", "cli.py"), "r", encoding="utf-8") as f:
        file_content = f.read()
-        pattern = r"{0}\W*=\W*\"([^\"]+)\"".format("__version__")
+        pattern = r"{}\W*=\W*\"([^\"]+)\"".format("VERSION")
        (version,) = re.findall(pattern, file_content)
        return version
@@ -20,12 +20,13 @@ def get_requires():
 extra_require = {
-    "deepspeed": ["deepspeed>=0.10.0"],
+    "torch": ["torch>=1.13.1"],
    "metrics": ["nltk", "jieba", "rouge-chinese"],
    "deepspeed": ["deepspeed>=0.10.0,<=0.14.0"],
    "bitsandbytes": ["bitsandbytes>=0.39.0"],
    "vllm": ["vllm>=0.4.0"],
    "galore": ["galore-torch"],
    "badam": ["badam"],
    "vllm": ["vllm>=0.4.0"],
    "bitsandbytes": ["bitsandbytes>=0.39.0"],
    "gptq": ["optimum>=1.16.0", "auto-gptq>=0.5.0"],
    "awq": ["autoawq"],
    "aqlm": ["aqlm[gpu]>=1.1.0"],
@@ -52,6 +53,7 @@ def main():
        python_requires=">=3.8.0",
        install_requires=get_requires(),
        extras_require=extra_require,
        entry_points={"console_scripts": ["llamafactory-cli = llmtuner.cli:main"]},
        classifiers=[
            "Development Status :: 4 - Beta",
            "Intended Audience :: Developers",
--- a/src/api.py
+++ b/src/api.py
@@ -0,0 +1,19 @@
 import os
 import uvicorn
 from llmtuner.api.app import create_app
 from llmtuner.chat import ChatModel
 def main():
    chat_model = ChatModel()
    app = create_app(chat_model)
    api_host = os.environ.get("API_HOST", "0.0.0.0")
    api_port = int(os.environ.get("API_PORT", "8000"))
    print("Visit http://localhost:{}/docs for API document.".format(api_port))
    uvicorn.run(app, host=api_host, port=api_port)
 if __name__ == "__main__":
    main()
--- a/src/api_demo.py
+++ b/src/api_demo.py
@@ -1,16 +0,0 @@
 import os
 import uvicorn
 from llmtuner import ChatModel, create_app
 def main():
    chat_model = ChatModel()
    app = create_app(chat_model)
    print("Visit http://localhost:{}/docs for API document.".format(os.environ.get("API_PORT", 8000)))
    uvicorn.run(app, host="0.0.0.0", port=int(os.environ.get("API_PORT", 8000)), workers=1)
 if __name__ == "__main__":
    main()
--- a/src/cli_demo.py
+++ b/src/cli_demo.py
@@ -1,49 +0,0 @@
 from llmtuner import ChatModel
 from llmtuner.extras.misc import torch_gc
 try:
    import platform
    if platform.system() != "Windows":
        import readline  # noqa: F401
 except ImportError:
    print("Install `readline` for a better experience.")
 def main():
    chat_model = ChatModel()
    messages = []
    print("Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.")
    while True:
        try:
            query = input("\nUser: ")
        except UnicodeDecodeError:
            print("Detected decoding error at the inputs, please set the terminal encoding to utf-8.")
            continue
        except Exception:
            raise
        if query.strip() == "exit":
            break
        if query.strip() == "clear":
            messages = []
            torch_gc()
            print("History has been removed.")
            continue
        messages.append({"role": "user", "content": query})
        print("Assistant: ", end="", flush=True)
        response = ""
        for new_text in chat_model.stream_chat(messages):
            print(new_text, end="", flush=True)
            response += new_text
        print()
        messages.append({"role": "assistant", "content": response})
 if __name__ == "__main__":
    main()
--- a/src/evaluate.py
+++ b/src/evaluate.py
@@ -1,9 +0,0 @@
 from llmtuner import Evaluator
 def main():
    Evaluator().eval()
 if __name__ == "__main__":
    main()
--- a/src/export_model.py
+++ b/src/export_model.py
@@ -1,9 +0,0 @@
 from llmtuner import export_model
 def main():
    export_model()
 if __name__ == "__main__":
    main()
--- a/src/llmtuner/init.py
+++ b/src/llmtuner/init.py
@@ -1,11 +1,6 @@
 # Level: api, webui > chat, eval, train > data, model > extras, hparams
-from .api import create_app
+from .cli import VERSION
 from .chat import ChatModel
 from .eval import Evaluator
 from .train import export_model, run_exp
 from .webui import create_ui, create_web_demo
-__version__ = "0.7.0"
+__version__ = VERSION
 __all__ = ["create_app", "ChatModel", "Evaluator", "export_model", "run_exp", "create_ui", "create_web_demo"]
--- a/src/llmtuner/api/init.py
+++ b/src/llmtuner/api/init.py
@@ -1,4 +0,0 @@
 from .app import create_app
 __all__ = ["create_app"]
--- a/src/llmtuner/api/app.py
+++ b/src/llmtuner/api/app.py
@@ -1,36 +1,31 @@
 import json
 import os
 from contextlib import asynccontextmanager
-from typing import Any, Dict, Sequence
+from typing import Optional
-from pydantic import BaseModel
+from typing_extensions import Annotated
 from ..chat import ChatModel
 from ..data import Role as DataRole
 from ..extras.misc import torch_gc
-from ..extras.packages import is_fastapi_availble, is_starlette_available, is_uvicorn_available
+from ..extras.packages import is_fastapi_available, is_starlette_available, is_uvicorn_available
 from .chat import (
    create_chat_completion_response,
    create_score_evaluation_response,
    create_stream_chat_completion_response,
 )
 from .protocol import (
    ChatCompletionMessage,
    ChatCompletionRequest,
    ChatCompletionResponse,
    ChatCompletionResponseChoice,
    ChatCompletionResponseStreamChoice,
    ChatCompletionResponseUsage,
    ChatCompletionStreamResponse,
    Finish,
    Function,
    FunctionCall,
    ModelCard,
    ModelList,
    Role,
    ScoreEvaluationRequest,
    ScoreEvaluationResponse,
 )
-if is_fastapi_availble():
+if is_fastapi_available():
-    from fastapi import FastAPI, HTTPException, status
+    from fastapi import Depends, FastAPI, HTTPException, status
    from fastapi.middleware.cors import CORSMiddleware
    from fastapi.security.http import HTTPAuthorizationCredentials, HTTPBearer
 if is_starlette_available():
@@ -47,23 +42,8 @@ async def lifespan(app: "FastAPI"):  # collects GPU memory
    torch_gc()
 def dictify(data: "BaseModel") -> Dict[str, Any]:
    try:  # pydantic v2
        return data.model_dump(exclude_unset=True)
    except AttributeError:  # pydantic v1
        return data.dict(exclude_unset=True)
 def jsonify(data: "BaseModel") -> str:
    try:  # pydantic v2
        return json.dumps(data.model_dump(exclude_unset=True), ensure_ascii=False)
    except AttributeError:  # pydantic v1
        return data.json(exclude_unset=True, ensure_ascii=False)
 def create_app(chat_model: "ChatModel") -> "FastAPI":
    app = FastAPI(lifespan=lifespan)
    app.add_middleware(
        CORSMiddleware,
        allow_origins=["*"],
@@ -71,160 +51,58 @@ def create_app(chat_model: "ChatModel") -> "FastAPI":
        allow_methods=["*"],
        allow_headers=["*"],
    )
    api_key = os.environ.get("API_KEY")
    security = HTTPBearer(auto_error=False)
-    role_mapping = {
+    async def verify_api_key(auth: Annotated[Optional[HTTPAuthorizationCredentials], Depends(security)]):
-        Role.USER: DataRole.USER.value,
+        if api_key and (auth is None or auth.credentials != api_key):
-        Role.ASSISTANT: DataRole.ASSISTANT.value,
+            raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid API key.")
        Role.SYSTEM: DataRole.SYSTEM.value,
        Role.FUNCTION: DataRole.FUNCTION.value,
        Role.TOOL: DataRole.OBSERVATION.value,
    }
-    @app.get("/v1/models", response_model=ModelList)
+    @app.get(
        "/v1/models",
        response_model=ModelList,
        status_code=status.HTTP_200_OK,
        dependencies=[Depends(verify_api_key)],
    )
    async def list_models():
        model_card = ModelCard(id="gpt-3.5-turbo")
        return ModelList(data=[model_card])
-    @app.post("/v1/chat/completions", response_model=ChatCompletionResponse, status_code=status.HTTP_200_OK)
+    @app.post(
        "/v1/chat/completions",
        response_model=ChatCompletionResponse,
        status_code=status.HTTP_200_OK,
        dependencies=[Depends(verify_api_key)],
    )
    async def create_chat_completion(request: ChatCompletionRequest):
        if not chat_model.engine.can_generate:
            raise HTTPException(status_code=status.HTTP_405_METHOD_NOT_ALLOWED, detail="Not allowed")
        if len(request.messages) == 0:
            raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid length")
        if request.messages[0].role == Role.SYSTEM:
            system = request.messages.pop(0).content
        else:
            system = ""
        if len(request.messages) % 2 == 0:
            raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Only supports u/a/u/a/u...")
        input_messages = []
        for i, message in enumerate(request.messages):
            if i % 2 == 0 and message.role not in [Role.USER, Role.TOOL]:
                raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid role")
            elif i % 2 == 1 and message.role not in [Role.ASSISTANT, Role.FUNCTION]:
                raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid role")
            if message.role == Role.ASSISTANT and isinstance(message.tool_calls, list) and len(message.tool_calls):
                name = message.tool_calls[0].function.name
                arguments = message.tool_calls[0].function.arguments
                content = json.dumps({"name": name, "argument": arguments}, ensure_ascii=False)
                input_messages.append({"role": role_mapping[Role.FUNCTION], "content": content})
            else:
                input_messages.append({"role": role_mapping[message.role], "content": message.content})
        tool_list = request.tools
        if isinstance(tool_list, list) and len(tool_list):
            try:
                tools = json.dumps([dictify(tool.function) for tool in tool_list], ensure_ascii=False)
            except Exception:
                raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid tools")
        else:
            tools = ""
        if request.stream:
-            if tools:
+            generate = create_stream_chat_completion_response(request, chat_model)
                raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Cannot stream function calls.")
            generate = stream_chat_completion(input_messages, system, tools, request)
            return EventSourceResponse(generate, media_type="text/event-stream")
        else:
            return await create_chat_completion_response(request, chat_model)
-        responses = await chat_model.achat(
+    @app.post(
-            input_messages,
+        "/v1/score/evaluation",
-            system,
+        response_model=ScoreEvaluationResponse,
-            tools,
+        status_code=status.HTTP_200_OK,
-            do_sample=request.do_sample,
+        dependencies=[Depends(verify_api_key)],
-            temperature=request.temperature,
+    )
            top_p=request.top_p,
            max_new_tokens=request.max_tokens,
            num_return_sequences=request.n,
        )
        prompt_length, response_length = 0, 0
        choices = []
        for i, response in enumerate(responses):
            if tools:
                result = chat_model.engine.template.format_tools.extract(response.response_text)
            else:
                result = response.response_text
            if isinstance(result, tuple):
                name, arguments = result
                function = Function(name=name, arguments=arguments)
                response_message = ChatCompletionMessage(
                    role=Role.ASSISTANT, tool_calls=[FunctionCall(function=function)]
                )
                finish_reason = Finish.TOOL
            else:
                response_message = ChatCompletionMessage(role=Role.ASSISTANT, content=result)
                finish_reason = Finish.STOP if response.finish_reason == "stop" else Finish.LENGTH
            choices.append(
                ChatCompletionResponseChoice(index=i, message=response_message, finish_reason=finish_reason)
            )
            prompt_length = response.prompt_length
            response_length += response.response_length
        usage = ChatCompletionResponseUsage(
            prompt_tokens=prompt_length,
            completion_tokens=response_length,
            total_tokens=prompt_length + response_length,
        )
        return ChatCompletionResponse(model=request.model, choices=choices, usage=usage)
    async def stream_chat_completion(
        messages: Sequence[Dict[str, str]], system: str, tools: str, request: ChatCompletionRequest
    ):
        choice_data = ChatCompletionResponseStreamChoice(
            index=0, delta=ChatCompletionMessage(role=Role.ASSISTANT, content=""), finish_reason=None
        )
        chunk = ChatCompletionStreamResponse(model=request.model, choices=[choice_data])
        yield jsonify(chunk)
        async for new_token in chat_model.astream_chat(
            messages,
            system,
            tools,
            do_sample=request.do_sample,
            temperature=request.temperature,
            top_p=request.top_p,
            max_new_tokens=request.max_tokens,
        ):
            if len(new_token) == 0:
                continue
            choice_data = ChatCompletionResponseStreamChoice(
                index=0, delta=ChatCompletionMessage(content=new_token), finish_reason=None
            )
            chunk = ChatCompletionStreamResponse(model=request.model, choices=[choice_data])
            yield jsonify(chunk)
        choice_data = ChatCompletionResponseStreamChoice(
            index=0, delta=ChatCompletionMessage(), finish_reason=Finish.STOP
        )
        chunk = ChatCompletionStreamResponse(model=request.model, choices=[choice_data])
        yield jsonify(chunk)
        yield "[DONE]"
    @app.post("/v1/score/evaluation", response_model=ScoreEvaluationResponse, status_code=status.HTTP_200_OK)
    async def create_score_evaluation(request: ScoreEvaluationRequest):
        if chat_model.engine.can_generate:
            raise HTTPException(status_code=status.HTTP_405_METHOD_NOT_ALLOWED, detail="Not allowed")
-        if len(request.messages) == 0:
+        return await create_score_evaluation_response(request, chat_model)
            raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid request")
        scores = await chat_model.aget_scores(request.messages, max_length=request.max_length)
        return ScoreEvaluationResponse(model=request.model, scores=scores)
    return app
-if __name__ == "__main__":
+def run_api() -> None:
    chat_model = ChatModel()
    app = create_app(chat_model)
-    uvicorn.run(app, host="0.0.0.0", port=int(os.environ.get("API_PORT", 8000)), workers=1)
+    api_host = os.environ.get("API_HOST", "0.0.0.0")
    api_port = int(os.environ.get("API_PORT", "8000"))
    print("Visit http://localhost:{}/docs for API document.".format(api_port))
    uvicorn.run(app, host=api_host, port=api_port)
--- a/src/llmtuner/api/chat.py
+++ b/src/llmtuner/api/chat.py
@@ -0,0 +1,186 @@
 import json
 import uuid
 from typing import TYPE_CHECKING, AsyncGenerator, Dict, List, Optional, Tuple
 from ..data import Role as DataRole
 from ..extras.logging import get_logger
 from ..extras.packages import is_fastapi_available
 from .common import dictify, jsonify
 from .protocol import (
    ChatCompletionMessage,
    ChatCompletionResponse,
    ChatCompletionResponseChoice,
    ChatCompletionResponseUsage,
    ChatCompletionStreamResponse,
    ChatCompletionStreamResponseChoice,
    Finish,
    Function,
    FunctionCall,
    Role,
    ScoreEvaluationResponse,
 )
 if is_fastapi_available():
    from fastapi import HTTPException, status
 if TYPE_CHECKING:
    from ..chat import ChatModel
    from .protocol import ChatCompletionRequest, ScoreEvaluationRequest
 logger = get_logger(__name__)
 ROLE_MAPPING = {
    Role.USER: DataRole.USER.value,
    Role.ASSISTANT: DataRole.ASSISTANT.value,
    Role.SYSTEM: DataRole.SYSTEM.value,
    Role.FUNCTION: DataRole.FUNCTION.value,
    Role.TOOL: DataRole.OBSERVATION.value,
 }
 def _process_request(request: "ChatCompletionRequest") -> Tuple[List[Dict[str, str]], str, str]:
    logger.info("==== request ====\n{}".format(json.dumps(dictify(request), indent=2, ensure_ascii=False)))
    if len(request.messages) == 0:
        raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid length")
    if request.messages[0].role == Role.SYSTEM:
        system = request.messages.pop(0).content
    else:
        system = ""
    if len(request.messages) % 2 == 0:
        raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Only supports u/a/u/a/u...")
    input_messages = []
    for i, message in enumerate(request.messages):
        if i % 2 == 0 and message.role not in [Role.USER, Role.TOOL]:
            raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid role")
        elif i % 2 == 1 and message.role not in [Role.ASSISTANT, Role.FUNCTION]:
            raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid role")
        if message.role == Role.ASSISTANT and isinstance(message.tool_calls, list) and len(message.tool_calls):
            name = message.tool_calls[0].function.name
            arguments = message.tool_calls[0].function.arguments
            content = json.dumps({"name": name, "argument": arguments}, ensure_ascii=False)
            input_messages.append({"role": ROLE_MAPPING[Role.FUNCTION], "content": content})
        else:
            input_messages.append({"role": ROLE_MAPPING[message.role], "content": message.content})
    tool_list = request.tools
    if isinstance(tool_list, list) and len(tool_list):
        try:
            tools = json.dumps([dictify(tool.function) for tool in tool_list], ensure_ascii=False)
        except Exception:
            raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid tools")
    else:
        tools = ""
    return input_messages, system, tools
 def _create_stream_chat_completion_chunk(
    completion_id: str,
    model: str,
    delta: "ChatCompletionMessage",
    index: Optional[int] = 0,
    finish_reason: Optional["Finish"] = None,
 ) -> str:
    choice_data = ChatCompletionStreamResponseChoice(index=index, delta=delta, finish_reason=finish_reason)
    chunk = ChatCompletionStreamResponse(id=completion_id, model=model, choices=[choice_data])
    return jsonify(chunk)
 async def create_chat_completion_response(
    request: "ChatCompletionRequest", chat_model: "ChatModel"
 ) -> "ChatCompletionResponse":
    completion_id = "chatcmpl-{}".format(uuid.uuid4().hex)
    input_messages, system, tools = _process_request(request)
    responses = await chat_model.achat(
        input_messages,
        system,
        tools,
        do_sample=request.do_sample,
        temperature=request.temperature,
        top_p=request.top_p,
        max_new_tokens=request.max_tokens,
        num_return_sequences=request.n,
        stop=request.stop,
    )
    prompt_length, response_length = 0, 0
    choices = []
    for i, response in enumerate(responses):
        if tools:
            result = chat_model.engine.template.format_tools.extract(response.response_text)
        else:
            result = response.response_text
        if isinstance(result, tuple):
            name, arguments = result
            function = Function(name=name, arguments=arguments)
            tool_call = FunctionCall(id="call_{}".format(uuid.uuid4().hex), function=function)
            response_message = ChatCompletionMessage(role=Role.ASSISTANT, tool_calls=[tool_call])
            finish_reason = Finish.TOOL
        else:
            response_message = ChatCompletionMessage(role=Role.ASSISTANT, content=result)
            finish_reason = Finish.STOP if response.finish_reason == "stop" else Finish.LENGTH
        choices.append(ChatCompletionResponseChoice(index=i, message=response_message, finish_reason=finish_reason))
        prompt_length = response.prompt_length
        response_length += response.response_length
    usage = ChatCompletionResponseUsage(
        prompt_tokens=prompt_length,
        completion_tokens=response_length,
        total_tokens=prompt_length + response_length,
    )
    return ChatCompletionResponse(id=completion_id, model=request.model, choices=choices, usage=usage)
 async def create_stream_chat_completion_response(
    request: "ChatCompletionRequest", chat_model: "ChatModel"
 ) -> AsyncGenerator[str, None]:
    completion_id = "chatcmpl-{}".format(uuid.uuid4().hex)
    input_messages, system, tools = _process_request(request)
    if tools:
        raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Cannot stream function calls.")
    if request.n > 1:
        raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Cannot stream multiple responses.")
    yield _create_stream_chat_completion_chunk(
        completion_id=completion_id, model=request.model, delta=ChatCompletionMessage(role=Role.ASSISTANT, content="")
    )
    async for new_token in chat_model.astream_chat(
        input_messages,
        system,
        tools,
        do_sample=request.do_sample,
        temperature=request.temperature,
        top_p=request.top_p,
        max_new_tokens=request.max_tokens,
        stop=request.stop,
    ):
        if len(new_token) != 0:
            yield _create_stream_chat_completion_chunk(
                completion_id=completion_id, model=request.model, delta=ChatCompletionMessage(content=new_token)
            )
    yield _create_stream_chat_completion_chunk(
        completion_id=completion_id, model=request.model, delta=ChatCompletionMessage(), finish_reason=Finish.STOP
    )
    yield "[DONE]"
 async def create_score_evaluation_response(
    request: "ScoreEvaluationRequest", chat_model: "ChatModel"
 ) -> "ScoreEvaluationResponse":
    if len(request.messages) == 0:
        raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Invalid request")
    scores = await chat_model.aget_scores(request.messages, max_length=request.max_length)
    return ScoreEvaluationResponse(model=request.model, scores=scores)
--- a/src/llmtuner/api/common.py
+++ b/src/llmtuner/api/common.py
@@ -0,0 +1,20 @@
 import json
 from typing import TYPE_CHECKING, Any, Dict
 if TYPE_CHECKING:
    from pydantic import BaseModel
 def dictify(data: "BaseModel") -> Dict[str, Any]:
    try:  # pydantic v2
        return data.model_dump(exclude_unset=True)
    except AttributeError:  # pydantic v1
        return data.dict(exclude_unset=True)
 def jsonify(data: "BaseModel") -> str:
    try:  # pydantic v2
        return json.dumps(data.model_dump(exclude_unset=True), ensure_ascii=False)
    except AttributeError:  # pydantic v1
        return data.json(exclude_unset=True, ensure_ascii=False)
--- a/src/llmtuner/api/protocol.py
+++ b/src/llmtuner/api/protocol.py
@@ -1,6 +1,6 @@
 import time
 from enum import Enum, unique
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List, Optional, Union
 from pydantic import BaseModel, Field
 from typing_extensions import Literal
@@ -51,7 +51,7 @@ class FunctionAvailable(BaseModel):
 class FunctionCall(BaseModel):
-    id: Literal["call_default"] = "call_default"
+    id: str
    type: Literal["function"] = "function"
    function: Function
@@ -77,6 +77,7 @@ class ChatCompletionRequest(BaseModel):
    top_p: Optional[float] = None
    n: int = 1
    max_tokens: Optional[int] = None
    stop: Optional[Union[str, List[str]]] = None
    stream: bool = False
@@ -86,7 +87,7 @@ class ChatCompletionResponseChoice(BaseModel):
    finish_reason: Finish
-class ChatCompletionResponseStreamChoice(BaseModel):
+class ChatCompletionStreamResponseChoice(BaseModel):
    index: int
    delta: ChatCompletionMessage
    finish_reason: Optional[Finish] = None
@@ -99,7 +100,7 @@ class ChatCompletionResponseUsage(BaseModel):
 class ChatCompletionResponse(BaseModel):
-    id: Literal["chatcmpl-default"] = "chatcmpl-default"
+    id: str
    object: Literal["chat.completion"] = "chat.completion"
    created: int = Field(default_factory=lambda: int(time.time()))
    model: str
@@ -108,11 +109,11 @@ class ChatCompletionResponse(BaseModel):
 class ChatCompletionStreamResponse(BaseModel):
-    id: Literal["chatcmpl-default"] = "chatcmpl-default"
+    id: str
    object: Literal["chat.completion.chunk"] = "chat.completion.chunk"
    created: int = Field(default_factory=lambda: int(time.time()))
    model: str
-    choices: List[ChatCompletionResponseStreamChoice]
+    choices: List[ChatCompletionStreamResponseChoice]
 class ScoreEvaluationRequest(BaseModel):
@@ -122,7 +123,7 @@ class ScoreEvaluationRequest(BaseModel):
 class ScoreEvaluationResponse(BaseModel):
-    id: Literal["scoreeval-default"] = "scoreeval-default"
+    id: str
    object: Literal["score.evaluation"] = "score.evaluation"
    model: str
    scores: List[float]
--- a/src/llmtuner/chat/chat_model.py
+++ b/src/llmtuner/chat/chat_model.py
@@ -2,6 +2,7 @@ import asyncio
 from threading import Thread
 from typing import TYPE_CHECKING, Any, AsyncGenerator, Dict, Generator, List, Optional, Sequence
 from ..extras.misc import torch_gc
 from ..hparams import get_infer_args
 from .hf_engine import HuggingfaceEngine
 from .vllm_engine import VllmEngine
@@ -95,3 +96,45 @@ class ChatModel:
        **input_kwargs,
    ) -> List[float]:
        return await self.engine.get_scores(batch_input, **input_kwargs)
 def run_chat() -> None:
    try:
        import platform
        if platform.system() != "Windows":
            import readline  # noqa: F401
    except ImportError:
        print("Install `readline` for a better experience.")
    chat_model = ChatModel()
    messages = []
    print("Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.")
    while True:
        try:
            query = input("\nUser: ")
        except UnicodeDecodeError:
            print("Detected decoding error at the inputs, please set the terminal encoding to utf-8.")
            continue
        except Exception:
            raise
        if query.strip() == "exit":
            break
        if query.strip() == "clear":
            messages = []
            torch_gc()
            print("History has been removed.")
            continue
        messages.append({"role": "user", "content": query})
        print("Assistant: ", end="", flush=True)
        response = ""
        for new_text in chat_model.stream_chat(messages):
            print(new_text, end="", flush=True)
            response += new_text
        print()
        messages.append({"role": "assistant", "content": response})
--- a/src/llmtuner/chat/hf_engine.py
+++ b/src/llmtuner/chat/hf_engine.py
@@ -65,23 +65,30 @@ class HuggingfaceEngine(BaseEngine):
        prompt_length = len(prompt_ids)
        inputs = torch.tensor([prompt_ids], device=model.device)
-        do_sample = input_kwargs.pop("do_sample", None)
+        do_sample = input_kwargs.pop("do_sample", generating_args["do_sample"])
-        temperature = input_kwargs.pop("temperature", None)
+        temperature = input_kwargs.pop("temperature", generating_args["temperature"])
-        top_p = input_kwargs.pop("top_p", None)
+        top_p = input_kwargs.pop("top_p", generating_args["top_p"])
-        top_k = input_kwargs.pop("top_k", None)
+        top_k = input_kwargs.pop("top_k", generating_args["top_k"])
-        num_return_sequences = input_kwargs.pop("num_return_sequences", None)
+        num_return_sequences = input_kwargs.pop("num_return_sequences", 1)
-        repetition_penalty = input_kwargs.pop("repetition_penalty", None)
+        repetition_penalty = input_kwargs.pop("repetition_penalty", generating_args["repetition_penalty"])
        length_penalty = input_kwargs.pop("length_penalty", generating_args["length_penalty"])
        max_length = input_kwargs.pop("max_length", None)
        max_new_tokens = input_kwargs.pop("max_new_tokens", None)
        stop = input_kwargs.pop("stop", None)
        if stop is not None:
            raise ValueError("Stop parameter is not supported in Huggingface engine yet.")
        generating_args = generating_args.copy()
        generating_args.update(
            dict(
-                do_sample=do_sample if do_sample is not None else generating_args["do_sample"],
+                do_sample=do_sample,
-                temperature=temperature or generating_args["temperature"],
+                temperature=temperature,
-                top_p=top_p or generating_args["top_p"],
+                top_p=top_p,
-                top_k=top_k or generating_args["top_k"],
+                top_k=top_k,
-                num_return_sequences=num_return_sequences or 1,
+                num_return_sequences=num_return_sequences,
-                repetition_penalty=repetition_penalty or generating_args["repetition_penalty"],
+                repetition_penalty=repetition_penalty,
                length_penalty=length_penalty,
                eos_token_id=[tokenizer.eos_token_id] + tokenizer.additional_special_tokens_ids,
                pad_token_id=tokenizer.pad_token_id,
            )
@@ -90,6 +97,10 @@ class HuggingfaceEngine(BaseEngine):
        if isinstance(num_return_sequences, int) and num_return_sequences > 1:
            generating_args["do_sample"] = True
        if not generating_args["do_sample"]:
            generating_args.pop("temperature", None)
            generating_args.pop("top_p", None)
        if max_length:
            generating_args.pop("max_new_tokens", None)
            generating_args["max_length"] = max_length
--- a/src/llmtuner/chat/vllm_engine.py
+++ b/src/llmtuner/chat/vllm_engine.py
@@ -2,9 +2,11 @@ import uuid
 from typing import TYPE_CHECKING, AsyncGenerator, AsyncIterator, Dict, List, Optional, Sequence
 from ..data import get_template_and_fix_tokenizer
 from ..extras.logging import get_logger
 from ..extras.misc import get_device_count, infer_optim_dtype
 from ..extras.packages import is_vllm_available
 from ..model import load_config, load_tokenizer
 from ..model.utils.visual import LlavaMultiModalProjectorForYiVLForVLLM
 from .base_engine import BaseEngine, Response
@@ -22,6 +24,9 @@ if TYPE_CHECKING:
    from ..hparams import DataArguments, FinetuningArguments, GeneratingArguments, ModelArguments
 logger = get_logger(__name__)
 class VllmEngine(BaseEngine):
    def __init__(
        self,
@@ -57,13 +62,19 @@ class VllmEngine(BaseEngine):
        }
        if model_args.visual_inputs:
-            # TODO: auto derive from config
+            image_size = config.vision_config.image_size
-            # https://github.com/vllm-project/vllm/pull/3042#issuecomment-1984893549
+            patch_size = config.vision_config.patch_size
-            self.image_feature_size = 576
+            self.image_feature_size = (image_size // patch_size) ** 2
            engine_args["image_input_type"] = "pixel_values"
            engine_args["image_token_id"] = self.tokenizer.convert_tokens_to_ids("<image>")
-            engine_args["image_input_shape"] = "1,3,336,336"
+            engine_args["image_input_shape"] = "1,3,{},{}".format(image_size, image_size)
            engine_args["image_feature_size"] = self.image_feature_size
            if getattr(config, "is_yi_vl_derived_model", None):
                # bug in vllm 0.4.2, see: https://github.com/vllm-project/vllm/pull/4828
                import vllm.model_executor.models.llava
                logger.info("Detected Yi-VL model, applying projector patch.")
                vllm.model_executor.models.llava.LlavaMultiModalProjector = LlavaMultiModalProjectorForYiVLForVLLM
        self.model = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**engine_args))
        if model_args.adapter_name_or_path is not None:
@@ -89,41 +100,35 @@ class VllmEngine(BaseEngine):
        )
        prompt_length = len(prompt_ids)
-        temperature = input_kwargs.pop("temperature", None)
+        use_beam_search = self.generating_args["num_beams"] > 1
-        top_p = input_kwargs.pop("top_p", None)
+        temperature = input_kwargs.pop("temperature", self.generating_args["temperature"])
-        top_k = input_kwargs.pop("top_k", None)
+        top_p = input_kwargs.pop("top_p", self.generating_args["top_p"])
-        num_return_sequences = input_kwargs.pop("num_return_sequences", None)
+        top_k = input_kwargs.pop("top_k", self.generating_args["top_k"])
-        repetition_penalty = input_kwargs.pop("repetition_penalty", None)
+        num_return_sequences = input_kwargs.pop("num_return_sequences", 1)
        repetition_penalty = input_kwargs.pop("repetition_penalty", self.generating_args["repetition_penalty"])
        length_penalty = input_kwargs.pop("length_penalty", self.generating_args["length_penalty"])
        max_length = input_kwargs.pop("max_length", None)
        max_new_tokens = input_kwargs.pop("max_new_tokens", None)
        stop = input_kwargs.pop("stop", None)
-        generating_args = self.generating_args.copy()
+        max_tokens = self.generating_args["max_new_tokens"] or self.generating_args["max_length"]
        generating_args.update(
            dict(
                temperature=temperature or generating_args["temperature"],
                top_p=top_p or generating_args["top_p"],
                top_k=top_k or generating_args["top_k"],
                num_return_sequences=num_return_sequences or 1,
                repetition_penalty=repetition_penalty or generating_args["repetition_penalty"],
            )
        )
        if max_length:
-            generating_args["max_new_tokens"] = max_length - prompt_length
+            max_tokens = max_length - prompt_length if max_length > prompt_length else 1
        if max_new_tokens:
-            generating_args["max_new_tokens"] = max_new_tokens
+            max_tokens = max_new_tokens
        sampling_params = SamplingParams(
-            n=generating_args["num_return_sequences"],
+            n=num_return_sequences,
-            repetition_penalty=generating_args["repetition_penalty"],
+            repetition_penalty=repetition_penalty,
-            temperature=generating_args["temperature"],
+            temperature=temperature,
-            top_p=generating_args["top_p"],
+            top_p=top_p,
-            top_k=generating_args["top_k"],
+            top_k=top_k,
-            use_beam_search=generating_args["num_beams"] > 1,
+            use_beam_search=use_beam_search,
-            length_penalty=generating_args["length_penalty"],
+            length_penalty=length_penalty,
            stop=stop,
            stop_token_ids=[self.tokenizer.eos_token_id] + self.tokenizer.additional_special_tokens_ids,
-            max_tokens=generating_args["max_new_tokens"],
+            max_tokens=max_tokens,
            skip_special_tokens=True,
        )
--- a/src/llmtuner/cli.py
+++ b/src/llmtuner/cli.py
@@ -0,0 +1,75 @@
 import sys
 from enum import Enum, unique
 from .api.app import run_api
 from .chat.chat_model import run_chat
 from .eval.evaluator import run_eval
 from .train.tuner import export_model, run_exp
 from .webui.interface import run_web_demo, run_web_ui
 USAGE = (
    "-" * 70
    + "\n"
    + "| Usage:                                                             |\n"
    + "|   llamafactory-cli api -h: launch an OpenAI-style API server       |\n"
    + "|   llamafactory-cli chat -h: launch a chat interface in CLI         |\n"
    + "|   llamafactory-cli eval -h: evaluate models                        |\n"
    + "|   llamafactory-cli export -h: merge LoRA adapters and export model |\n"
    + "|   llamafactory-cli train -h: train models                          |\n"
    + "|   llamafactory-cli webchat -h: launch a chat interface in Web UI   |\n"
    + "|   llamafactory-cli webui: launch LlamaBoard                        |\n"
    + "|   llamafactory-cli version: show version info                      |\n"
    + "-" * 70
 )
 VERSION = "0.7.1"
 WELCOME = (
    "-" * 58
    + "\n"
    + "| Welcome to LLaMA Factory, version {}".format(VERSION)
    + " " * (21 - len(VERSION))
    + "|\n|"
    + " " * 56
    + "|\n"
    + "| Project page: https://github.com/hiyouga/LLaMA-Factory |\n"
    + "-" * 58
 )
@unique
 class Command(str, Enum):
    API = "api"
    CHAT = "chat"
    EVAL = "eval"
    EXPORT = "export"
    TRAIN = "train"
    WEBDEMO = "webchat"
    WEBUI = "webui"
    VER = "version"
    HELP = "help"
 def main():
    command = sys.argv.pop(1)
    if command == Command.API:
        run_api()
    elif command == Command.CHAT:
        run_chat()
    elif command == Command.EVAL:
        run_eval()
    elif command == Command.EXPORT:
        export_model()
    elif command == Command.TRAIN:
        run_exp()
    elif command == Command.WEBDEMO:
        run_web_demo()
    elif command == Command.WEBUI:
        run_web_ui()
    elif command == Command.VER:
        print(WELCOME)
    elif command == Command.HELP:
        print(USAGE)
    else:
        raise NotImplementedError("Unknown command: {}".format(command))
--- a/src/llmtuner/data/loader.py
+++ b/src/llmtuner/data/loader.py
@@ -11,7 +11,7 @@ from .aligner import align_dataset
 from .parser import get_dataset_list
 from .preprocess import get_preprocess_and_print_func
 from .template import get_template_and_fix_tokenizer
-from .utils import checksum, merge_dataset
+from .utils import merge_dataset
 if TYPE_CHECKING:
@@ -61,8 +61,6 @@ def load_single_dataset(
        if data_path is None:
            raise ValueError("File extension must be txt, csv, json or jsonl.")
        checksum(data_files, dataset_attr.file_sha1)
    else:
        raise NotImplementedError
--- a/src/llmtuner/data/parser.py
+++ b/src/llmtuner/data/parser.py
@@ -21,7 +21,6 @@ class DatasetAttr:
    load_from: Literal["hf_hub", "ms_hub", "script", "file"]
    dataset_name: str
    """ extra configs """
    file_sha1: Optional[str] = None
    subset: Optional[str] = None
    folder: Optional[str] = None
    ranking: bool = False
@@ -99,7 +98,6 @@ def get_dataset_list(data_args: "DataArguments") -> List["DatasetAttr"]:
        else:
            dataset_attr = DatasetAttr("file", dataset_name=dataset_info[name]["file_name"])
        dataset_attr.set_attr("file_sha1", dataset_info[name])
        dataset_attr.set_attr("subset", dataset_info[name])
        dataset_attr.set_attr("folder", dataset_info[name])
        dataset_attr.set_attr("ranking", dataset_info[name], default=False)
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
hiyouga	b2949b88e9	release v0.7.1 Former-commit-id: a4f8adb021b6218d624303b51cd5e93ffa3111a1	2024-05-16 00:57:16 +08:00
hiyouga	538c79fd8f	fix #3694 Former-commit-id: 3d1b818cb6a77b7603724fbeb756b468aa74e7ea	2024-05-16 00:35:28 +08:00
hiyouga	437cc20be6	fix #3606 https://github.com/huggingface/peft/pull/1706 Former-commit-id: bf2783e1b6bc207375974c48736d6f82dd293f02	2024-05-15 23:05:02 +08:00
hiyouga	2ac972d6e7	add Yi-VL-34B model Former-commit-id: 8b3d8a7e3bd8dff27cc72edba1b8a042f6d1929c	2024-05-15 22:58:19 +08:00
hiyouga	4d7f0fbb7a	add yi-vl 6b model Former-commit-id: 35f4041b13a593a6cf1ec6686fa18b38911ad6a4	2024-05-15 20:02:41 +08:00
hiyouga	40e3d3fbdd	fix yi vl vllm infer Former-commit-id: de54e5d7ec06dd7c20ec82c9ff032fc16cd50244	2024-05-15 19:25:48 +08:00
hiyouga	096677b989	add NPU docker images Former-commit-id: 3b3257962c52f5d1f15ce245fee402c5baddb774	2024-05-15 19:20:11 +08:00
hoshi-hiyouga	7940b968ae	Merge pull request #3748 from BUAADreamer/main Add MLLM YI-VL and save processor config during training Former-commit-id: 1d3cbd24ccea63d36c27725cdc5ecd02b460b0ed	2024-05-15 16:40:54 +08:00
hoshi-hiyouga	36a4224bf5	Update visual.py Former-commit-id: f5f13a995c64fc374ad05e26cde8efa6651aefa1	2024-05-15 16:39:57 +08:00
hiyouga	d4d36e157c	fix fsdp model loading Former-commit-id: fc6fe23cc9ae4a920a17e8268a85c1aa4ad16d3b	2024-05-15 16:32:28 +08:00
hoshi-hiyouga	c4f5e49d0d	Update patcher.py Former-commit-id: 4c31a21f2106adcdad100119bad83ecaef0be3f3	2024-05-15 15:37:07 +08:00
hoshi-hiyouga	8e518d6c62	Update template.py Former-commit-id: a13022166ba691c03f4fea7e9e2927fa446cf681	2024-05-15 14:20:39 +08:00
hoshi-hiyouga	79165100e5	Update trainer.py Former-commit-id: dd767b20635bb549ce14f9556e1c4fb44b3662c5	2024-05-15 14:13:26 +08:00
hoshi-hiyouga	fc82acbbd8	Update workflow.py Former-commit-id: 97cfb44bced18b721166ccb5f260098645fc5318	2024-05-15 14:13:01 +08:00
BUAADreamer	aead3ca8e5	rm extra import Former-commit-id: 031215019e3d7727b1c7cc87a44e1cf1eb2853ec	2024-05-15 12:48:18 +08:00
BUAADreamer	b12679ad59	cast dtype in mm_proj Former-commit-id: e0ab22648fe8b65055b5986258cc2800438dc60c	2024-05-15 11:22:15 +08:00
BUAADreamer	8061cb5671	modify style Former-commit-id: 823af88c3201412da7ef734d34198424e09b2d51	2024-05-15 10:18:10 +08:00
BUAADreamer	0a7e5f2f57	Merge branch 'main' of https://github.com/BUAADreamer/LLaMA-Factory Former-commit-id: ce5cb0f897eebe32a1c2c0a78fe1b0267e4b6d9d	2024-05-15 09:54:21 +08:00
BUAADreamer	812d2c25a7	Merge branch 'hiyouga:main' into main Former-commit-id: a4795c2f5328e0cfc657409f5774819e3defc006	2024-05-15 09:54:14 +08:00
BUAADreamer	51795e8db1	add yivl and save processor to model_dir Former-commit-id: ae72f745cb4f7713c3b835d11202aec19c3c5093	2024-05-15 09:54:00 +08:00
hiyouga	2c011060b1	fix bug in vllm engine Former-commit-id: 38f02a2c5b52cba6908c2d3c2a455677f8574faf	2024-05-15 02:17:54 +08:00
hiyouga	a8c7531250	fix gen args Former-commit-id: d79f91f87106ba1bc3c0ea08da5898aad59566a7	2024-05-15 01:49:05 +08:00
hiyouga	88c34d26a8	fix examples Former-commit-id: 910ffaf46e3dde87d2dbb48b82a59a9898a90847	2024-05-15 00:26:10 +08:00
hiyouga	12d666a63c	update examples Former-commit-id: 09269c59427e8a007c1c1b6f9d2014b4c0d0a328	2024-05-15 00:05:17 +08:00
hiyouga	304a2efec8	update readme Former-commit-id: 568cc1d33c3d202e6430b68e0bcb2772aa6b0aa2	2024-05-14 23:57:08 +08:00
hiyouga	322331df51	update readme Former-commit-id: f315a545d85a661746ad304b5a688d1fad9eaea1	2024-05-14 23:55:49 +08:00
hiyouga	ba0da83031	add npu examples Former-commit-id: 0f21e68e2dbd84c820d66d5c6d980004efc51d51	2024-05-14 23:32:53 +08:00
hoshi-hiyouga	0a82e15e7c	Merge pull request #3584 from zhou-wjjw/main Enhancing Ascend 910A Training Efficiency in LlamaFactory with NPU Former-commit-id: 310cf017a5ec24af8f5cf3af298760dd4150f9f2	2024-05-14 22:18:37 +08:00
hiyouga	6670b36c49	use robust envs Former-commit-id: f3e194c3b3c40a3e6c3c5397ec0d859e6db614b5	2024-05-14 21:36:42 +08:00
hoshi-hiyouga	7a1d13aae2	Update train.py Former-commit-id: da1e6f0d9c2eff64f92da1f6ada3aa44ef6d6a7e	2024-05-14 20:47:52 +08:00
hoshi-hiyouga	86a048128b	Apply suggestions from code review Co-authored-by: Huazhong Ji <hzji210@gmail.com> Former-commit-id: abef48c17ee795eae984fcc89019c2c4859108c1	2024-05-14 20:44:21 +08:00
hoshi-hiyouga	fe1a3b1367	Apply suggestions from code review Co-authored-by: Huazhong Ji <hzji210@gmail.com> Former-commit-id: a435e5a0bdd7268c4f1204f99f289ee0b36fd930	2024-05-14 20:44:04 +08:00
hiyouga	84ff56c3a0	fix #3728 Former-commit-id: ea3e32a27f7f7dce75a708f8a6f376b5d3e8059a	2024-05-14 20:37:21 +08:00
BUAADreamer	483ed64b43	modify yi-vl template Former-commit-id: f113975b425e70bed2588ca55a2c62594fbf2283	2024-05-14 16:45:28 +08:00
BUAADreamer	dd4619e9f3	add support for Yi-VL Former-commit-id: d7834ca92d3048949caa48f8635cfbcea2c85771	2024-05-14 14:03:19 +08:00
BUAADreamer	905815d878	Merge branch 'main' of https://github.com/BUAADreamer/LLaMA-Factory Former-commit-id: e82f527ea583a7e99a25a06c7fe7b03c1dc2ebb9	2024-05-13 23:28:52 +08:00
BUAADreamer	ba72e08901	add yi-vl Former-commit-id: 891b25cb3d709ea82182ca90496034360e1cd5d8	2024-05-13 23:28:28 +08:00
hiyouga	e4972c8fc4	update examples Former-commit-id: 779603055ae9216ff549f5285caac8c0c0a1e9fb	2024-05-13 20:39:36 +08:00
hiyouga	5f5f948806	fix #3724 Former-commit-id: 62f5999d79834d6cbc4129eda387a317665d6099	2024-05-13 20:09:09 +08:00
hiyouga	2892e5d42a	fix #3702 Former-commit-id: 55755786f21050b9efc127c391509ba5d9ea8982	2024-05-13 18:24:35 +08:00
hoshi-hiyouga	542a5d15ef	Merge pull request #3655 from Tendo33/main 1.Change the name of is_fastapi_available function 2. Added the log of printing requests when deploying using vllm Former-commit-id: 28c75448eed9d472e96285737a66ac0d20280e13	2024-05-13 18:05:50 +08:00
hiyouga	b1c791fb0d	support Yi 1.5 Former-commit-id: e580823676cbb83ddb9a0f685992e6054ae5ffaa	2024-05-13 16:51:20 +08:00
Tendo33	7589123465	ruff check scripts src tests --fix Former-commit-id: da5277b6a1cff40d59df8f1835d9514b2a51be34	2024-05-13 09:40:33 +08:00
Sun Jinfeng	f94b54b776	Merge branch 'hiyouga:main' into main Former-commit-id: 014acaa7845b7ac2876596d216b1be369a8e9311	2024-05-13 09:29:58 +08:00
hiyouga	1e1b8899f5	lint Former-commit-id: cb72eb6ab24615ce492ca2945f29daa34c0c52d4	2024-05-12 01:28:51 +08:00
hiyouga	7b02c83399	fix #3658 Former-commit-id: 37799a62d4431d1d8c02fee6c23d607a65723c1a	2024-05-12 01:25:16 +08:00
hiyouga	8f1ba07b30	remove checksum and fix ui args Former-commit-id: 0cfdeb1d30efb63211434bc4656bceb59e666289	2024-05-12 01:10:30 +08:00
hoshi-hiyouga	1ce400bddf	Merge pull request #3654 from betapeanut/main Remove Redundant Environment Variable Usage Former-commit-id: aa57a2a183eef822973d7e5d7c7bc80a42167482	2024-05-12 00:49:00 +08:00
hiyouga	6bc0ec63c7	update readme Former-commit-id: d57ca8a865b46588f65b2cc15073c5fcc4e4cebc	2024-05-12 00:33:49 +08:00
hiyouga	25d316b1a0	fix #3674 Former-commit-id: 6bad2eafef75ec697477e1f2ce739006042fb4c7	2024-05-12 00:03:59 +08:00
hiyouga	2bcd5b2b73	fix llava config Former-commit-id: b13d032325e45d401a9dbc64d4c73e308eff3288	2024-05-12 00:02:49 +08:00
hoshi-hiyouga	436afcba57	Merge pull request #3651 from BUAADreamer/main add some mllm features and try to incorporate Chinese-LLaVA-Med project Former-commit-id: 143d311d4a82e1fa9b6d4ad98b0db5b02f3572c4	2024-05-11 23:59:08 +08:00
hoshi-hiyouga	db47c53486	Update loader.py Former-commit-id: 2fc12790414677bb82736208fb9547640780af2e	2024-05-11 23:58:47 +08:00
hoshi-hiyouga	4efe56fd68	Update model_args.py Former-commit-id: c4114add4c42c1d7723f7270451a6c9fc656ecd1	2024-05-11 23:57:05 +08:00
hoshi-hiyouga	d54313fcf9	Update patcher.py Former-commit-id: 2c88d394d29c6e98ac3a6860848855722614ca52	2024-05-11 23:56:40 +08:00
hoshi-hiyouga	382f096475	Update tuner.py Former-commit-id: ccd1eb2c0992f75440c0e1c5cd3f02d03aacb085	2024-05-11 23:55:59 +08:00
hoshi-hiyouga	0ccc76392e	Update tuner.py Former-commit-id: 22afcbdb25160583e5ece28fad0585c7bc70f41a	2024-05-11 23:54:53 +08:00
hoshi-hiyouga	e2cfcb0a5f	Update README_zh.md Former-commit-id: 1a205478403b5852fac0aa8418cdb8995fbe40e3	2024-05-11 22:44:51 +08:00
hoshi-hiyouga	b530a798c1	Update README.md Former-commit-id: d24c83bb30e2829ba78db90c4c4975788f2eed25	2024-05-11 22:43:04 +08:00
BUAADreamer	fdf38b70a0	Merge branch 'main' of https://github.com/BUAADreamer/LLaMA-Factory Former-commit-id: 50cc5cf93d50c42cfcf5047bcd9b5c7959d503ae	2024-05-11 13:11:10 +08:00
BUAADreamer	1a78b675be	add full parameter finetuning of mllm Former-commit-id: f90c1da5636ac3cb8112c5081a3b56b09a17fcf8	2024-05-11 13:11:00 +08:00
kkkl	9b1008912c	Update constants.py Fix the download issue of the Phi3 model Former-commit-id: 8978e80914ac6db1ed1b79641b20c84087dd4341	2024-05-11 00:22:40 +08:00
BUAADreamer	18241f4ed8	Merge branch 'hiyouga:main' into main Former-commit-id: 0dd072703508f68fd4ee51b6648d0c7642a4cc93	2024-05-10 20:34:41 +08:00
hiyouga	223bbd9930	resolve python 3.8 package Former-commit-id: 5eee4ec7016846356715a4fa1ad58e3cbb1cac6e	2024-05-09 16:52:27 +08:00
Tendo33	9dadff90bb	1.Change the name of is_fastapi_available function 2. Added the log of printing requests when deploying using vllm Former-commit-id: 530d4f5d51c13c71d99de5fe2d23805b0aa875a2	2024-05-09 14:28:01 +08:00
BUAADreamer	827a929f1d	add push processor to hub Former-commit-id: 7a05a965311edfdfafa57af8342875860d341f27	2024-05-09 14:05:19 +08:00
BUAADreamer	e508519e0a	add mllm processor save and Chinese-LLaVA-Med show Former-commit-id: 110c49fbf79fe0625f091e63746bfabde00add99	2024-05-09 13:53:39 +08:00
BUAADreamer	47892418ad	Merge branch 'hiyouga:main' into main Former-commit-id: 1f3163509ecd05902ea216a905b4ca15ddd3696f	2024-05-09 13:45:43 +08:00
cocktailpeanut	2aeae4b88b	yet another removal of unnecessary environment variables Former-commit-id: a07726028f0287de28e4751672b27efe0efc6477	2024-05-09 01:33:20 -04:00
cocktailpeanut	c213f2a9a9	more removal of unnecessary environment variables Former-commit-id: 59ef1a6e0d81585a6c010143d05fcfae26d40c00	2024-05-09 01:32:00 -04:00
cocktailpeanut	333f4a69bb	remove unnecessary environment variable usage Former-commit-id: 4be1d832cb269a07987f5cab5d5f949e269087da	2024-05-09 01:26:15 -04:00
BUAADreamer	172600d432	add mllm export Former-commit-id: ce4770d33f6761d3b1d60661efcb0be34a036154	2024-05-08 22:50:42 +08:00
hiyouga	4ce4172c87	fix #3625 Former-commit-id: 8c0f5d1db29862277d84aa128b424b7d0f2b187f	2024-05-08 17:12:56 +08:00
hiyouga	400ae144a4	add llama3 chinese chat Former-commit-id: ee3e5920f2f28567259693cb106e884a90cb02a2	2024-05-08 17:10:03 +08:00
hiyouga	0a1b6ca5a7	add deepseek moe 236B Former-commit-id: 30c10e2dc41b5d64191a91ad2d61f3b5c440b1d5	2024-05-08 16:37:54 +08:00
BUAADreamer	05ef89cfcc	modify export model Former-commit-id: c7051edae4ce23f85daf204a2aaac134b1f29c3d	2024-05-08 10:36:36 +08:00
hiyouga	6d9d8b92ca	update readme Former-commit-id: bcc3d3b95609555e5e9a4deb68e65391c5b465bd	2024-05-07 22:17:04 +08:00
hiyouga	3f7f1daa33	remove big file Former-commit-id: 8a05242787f810ec25d1b33358257d2867c45497	2024-05-07 22:14:06 +08:00
hiyouga	8061e92d07	update readme Former-commit-id: ecefcb2e891e75d37df5ebfc616cfdb2106bcfd6	2024-05-07 21:17:31 +08:00
hiyouga	0c811a7653	update readme Former-commit-id: 730ea71584debc5784d68eeadceb42f7e827447f	2024-05-07 19:03:47 +08:00
hiyouga	f6ac3796ca	fix #3560 Former-commit-id: ea69cbe903a301df1bcc4b63cdc5bd4c6e3a8255	2024-05-07 19:03:35 +08:00
hoshi-hiyouga	c1394e7dfc	Merge pull request #3601 from Katehuuh/main Add contribution Luminia Former-commit-id: 53bef571c445111f49bcc8a5d49afc2872f754ae	2024-05-07 18:01:48 +08:00
hiyouga	ebab655683	fix #3602 Former-commit-id: 1518b45490606ea200482da4737113c46985e8c5	2024-05-07 17:50:27 +08:00
hoshi-hiyouga	3d74f21738	Merge pull request #3604 from gaussian8/main fix: splitted Dockerfile's CMD Former-commit-id: 1d6e6956ca45d3cb7de213c4a641b98a35af5896	2024-05-07 16:53:23 +08:00
junwooo.lee	8493753fab	fix: splitted Dockerfile's CMD Former-commit-id: d8032550c7e084648fbf24da5abbac6432b54f26	2024-05-07 15:09:48 +09:00
Katehuuh	0f626a2145	Update README_zh.md Add Projects Nekochu/Luminia-13B-v3 Former-commit-id: 88d01e831bd511daec30a94817f06e07b8406b18	2024-05-07 06:28:48 +02:00
Katehuuh	5100c290c4	Update README.md Add Projects Nekochu/Luminia-13B-v3 Former-commit-id: 3d2cd743c2c8830e8b131d1192f1549fa557762d	2024-05-07 06:23:36 +02:00
hiyouga	4bde37e7c8	update readme Former-commit-id: 3fdc72b9aad9e129f74417cbbf25e841d28e3737	2024-05-07 06:19:29 +08:00
hiyouga	e3b3a722de	fix stop param Former-commit-id: f0a850c25211b72eddbb357c81679db9b0930d44	2024-05-07 00:41:04 +08:00
hoshi-hiyouga	b9e167e6ca	Merge pull request #3527 from zhaonx/dev "add support for vllm api stop parameter" Former-commit-id: e7d436403af6ac4c6a33cf36411098a0b0fefce2	2024-05-07 00:37:49 +08:00
hoshi-hiyouga	1ebd1e50e7	Update vllm_engine.py Former-commit-id: fa2410de07150a82082ab5b88baf56aa891db870	2024-05-07 00:37:05 +08:00
hoshi-hiyouga	14316f6583	Update generating_args.py Former-commit-id: 714957ba0159919a89fc1659a7a7b4b6bd82eead	2024-05-07 00:28:16 +08:00
hoshi-hiyouga	8e4ab2f7d0	Update generating_args.py Former-commit-id: 7a9fb56786f4c40856211009656a983be1e42cb7	2024-05-07 00:27:56 +08:00
hiyouga	196068fa19	update readme Former-commit-id: 1c67708291195825e8356d5862d22cbee9566233	2024-05-06 23:34:59 +08:00
hiyouga	da2295f8c8	fix gradio args Former-commit-id: 7767c1ad4b2b638b558f941ba1f0d05d4a049507	2024-05-06 23:33:06 +08:00
hoshi-hiyouga	ab0741b5a6	Merge pull request #3596 from hiyouga/dev_doc Add CLI document Former-commit-id: 2b08c51500592f092b9596517e787081453ecbb5	2024-05-06 23:10:38 +08:00
hiyouga	6aec446940	update examples Former-commit-id: cca50b627c85e0a777717d609377406cc7fd579f	2024-05-06 23:07:55 +08:00
hiyouga	50c71dd29f	update example docs Former-commit-id: 102cd42768d9eb2cf1219309a25b41e26149067e	2024-05-06 22:51:02 +08:00
hiyouga	5c9da798b5	update docs Former-commit-id: a4a2e94241bea6f96590f6cb8ca8b5cddee1917e	2024-05-06 21:47:00 +08:00
zhouwei	3d1b0e1864	The training efficiency of the Ascend 910A has been significantly enhanced, leveraging the full computational power of the NPU (Neural Processing Unit) and the capabilities of torch_npu, a PyTorch library optimized for NPUs. This improvement has resulted in a remarkable tenfold increase in efficiency. Former-commit-id: 90980b626d3408b3e2ee32a02456c20881318be7	2024-05-06 13:29:59 +08:00
zhaonx96	45becd2a45	”add stop parameter in chat.py“ Former-commit-id: e529bf5bc14c72558d26f73c42076eaa9684205c	2024-05-06 10:10:00 +08:00
zhaonx96	8f1197de7e	Merge branch 'main' of https://github.com/zhaonx/LLaMA-Factory into dev Former-commit-id: ec1f834905e241277fdd3f764c70eede97e9ff40	2024-05-06 10:09:00 +08:00
hoshi-hiyouga	25de4ce56a	Merge pull request #3578 from pha123661/main Fix badam example argument Former-commit-id: d6edf3d91e5d20f48938e02d96d2193ed3d50181	2024-05-05 23:41:58 +08:00
Oscar	d0597897bf	Fix badam example outdated argument Former-commit-id: 29aa188cc774cb72367f706f1cd4c07bc5a9f241	2024-05-05 23:35:19 +08:00
hiyouga	4674f3baa7	add version and help to cli Former-commit-id: f762f2215169b9fe55564d5600b758ddc66f9c9c	2024-05-05 02:44:35 +08:00
hiyouga	2f5f6722cf	fix eval scripts Former-commit-id: fc3743d0b82c28fbff1170761139e4fa5d2a8939	2024-05-05 00:53:07 +08:00
hiyouga	7ef3788ff4	update webui Former-commit-id: 17a53d25cdadd2df70a8afa0488f75bbf1918b89	2024-05-05 00:17:54 +08:00
hiyouga	f9aa74715a	update scripts Former-commit-id: 1c07648c4bb4bb0c46bc0240547b46bd2835dce1	2024-05-04 23:05:17 +08:00
hiyouga	9b187b274c	add avg ppl Former-commit-id: 40caeb6f0fdf76a1e2c9ca3761299d087fc643e0	2024-05-04 22:35:31 +08:00
hiyouga	68ed89f351	update ppl script Former-commit-id: 07606fa4ab303f088170a569c1f86141a1b496c5	2024-05-04 22:13:14 +08:00
hiyouga	342d7da8d7	add cal_ppl script Former-commit-id: 947068c11c0be00db2cecddb2c5842a0d6e2c321	2024-05-04 22:02:25 +08:00
hiyouga	6eda42eb7c	update readme Former-commit-id: eaf83847ef6d89d8b70429138e73b04fd2aa3ef8	2024-05-04 17:01:21 +08:00
hiyouga	e9fe8815be	remove empty stream response Former-commit-id: 070d0da928b1e974a094279a2782201016d2a3ab	2024-05-04 16:13:52 +08:00
hiyouga	9381fecca7	fix async stream api response Former-commit-id: d70bbcae6513e50aa6094f2d98c4aa5c6641ea02	2024-05-04 16:11:18 +08:00
hiyouga	efa9140577	update api and support abort eval in webui Former-commit-id: 8661bed68812e9ded9439e8a821b1d7716bc797b	2024-05-04 15:59:15 +08:00
hiyouga	b1b18b2c5a	update readme Former-commit-id: 5061f7196a3278af5ebce77249d9c3c0f8a55b34	2024-05-04 00:43:53 +08:00
hiyouga	37bcbf72b4	update readme and webui launch Former-commit-id: c66ffa57323ef6ea78a9b75ec5122d9ea25fd420	2024-05-04 00:43:02 +08:00
hiyouga	99125c8825	update readme Former-commit-id: 012e5b9625682a628a0b7fb5879097be7166c7be	2024-05-04 00:31:02 +08:00
hiyouga	182b974786	fix eval in webui Former-commit-id: 774ef2bf5823d68b9cc254a676f5adb4af533d75	2024-05-04 00:19:19 +08:00
hiyouga	7a4a6a5522	fix webui resume Former-commit-id: c2f6582ddd365bb64b72e8057cc4ecd7884d2480	2024-05-03 23:15:19 +08:00
hiyouga	2383e5440c	fix slow op in dpo/orpo trainer Former-commit-id: 38cad0896ea0516de6d4b2759ec9d45ee67d339b	2024-05-03 23:06:52 +08:00
hiyouga	1fea91736a	fix callback log multigpu #3559 Former-commit-id: 1f105f1551b12675ca7d339ef5f91333f0371987	2024-05-03 21:24:27 +08:00
hiyouga	09d9fb28f9	enable tqdm in webui Former-commit-id: 1737bff64799047a5b715fd979b4c038ae213bb3	2024-05-03 04:42:50 +08:00
hiyouga	57c6eabf83	fix gen_args Former-commit-id: c3e2f4f07b7fb3b1d7d2b44451660f082a467aed	2024-05-03 04:24:50 +08:00
hiyouga	33d440b577	fix colab gradio Former-commit-id: 26179a29d3400d1fea155e325a79473a8bc12f04	2024-05-03 03:54:46 +08:00
hiyouga	ce8200ad98	update webui and add CLIs Former-commit-id: 1368dda22ab875914c9dd86ee5146a4f6a4736ad	2024-05-03 02:58:23 +08:00
hiyouga	2cedb59bee	Update prepare.sh Former-commit-id: 5928b869251a984a085289ca6861a9731dc5b910	2024-05-02 17:16:02 +08:00
hiyouga	dd0b85580e	fix badam configs Former-commit-id: 8a4e6a4c65a9a42e6501b0d3ce81d6220c287454	2024-05-02 02:47:04 +08:00
hoshi-hiyouga	cd4dad846b	Merge pull request #3487 from codemayq/main support BAdam in WebUI Former-commit-id: 6eada1a2844a2b2c8aad599ebfcc35b376c938ea	2024-05-02 02:38:01 +08:00
hoshi-hiyouga	a11a04a24f	Update train.py Former-commit-id: 16f0d0056967872e02969fdd842a381f9484af8a	2024-05-02 02:21:27 +08:00
hoshi-hiyouga	eb99999ca8	Update README_zh.md Former-commit-id: 1c673d89faca3160627009fcd0a4aa39138570c0	2024-05-02 02:14:55 +08:00
hoshi-hiyouga	ea58cf111e	Update README.md Former-commit-id: 4fb43b0c9aa48242126252ad755a2a1683b38d6a	2024-05-02 02:13:46 +08:00
zhaonx	2d95127c33	"add support for vllm api stop parameter" Former-commit-id: b9f21fa639b66db09c79404d885661c96bdf9395	2024-04-30 17:17:09 +08:00
Lao	57fcdca336	Update README_zh.md Former-commit-id: bacc8588dc7b0b43c240189ecf4336bedc299357	2024-04-28 23:31:37 +08:00
khazic	3d88589c0f	Upgrade the second sharegpt format Former-commit-id: 057f992a666b029d207a3dc7dfc353f9abcf8316	2024-04-28 14:30:05 +08:00
khazic	dfd153cc81	added the second sharegpt format Former-commit-id: 6d140ac98a78ecc0a713842bb917dc8eb14450cb	2024-04-28 14:27:45 +08:00
codingma	7641a214d8	support BAdam in WebUI Former-commit-id: 1247154dd7d5eba5d11c4bb8504bf551ab49eb72	2024-04-28 11:31:34 +08:00
		`@@ -0,0 +1,2 @@`
							`model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct`
							`template: llama3`
		`@@ -1,4 +0,0 @@`
			`from .app import create_app`


			`__all__ = ["create_app"]`