[model] support audio (#6701)

* support qwen2_audio * improve code * lint * fix * fix * fix --------- Co-authored-by: hiyouga <hiyouga@buaa.edu.cn> Former-commit-id: 5eacb5629e4d7733cd992a63747a1335f2c6a929
2025-02-05 04:59:09 +08:00
parent 9feb78e7b4
commit 8f401e37f8
35 changed files with 675 additions and 213 deletions
--- a/data/README.md
+++ b/data/README.md
@@ -24,6 +24,7 @@ Currently we support datasets in **alpaca** and **sharegpt** format.
    "tools": "the column name in the dataset containing the tool description. (default: None)",
    "images": "the column name in the dataset containing the image inputs. (default: None)",
    "videos": "the column name in the dataset containing the videos inputs. (default: None)",
+    "audios": "the column name in the dataset containing the audios inputs. (default: None)",
    "chosen": "the column name in the dataset containing the chosen answers. (default: None)",
    "rejected": "the column name in the dataset containing the rejected answers. (default: None)",
    "kto_tag": "the column name in the dataset containing the kto tags. (default: None)"
@@ -150,6 +151,10 @@ An additional column `images` is required. Please refer to the [sharegpt](#share

 An additional column `videos` is required. Please refer to the [sharegpt](#sharegpt-format) format for details.

+### Multimodal Audio Dataset
+
+An additional column `audios` is required. Please refer to the [sharegpt](#sharegpt-format) format for details.
+
 ## Sharegpt Format

 ### Supervised Fine-Tuning Dataset
@@ -296,7 +301,7 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh

 - [Example dataset](mllm_demo.json)

-Multimodal image datasets require a `images` column containing the paths to the input images.
+Multimodal image datasets require an `images` column containing the paths to the input images.

 The number of images should be identical to the `<image>` tokens in the conversations.

@@ -374,6 +379,47 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
 }
 ```

+### Multimodal Audio Dataset
+
+- [Example dataset](mllm_audio_demo.json)
+
+Multimodal audio datasets require an `audios` column containing the paths to the input audios.
+
+The number of audios should be identical to the `<audio>` tokens in the conversations.
+
+```json
+[
+  {
+    "conversations": [
+      {
+        "from": "human",
+        "value": "<audio>human instruction"
+      },
+      {
+        "from": "gpt",
+        "value": "model response"
+      }
+    ],
+    "audios": [
+      "audio path (required)"
+    ]
+  }
+]
+```
+
+Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
+
+```json
+"dataset_name": {
+  "file_name": "data.json",
+  "formatting": "sharegpt",
+  "columns": {
+    "messages": "conversations",
+    "audios": "audios"
+  }
+}
+```
+
 ### OpenAI Format

 The openai format is simply a special case of the sharegpt format, where the first message may be a system prompt.