update data readme

Former-commit-id: 81adb153b7d0b30e6cd50c9bf4ca1ccf17458611
2024-09-05 04:25:27 +08:00
parent 72222d1598
commit 4d35ace75e
2 changed files with 72 additions and 4 deletions
--- a/data/README.md
+++ b/data/README.md
@@ -23,6 +23,7 @@ Currently we support datasets in **alpaca** and **sharegpt** format.
    "system": "the column name in the dataset containing the system prompts. (default: None)",
    "tools": "the column name in the dataset containing the tool description. (default: None)",
    "images": "the column name in the dataset containing the image inputs. (default: None)",
+    "videos": "the column name in the dataset containing the videos inputs. (default: None)",
    "chosen": "the column name in the dataset containing the chosen answers. (default: None)",
    "rejected": "the column name in the dataset containing the rejected answers. (default: None)",
    "kto_tag": "the column name in the dataset containing the kto tags. (default: None)"
@@ -168,11 +169,11 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
 }
 ```

-### Multimodal Dataset
+### Multimodal Image Dataset

 - [Example dataset](mllm_demo.json)

-Multimodal datasets require a `images` column containing the paths to the input images.
+Multimodal image datasets require a `images` column containing the paths to the input images.

 ```json
 [
@@ -201,6 +202,39 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
 }
 ```

+### Multimodal Video Dataset
+
+- [Example dataset](mllm_demo_video.json)
+
+Multimodal video datasets require a `videos` column containing the paths to the input videos.
+
+```json
+[
+  {
+    "instruction": "human instruction (required)",
+    "input": "human input (optional)",
+    "output": "model response (required)",
+    "videos": [
+      "video path (required)"
+    ]
+  }
+]
+```
+
+Regarding the above dataset, the *dataset description* in `dataset_info.json` should be:
+
+```json
+"dataset_name": {
+  "file_name": "data.json",
+  "columns": {
+    "prompt": "instruction",
+    "query": "input",
+    "response": "output",
+    "videos": "videos"
+  }
+}
+```
+
 ## Sharegpt Format

 ### Supervised Fine-Tuning Dataset