mirror of
https://github.com/karpathy/nanochat.git
synced 2026-01-30 04:22:02 +00:00
Fix args in readme (#438)
* fix commands in readme, using new arg format * fix typo * add required -i flag to chat_eval example runs
This commit is contained in:
committed by
GitHub
parent
bdcc030ffa
commit
d4ea28d4e2
@@ -82,10 +82,10 @@ That said, to give a sense, the example changes needed for the [speedrun.sh](spe
|
||||
python -m nanochat.dataset -n 450 &
|
||||
...
|
||||
# use --depth to increase model size. to not oom, halve device batch size 32 -> 16:
|
||||
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --device_batch_size=16
|
||||
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --device-batch-size=16
|
||||
...
|
||||
# make sure to use the same later during midtraining:
|
||||
torchrun --standalone --nproc_per_node=8 -m scripts.mid_train -- --device_batch_size=16
|
||||
torchrun --standalone --nproc_per_node=8 -m scripts.mid_train -- --device-batch-size=16
|
||||
```
|
||||
|
||||
That's it! The biggest thing to pay attention to is making sure you have enough data shards to train on (the code will loop and do more epochs over the same training set otherwise, decreasing learning speed a bit), and managing your memory/VRAM, primarily by decreasing the `device_batch_size` until things fit (the scripts automatically compensate by increasing the number of gradient accumulation loops, simply turning parallel compute to sequential compute).
|
||||
|
||||
Reference in New Issue
Block a user