mirror of
https://github.com/karpathy/nanochat.git
synced 2026-01-30 04:22:02 +00:00
update log with a bunch of attempts
This commit is contained in:
18
dev/LOG.md
18
dev/LOG.md
@@ -4,6 +4,22 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## 2026-01-16: Modded-nanogpt Ideas Sweep (Mostly Negative)
|
||||||
|
|
||||||
|
Tested several architectural ideas from modded-nanogpt to see if they transfer to nanochat. All of these did not help:
|
||||||
|
|
||||||
|
| Idea | Result | Notes |
|
||||||
|
|------|--------|-------|
|
||||||
|
| Half-truncated RoPE | No improvement | Only first half of head dims get RoPE (base 1024, linspace). Second half "stationary". |
|
||||||
|
| Asymmetric softcap | Slightly worse | `23 * sigmoid((x+5)/7.5)` vs our symmetric `15 * tanh(x/15)`. May only help with FP8. |
|
||||||
|
| Smear gate | Negligible | Blend each token with predecessor via learned gate. Tiny improvement not worth n_embd² params. |
|
||||||
|
| Backout | No improvement | Save activations at ~60% through network, subtract scaled version at end. |
|
||||||
|
| Skip connection | Slightly worse | Save at layer ~25%, add at layer ~50%. Also +2GB memory from storing activations. |
|
||||||
|
|
||||||
|
Value Embeddings do show promise. I need a more elaborate exploration of a few related ideas, which I leave for tomorrow.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## 2026-01-15: Olmo pretraining mix (Negative result)
|
## 2026-01-15: Olmo pretraining mix (Negative result)
|
||||||
|
|
||||||
I attempted to train on the Olmo 3 pretraining dataset [allenai/dolma3_mix-6T](https://huggingface.co/datasets/allenai/dolma3_mix-6T) instead of FineWeb-edu. I ran into a number of [errors and issues](https://huggingface.co/datasets/allenai/dolma3_mix-6T/discussions/2) trying to both download and process the dataset and then noticed some quality issues (e.g. some documents seem to be extremely short, like "5".). I managed to work around these with some sensible hacks (e.g. reject documents less than 100 characters in length) and tried to process the dataset exactly as FineWeb, re-trained the tokenizer and trained a d16 model. The CORE score decreased from 15.5 to 13.8, i.e. the result is quite a bit worse.
|
I attempted to train on the Olmo 3 pretraining dataset [allenai/dolma3_mix-6T](https://huggingface.co/datasets/allenai/dolma3_mix-6T) instead of FineWeb-edu. I ran into a number of [errors and issues](https://huggingface.co/datasets/allenai/dolma3_mix-6T/discussions/2) trying to both download and process the dataset and then noticed some quality issues (e.g. some documents seem to be extremely short, like "5".). I managed to work around these with some sensible hacks (e.g. reject documents less than 100 characters in length) and tried to process the dataset exactly as FineWeb, re-trained the tokenizer and trained a d16 model. The CORE score decreased from 15.5 to 13.8, i.e. the result is quite a bit worse.
|
||||||
@@ -12,6 +28,8 @@ I am still looking to try the [DCLM dataset](https://arxiv.org/abs/2406.11794),
|
|||||||
|
|
||||||
Classifying as negative result and reverting back to FineWeb-edu for now.
|
Classifying as negative result and reverting back to FineWeb-edu for now.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## 2026-01-13: Varlen Attention (Negative Result)
|
## 2026-01-13: Varlen Attention (Negative Result)
|
||||||
|
|
||||||
Attempted to prevent attention from "leaking" across document boundaries using Flash Attention's `flash_attn_varlen_func`, similar to modded-nanogpt's approach.
|
Attempted to prevent attention from "leaking" across document boundaries using Flash Attention's `flash_attn_varlen_func`, similar to modded-nanogpt's approach.
|
||||||
|
|||||||
Reference in New Issue
Block a user