mirror of
https://github.com/karpathy/nanochat.git
synced 2026-01-30 04:22:02 +00:00
add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb
This commit is contained in:
16
dev/LOG.md
16
dev/LOG.md
@@ -4,6 +4,22 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026
|
||||
|
||||
---
|
||||
|
||||
## 2026-01-11: Sliding Window Attention
|
||||
|
||||
Added configurable sliding window attention, inspired by GPT-3's alternating short/long pattern.
|
||||
|
||||
**Pattern string configuration:**
|
||||
- New `--window_pattern` CLI arg and `GPTConfig.window_pattern` field
|
||||
- Pattern is tiled across layers (e.g., `SSSL` for 20 layers → `SSSLSSSLSSSLSSSLSSSL`)
|
||||
- Final layer always forced to L (full context) regardless of pattern
|
||||
- Short window = `sequence_len // 2`
|
||||
- Long window = `sequence_len` (full context)
|
||||
- All previous models so far have been simply `L` and checkpoint loading is modified accordingly to fill in this param for old models, see `_patch_missing_config_keys`
|
||||
|
||||
Quick experiments showed `SSSL` (every 4th layer is long) works well - provides a good balance between compute savings and model quality. This is now the default.
|
||||
|
||||
---
|
||||
|
||||
## 2026-01-11: Flash Attention 3 Integration
|
||||
|
||||
Replaced PyTorch's `scaled_dot_product_attention` (FA2) with Flash Attention 3 for training and inference.
|
||||
|
||||
Reference in New Issue
Block a user