add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb

2026-01-30 04:22:02 +00:00 · 2026-01-11 21:49:54 +00:00
parent 2ff7d51252
commit fbc1484e8c
4 changed files with 83 additions and 13 deletions
--- a/dev/LOG.md
+++ b/dev/LOG.md
@@ -4,6 +4,22 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026

 ---

+## 2026-01-11: Sliding Window Attention
+
+Added configurable sliding window attention, inspired by GPT-3's alternating short/long pattern.
+
+**Pattern string configuration:**
+- New `--window_pattern` CLI arg and `GPTConfig.window_pattern` field
+- Pattern is tiled across layers (e.g., `SSSL` for 20 layers → `SSSLSSSLSSSLSSSLSSSL`)
+- Final layer always forced to L (full context) regardless of pattern
+- Short window = `sequence_len // 2`
+- Long window = `sequence_len` (full context)
+- All previous models so far have been simply `L` and checkpoint loading is modified accordingly to fill in this param for old models, see `_patch_missing_config_keys`
+
+Quick experiments showed `SSSL` (every 4th layer is long) works well - provides a good balance between compute savings and model quality. This is now the default.
+
+---
+
 ## 2026-01-11: Flash Attention 3 Integration

 Replaced PyTorch's `scaled_dot_product_attention` (FA2) with Flash Attention 3 for training and inference.