mirror of
https://github.com/karpathy/nanochat.git
synced 2026-01-30 04:22:02 +00:00
delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then
This commit is contained in:
23
dev/LOG.md
Normal file
23
dev/LOG.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# Experiment Log
|
||||
|
||||
A running summary documenting some experiments and findings. Started ~Jan 7 2026.
|
||||
|
||||
---
|
||||
|
||||
## 2026-01-08: exp_grad_clip - Gradient Clipping
|
||||
|
||||
**Hypothesis:** Gradient clipping may be unnecessary overhead. Tested L2 norm clipping at various thresholds (0.25, 0.5, 1.0, 2.0) and elementwise clipping.
|
||||
|
||||
**Results:**
|
||||
- No benefit at any scale tested (d12, d20)
|
||||
- All variants within noise (~0.9827 val_bpb)
|
||||
- Grad norm never exceeds 1.0 naturally, so clipping is always inactive
|
||||
- Clipping adds ~2% time overhead from the all-reduce
|
||||
|
||||
**Bug Found:** Original implementation clipped local gradients before sync. Since this codebase doesn't use DDP (gradient sync is in the optimizers), each rank was clipping based on its own local norm. Fixed on the branch with proper distributed all-reduce.
|
||||
|
||||
**Observartion:** modded-nanogpt does not appear to clip either right now.
|
||||
|
||||
**Recommendation:** Disable by default (`--grad_clip=0.0`). The code naturally produces well-behaved gradients.
|
||||
|
||||
---
|
||||
Reference in New Issue
Block a user