Compare commits

3 Commits

Author SHA1 Message Date
Andrej Karpathy
067daa7758 small fix cpu script ty PR #474 2026-01-30 02:11:25 +00:00
Andrej Karpathy
6a341f2ecf contiguous views and single HtoD transfer for inputs/targets much cleaner 2026-01-30 00:23:01 +00:00
Andrej Karpathy
ebd4d9bbf5 tried muonh, appealing but didn't work out of the box 2026-01-29 19:01:36 +00:00
3 changed files with 40 additions and 8 deletions

View File

@@ -4,6 +4,27 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026
---
## 2026-01-29: Hyperball/MuonH Experiments (Negative Result)
Explored Hyperball optimization from [this post](https://psychedelic-sunstone-851.notion.site/Fantastic-Pretraining-Optimizers-and-Where-to-Find-Them-2-1-Hyperball-Optimization-2e924306e6f280e7a5ffee00eb40a0dd) (saved to `knowledge/muonh.md`). Constrains weights to sphere of radius R (initial norm): `W_{t+1} = R · Normalize(W_t - η·R · Normalize(u_t))`. Had to change a number of details in a branch, e.g. not use zero init for our projections (or the initial norm would be zero), keep track of the initial norm, adjust Muon -> MuonH for the update.
Experiments on d12:
| Experiment | Result |
|------------|--------|
| MuonH for matrix params | Worse than baseline |
| MuonH + LR sweep (2.5e-3 to 1e-2) | Still worse |
| Added learnable RMSNorm scales (paper says γ preserves expressivity) | Still worse |
| Various RMSNorm init tweaks, e.g. 0 at init to residual | Still worse |
| AdamH for lm_head (paper recommends this) | Broken - loss plateaus (see below) |
| AdamH + learnable output scales | Still worse |
Could not outperform the baseline implementation. The article doesn't go into too much detail on how AdamH is applied to `lm_head` exactly. The classifier layer has to be able to increase in magnitude to make more confident predictions over time. Tried a sensible version with added 0-D learnable scalar, and also with RMSNorms with per-channel learnable scalars both pre and post resnet blocks.
**Result:** This was not an out-of-the-box win for nanochat even with a mild attempt over a few hours at a bit of tuning and debugging. The idea itself is intuitively appealing. Might come back around later to try harder later.
---
## 2026-01-28: Reverted Bigram Hash Embeddings
Removed bigram embeddings (engram-lite) from the codebase. At larger scale (d25), the improvement was tiny and disappeared entirely when measured by wall clock time. It also bloated the VRAM used. The extra parameters and complexity aren't justified.

View File

@@ -154,6 +154,16 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
for tokens in token_lists:
doc_buffer.append(tokens)
# Pre-allocate buffers once: layout is [inputs (B*T) | targets (B*T)]
# This gives us contiguous views and a single HtoD transfer
use_cuda = device == "cuda"
cpu_buffer = torch.empty(2 * B * T, dtype=torch.long, pin_memory=use_cuda) # staging area (CPU)
gpu_buffer = torch.empty(2 * B * T, dtype=torch.long, device=device) # on-device buffer
cpu_inputs = cpu_buffer[:B * T].view(B, T) # a few views into these buffers just for convenience
cpu_targets = cpu_buffer[B * T:].view(B, T)
inputs = gpu_buffer[:B * T].view(B, T)
targets = gpu_buffer[B * T:].view(B, T)
while True:
rows = []
for _ in range(B):
@@ -185,13 +195,16 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
rows.append(row[:row_capacity])
use_cuda = device == "cuda"
batch_tensor = torch.tensor(rows, dtype=torch.long, pin_memory=use_cuda)
inputs = batch_tensor[:, :-1].to(device=device, non_blocking=use_cuda)
targets = batch_tensor[:, 1:].to(device=device, non_blocking=use_cuda)
# Convert rows to tensor and copy slices to pinned buffer (CPU work)
row_data = torch.tensor(rows, dtype=torch.long) # [B, T+1], temporary
cpu_inputs.copy_(row_data[:, :-1])
cpu_targets.copy_(row_data[:, 1:])
yield inputs, targets, {"pq_idx": pq_idx, "rg_idx": rg_idx, "epoch": epoch}
state_dict = {"pq_idx": pq_idx, "rg_idx": rg_idx, "epoch": epoch}
# Single HtoD copy into persistent GPU buffer and yield
gpu_buffer.copy_(cpu_buffer, non_blocking=use_cuda)
yield inputs, targets, state_dict
def tokenizing_distributed_data_loader_bos_bestfit(*args, **kwargs):
"""Helper that omits state_dict from yields."""

View File

@@ -4,15 +4,13 @@
# This script was last updated/tuned on Jan 17, 2026.
# Run as:
# bash dev/cpu_demo_run.sh
# bash runs/runcpu.sh
# NOTE: Training LLMs requires GPU compute and $$$. You will not get far on your Macbook.
# Think of this run as educational/fun demo, not something you should expect to work well.
# (This is why I hide this script away in dev/)
# You may also want to run this script manually and one by one, copy pasting commands into your terminal.
# all the setup stuff
export OMP_NUM_THREADS=1
export NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
mkdir -p $NANOCHAT_BASE_DIR
command -v uv &> /dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh