small fix cpu script ty PR #474

contiguous views and single HtoD transfer for inputs/targets much cleaner
tried muonh, appealing but didn't work out of the box
2026-01-30 04:22:02 +00:00 · 2026-01-30 02:11:25 +00:00 · 2026-01-30 00:23:01 +00:00 · 2026-01-29 19:01:36 +00:00
3 changed files with 40 additions and 8 deletions
--- a/dev/LOG.md
+++ b/dev/LOG.md
@@ -4,6 +4,27 @@ A running summary documenting some experiments and findings. Started ~Jan 7 2026

 ---

+## 2026-01-29: Hyperball/MuonH Experiments (Negative Result)
+
+Explored Hyperball optimization from [this post](https://psychedelic-sunstone-851.notion.site/Fantastic-Pretraining-Optimizers-and-Where-to-Find-Them-2-1-Hyperball-Optimization-2e924306e6f280e7a5ffee00eb40a0dd) (saved to `knowledge/muonh.md`). Constrains weights to sphere of radius R (initial norm): `W_{t+1} = R · Normalize(W_t - η·R · Normalize(u_t))`. Had to change a number of details in a branch, e.g. not use zero init for our projections (or the initial norm would be zero), keep track of the initial norm, adjust Muon -> MuonH for the update.
+
+Experiments on d12:
+
+| Experiment | Result |
+|------------|--------|
+| MuonH for matrix params | Worse than baseline |
+| MuonH + LR sweep (2.5e-3 to 1e-2) | Still worse |
+| Added learnable RMSNorm scales (paper says γ preserves expressivity) | Still worse |
+| Various RMSNorm init tweaks, e.g. 0 at init to residual | Still worse |
+| AdamH for lm_head (paper recommends this) | Broken - loss plateaus (see below) |
+| AdamH + learnable output scales | Still worse |
+
+Could not outperform the baseline implementation. The article doesn't go into too much detail on how AdamH is applied to `lm_head` exactly. The classifier layer has to be able to increase in magnitude to make more confident predictions over time. Tried a sensible version with added 0-D learnable scalar, and also with RMSNorms with per-channel learnable scalars both pre and post resnet blocks.
+
+**Result:** This was not an out-of-the-box win for nanochat even with a mild attempt over a few hours at a bit of tuning and debugging. The idea itself is intuitively appealing. Might come back around later to try harder later.
+
+---
+
 ## 2026-01-28: Reverted Bigram Hash Embeddings

 Removed bigram embeddings (engram-lite) from the codebase. At larger scale (d25), the improvement was tiny and disappeared entirely when measured by wall clock time. It also bloated the VRAM used. The extra parameters and complexity aren't justified.
--- a/nanochat/dataloader.py
+++ b/nanochat/dataloader.py
@@ -154,6 +154,16 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
        for tokens in token_lists:
            doc_buffer.append(tokens)

+    # Pre-allocate buffers once: layout is [inputs (B*T) | targets (B*T)]
+    # This gives us contiguous views and a single HtoD transfer
+    use_cuda = device == "cuda"
+    cpu_buffer = torch.empty(2 * B * T, dtype=torch.long, pin_memory=use_cuda) # staging area (CPU)
+    gpu_buffer = torch.empty(2 * B * T, dtype=torch.long, device=device) # on-device buffer
+    cpu_inputs = cpu_buffer[:B * T].view(B, T) # a few views into these buffers just for convenience
+    cpu_targets = cpu_buffer[B * T:].view(B, T)
+    inputs = gpu_buffer[:B * T].view(B, T)
+    targets = gpu_buffer[B * T:].view(B, T)
+
    while True:
        rows = []
        for _ in range(B):
@@ -185,13 +195,16 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(

            rows.append(row[:row_capacity])

-        use_cuda = device == "cuda"
-        batch_tensor = torch.tensor(rows, dtype=torch.long, pin_memory=use_cuda)
-        inputs = batch_tensor[:, :-1].to(device=device, non_blocking=use_cuda)
-        targets = batch_tensor[:, 1:].to(device=device, non_blocking=use_cuda)
+        # Convert rows to tensor and copy slices to pinned buffer (CPU work)
+        row_data = torch.tensor(rows, dtype=torch.long)  # [B, T+1], temporary
+        cpu_inputs.copy_(row_data[:, :-1])
+        cpu_targets.copy_(row_data[:, 1:])

-        yield inputs, targets, {"pq_idx": pq_idx, "rg_idx": rg_idx, "epoch": epoch}
+        state_dict = {"pq_idx": pq_idx, "rg_idx": rg_idx, "epoch": epoch}

+        # Single HtoD copy into persistent GPU buffer and yield
+        gpu_buffer.copy_(cpu_buffer, non_blocking=use_cuda)
+        yield inputs, targets, state_dict

 def tokenizing_distributed_data_loader_bos_bestfit(*args, **kwargs):
    """Helper that omits state_dict from yields."""
--- a/runs/runcpu.sh
+++ b/runs/runcpu.sh
@@ -4,15 +4,13 @@
 # This script was last updated/tuned on Jan 17, 2026.

 # Run as:
-# bash dev/cpu_demo_run.sh
+# bash runs/runcpu.sh

 # NOTE: Training LLMs requires GPU compute and $$$. You will not get far on your Macbook.
 # Think of this run as educational/fun demo, not something you should expect to work well.
-# (This is why I hide this script away in dev/)
 # You may also want to run this script manually and one by one, copy pasting commands into your terminal.

 # all the setup stuff
-export OMP_NUM_THREADS=1
 export NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
 mkdir -p $NANOCHAT_BASE_DIR
 command -v uv &> /dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
Author	SHA1	Message	Date
Andrej Karpathy	067daa7758	small fix cpu script ty PR #474	2026-01-30 02:11:25 +00:00
Andrej Karpathy	6a341f2ecf	contiguous views and single HtoD transfer for inputs/targets much cleaner	2026-01-30 00:23:01 +00:00
Andrej Karpathy	ebd4d9bbf5	tried muonh, appealing but didn't work out of the box	2026-01-29 19:01:36 +00:00