nanochat

mirror of https://github.com/karpathy/nanochat.git synced 2026-01-30 04:22:02 +00:00

Author	SHA1	Message	Date
Andrej Karpathy	7312ec9898	fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way	2026-01-13 22:45:27 +00:00
Andrej Karpathy	3b50b77ed3	fix base_loss to report correct loss by switching the dataloader to the new default	2026-01-13 22:09:36 +00:00
Andrej Karpathy	f92efce169	add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance	2026-01-13 21:33:54 +00:00
Andrej Karpathy	43c29dd9d5	Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training The new DataLoader ensures that every token sequence in train/val batches has a BOS token at the beginning. Therefore, no token streams start abruptly in the middle of a document, which could be confusing for the model. Note that this changes the loss scale because there are fewer confusing tokens in the train/val batches. The main downside is that we now waste about 35% of tokens due to cropping. This is ok because we have a lot of data. See dev/LOG.md entry for this change for a lot more information.	2026-01-13 20:05:47 +00:00
Andrej Karpathy	23985413aa	adjust the comment on the regex pattern per recent experimnet see dev/LOG.md	2026-01-13 17:50:39 +00:00
Andrej Karpathy	64b48d0e5c	validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs	2026-01-13 17:45:06 +00:00
Andrej Karpathy	238353c998	document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight.	2026-01-13 17:14:29 +00:00
Andrej Karpathy	4610a838a1	record negative result on MTP	2026-01-12 05:23:47 +00:00
Andrej Karpathy	21608ec51e	allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway	2026-01-12 03:10:13 +00:00
Andrej Karpathy	aa95fb2e03	make miniseries more generic and easier to run and less hard coded	2026-01-12 02:54:35 +00:00
Andrej Karpathy	b33e394528	oops actually make SSSL the default window pattern	2026-01-11 21:50:35 +00:00
Andrej Karpathy	fbc1484e8c	add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb	2026-01-11 21:49:54 +00:00
Andrej Karpathy	2ff7d51252	integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge	2026-01-11 20:33:19 +00:00
Andrej Karpathy	201d705957	recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints	2026-01-11 20:13:12 +00:00
Andrej Karpathy	aa530cdad5	Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb	2026-01-11 18:47:35 +00:00
Andrej Karpathy	2c4473dd1b	Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes.	2026-01-11 16:56:59 +00:00
Andrej Karpathy	f5a0ea4d3f	take out these gitignore dirs	2026-01-08 18:18:42 +00:00
Andrej Karpathy	4ddc803797	fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug	2026-01-08 18:18:42 +00:00
Sofie Van Landeghem	a1ccb3dc0b	remove rust compilation as rustbpe is now installed from separate package (#416 )	2026-01-08 06:18:37 -08:00
Andrej Karpathy	061f83c152	delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then	2026-01-08 02:16:50 +00:00
Andrej Karpathy	e8c30c3b19	add notebook used for scaling laws analysis	2026-01-07 22:28:53 +00:00
Andrej Karpathy	3af4dcf6ee	also add scaling_laws.sh script if it's a useful reference	2026-01-07 22:25:13 +00:00
Andrej Karpathy	4cc605b940	quick pointer to miniseries post in readme for now	2026-01-07 22:14:21 +00:00
Andrej Karpathy	ccf4b7f9bf	nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script	2026-01-07 22:11:59 +00:00
Adria Blancafort	1b5de29e71	Fix undefined variable in chat_rl after recent refactor * Fix undefined variable * Remove unused import Remove unused import 're' from chat_rl.py	2026-01-07 09:08:57 -08:00
Andrej Karpathy	ae0bf52529	tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3	2026-01-05 18:57:46 +00:00
Andrej Karpathy	eec0c79563	also add matplotlib dep so that we can have jupyter notebooks	2026-01-05 18:41:09 +00:00
Andrej Karpathy	54e59c38ad	add notebook on deriving the CORE estimates for the GPT-3 miniseries.	2026-01-05 18:40:28 +00:00
Andrej Karpathy	9d4c9b786d	many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works	2026-01-05 00:38:09 +00:00
Andrej Karpathy	962b6bfba3	alright add transformers as a dep of the repo because it should be easy to evaluate the CORE score of HF models. Not super happy about it but i tried it and the uv.lock doesn't get bloated as much as i expected	2026-01-04 20:37:28 +00:00
Andrej Karpathy	ed2082fbc4	sane secrets management	2026-01-04 19:29:22 +00:00
Andrej Karpathy	eb7bbc1b66	delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts	2026-01-04 19:14:23 +00:00
Andrej Karpathy	507d54224a	fix small bug where this would break if git stage has deleted files	2026-01-04 19:11:43 +00:00
Andrej Karpathy	9c60dfb64c	bump nanochat to use the latest stable pytorch that is 2.9.1 . Run e.g. to re-update your local environment if you git pull	2026-01-04 18:36:36 +00:00
Andrej Karpathy	be56d29b87	simplify redundant if/elif in bloat metrics 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-04 01:40:42 +00:00
Andrej Karpathy	ee79f29fbd	replace files-to-prompt with git ls-files for bloat metrics files-to-prompt was including untracked files (knowledge/, dev scripts, etc.) which inflated the bloat metrics. now we use git ls-files to only count tracked source files, which is more accurate and removes an external dependency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-04 01:38:15 +00:00
Andrej Karpathy	da8b7ea4cb	also delete the rustbpe test code, this now lives in rustbpe repo that is separate	2026-01-04 01:23:34 +00:00
Andrej Karpathy	aa42f40e66	delete the inline rustbpe project. it was ugly to have a project within project and rustbpe is now nicely a separate repo on my github karpathy/rustbpe and it's on pypi etc., so we just add it as a depedency to uv. i think it is appropriate that this is a separate repo because 1) it doesn't have too many knobs, other than the ones that are exposed - the regex pattern and vocab size and 2) all of its complexity is not algorithmic (it's equivalent to minbpe), instead it is efficiency-related, so it is ok to hide relatively speaking	2026-01-03 23:55:28 +00:00
Andrej Karpathy	48abd7d85f	simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer	2026-01-01 21:15:09 +00:00
Paweł Krefta	10231dfb40	Fix conversation scroll to bottom on some browsers + remove duplicated padding (#348 )	2025-12-31 13:03:22 -08:00
helloaidank	389d019a0b	small change to doc string at top of tok_train.py (#402 )	2025-12-31 12:57:26 -08:00
Hossein-Lakzaei	8c89661465	Update README to match current d34 demo (#314 ) (#381 ) * Update README: switch hosted model description from d32 to d34 per discussion #314 * link to discussion thread * parameter in quotes --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2025-12-30 10:17:11 +01:00
Andrej Karpathy	8f979a8bda	fix: sample first token independently for each row in multi-sample generation Previously, when generating multiple samples (num_samples > 1), the first token after prefill was sampled once and broadcast to all rows, causing all samples to start identically. Now the prefill logits are expanded to num_samples and sampled independently for each row. Also simplified the generation loop by moving the forward pass to the end of the loop, eliminating the first_iteration flag and if/else branching. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-28 04:52:13 +00:00
Dipesh Babu	2f2d7ab80c	fix: safe DDP cleanup (check initialized PG, not just env) (#256 )	2025-12-27 20:27:40 -08:00
Andrej Karpathy	91d76cc690	Replace speedup assertion with warning in batch_encode test Performance varies by machine and load, making hard assertions flaky. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-28 04:10:49 +00:00
Andrej	7a8769a40c	Merge pull request #383 from barisozmen/master 3x faster rust encode (`batch_encode`) (12 LoC + 2 tests)	2025-12-27 20:06:57 -08:00
Andrej	088726aa7d	clean up model_tag handling across scripts a bit more.	2025-12-27 20:01:09 -08:00
Andrej Karpathy	2874eda59a	update to new os env var to get rid of deprecation warning	2025-12-28 03:32:46 +00:00
Andrej Karpathy	e1770a3061	remove spurious cast, gets compiled away anyway but it's confusing people	2025-12-27 23:07:48 +00:00
Andrej Karpathy	49389ecaa8	fix tf32 warning for deprecated api use	2025-12-27 22:03:06 +00:00

1 2 3 4 5

229 Commits