Commit Graph

252 Commits

Author SHA1 Message Date
Andrej Karpathy
54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries. 2026-01-05 18:40:28 +00:00
Andrej Karpathy
9d4c9b786d many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works 2026-01-05 00:38:09 +00:00
Andrej Karpathy
962b6bfba3 alright add transformers as a dep of the repo because it should be easy to evaluate the CORE score of HF models. Not super happy about it but i tried it and the uv.lock doesn't get bloated as much as i expected 2026-01-04 20:37:28 +00:00
Andrej Karpathy
ed2082fbc4 sane secrets management 2026-01-04 19:29:22 +00:00
Andrej Karpathy
eb7bbc1b66 delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts 2026-01-04 19:14:23 +00:00
Andrej Karpathy
507d54224a fix small bug where this would break if git stage has deleted files 2026-01-04 19:11:43 +00:00
Andrej Karpathy
9c60dfb64c bump nanochat to use the latest stable pytorch that is 2.9.1 . Run e.g. to re-update your local environment if you git pull 2026-01-04 18:36:36 +00:00
Andrej Karpathy
be56d29b87 simplify redundant if/elif in bloat metrics
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-04 01:40:42 +00:00
Andrej Karpathy
ee79f29fbd replace files-to-prompt with git ls-files for bloat metrics
files-to-prompt was including untracked files (knowledge/, dev scripts, etc.) which inflated the bloat metrics. now we use git ls-files to only count tracked source files, which is more accurate and removes an external dependency.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-04 01:38:15 +00:00
Andrej Karpathy
da8b7ea4cb also delete the rustbpe test code, this now lives in rustbpe repo that is separate 2026-01-04 01:23:34 +00:00
Andrej Karpathy
aa42f40e66 delete the inline rustbpe project. it was ugly to have a project within project and rustbpe is now nicely a separate repo on my github karpathy/rustbpe and it's on pypi etc., so we just add it as a depedency to uv. i think it is appropriate that this is a separate repo because 1) it doesn't have too many knobs, other than the ones that are exposed - the regex pattern and vocab size and 2) all of its complexity is not algorithmic (it's equivalent to minbpe), instead it is efficiency-related, so it is ok to hide relatively speaking 2026-01-03 23:55:28 +00:00
Andrej Karpathy
48abd7d85f simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer 2026-01-01 21:15:09 +00:00
Paweł Krefta
10231dfb40 Fix conversation scroll to bottom on some browsers + remove duplicated padding (#348) 2025-12-31 13:03:22 -08:00
helloaidank
389d019a0b small change to doc string at top of tok_train.py (#402) 2025-12-31 12:57:26 -08:00
Hossein-Lakzaei
8c89661465 Update README to match current d34 demo (#314) (#381)
* Update README: switch hosted model description from d32 to d34 per discussion #314

* link to discussion thread

* parameter in quotes

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2025-12-30 10:17:11 +01:00
Andrej Karpathy
8f979a8bda fix: sample first token independently for each row in multi-sample generation
Previously, when generating multiple samples (num_samples > 1), the first
token after prefill was sampled once and broadcast to all rows, causing
all samples to start identically. Now the prefill logits are expanded to
num_samples and sampled independently for each row.

Also simplified the generation loop by moving the forward pass to the end
of the loop, eliminating the first_iteration flag and if/else branching.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-28 04:52:13 +00:00
Dipesh Babu
2f2d7ab80c fix: safe DDP cleanup (check initialized PG, not just env) (#256) 2025-12-27 20:27:40 -08:00
Andrej Karpathy
91d76cc690 Replace speedup assertion with warning in batch_encode test
Performance varies by machine and load, making hard assertions flaky.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-28 04:10:49 +00:00
Andrej
7a8769a40c Merge pull request #383 from barisozmen/master
3x faster rust encode (`batch_encode`) (12 LoC + 2 tests)
2025-12-27 20:06:57 -08:00
Andrej
088726aa7d clean up model_tag handling across scripts a bit more. 2025-12-27 20:01:09 -08:00
Andrej Karpathy
2874eda59a update to new os env var to get rid of deprecation warning 2025-12-28 03:32:46 +00:00
Andrej Karpathy
e1770a3061 remove spurious cast, gets compiled away anyway but it's confusing people 2025-12-27 23:07:48 +00:00
Andrej Karpathy
49389ecaa8 fix tf32 warning for deprecated api use 2025-12-27 22:03:06 +00:00
DU Wenjie
ea4229851b bugfix 2025-12-26 19:02:12 +08:00
DU Wenjie
7840049189 bugfix keep same args style in scripts/base_eval.py 2025-12-26 17:29:08 +08:00
Andrej
bc51da8bac pad vocab size to 64 for DDP optimizers and efficiency 2025-12-23 09:13:31 -08:00
duwenjie
92c6654b95 bugfix save and load ckpt from model_tag dir 2025-12-21 15:07:04 +08:00
Barış Özmen
790f3be65c add rust batch encode as a faster option over encode 2025-12-18 19:17:59 +03:00
Matěj Kripner
d314e96aa2 formatting 2025-12-09 12:48:46 +01:00
Matěj Kripner
bbc57da7d5 slightly nicer error message 2025-12-09 12:46:48 +01:00
Matěj Kripner
f1bf69d562 feat: pad vocab size to 64 for DDP optimizers and efficiency 2025-12-09 12:38:18 +01:00
Andrej
d5759400f9 fixing two typos in comments 2025-12-08 20:03:08 -08:00
Andrej
e72c3299df fix random.seed() footgun bug for SpellingBee data generation 2025-12-08 19:58:45 -08:00
Andrej
7931e0903a rename checkpoint_dir to checkpoints_dir for consistency. 2025-12-08 18:32:12 -08:00
Andrej
849d95ae1f remove unnecessary check to make the logic in CausalSelfAttention.forward() clearer 2025-12-08 18:30:37 -08:00
Andrej
39cccc527f small bugfix make mid_train script work even with a tiny number of iterations 2025-12-08 18:27:32 -08:00
Andrej
8b1cecaa95 Apply suggestion from @svlandeg for nicer looking comparison
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2025-12-08 18:27:06 -08:00
Andrej
58f3e84e01 clean up train/val loader in sft for consistency with mid/base 2025-12-08 18:23:57 -08:00
Andrej
1b2a675c88 Improve KV cache code readability 2025-12-08 18:19:05 -08:00
Andrej
d75e6ed711 Fix script comment to reference correct file 2025-12-08 18:16:42 -08:00
Andrej
72a7cf2bc4 Fix distributed Parquet dataloader resume for multi-epoch training 2025-12-08 18:15:02 -08:00
Andrej Karpathy
bffdb2ef91 group common code to make things neater in gpt logit computation 2025-12-09 02:01:05 +00:00
Andrej
cbf30c842c apply float32 cast before logits softcapping so the tanh is in fp32. torch compile fuses this correctly with no extra memory costs. 2025-12-08 14:17:43 -08:00
Andrej Karpathy
90442de35f fix bug where any rank has to be able to create checkpoint_dir if saving optim 2025-12-08 20:45:19 +00:00
Andrej
2fd0440355 fix: missing val_bpb on resume 2025-12-08 12:35:08 -08:00
sunyujun03
01ea71be39 Fix distributed Parquet dataloader resume for multi-epoch training 2025-12-08 00:10:19 -06:00
KimYeongHyeon
a8847a0f83 Fix script comment to reference correct file 2025-12-02 10:46:20 +09:00
deepbuilder
06677c30e0 Refactor dimension validation for KV cache 2025-11-28 15:22:18 -05:00
deepbuilder
a770dcef2e Fix kv_cache indexing to explicitly include head dimension 2025-11-28 15:00:14 -05:00
spjosyula
16788eed3c fix(model): apply float32 cast before logits softcapping
This change ensures that the logits softcapping operation (tanh) is performed in float32 precision rather than bfloat16. Previously, the code cast to float32 after the tanh operation, which meant the non-linearity was computed with bfloat16 precision
2025-11-23 20:12:09 +05:30