nanochat

ros/nanochat

Fork 0

mirror of https://github.com/karpathy/nanochat.git synced 2026-01-30 04:22:02 +00:00

Commit Graph

Select branches

Hide Pull Requests

cpu-mps-dev

fp8_attempt_fail

master

#103

#105

#106

#108

#109

#110

#111

#112

#113

#115

#116

#119

#121

#122

#123

#126

#127

#128

#128

#129

#13

#130

#131

#132

#133

#135

#137

#14

#140

#141

#141

#142

#144

#145

#145

#146

#147

#149

#15

#151

#151

#153

#154

#155

#156

#159

#160

#161

#162

#165

#169

#17

#172

#173

#174

#175

#179

#18

#180

#181

#182

#184

#185

#186

#19

#190

#191

#192

#194

#195

#197

#2

#201

#204

#204

#205

#21

#217

#224

#226

#227

#228

#23

#230

#232

#233

#234

#235

#236

#237

#238

#239

#24

#240

#241

#242

#243

#244

#246

#247

#249

#251

#252

#253

#255

#256

#258

#258

#259

#261

#262

#263

#265

#267

#268

#269

#27

#270

#271

#274

#275

#276

#279

#280

#281

#282

#286

#29

#290

#294

#295

#296

#296

#298

#299

#3

#30

#301

#301

#306

#307

#308

#309

#31

#310

#311

#311

#312

#312

#316

#317

#318

#319

#319

#32

#32

#322

#323

#324

#325

#326

#327

#328

#328

#329

#33

#330

#333

#336

#34

#342

#345

#346

#348

#35

#350

#351

#352

#353

#355

#356

#357

#358

#36

#361

#364

#366

#367

#370

#370

#375

#376

#377

#38

#38

#381

#383

#386

#387

#388

#39

#390

#392

#393

#393

#394

#395

#396

#397

#398

#399

#399

#4

#40

#400

#400

#402

#405

#406

#407

#409

#409

#41

#412

#414

#414

#416

#417

#419

#424

#425

#425

#428

#429

#429

#43

#431

#432

#432

#433

#434

#436

#437

#437

#438

#441

#442

#442

#445

#447

#447

#448

#449

#449

#451

#454

#454

#455

#455

#456

#456

#457

#458

#458

#46

#464

#467

#467

#470

#474

#48

#49

#5

#50

#51

#52

#53

#54

#55

#55

#56

#59

#59

#6

#61

#62

#63

#64

#66

#67

#70

#74

#75

#80

#81

#84

#85

#85

#86

#88

#89

#9

#90

#91

#93

#93

#94

#95

#97

#98

#99

067daa7758 small fix cpu script ty PR #474 master Andrej Karpathy 2026-01-30 02:11:25 +00:00
6a341f2ecf contiguous views and single HtoD transfer for inputs/targets much cleaner Andrej Karpathy 2026-01-30 00:23:01 +00:00
ebd4d9bbf5 tried muonh, appealing but didn't work out of the box Andrej Karpathy 2026-01-29 19:01:36 +00:00
41bb2eac32 Combine AdamW and Muon into single MuonAdamW optimizer, cleaner, ty @chrisjmccormick for idea/help Andrej Karpathy 2026-01-29 00:50:50 +00:00
64a651a63c include .claude is ok Andrej Karpathy 2026-01-29 00:35:02 +00:00
65df0de42b add arxiv reading skill Andrej Karpathy 2026-01-29 00:34:24 +00:00
74554be3b5 revert engram, not seeing an improvement at larger scale Andrej Karpathy 2026-01-28 20:07:39 +00:00
d5418ea5a1 Fix link to DeepSeek Engram paper (#470) Sofie Van Landeghem 2026-01-28 17:31:44 +01:00
c88bbf8133 Merge branch 'engram' Andrej Karpathy 2026-01-27 22:33:16 +00:00
c8d93beed2 add engram-lite, add log, tune scaling laws analysis scripts Andrej Karpathy 2026-01-27 22:31:17 +00:00
8630d32be4 quick fix to not OOM main speedrun script Andrej Karpathy 2026-01-26 22:31:42 +00:00
59e36cc727 first version of engram following modded nanogpt style Andrej Karpathy 2026-01-25 18:59:51 +00:00
85b3e95e09 320 experiments just to tune the adam beta1 of x0 a little bit up from 0.8 to 0.96 Andrej Karpathy 2026-01-25 00:03:55 +00:00
6a477eedbd fix: pass device_type to compute_init in engine.__main__ (#451) xiayan0118 2026-01-19 17:19:51 -08:00
63bb5831e2 something i've wanted to do for a while - move all .sh runs to their own directory so they don't pollute root dir Andrej Karpathy 2026-01-18 15:27:41 +00:00
a91743c168 Merge branch 've' Andrej Karpathy 2026-01-18 15:14:39 +00:00
d58fcd9d73 log for jan 17 Andrej Karpathy 2026-01-18 03:01:13 +00:00
babde18ce1 small tweaks Andrej Karpathy 2026-01-18 03:00:38 +00:00
cf5c9e5b8e resolve a crash for odd depths because FA3 needs head_dim % 8 == 0 Andrej Karpathy 2026-01-18 00:07:08 +00:00
413e91aa0f optimal ratio is now around 4 Andrej Karpathy 2026-01-17 23:51:09 +00:00
e7ed2082b8 update the default GPTConfig kwargs otherwise they are confusing Andrej Karpathy 2026-01-17 21:16:46 +00:00
f9a7e0f111 update the CPU/MPS script to give reasonable results. The model can at least answer that Paris is the capital of France and knows that the sky is blue, for about 40 minutes of training on my macbook. Also fixed a bug that existed due to KVCache bfloat16 dtype assumption karpathy 2026-01-17 12:27:30 -08:00
f5425245f9 more GPU types from PR 147 thanks @Qubitium Andrej Karpathy 2026-01-17 03:22:20 +00:00
2955650327 add detection of device to report more correct mfu for bf16 Andrej Karpathy 2026-01-17 03:16:12 +00:00
77a46902e4 Fix WANDB_RUN parameter passing in runcpu.sh (#407) Yury Kirpichev 2026-01-16 18:59:44 -08:00
bbc4413c58 Add high value engine tests for core invariants (33 LoC) (#396) Barış Özmen 2026-01-17 05:59:12 +03:00
f42ae9e901 fix condition to perform bpb evaluation (#324) Nitish Pandey 2026-01-17 08:26:43 +05:30
e1dafc510f Reduce token waste in BOS bestfit by cropping shortest doc (#445) Yamahammer 2026-01-16 21:50:34 -05:00
6460dc6382 tweaks to readme a bit Andrej Karpathy 2026-01-17 02:28:31 +00:00
1933e85046 brief update to log Andrej Karpathy 2026-01-17 00:25:50 +00:00
3b95d4fd39 allow label for scaling laws script Andrej Karpathy 2026-01-17 00:23:30 +00:00
e85db6b4a4 alternating design Andrej Karpathy 2026-01-16 23:52:12 +00:00
9a88194c3f simply one VE per layer, works best Andrej Karpathy 2026-01-16 22:08:52 +00:00
0b58d70e99 full ve version works very well Andrej Karpathy 2026-01-16 21:16:47 +00:00
e3f58b838e ranked version Andrej Karpathy 2026-01-16 20:59:42 +00:00
184d4c12b1 also add to log about the FA3 changes Andrej Karpathy 2026-01-16 18:25:04 +00:00
b62a5bc44a naturally i failed to include the actual code in the previous commit facepalm Andrej Karpathy 2026-01-16 17:39:41 +00:00
8203efa919 implement flash attention 3 fallback to pytorch sdpa by touching as few lines of code as possible in main files and keeping all implementation to a single file. add tests. add helpful warning messages for the user. Andrej Karpathy 2026-01-16 17:37:51 +00:00
50413d2d67 typo in comments: change "GAPO" to "DAPO" Haoyu Wang 2026-01-16 01:03:42 -05:00
fbf2bbea25 update log with a bunch of attempts Andrej Karpathy 2026-01-16 02:21:17 +00:00
747ed4491f add negative result on olmo3 pretraining mix Andrej Karpathy 2026-01-16 00:43:54 +00:00
7d1700c521 add zstd lib Andrej Karpathy 2026-01-16 00:40:59 +00:00
d4ea28d4e2 Fix args in readme (#438) Sofie Van Landeghem 2026-01-16 01:26:38 +01:00
bdcc030ffa oops legacy spurious line now Andrej Karpathy 2026-01-15 23:32:20 +00:00
22a71aa3d3 fuse adamw into a single torch compiled kernel similar to muon. it's about 1.7X faster, but overall it's so tiny that it's not making a major dent Andrej Karpathy 2026-01-15 23:30:44 +00:00
255f8b9af6 cleanly separate cpu and gpu sections Andrej Karpathy 2026-01-15 23:30:11 +00:00
6bb92403d5 changes and optimizations to muon, making it more efficient and simpler/cleaner a bit Andrej Karpathy 2026-01-15 03:20:48 +00:00
3142ca1a28 minor helpful message Andrej Karpathy 2026-01-15 03:20:21 +00:00
7312ec9898 fix buggy midtrain and update all kwargs to be idiomatic. that is, argparse uses dashes variables use underscores. the underscores are just a remnant of the previous Configurator object. This is the right way Andrej Karpathy 2026-01-13 22:45:27 +00:00
3b50b77ed3 fix base_loss to report correct loss by switching the dataloader to the new default Andrej Karpathy 2026-01-13 22:09:36 +00:00
f92efce169 add negative result about not allowing attention across BOS tokens. A lot more code complexity for basically no gain in performance Andrej Karpathy 2026-01-13 21:33:54 +00:00
43c29dd9d5 Big DataLoader refactor: BOS-aligned dataloaders with epoch tracking for pre/mid-training Andrej Karpathy 2026-01-13 20:05:47 +00:00
23985413aa adjust the comment on the regex pattern per recent experimnet see dev/LOG.md Andrej Karpathy 2026-01-13 17:50:39 +00:00
64b48d0e5c validated that \p{N}{1,2} is the correct number of digits to group up to in the regex pattern of the GPT-4 tokenizer (2 down from 3), leading to the best val_bpb for 32K vocabs Andrej Karpathy 2026-01-13 17:45:06 +00:00
238353c998 document my struggle with fp8 integration yesterday, it's not working like i thought it would and i suffered. one day i will return to continue the fight. Andrej Karpathy 2026-01-13 17:14:29 +00:00
69b1ed245e also add base_train change example for how to swap LinearFP8 fp8_attempt_fail Andrej Karpathy 2026-01-13 17:08:10 +00:00
a6382a6ce8 saving these two attempts Andrej Karpathy 2026-01-13 17:05:09 +00:00
4610a838a1 record negative result on MTP Andrej Karpathy 2026-01-12 05:23:47 +00:00
21608ec51e allow base_loss to report the loss of any arbitrary huggingface model similar to base_eval. had to change dataloader to be a lot better and just take tokenizer, not load the nanochat one. much better this way anyway Andrej Karpathy 2026-01-12 03:10:13 +00:00
aa95fb2e03 make miniseries more generic and easier to run and less hard coded Andrej Karpathy 2026-01-12 02:54:35 +00:00
b33e394528 oops actually make SSSL the default window pattern Andrej Karpathy 2026-01-11 21:50:35 +00:00
fbc1484e8c add alternating window size patterns for the GPT layers, following GPT-3. Experimented a bit and found the pattern SSSL to work well - 3 short, 1 long alternating. This is now the new default and the plots look quite a bit better on flops vs. bpb Andrej Karpathy 2026-01-11 21:49:54 +00:00
2ff7d51252 integrate Flash Attention 3. +9% tok_per_sec for d12 with ctx even as low as 2048 out of the box nice. also, ready to tune windows huge Andrej Karpathy 2026-01-11 20:33:19 +00:00
201d705957 recover the ability to load old checkpoints by patching the lambdas if they don't exist in checkpoints Andrej Karpathy 2026-01-11 20:13:12 +00:00
aa530cdad5 Add learnable lambdas that gate the residual connection and a skip connection to the input embeddings, solid bump to val_bpb Andrej Karpathy 2026-01-11 18:47:35 +00:00
2c4473dd1b Big Muon optimizer changes inspired by latest of modded-nanogpt. Added Polar Express, Adafactor-style variance reduction, cautious weight decay, schedule weight decay linearly to ramp down to zero. Tuned optimum weight decay for multiple model sizes d8, d12, d16, d20 and found a scaling law with optimum wd \propto 1/channels^2, including it as default into code. --weight_decay of base_train is now default on and configured optimally according to all of these experiments. Solid bump to val_bpb observed as a result of these changes. Andrej Karpathy 2026-01-11 16:56:59 +00:00
f5a0ea4d3f take out these gitignore dirs Andrej Karpathy 2026-01-08 18:18:39 +00:00
4ddc803797 fix adamw slight bug. this chunk was copy pasted originally from modded-nanogpt, which still seems to have the bug Andrej Karpathy 2026-01-08 18:18:22 +00:00
a1ccb3dc0b remove rust compilation as rustbpe is now installed from separate package (#416) Sofie Van Landeghem 2026-01-08 15:18:37 +01:00
061f83c152 delete grad_clip. appears to not be necessary at all. not only was it buggy because the clipping happened per gpu before grad synchronization, but it costs ~2% MFU, and it also doesn't even help. I tried deleting it a while ago and back then it did help. So I'm guessing that some hyperparameter tuning obviated the reason for it since then Andrej Karpathy 2026-01-08 02:16:50 +00:00
e8c30c3b19 add notebook used for scaling laws analysis Andrej Karpathy 2026-01-07 22:28:53 +00:00
3af4dcf6ee also add scaling_laws.sh script if it's a useful reference Andrej Karpathy 2026-01-07 22:25:13 +00:00
4cc605b940 quick pointer to miniseries post in readme for now Andrej Karpathy 2026-01-07 22:14:21 +00:00
ccf4b7f9bf nudge hyperparameters of the base script with the results of the sweeps and miniseries. vocab size down to 32K. D:N ratio from 20 to 8. add miniseries script Andrej Karpathy 2026-01-07 22:11:52 +00:00
1b5de29e71 Fix undefined variable in chat_rl after recent refactor Adria Blancafort 2026-01-07 18:08:57 +01:00
ae0bf52529 tune hyperparameters based on overnight sweeps. warmdown_ratio is the biggest free win, increasing 0.2 -> 0.4, and embedding lr can be larger bumping 0.2 -> 0.3 Andrej Karpathy 2026-01-05 18:57:46 +00:00
eec0c79563 also add matplotlib dep so that we can have jupyter notebooks Andrej Karpathy 2026-01-05 18:41:09 +00:00
54e59c38ad add notebook on deriving the CORE estimates for the GPT-3 miniseries. Andrej Karpathy 2026-01-05 18:40:28 +00:00
9d4c9b786d many small fixes to base_train: reporting ETA, allowing some additional kwarg flexibility, making sure we don't crash when e.g. depth = 11 - we now calculate the closest num_heads that works Andrej Karpathy 2026-01-05 00:38:09 +00:00
962b6bfba3 alright add transformers as a dep of the repo because it should be easy to evaluate the CORE score of HF models. Not super happy about it but i tried it and the uv.lock doesn't get bloated as much as i expected Andrej Karpathy 2026-01-04 20:37:28 +00:00
ed2082fbc4 sane secrets management Andrej Karpathy 2026-01-04 19:29:22 +00:00
eb7bbc1b66 delete the configurator in favor of argparse and clean up a lot of kwarg details to make them more consistent across all scripts Andrej Karpathy 2026-01-04 19:14:23 +00:00
507d54224a fix small bug where this would break if git stage has deleted files Andrej Karpathy 2026-01-04 19:11:43 +00:00
9c60dfb64c bump nanochat to use the latest stable pytorch that is 2.9.1 . Run e.g. to re-update your local environment if you git pull Andrej Karpathy 2026-01-04 18:36:36 +00:00
be56d29b87 simplify redundant if/elif in bloat metrics Andrej Karpathy 2026-01-04 01:40:42 +00:00
ee79f29fbd replace files-to-prompt with git ls-files for bloat metrics Andrej Karpathy 2026-01-04 01:38:15 +00:00
da8b7ea4cb also delete the rustbpe test code, this now lives in rustbpe repo that is separate Andrej Karpathy 2026-01-04 01:23:34 +00:00
aa42f40e66 delete the inline rustbpe project. it was ugly to have a project within project and rustbpe is now nicely a separate repo on my github karpathy/rustbpe and it's on pypi etc., so we just add it as a depedency to uv. i think it is appropriate that this is a separate repo because 1) it doesn't have too many knobs, other than the ones that are exposed - the regex pattern and vocab size and 2) all of its complexity is not algorithmic (it's equivalent to minbpe), instead it is efficiency-related, so it is ok to hide relatively speaking Andrej Karpathy 2026-01-03 23:55:28 +00:00
48abd7d85f simplify, clarify and slightly tune model initialization. should be very slightly better possibly, but certainly a lot clearer Andrej Karpathy 2026-01-01 21:14:26 +00:00
10231dfb40 Fix conversation scroll to bottom on some browsers + remove duplicated padding (#348) Paweł Krefta 2025-12-31 22:03:22 +01:00
389d019a0b small change to doc string at top of tok_train.py (#402) helloaidank 2025-12-31 20:57:26 +00:00
8c89661465 Update README to match current d34 demo (#314) (#381) Hossein-Lakzaei 2025-12-30 12:47:11 +03:30
8f979a8bda fix: sample first token independently for each row in multi-sample generation Andrej Karpathy 2025-12-28 04:52:13 +00:00
2f2d7ab80c fix: safe DDP cleanup (check initialized PG, not just env) (#256) Dipesh Babu 2025-12-27 23:27:40 -05:00
91d76cc690 Replace speedup assertion with warning in batch_encode test Andrej Karpathy 2025-12-28 04:10:49 +00:00
7a8769a40c Merge pull request #383 from barisozmen/master Andrej 2025-12-27 20:06:57 -08:00
088726aa7d clean up model_tag handling across scripts a bit more. Andrej 2025-12-27 20:01:09 -08:00
2874eda59a update to new os env var to get rid of deprecation warning Andrej Karpathy 2025-12-28 03:32:46 +00:00
e1770a3061 remove spurious cast, gets compiled away anyway but it's confusing people Andrej Karpathy 2025-12-27 23:07:48 +00:00
49389ecaa8 fix tf32 warning for deprecated api use Andrej Karpathy 2025-12-27 22:03:06 +00:00