karpathy/nanochat
find_last_step crashes on filenames with extra underscores (e.g., model_000200_backup.pt)
Summary
Context: The
find_last_stepfunction incheckpoint_manager.pyis used to automatically detect the latest checkpoint when loading models, which happens in all inference and training resumption scripts throughout the codebase.Bug: The function incorrectly parses checkpoint filenames by taking the substring after the last underscore instead of after the first underscore, causing it to crash when checkpoint files have additional underscores in their names.
Actual vs. expected: When a file like
model_000200_backup.ptexists in the checkpoint directory, the function tries to parse “backup” as an integer instead of “000200”, resulting in aValueError.Impact: The bug causes application crashes when users create backup checkpoint files (e.g.,
model_000200_backup.pt) or when checkpoint files contain extra underscores for any reason, preventing model loading across all scripts that auto-detect the latest checkpoint.
Code with bug
Example
Given checkpoint files:
model_000100.ptmodel_000200.ptmodel_000200_backup.pt
When processing model_000200_backup.pt:
os.path.basename(f)→"model_000200_backup.pt".split("_")[-1]→"backup.pt".split(".")[0]→"backup"int("backup")→ raisesValueError: invalid literal for int() with base 10: 'backup'
Expected behavior: extract "000200" (the part immediately after "model_") and parse it as an integer.
Recommended fix
Change the parsing to target the step immediately after model_ or use a regex to match only valid files.
Option 1 – index the part after the first underscore:
Option 2 – regex (more robust; ignores malformed files like backups):