karpathy/nanochat

find_last_step crashes on filenames with extra underscores (e.g., model_000200_backup.pt)

Summary

  • Context: The find_last_step function in checkpoint_manager.py is used to automatically detect the latest checkpoint when loading models, which happens in all inference and training resumption scripts throughout the codebase.

  • Bug: The function incorrectly parses checkpoint filenames by taking the substring after the last underscore instead of after the first underscore, causing it to crash when checkpoint files have additional underscores in their names.

  • Actual vs. expected: When a file like model_000200_backup.pt exists in the checkpoint directory, the function tries to parse “backup” as an integer instead of “000200”, resulting in a ValueError.

  • Impact: The bug causes application crashes when users create backup checkpoint files (e.g., model_000200_backup.pt) or when checkpoint files contain extra underscores for any reason, preventing model loading across all scripts that auto-detect the latest checkpoint.

Code with bug

def find_last_step(checkpoint_dir):
    # Look into checkpoint_dir and find model_<step>.pt with the highest step
    checkpoint_files = glob.glob(
        os.path.join(checkpoint_dir, "model_*.pt")
    )

    if not checkpoint_files:
        raise FileNotFoundError(
            f"No checkpoints found in {checkpoint_dir}"
        )

    last_step = int(
        max(
            os.path.basename(f)
            .split("_")[-1]
            .split(".")[0]
            for f in checkpoint_files
        )
    )

    # BUG 🔴: Using [-1] takes the last underscore-delimited part,
    #          which fails for files like "model_000200_backup.pt"
    return last_step

  last_step = int(
    max(
        os.path.basename(f)
            .split("_")[-1]
            .split(".")[0]
        for f in checkpoint_files
    )
)  # <-- BUG 🔴 uses last part after underscore; fails if extra underscores present

Example

Given checkpoint files:

  • model_000100.pt

  • model_000200.pt

  • model_000200_backup.pt

When processing model_000200_backup.pt:

  • os.path.basename(f) → "model_000200_backup.pt"

  • .split("_")[-1] → "backup.pt"

  • .split(".")[0] → "backup"

  • int("backup") → raises ValueError: invalid literal for int() with base 10: 'backup'

Expected behavior: extract "000200" (the part immediately after "model_") and parse it as an integer.

Recommended fix

Change the parsing to target the step immediately after model_ or use a regex to match only valid files.

Option 1 – index the part after the first underscore:

last_step = int(
    max(
        os.path.basename(f)
            .split("_")[1]
            .split(".")[0]
        for f in checkpoint_files
    )
)

# FIX 🟢: Using [1] takes the part immediately after "model_",
#         which is always the step number

Option 2 – regex (more robust; ignores malformed files like backups):

import re

steps = []

for f in checkpoint_files:
    match = re.match(
        r"model_(\d+)\.pt",
        os.path.basename(f),
    )
    if match:
        steps.append(int(match.group(1)))

if not steps:
    raise ValueError(
        f"No valid checkpoint files found in {checkpoint_dir}"
    )

last_step = max(steps)

# FIX 🟢: Regex explicitly matches the expected format
#         and ignores malformed files