karpathy/nanochat
Task slicing: stop not validated → len() misreports; iteration raises IndexError
Summary
Context:
SmolTalkis a Task implementation that loads conversational training data from HuggingFace and supports dataset slicing via the baseTaskclass (start, stop, step parameters).Bug:
SmolTalk(and all other Task subclasses) do not validate thestopparameter against the actual dataset size, allowing users to create Task instances with out-of-boundsstopvalues.Actual vs. expected: When
stopexceeds the dataset size,len()reports the invalid stop value, but accessing indices beyond the actual dataset raises anIndexError. Expected behavior would be to either validate thestopparameter during initialization or clamp it to the dataset size.Impact: Training or evaluation jobs crash with
IndexErrorwhen iterating over Task instances created with invalidstopvalues, wasting compute time and causing confusing failures that appear during data loading rather than at initialization.
Code with bug
tasks/smoltalk.py:
tasks/common.py (base Task class):
Example
Repro with an out-of-bounds stop on the test split (24K rows):
Output:
This shows len() reflects the invalid stop while item access beyond the actual dataset crashes.
Recommended fix
Add validation callable after the subclass loads its dataset in the base class, then invoke it from each subclass:
And in tasks/smoltalk.py: