Reduce token waste in BOS bestfit by cropping shortest doc (#445)

When no document fits the remaining row space, crop the shortest
document in the buffer instead of the first. This minimizes
discarded tokens.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Yamahammer
2026-01-16 21:50:34 -05:00
committed by GitHub
parent 6460dc6382
commit e1dafc510f

View File

@@ -178,8 +178,9 @@ def tokenizing_distributed_data_loader_with_state_bos_bestfit(
doc = doc_buffer.pop(best_idx)
row.extend(doc)
else:
# No doc fits - crop first doc to fill remaining
doc = doc_buffer.pop(0)
# No doc fits - crop shortest in buffer to fill remaining and minimize waste
shortest_idx = min(range(len(doc_buffer)), key=lambda i: len(doc_buffer[i]))
doc = doc_buffer.pop(shortest_idx)
row.extend(doc[:remaining])
rows.append(row[:row_capacity])