feat(cli): add --batch-size parameter to prevent context leaking#2055
feat(cli): add --batch-size parameter to prevent context leaking#2055Hellnight2005 wants to merge 12 commits intolingodotdev:mainfrom
Conversation
…loss (CodeRabbit review)
📝 WalkthroughWalkthroughAdds a Changes
Sequence DiagramsequenceDiagram
actor CLI as CLI User
participant Cmd as "run / i18n Command"
participant Setup as "setup.ts"
participant Localizer as "createLocalizer"
participant Explicit as "createExplicitLocalizer"
participant Processor as "createProcessor / Translator"
participant AI as "AI SDK (generateText)"
CLI->>Cmd: invoke with --batch-size N
Cmd->>Setup: ctx.flags.batchSize = N
Setup->>Localizer: createLocalizer(..., batchSize: N)
Localizer->>Explicit: createExplicitLocalizer(provider, batchSize: N)
Cmd->>Processor: createProcessor(..., batchSize: N)
Processor->>Processor: translator.extractPayloadChunks(data, batchSize: N)
loop per chunk
Processor->>AI: generateText(chunk)
AI-->>Processor: response
Processor->>Processor: parse & accumulate chunk result
Processor->>Cmd: onProgress(percent, chunk, partialResult)
end
Processor->>Explicit: merged final result
Explicit->>CLI: return final translations
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly Related PRs
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
packages/cli/src/cli/processor/basic.ts (1)
114-115: Minor optimization: moveeffectiveBatchSizecalculation outside the loop.The
effectiveBatchSizevalue doesn't change per iteration since it only depends onbatchSizeandpayloadEntries.length. Computing it once before the loop would be cleaner.♻️ Proposed refactor
+ const effectiveBatchSize = + batchSize && batchSize > 0 ? batchSize : payloadEntries.length || 1; + const payloadEntries = Object.entries(payload); for (let i = 0; i < payloadEntries.length; i++) { const [key, value] = payloadEntries[i]; currentChunk[key] = value; currentChunkItemCount++; const currentChunkSize = countWordsInRecord(currentChunk); - const effectiveBatchSize = - batchSize && batchSize > 0 ? batchSize : payloadEntries.length || 1; if (🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/cli/src/cli/processor/basic.ts` around lines 114 - 115, The calculation of effectiveBatchSize inside the loop is redundant; compute it once before iterating. Move the expression const effectiveBatchSize = batchSize && batchSize > 0 ? batchSize : payloadEntries.length || 1; out of the loop that processes payloadEntries so the loop uses the precomputed effectiveBatchSize value instead of recalculating it each iteration.packages/cli/src/cli/cmd/i18n.ts (1)
93-97: Consider aligning the description with theruncommand.The
runcommand's--batch-sizedescription mentions "(not applicable when using lingo.dev provider)", but this description omits that detail. While thei18ncommand is deprecated, consistency would help users understand the limitation.📝 Suggested description update
.option( "--batch-size <number>", - "Number of translations to process in a single batch", + "Number of translations to process in a single batch (not applicable when using lingo.dev provider)", parseInt, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/cli/src/cli/cmd/i18n.ts` around lines 93 - 97, Update the "--batch-size <number>" option description in the i18n command to match the run command by appending the limitation text "(not applicable when using lingo.dev provider)"; locate the .option call for "--batch-size <number>" in the i18n command (packages/cli/src/cli/cmd/i18n.ts) and change the second argument string to include the parenthetical note so the help text is consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/cli/src/cli/localizer/explicit.ts`:
- Around line 296-317: The current branch that handles string responses to
populate finalResult silently drops data when result?.data is a truthy
non-object string that lacks "{" and "}" (so index/lastIndex are -1); update the
else branch after computing index/lastIndex to handle that case by either
logging a warning and including the snippet or throwing an Error so translations
aren't silently lost: modify the block that references result.data, index,
lastIndex, jsonrepair and finalResult to add an else that calls
processLogger.warn or console.warn with a clear message and the first 200 chars
of result.data (or throw new Error(`Unparsable response: ...`)) and ensure any
thrown error includes the original result/data for debugging; keep the existing
try/catch for jsonrepair/JSON.parse as-is.
- Around line 286-291: The error messages incorrectly mention "Lingo.dev" while
this code path handles BYOK providers (e.g., OpenAI, Anthropic); update the
console.error and thrown Error (the lines using console.error(...`Response
snippet: ${snippet}`) and throw new Error(...`Failed to parse response from
Lingo.dev: ${e2}`)) to reference the actual configured provider (or a generic
"configured provider"/providerName variable) and include the existing error (e2)
and snippet variables so the message reads like "Failed to parse response from
configured provider <providerName>: <error> (Snippet: <snippet>)".
- Around line 337-340: The function signature for extractPayloadChunks
incorrectly types the payload parameter as Record<string, string> even though
LocalizerData.processableData contains arrays, nested objects and plural forms;
change the parameter type to Record<string, any> (or a more specific union
matching LocalizerData.processableData) so it aligns with countWordsInRecord's
handling of nested structures and avoids type errors; update the
extractPayloadChunks declaration and any related references to use the broader
type while keeping the internal logic unchanged.
In `@packages/cli/src/cli/processor/basic.ts`:
- Around line 92-96: The JSDoc for extractPayloadChunks is stale — it claims
"default: 25" but the implementation uses payloadEntries.length when batchSize
is omitted (so all keys are processed in one batch). Update the function's JSDoc
to accurately describe the default behavior: that if batchSize is not provided,
the function will use the full payload length (i.e., create a single batch
containing all keys), and keep references to the parameters payload and
batchSize in the description.
---
Nitpick comments:
In `@packages/cli/src/cli/cmd/i18n.ts`:
- Around line 93-97: Update the "--batch-size <number>" option description in
the i18n command to match the run command by appending the limitation text "(not
applicable when using lingo.dev provider)"; locate the .option call for
"--batch-size <number>" in the i18n command (packages/cli/src/cli/cmd/i18n.ts)
and change the second argument string to include the parenthetical note so the
help text is consistent.
In `@packages/cli/src/cli/processor/basic.ts`:
- Around line 114-115: The calculation of effectiveBatchSize inside the loop is
redundant; compute it once before iterating. Move the expression const
effectiveBatchSize = batchSize && batchSize > 0 ? batchSize :
payloadEntries.length || 1; out of the loop that processes payloadEntries so the
loop uses the precomputed effectiveBatchSize value instead of recalculating it
each iteration.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7aaddc5f-6b94-4510-b8df-8802fb6bf23e
📒 Files selected for processing (12)
.changeset/add-batch-size.mdpackages/cli/src/cli/cmd/i18n.tspackages/cli/src/cli/cmd/run/_types.tspackages/cli/src/cli/cmd/run/index.tspackages/cli/src/cli/cmd/run/setup.tspackages/cli/src/cli/localizer/explicit.tspackages/cli/src/cli/localizer/index.tspackages/cli/src/cli/localizer/pseudo.tspackages/cli/src/cli/processor/basic.spec.tspackages/cli/src/cli/processor/basic.tspackages/cli/src/cli/processor/index.tspackages/react/src/client/index.ts
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/cli/src/cli/processor/basic.ts (1)
98-127:⚠️ Potential issue | 🟠 Major
--batch-sizestill counts top-level objects, not logical translation entries.This helper now explicitly accepts nested payloads, but it still batches on
Object.entries(payload). A namespace like{ common: { save: "...", cancel: "..." } }is treated as one item, so--batch-size 1can still send multiple keys together and miss the per-key isolation this flag is meant to provide. Split on logical translation entries instead of only top-level properties.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/cli/src/cli/processor/basic.ts` around lines 98 - 127, extractPayloadChunks currently counts only top-level keys so nested translation entries (e.g., {common:{save:...,cancel:...}}) are treated as one item; change extractPayloadChunks to iterate logical translation entries instead: first flatten the payload into an array of entries representing each translation key path and value (e.g., ["common.save", value] or path array), then build chunks by adding one logical entry at a time and reconstructing the nested object shape when pushing currentChunk; keep the same batching logic (use countWordsInRecord to measure size and effectiveBatchSize for item counting) but increment currentChunkItemCount per logical entry rather than per top-level property, and add a small helper (or inline logic) to set nested values into currentChunk when adding an entry so output chunks preserve the original nested structure.
🧹 Nitpick comments (1)
packages/cli/src/cli/processor/basic.ts (1)
5-8: KeepbatchSizeout of the AI SDK call settings.
batchSizeis internal batching state, but the samesettingsobject is still reused later forgenerateText(...). That leaks a CLI-only flag into provider request options and makes this path depend on adapters tolerating unknown fields. SplitbatchSizefrom the actual model settings before building the request.Suggested change
export function createBasicTranslator( model: LanguageModel, systemPrompt: string, settings: ModelSettings = {}, ) { + const { batchSize, ...modelSettings } = settings; + return async (input: LocalizerInput, onProgress: LocalizerProgressFn) => { const chunks = extractPayloadChunks( input.processableData, - settings.batchSize, + batchSize, ); @@ const response = await generateText({ model, - ...settings, + ...modelSettings, messages: [Also applies to: 16-18
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/cli/src/cli/processor/basic.ts` around lines 5 - 8, The ModelSettings type currently includes batchSize which is internal and is being passed through into provider calls; update the code so you split out batchSize from the settings object before building SDK request options: keep ModelSettings as-is for typing if desired but when preparing the call (where you pass settings into generateText / the provider request) destructure like `const { batchSize, ...modelSettings } = settings` and use modelSettings for the AI SDK/generateText call and batchSize only for local batching logic; ensure any place that forwards `settings` to the provider (e.g., the generateText invocation) uses the cleaned modelSettings to avoid leaking the CLI-only flag.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/cli/src/cli/localizer/explicit.ts`:
- Around line 199-202: The current chunking leaks sibling hints because
extractPayloadChunks splits only on top-level entries so chunkHints carries
entire top-level subtrees; update the chunking logic (extractPayloadChunks and
the code that builds chunkHints) to operate at the leaf/key-path level: when
batching, enumerate leaf key paths from processableData and create chunks of
those full paths respecting params.batchSize, and when building chunkHints
prune/copy only the hint nodes that correspond to the exact leaf key paths
included in that chunk (not the entire top-level subtree). Ensure functions
referenced (extractPayloadChunks, any chunkHints builder code that consumes
processableData) use key-path traversal rather than top-level splitting so
--batch-size 1 isolates sibling keys and prevents hint bleed.
- Around line 278-324: The code currently only inspects result.data and drops
top-level parsed payloads; change the logic to first normalize the parsed output
into a single variable (e.g., rawData = result?.data ?? result) and then run the
existing object/string handling against rawData instead of result.data so plain
objects or top-level JSON strings aren't ignored; update all subsequent
references (the object branch, string branch, error messages and places using
result.data) to use rawData, keeping parseModelResponse, finalResult and
jsonrepair behavior intact and preserving the same error/snippet reporting that
includes params.id.
---
Outside diff comments:
In `@packages/cli/src/cli/processor/basic.ts`:
- Around line 98-127: extractPayloadChunks currently counts only top-level keys
so nested translation entries (e.g., {common:{save:...,cancel:...}}) are treated
as one item; change extractPayloadChunks to iterate logical translation entries
instead: first flatten the payload into an array of entries representing each
translation key path and value (e.g., ["common.save", value] or path array),
then build chunks by adding one logical entry at a time and reconstructing the
nested object shape when pushing currentChunk; keep the same batching logic (use
countWordsInRecord to measure size and effectiveBatchSize for item counting) but
increment currentChunkItemCount per logical entry rather than per top-level
property, and add a small helper (or inline logic) to set nested values into
currentChunk when adding an entry so output chunks preserve the original nested
structure.
---
Nitpick comments:
In `@packages/cli/src/cli/processor/basic.ts`:
- Around line 5-8: The ModelSettings type currently includes batchSize which is
internal and is being passed through into provider calls; update the code so you
split out batchSize from the settings object before building SDK request
options: keep ModelSettings as-is for typing if desired but when preparing the
call (where you pass settings into generateText / the provider request)
destructure like `const { batchSize, ...modelSettings } = settings` and use
modelSettings for the AI SDK/generateText call and batchSize only for local
batching logic; ensure any place that forwards `settings` to the provider (e.g.,
the generateText invocation) uses the cleaned modelSettings to avoid leaking the
CLI-only flag.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0194db5f-69f6-420b-9b1d-0f02f2f07e99
📒 Files selected for processing (2)
packages/cli/src/cli/localizer/explicit.tspackages/cli/src/cli/processor/basic.ts
| const chunks = extractPayloadChunks( | ||
| input.processableData, | ||
| params.batchSize, | ||
| ); |
There was a problem hiding this comment.
Nested locale trees can still leak hints across sibling keys.
extractPayloadChunks() only splits on top-level entries, and chunkHints then carries the matching top-level hint subtree into the same request. For nested processableData, --batch-size 1 still batches all leaf keys under one namespace together, so sibling hints can continue to bleed across translations.
Also applies to: 250-259, 344-374
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/cli/src/cli/localizer/explicit.ts` around lines 199 - 202, The
current chunking leaks sibling hints because extractPayloadChunks splits only on
top-level entries so chunkHints carries entire top-level subtrees; update the
chunking logic (extractPayloadChunks and the code that builds chunkHints) to
operate at the leaf/key-path level: when batching, enumerate leaf key paths from
processableData and create chunks of those full paths respecting
params.batchSize, and when building chunkHints prune/copy only the hint nodes
that correspond to the exact leaf key paths included in that chunk (not the
entire top-level subtree). Ensure functions referenced (extractPayloadChunks,
any chunkHints builder code that consumes processableData) use key-path
traversal rather than top-level splitting so --batch-size 1 isolates sibling
keys and prevents hint bleed.
| let result: any; | ||
| try { | ||
| result = parseModelResponse(response.text); | ||
| } catch (e2) { | ||
| const snippet = | ||
| response.text.length > 500 | ||
| ? `${response.text.slice(0, 500)}…` | ||
| : response.text; | ||
| console.error( | ||
| `Failed to parse response from ${params.id}. Response snippet: ${snippet}`, | ||
| ); | ||
| throw new Error( | ||
| `Failed to parse response from ${params.id}: ${e2} (Snippet: ${snippet})`, | ||
| ); | ||
| } | ||
| let finalResult: Record<string, any> = {}; | ||
|
|
||
| // Handle both object and string responses | ||
| if (typeof result.data === "object" && result.data !== null) { | ||
| return result.data; | ||
| // Handle both object and string responses | ||
| if (typeof result?.data === "object" && result.data !== null) { | ||
| finalResult = result.data; | ||
| } else if (result?.data) { | ||
| // Handle string responses - extract and repair JSON | ||
| const index = result.data.indexOf("{"); | ||
| const lastIndex = result.data.lastIndexOf("}"); | ||
| if (index !== -1 && lastIndex !== -1) { | ||
| try { | ||
| const trimmed = result.data.slice(index, lastIndex + 1); | ||
| const repaired = jsonrepair(trimmed); | ||
| const parsed = JSON.parse(repaired); | ||
| finalResult = parsed.data || parsed || {}; | ||
| } catch (e) { | ||
| console.error( | ||
| `Failed to parse nested JSON response. Snippet: ${result.data.slice(0, 100)}...`, | ||
| ); | ||
| throw new Error( | ||
| `Failed to parse nested JSON response: ${e} (Snippet: ${result.data.slice(0, 100)}...)`, | ||
| ); | ||
| } | ||
| } else { | ||
| console.error( | ||
| `Unexpected response format - no JSON object found. Snippet: ${String(result.data).slice(0, 100)}...`, | ||
| ); | ||
| throw new Error( | ||
| `Unexpected response format from ${params.id} - no JSON object found in response`, | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Handle top-level parsed payloads before assuming a data wrapper.
This branch only succeeds when the parsed response is stored under result.data. If parseModelResponse() returns a plain object or a top-level JSON string, finalResult stays {} and the whole chunk is silently dropped. Normalize result itself into rawData first, then run the existing object/string handling on that value.
Suggested change
- let finalResult: Record<string, any> = {};
-
- // Handle both object and string responses
- if (typeof result?.data === "object" && result.data !== null) {
- finalResult = result.data;
- } else if (result?.data) {
+ let finalResult: Record<string, any> = {};
+ const rawData =
+ typeof result === "object" && result !== null && "data" in result
+ ? result.data
+ : result;
+
+ if (typeof rawData === "object" && rawData !== null) {
+ finalResult = rawData as Record<string, any>;
+ } else if (typeof rawData === "string" && rawData) {
// Handle string responses - extract and repair JSON
- const index = result.data.indexOf("{");
- const lastIndex = result.data.lastIndexOf("}");
+ const index = rawData.indexOf("{");
+ const lastIndex = rawData.lastIndexOf("}");
if (index !== -1 && lastIndex !== -1) {
try {
- const trimmed = result.data.slice(index, lastIndex + 1);
+ const trimmed = rawData.slice(index, lastIndex + 1);
const repaired = jsonrepair(trimmed);
const parsed = JSON.parse(repaired);
finalResult = parsed.data || parsed || {};
} catch (e) {
console.error(
- `Failed to parse nested JSON response. Snippet: ${result.data.slice(0, 100)}...`,
+ `Failed to parse nested JSON response. Snippet: ${rawData.slice(0, 100)}...`,
);
throw new Error(
- `Failed to parse nested JSON response: ${e} (Snippet: ${result.data.slice(0, 100)}...)`,
+ `Failed to parse nested JSON response: ${e} (Snippet: ${rawData.slice(0, 100)}...)`,
);
}
} else {
console.error(
- `Unexpected response format - no JSON object found. Snippet: ${String(result.data).slice(0, 100)}...`,
+ `Unexpected response format - no JSON object found. Snippet: ${String(rawData).slice(0, 100)}...`,
);
throw new Error(
`Unexpected response format from ${params.id} - no JSON object found in response`,
);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/cli/src/cli/localizer/explicit.ts` around lines 278 - 324, The code
currently only inspects result.data and drops top-level parsed payloads; change
the logic to first normalize the parsed output into a single variable (e.g.,
rawData = result?.data ?? result) and then run the existing object/string
handling against rawData instead of result.data so plain objects or top-level
JSON strings aren't ignored; update all subsequent references (the object
branch, string branch, error messages and places using result.data) to use
rawData, keeping parseModelResponse, finalResult and jsonrepair behavior intact
and preserving the same error/snippet reporting that includes params.id.
Summary
Implements the
--batch-sizeparameter for CLI commands to control the number of translation strings sent to the LLM per request, preventing context leaking of translator hints between unrelated keys.Changes
--batch-size <number>flag to therunandi18nCLI commands.createBasicTranslatorinsrc/cli/processor/basic.tsto break down and process inputs into configurable chunks.JSON.parseexceptions gracefully usingjsonrepairin the AI feedback loop.Testing
Business logic tests added:
batchSizeis strictly set to 1.packages/clisucceed).Visuals
Required for UI/UX changes:
Checklist
Closes #1733
Summary by CodeRabbit
New Features
--batch-sizeoption torunandi18ncommands for configurable batch processingTests