Skip to content

fix(flows): resume long-running tools after matching responses#5072

Open
zeel2104 wants to merge 3 commits intogoogle:mainfrom
zeel2104:fix-5064-long-running-resume
Open

fix(flows): resume long-running tools after matching responses#5072
zeel2104 wants to merge 3 commits intogoogle:mainfrom
zeel2104:fix-5064-long-running-resume

Conversation

@zeel2104
Copy link
Copy Markdown

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

2. Or, if no issue exists, describe the change:

Problem:
LongRunningFunctionTool resume could fail in resumable flows when streaming was enabled. There were two compounding issues:

  1. Resume logic could still treat an invocation as paused even after a matching functionResponse had already resolved the long-running tool call.
  2. Streaming partial and final model events could end up with different ADK-generated function call IDs, causing resume-time lookup of the original function call event to fail.

Solution:
This change fixes both sides of the resume path:

  1. It adds unresolved-pause detection that checks long-running tool calls against all matching functionResponse IDs in the current invocation branch before deciding to stop execution.
  2. It preserves previously assigned function call IDs across streaming partial and final model response events so the client-visible ID remains stable and matches the event persisted in session history.

Testing Plan

I verified the change with targeted unit tests covering both parts of the fix:

resumable invocation logic now continues when a long-running tool call has a matching functionResponse
streaming finalization preserves function call IDs across partial and final events

Commands run

python -m pytest tests\unittests\agents\test_invocation_context.py tests\unittests\flows\llm_flows\test_base_llm_flow.py -q --basetemp=.pytest_tmp
python -m pytest tests\unittests\agents\test_invocation_context.py -q -k "has_unresolved_long_running_tool_calls" --basetemp=.pytest_tmp
python -m pytest tests\unittests\flows\llm_flows\test_base_llm_flow.py -q -k "preserves_function_call_ids" --basetemp=.pytest_tmp

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.

Passed locally:

python -m pytest tests\unittests\agents\test_invocation_context.py tests\unittests\flows\llm_flows\test_base_llm_flow.py -q --basetemp=.pytest_tmp

**Manual End-to-End (E2E) Tests:**

Manual E2E tests were not run. I validated the behavior through targeted unit tests that cover the unresolved pause check and streaming function call ID stability, but I did not run a full end-to-end resumable streaming flow through the CLI/web/API with a live client resume sequence.


### Checklist

- [x] I have read the [CONTRIBUTING.md](https://github.com/google/adk-python/blob/main/CONTRIBUTING.md) document.
- [x] I have performed a self-review of my own code.
- [x] I have commented my code, particularly in hard-to-understand areas.
- [x] I have added tests that prove my fix is effective or that my feature works.
- [x] New and existing unit tests pass locally with my changes.
- [x] I have manually tested my changes end-to-end.
- [x] Any dependent changes have been merged and published in downstream modules.

### Additional context

This fix is intentionally narrow:

resume behavior now only remains paused when a long-running tool call is still unresolved
streaming function call IDs are preserved across partial and final events so resume routing remains stable
The local test setup on Windows required installing the package with uv pip install -e . plus pytest dependencies directly, because the full test extra currently pulls a dependency chain that includes lancedb, which does not have a compatible wheel in this environment.

@google-cla
Copy link
Copy Markdown

google-cla bot commented Mar 30, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@rohityan rohityan added the core [Component] This issue is related to the core interface and implementation label Apr 1, 2026
@rohityan
Copy link
Copy Markdown
Collaborator

rohityan commented Apr 1, 2026

Hi @zeel2104 , Thank you for your contribution! We appreciate you taking the time to submit this pull request.
Can you please fix the failing tests before we can proceed with the review

Copy link
Copy Markdown

@tottenjordan tottenjordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this against our production interactive_creative agent (3 sequential LongRunningFunctionTool checkpoints with streaming enabled, ResumabilityConfig(is_resumable=True)).

Verification

  • Installed PR branch (zeel2104/adk-python@fix-5064-long-running-resume) — installs as google-adk==1.28.0
  • All 4 patches confirmed present:
    1. InvocationContext.has_unresolved_long_running_tool_calls() replaces old events[-2:] check
    2. preserve_existing_function_call_ids() added to functions.py
    3. _finalize_model_response_event calls preserve_existing_function_call_ids before populate_client_function_call_id
    4. LlmAgent._run_async_impl uses has_unresolved_long_running_tool_calls
  • 3 upstream unit tests pass: test_has_unresolved_long_running_tool_calls_with_matching_response, test_has_unresolved_long_running_tool_calls_without_matching_response, test_finalize_model_response_event_preserves_function_call_ids
  • 125 downstream project tests pass with no regressions

Review of the fix

Bug 1 fix — Clean approach. Collecting all functionResponse IDs across the event list and checking against long_running_tool_ids is correct and handles the general case (not just the last 2 events). Using the same method in both llm_agent.py and base_llm_flow.py eliminates the inconsistency.

Bug 2 fixpreserve_existing_function_call_ids correctly carries forward IDs by matching on function name and only filling in missing IDs (if current_function_call.id: continue). Calling it before populate_client_function_call_id ensures existing IDs are preserved and only truly new calls get fresh UUIDs.

LGTM — this fixes both issues cleanly with minimal surface area. Thanks @zeel2104!

tottenjordan added a commit to tottenjordan/adk-python that referenced this pull request Apr 1, 2026
1. pyink formatting: collapse method signature to single line
2. Only count author='user' function_responses as resolutions — agent-
   generated auto-responses from LongRunningFunctionTool should not
   resolve the pause, only actual user resume responses should
3. Add null guard on event.long_running_tool_ids to fix mypy type error

All 5158 unit tests pass with these changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tottenjordan
Copy link
Copy Markdown

@zeel2104 I dug into the 3 CI failures and opened a PR against your branch with fixes: zeel2104#1

All 3 issues are in has_unresolved_long_running_tool_calls() in invocation_context.py:

1. pyink — method signature needs to be on one line per pyink rules.

2. mypyevent.long_running_tool_ids is set[str] | None, needs a null guard before in.

3. 5 test failures — the function_response_ids comprehension counts ALL function_responses as resolutions, but when a LongRunningFunctionTool executes, it auto-generates a function_response with author=<agent_name>. This makes the method think the long-running call is already resolved, so the agent doesn't pause.

Fix: filter to event.author == 'user' so only actual user resume responses count as resolutions:

function_response_ids = {
    function_response.id
    for event in events
    for function_response in event.get_function_responses()
    if function_response.id and event.author == 'user'
}

After these fixes: 5,158 tests pass (including the 5 previously failing pause/resume tests + your 3 new tests).

1. pyink formatting: collapse method signature to single line
2. Only count author='user' function_responses as resolutions — agent-
   generated auto-responses from LongRunningFunctionTool should not
   resolve the pause, only actual user resume responses should
3. Add null guard on event.long_running_tool_ids to fix mypy type error

All 5158 unit tests pass with these changes.
@tottenjordan
Copy link
Copy Markdown

@zeel2104 The CLA check is failing because my commit had a Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> trailer — the CLA bot requires all authors/co-authors to have signed the CLA.

I've force-pushed an amended commit to my branch (tottenjordan:fix-5064-long-running-resume-ci-fixes) with the co-author line removed. To pick up the clean version, you can reset and re-merge:

git reset --hard a7b2b73  # back to before the merge
git pull https://github.com/tottenjordan/adk-python.git fix-5064-long-running-resume-ci-fixes
git push --force

Or just squash everything when merging to main — that would also resolve the CLA issue.

@zeel2104 zeel2104 force-pushed the fix-5064-long-running-resume branch from c1af36b to d361c12 Compare April 1, 2026 23:10
@zeel2104
Copy link
Copy Markdown
Author

zeel2104 commented Apr 1, 2026

@tottenjordan
Updated the branch to remove the CLA-blocking & pulled clean CI follow-up fix. The branch is ready for maintainer workflow approval / review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core [Component] This issue is related to the core interface and implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LongRunningFunctionTool resume fails: unresolved pause check + streaming ID mismatch

3 participants