Skip to content

perf: Batch SQLite INSERTs for indexing pipeline#230

Open
KRRT7 wants to merge 4 commits intomicrosoft:mainfrom
KRRT7:perf/batch-inserts
Open

perf: Batch SQLite INSERTs for indexing pipeline#230
KRRT7 wants to merge 4 commits intomicrosoft:mainfrom
KRRT7:perf/batch-inserts

Conversation

@KRRT7
Copy link
Copy Markdown
Contributor

@KRRT7 KRRT7 commented Apr 10, 2026

Stack: 3/4 — depends on #229. Merge #231, #229, then this PR.


  • Add add_terms_batch and add_properties_batch to ITermToSemanticRefIndex and IPropertyToSemanticRefIndex interfaces
  • SQLite backend uses executemany instead of individual cursor.execute() calls (~1000+ calls per indexing batch reduced to 2-3)
  • Restructure add_metadata_to_index_from_list and add_to_property_index to collect all data first (pure functions), then batch-insert
  • Memory backend implements batch methods as loops for interface compatibility

Benchmark

Azure Standard_D2s_v5 -- 2 vCPU, 8 GiB RAM, Python 3.13

Indexing Pipeline (pytest-async-benchmark pedantic, 20 rounds, 3 warmup)

Only the hot path (add_messages_with_indexing) is timed -- DB creation, storage init, and teardown are excluded.

Benchmark Before (min) After (min) Speedup
add_messages_with_indexing (200 msgs) 28.8 ms 25.0 ms 1.16x
add_messages_with_indexing (50 msgs) 7.8 ms 6.7 ms 1.16x
VTT ingest (40 msgs) 6.9 ms 6.1 ms 1.14x

Consistent ~14-16% improvement -- executemany amortizes per-call overhead.

Reproduce the benchmark locally

Save the benchmark file below as tests/benchmarks/test_benchmark_indexing.py, then:

pip install 'pytest-async-benchmark @ git+https://github.com/KRRT7/pytest-async-benchmark.git@feat/pedantic-mode' pytest-asyncio

# Run on main
git checkout main
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s

# Run on this branch
git checkout perf/batch-inserts
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s

Generated by codeflash optimization agent

Copy link
Copy Markdown
Collaborator

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! It requires more review time than I have right now, so I'll keep it open until I have more time.

KRRT7 added 3 commits April 10, 2026 15:49
Add add_terms_batch / add_properties_batch to the index interfaces
with executemany-based SQLite implementations. Restructure
add_metadata_to_index_from_list and add_to_property_index to collect
all items first, then batch-insert via extend() and the new batch
methods. Eliminates ~1000 individual INSERT round-trips during
indexing.
Rename _collect_{facet,entity,action}_{terms,properties} to drop the
leading underscore in propindex.py and semrefindex.py.
Change list to Sequence in add_terms_batch and add_properties_batch
interfaces and implementations to satisfy covariance. Add missing
add_terms_batch to FakeTermIndex in conftest.py.
@KRRT7 KRRT7 force-pushed the perf/batch-inserts branch from 4030379 to 82ba650 Compare April 10, 2026 20:50
@bmerkle bmerkle self-assigned this Apr 11, 2026
Copy link
Copy Markdown
Collaborator

@bmerkle bmerkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @KRRT7
I was asked by @gvanrossum to do a review of this PR
Please find attached some comments.

There are also some pre-existing issues in these files, so things which have not be introduced by this PR, but i would suggest that we cover those in a future, seperate PR.
I have only mentioned the code duplicate, which could possibly be fixed also in this PR.

please let me know what you think.

knowledge_validator: KnowledgeValidator | None = None,
) -> None:
"""Extract metadata knowledge from a list of messages starting at ordinal."""
next_ordinal = await semantic_refs.size()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a possible bug because the new implementation add_metadata_to_index_from_list drops inverse_actions

The original per-item functions (add_knowledge_to_index, add_knowledge_to_index) process knowledge_response.inverse_actions in addition to actions.

The new batched add_metadata_to_index_from_list (semrefindex.py:631-675) only iterates over entities, actions, and topics — inverse_actions are silently skipped. This is a correctness regression: any inverse actions in messages will no longer be indexed.

Maybe we also should extend the testcases to cover all the knowlege_schemas.

return props


def collect_action_properties(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential BUG: collect_action_properties drops subject_entity_facet

The old add_action_properties_to_index does not index action.subject_entity_facet (correct — only the semrefindex adds facet terms). But the new collect_action_properties in propindex.py also doesn't — so this is fine.
However, comparing the semrefindex side: collect_action_terms does include action.subject_entity_facet via collect_facet_terms, which matches the old add_action_to_index.

)


async def add_metadata_to_index[TMessage: IMessage](
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_metadata_to_index is not batched

Only add_metadata_to_index_from_list was converted to batch mode. The older add_metadata_to_index (which takes AsyncIterable[TMessage], semrefindex.py:547-580) still uses single-item inserts. This is inconsistent — callers using the iterator-based path won't benefit from the optimization. This may be intentional (harder to batch an async iterator), but it's a missed optimization.

) -> None:
if not properties:
return
from ...storage.memory.propindex import (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't put imports inside functions
these should be top-level imports, consistent with the coding guidelines

see AGENTS.md

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think that there are times when imports should go inside functions, one such example is https://github.com/microsoft/typeagent-py/pull/229/changes
however, I'll need to double check if it really matters for this PR.

  • as an aside I wanted to take a look to your agentic rules in a separate PR to see how I could optimize them as well for your needs.

return terms


async def add_metadata_to_index_from_list[TMessage: IMessage](
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_metadata_to_index_from_list also doesn't process inverse_actions from knowledge_response

)


async def add_entity_to_index(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have massive code duplication. This is not origin in this PR but IMO we should fix this.

There are two parallel sets of functions that do the same thing with minor variations:

add_entity_to_index (line 154) vs add_entity (line 199)
add_action_to_index (line 486) vs add_action (line 327)
add_topic_to_index (line 468) vs add_topic (line 277)
add_facet at line 243 is shared but called differently
add_knowledge_to_index (line 520) vs add_knowledge_to_semantic_ref_index (line 420)

The _to_index variants use text_range_from_location while the other set uses text_range_from_message_chunk and supports terms_added.

Now the PR adds a third path (collect_*_terms + batch). Every behavioral change must be synchronized across all three

await semantic_ref_index.add_term(topic.text, ref_ordinal)


async def add_action_to_index(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see comment on line R154

return bool(entity.name)


async def add_topic_to_index(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see comment on line R154

await add_facet(facet, semantic_ref_ordinal, semantic_ref_index)


async def add_facet(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see comment on line R154

await add_facet(action.subject_entity_facet, ref_ordinal, semantic_ref_index)


async def add_knowledge_to_index(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see comment on line R154

@KRRT7
Copy link
Copy Markdown
Contributor Author

KRRT7 commented Apr 11, 2026

Hi @KRRT7
I was asked by @gvanrossum to do a review of this PR

Thanks! I'll take a closer look to your reviews this afternoon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants