perf: Batch SQLite INSERTs for indexing pipeline by KRRT7 · Pull Request #230 · microsoft/typeagent-py

KRRT7 · 2026-04-10T07:06:47Z

Stack: 3/4 — depends on #229. Merge #231, #229, then this PR.

Add add_terms_batch and add_properties_batch to ITermToSemanticRefIndex and IPropertyToSemanticRefIndex interfaces
SQLite backend uses executemany instead of individual cursor.execute() calls (~1000+ calls per indexing batch reduced to 2-3)
Restructure add_metadata_to_index_from_list and add_to_property_index to collect all data first (pure functions), then batch-insert
Memory backend implements batch methods as loops for interface compatibility

Benchmark

Azure Standard_D2s_v5 -- 2 vCPU, 8 GiB RAM, Python 3.13

Indexing Pipeline (pytest-async-benchmark pedantic, 20 rounds, 3 warmup)

Only the hot path (add_messages_with_indexing) is timed -- DB creation, storage init, and teardown are excluded.

Benchmark	Before (min)	After (min)	Speedup
`add_messages_with_indexing` (200 msgs)	28.8 ms	25.0 ms	1.16x
`add_messages_with_indexing` (50 msgs)	7.8 ms	6.7 ms	1.16x
VTT ingest (40 msgs)	6.9 ms	6.1 ms	1.14x

Consistent ~14-16% improvement -- executemany amortizes per-call overhead.

Reproduce the benchmark locally

Save the benchmark file below as tests/benchmarks/test_benchmark_indexing.py, then:

pip install 'pytest-async-benchmark @ git+https://github.com/KRRT7/pytest-async-benchmark.git@feat/pedantic-mode' pytest-asyncio

# Run on main
git checkout main
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s

# Run on this branch
git checkout perf/batch-inserts
python -m pytest tests/benchmarks/test_benchmark_indexing.py -v -s

Generated by codeflash optimization agent

gvanrossum

Thanks for this! It requires more review time than I have right now, so I'll keep it open until I have more time.

Add add_terms_batch / add_properties_batch to the index interfaces with executemany-based SQLite implementations. Restructure add_metadata_to_index_from_list and add_to_property_index to collect all items first, then batch-insert via extend() and the new batch methods. Eliminates ~1000 individual INSERT round-trips during indexing.

Rename _collect_{facet,entity,action}_{terms,properties} to drop the leading underscore in propindex.py and semrefindex.py.

Change list to Sequence in add_terms_batch and add_properties_batch interfaces and implementations to satisfy covariance. Add missing add_terms_batch to FakeTermIndex in conftest.py.

bmerkle

Hi @KRRT7
I was asked by @gvanrossum to do a review of this PR
Please find attached some comments.

There are also some pre-existing issues in these files, so things which have not be introduced by this PR, but i would suggest that we cover those in a future, seperate PR.
I have only mentioned the code duplicate, which could possibly be fixed also in this PR.

please let me know what you think.

bmerkle · 2026-04-11T18:53:28Z

src/typeagent/storage/memory/semrefindex.py

    knowledge_validator: KnowledgeValidator | None = None,
 ) -> None:
    """Extract metadata knowledge from a list of messages starting at ordinal."""
+    next_ordinal = await semantic_refs.size()


I think there is a possible bug because the new implementation add_metadata_to_index_from_list drops inverse_actions

The original per-item functions (add_knowledge_to_index, add_knowledge_to_index) process knowledge_response.inverse_actions in addition to actions.

The new batched add_metadata_to_index_from_list (semrefindex.py:631-675) only iterates over entities, actions, and topics — inverse_actions are silently skipped. This is a correctness regression: any inverse actions in messages will no longer be indexed.

Maybe we also should extend the testcases to cover all the knowlege_schemas.

bmerkle · 2026-04-11T18:55:54Z

src/typeagent/storage/memory/propindex.py

+    return props
+
+
+def collect_action_properties(


potential BUG: collect_action_properties drops subject_entity_facet

The old add_action_properties_to_index does not index action.subject_entity_facet (correct — only the semrefindex adds facet terms). But the new collect_action_properties in propindex.py also doesn't — so this is fine.
However, comparing the semrefindex side: collect_action_terms does include action.subject_entity_facet via collect_facet_terms, which matches the old add_action_to_index.

bmerkle · 2026-04-11T18:59:33Z

src/typeagent/storage/memory/semrefindex.py

        )


 async def add_metadata_to_index[TMessage: IMessage](


add_metadata_to_index is not batched

Only add_metadata_to_index_from_list was converted to batch mode. The older add_metadata_to_index (which takes AsyncIterable[TMessage], semrefindex.py:547-580) still uses single-item inserts. This is inconsistent — callers using the iterator-based path won't benefit from the optimization. This may be intentional (harder to batch an async iterator), but it's a missed optimization.

bmerkle · 2026-04-11T19:02:55Z

src/typeagent/storage/sqlite/propindex.py

+    ) -> None:
+        if not properties:
+            return
+        from ...storage.memory.propindex import (


Don't put imports inside functions
these should be top-level imports, consistent with the coding guidelines

see AGENTS.md

Hi, I think that there are times when imports should go inside functions, one such example is https://github.com/microsoft/typeagent-py/pull/229/changes
however, I'll need to double check if it really matters for this PR.

as an aside I wanted to take a look to your agentic rules in a separate PR to see how I could optimize them as well for your needs.

bmerkle · 2026-04-11T19:04:18Z

src/typeagent/storage/memory/semrefindex.py

+    return terms
+
+
 async def add_metadata_to_index_from_list[TMessage: IMessage](


add_metadata_to_index_from_list also doesn't process inverse_actions from knowledge_response

bmerkle · 2026-04-11T19:13:57Z

src/typeagent/storage/memory/semrefindex.py

        )


 async def add_entity_to_index(


We have massive code duplication. This is not origin in this PR but IMO we should fix this.

There are two parallel sets of functions that do the same thing with minor variations:

add_entity_to_index (line 154) vs add_entity (line 199)
add_action_to_index (line 486) vs add_action (line 327)
add_topic_to_index (line 468) vs add_topic (line 277)
add_facet at line 243 is shared but called differently
add_knowledge_to_index (line 520) vs add_knowledge_to_semantic_ref_index (line 420)

The _to_index variants use text_range_from_location while the other set uses text_range_from_message_chunk and supports terms_added.

Now the PR adds a third path (collect_*_terms + batch). Every behavioral change must be synchronized across all three

bmerkle · 2026-04-11T19:14:44Z

src/typeagent/storage/memory/semrefindex.py

    await semantic_ref_index.add_term(topic.text, ref_ordinal)


 async def add_action_to_index(


please see comment on line R154

bmerkle · 2026-04-11T19:15:01Z

src/typeagent/storage/memory/semrefindex.py

    return bool(entity.name)


 async def add_topic_to_index(


please see comment on line R154

bmerkle · 2026-04-11T19:15:24Z

src/typeagent/storage/memory/semrefindex.py

            await add_facet(facet, semantic_ref_ordinal, semantic_ref_index)


 async def add_facet(


please see comment on line R154

bmerkle · 2026-04-11T19:15:37Z

src/typeagent/storage/memory/semrefindex.py

    await add_facet(action.subject_entity_facet, ref_ordinal, semantic_ref_index)


 async def add_knowledge_to_index(


please see comment on line R154

KRRT7 · 2026-04-11T19:34:05Z

Hi @KRRT7
I was asked by @gvanrossum to do a review of this PR

Thanks! I'll take a closer look to your reviews this afternoon.

KRRT7 force-pushed the perf/batch-inserts branch from 19520f3 to e7e804e Compare April 10, 2026 10:21

This was referenced Apr 10, 2026

Fix parse_azure_endpoint passing query string to AsyncAzureOpenAI #231

Merged

perf: Batch metadata query to avoid N+1 across 5 call sites #232

Open

gvanrossum reviewed Apr 10, 2026

View reviewed changes

KRRT7 added 3 commits April 10, 2026 15:49

Remove underscore prefix from collect helper functions

fcc7c23

Rename _collect_{facet,entity,action}_{terms,properties} to drop the leading underscore in propindex.py and semrefindex.py.

Fix pyright errors: use Sequence for batch method signatures

82ba650

Change list to Sequence in add_terms_batch and add_properties_batch interfaces and implementations to satisfy covariance. Add missing add_terms_batch to FakeTermIndex in conftest.py.

KRRT7 force-pushed the perf/batch-inserts branch from 4030379 to 82ba650 Compare April 10, 2026 20:50

Merge branch 'main' into perf/batch-inserts

544912a

bmerkle self-assigned this Apr 11, 2026

bmerkle reviewed Apr 11, 2026

View reviewed changes

		return terms


		async def add_metadata_to_index_from_list[TMessage: IMessage](

		await semantic_ref_index.add_term(topic.text, ref_ordinal)


		async def add_action_to_index(

		await add_facet(facet, semantic_ref_ordinal, semantic_ref_index)


		async def add_facet(

		await add_facet(action.subject_entity_facet, ref_ordinal, semantic_ref_index)


		async def add_knowledge_to_index(

Conversation

KRRT7 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Azure Standard_D2s_v5 -- 2 vCPU, 8 GiB RAM, Python 3.13

Indexing Pipeline (pytest-async-benchmark pedantic, 20 rounds, 3 warmup)

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

bmerkle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KRRT7 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KRRT7 commented Apr 10, 2026 •

edited

Loading