perf: Batch SQLite INSERTs for indexing pipeline#230
perf: Batch SQLite INSERTs for indexing pipeline#230KRRT7 wants to merge 4 commits intomicrosoft:mainfrom
Conversation
19520f3 to
e7e804e
Compare
gvanrossum
left a comment
There was a problem hiding this comment.
Thanks for this! It requires more review time than I have right now, so I'll keep it open until I have more time.
Add add_terms_batch / add_properties_batch to the index interfaces with executemany-based SQLite implementations. Restructure add_metadata_to_index_from_list and add_to_property_index to collect all items first, then batch-insert via extend() and the new batch methods. Eliminates ~1000 individual INSERT round-trips during indexing.
Rename _collect_{facet,entity,action}_{terms,properties} to drop the
leading underscore in propindex.py and semrefindex.py.
Change list to Sequence in add_terms_batch and add_properties_batch interfaces and implementations to satisfy covariance. Add missing add_terms_batch to FakeTermIndex in conftest.py.
4030379 to
82ba650
Compare
bmerkle
left a comment
There was a problem hiding this comment.
Hi @KRRT7
I was asked by @gvanrossum to do a review of this PR
Please find attached some comments.
There are also some pre-existing issues in these files, so things which have not be introduced by this PR, but i would suggest that we cover those in a future, seperate PR.
I have only mentioned the code duplicate, which could possibly be fixed also in this PR.
please let me know what you think.
| knowledge_validator: KnowledgeValidator | None = None, | ||
| ) -> None: | ||
| """Extract metadata knowledge from a list of messages starting at ordinal.""" | ||
| next_ordinal = await semantic_refs.size() |
There was a problem hiding this comment.
I think there is a possible bug because the new implementation add_metadata_to_index_from_list drops inverse_actions
The original per-item functions (add_knowledge_to_index, add_knowledge_to_index) process knowledge_response.inverse_actions in addition to actions.
The new batched add_metadata_to_index_from_list (semrefindex.py:631-675) only iterates over entities, actions, and topics — inverse_actions are silently skipped. This is a correctness regression: any inverse actions in messages will no longer be indexed.
Maybe we also should extend the testcases to cover all the knowlege_schemas.
| return props | ||
|
|
||
|
|
||
| def collect_action_properties( |
There was a problem hiding this comment.
potential BUG: collect_action_properties drops subject_entity_facet
The old add_action_properties_to_index does not index action.subject_entity_facet (correct — only the semrefindex adds facet terms). But the new collect_action_properties in propindex.py also doesn't — so this is fine.
However, comparing the semrefindex side: collect_action_terms does include action.subject_entity_facet via collect_facet_terms, which matches the old add_action_to_index.
| ) | ||
|
|
||
|
|
||
| async def add_metadata_to_index[TMessage: IMessage]( |
There was a problem hiding this comment.
add_metadata_to_index is not batched
Only add_metadata_to_index_from_list was converted to batch mode. The older add_metadata_to_index (which takes AsyncIterable[TMessage], semrefindex.py:547-580) still uses single-item inserts. This is inconsistent — callers using the iterator-based path won't benefit from the optimization. This may be intentional (harder to batch an async iterator), but it's a missed optimization.
| ) -> None: | ||
| if not properties: | ||
| return | ||
| from ...storage.memory.propindex import ( |
There was a problem hiding this comment.
Don't put imports inside functions
these should be top-level imports, consistent with the coding guidelines
see AGENTS.md
There was a problem hiding this comment.
Hi, I think that there are times when imports should go inside functions, one such example is https://github.com/microsoft/typeagent-py/pull/229/changes
however, I'll need to double check if it really matters for this PR.
- as an aside I wanted to take a look to your agentic rules in a separate PR to see how I could optimize them as well for your needs.
| return terms | ||
|
|
||
|
|
||
| async def add_metadata_to_index_from_list[TMessage: IMessage]( |
There was a problem hiding this comment.
add_metadata_to_index_from_list also doesn't process inverse_actions from knowledge_response
| ) | ||
|
|
||
|
|
||
| async def add_entity_to_index( |
There was a problem hiding this comment.
We have massive code duplication. This is not origin in this PR but IMO we should fix this.
There are two parallel sets of functions that do the same thing with minor variations:
add_entity_to_index (line 154) vs add_entity (line 199)
add_action_to_index (line 486) vs add_action (line 327)
add_topic_to_index (line 468) vs add_topic (line 277)
add_facet at line 243 is shared but called differently
add_knowledge_to_index (line 520) vs add_knowledge_to_semantic_ref_index (line 420)
The _to_index variants use text_range_from_location while the other set uses text_range_from_message_chunk and supports terms_added.
Now the PR adds a third path (collect_*_terms + batch). Every behavioral change must be synchronized across all three
| await semantic_ref_index.add_term(topic.text, ref_ordinal) | ||
|
|
||
|
|
||
| async def add_action_to_index( |
There was a problem hiding this comment.
please see comment on line R154
| return bool(entity.name) | ||
|
|
||
|
|
||
| async def add_topic_to_index( |
There was a problem hiding this comment.
please see comment on line R154
| await add_facet(facet, semantic_ref_ordinal, semantic_ref_index) | ||
|
|
||
|
|
||
| async def add_facet( |
There was a problem hiding this comment.
please see comment on line R154
| await add_facet(action.subject_entity_facet, ref_ordinal, semantic_ref_index) | ||
|
|
||
|
|
||
| async def add_knowledge_to_index( |
There was a problem hiding this comment.
please see comment on line R154
Thanks! I'll take a closer look to your reviews this afternoon. |
Stack: 3/4 — depends on #229. Merge #231, #229, then this PR.
add_terms_batchandadd_properties_batchtoITermToSemanticRefIndexandIPropertyToSemanticRefIndexinterfacesexecutemanyinstead of individualcursor.execute()calls (~1000+ calls per indexing batch reduced to 2-3)add_metadata_to_index_from_listandadd_to_property_indexto collect all data first (pure functions), then batch-insertBenchmark
Azure Standard_D2s_v5 -- 2 vCPU, 8 GiB RAM, Python 3.13
Indexing Pipeline (pytest-async-benchmark pedantic, 20 rounds, 3 warmup)
Only the hot path (
add_messages_with_indexing) is timed -- DB creation, storage init, and teardown are excluded.add_messages_with_indexing(200 msgs)add_messages_with_indexing(50 msgs)Consistent ~14-16% improvement --
executemanyamortizes per-call overhead.Reproduce the benchmark locally
Save the benchmark file below as
tests/benchmarks/test_benchmark_indexing.py, then:Generated by codeflash optimization agent