TTFT feedback loop for voice agent context management.
Other libraries compress blindly. voice-budget measures TTFT before and after, auto-tunes, and rolls back if compression hurts.
import asyncio
from voice_budget import wrap
async def main():
managed = wrap(your_llm, target_ms=800)
response = await managed(messages) # measures, compresses, verifies
asyncio.run(main())pip install voice-budget
# With semantic compression (recommended):
pip install "voice-budget[semantic]"Dependencies: numpy, tiktoken only. No GPU. No cloud API.
Use voice-budget with any framework:
import asyncio
from voice_budget import wrap
async def my_llm(messages, **kwargs):
resp = await openai_client.chat.completions.create(
model="gpt-4o", messages=messages, **kwargs
)
return resp.choices[0].message.content
async def voice_loop():
managed = wrap(my_llm, target_ms=800, verbose=True)
messages = [{"role": "system", "content": "You are a voice assistant."}]
while True:
messages.append({"role": "user", "content": await get_user_speech()})
response = await managed(messages)
messages.append({"role": "assistant", "content": response})
asyncio.run(voice_loop())Note for Pipecat Users: The provided
VoiceBudgetProcessorinpipecat_integration.pyis a blueprint. In order to properly integrate it with a full Pipecat pipeline, you will need to ensure it correctly inherits frompipecat.processors.frame_processor.FrameProcessorand wires up thepush_frameandprocess_framemethods to pass frames down the pipeline.
from pipecat.pipeline.pipeline import Pipeline
from voice_budget.pipecat_integration import VoiceBudgetProcessor
budget = VoiceBudgetProcessor(target_ms=800, verbose=True)
pipeline = Pipeline([
transport.input(), stt, context_aggregator.user(),
budget, # ← insert before LLM
llm, tts, transport.output(), context_aggregator.assistant(),
])Use VoiceBudgetAgent to wrap your LiveKit agent's LLM calls:
from voice_budget import VoiceBudgetAgent
budget = VoiceBudgetAgent(
target_ms=800,
token_budget=2000,
model="gpt-4o",
use_semantic=True,
verbose=True,
)
async def on_message(message: str, messages: list):
# Compress context and measure TTFT
response = await budget.process_messages(
messages=messages,
llm_fn=your_llm_function,
)
# Streaming LLMs return an async iterator; non-streaming calls return text.
if hasattr(response, "__aiter__"):
chunks = []
async for chunk in response:
chunks.append(chunk)
response_text = "".join(chunks)
else:
response_text = response
messages.append({"role": "assistant", "content": response_text})
return response_text
# Access stats and reports
stats = budget.stats()
report = budget.report()Turn 1: TTFT=480ms tokens=120 ✓ under budget
Turn 8: TTFT=920ms tokens=980 ↑ P95 > 800ms → sliding_window → 980→420 tokens
Turn 9: TTFT=490ms tokens=420 ✓ compression helped (delta=430ms)
Turn 14: TTFT=850ms tokens=720 ↑ P95 > 800ms → semantic_trim → 720→350 tokens
Turn 15: TTFT=460ms tokens=350 ✓ compression helped
| Strategy | Cost | When used |
|---|---|---|
sliding_window |
Free | First attempt — drop oldest turns |
semantic_trim |
~5ms (local embeddings) | If sliding window not enough |
summarise_tail |
1 LLM call | If semantic trim not enough (opt-in) |
from voice_budget import VoiceBudget
budget = VoiceBudget(
llm_fn=your_llm,
target_ms=800, # TTFT budget in ms (P95)
model="gpt-4o", # for tiktoken token counting
window_size=20, # rolling window for statistics
token_budget=2000, # target token count after compression
use_semantic=True, # semantic trim (needs sentence-transformers)
use_summarise=False, # LLM-based summarisation (costs 1 LLM call)
verbose=True, # print compression decisions
on_compression=callback, # called after each compression event
on_budget_violation=cb, # called when P95 > target_ms
)s = managed.stats()
print(s.p50_ms, s.p95_ms, s.jitter_ms)
managed.print_report()============================================================
voice-budget Report
============================================================
Total turns: 47
Current P50 TTFT: 510ms
Current P95 TTFT: 780ms
Target: 800ms
Budget met: ✓
Compressions: 3
Helpful: 3
Harmful (rolled back):0
Total tokens saved: 1,840
Strategies used: sliding_window, semantic_trim
============================================================
| Tool | TTFT-aware? | Feedback loop? | Auto-tune? |
|---|---|---|---|
| context-compressor | ✗ | ✗ | ✗ |
| reme-ai | ✗ | ✗ | ✗ |
| Pipecat compaction | ✗ | ✗ | ✗ |
| LangChain SummaryMemory | ✗ | ✗ | ✗ |
| voice-budget | ✓ | ✓ | ✓ |
Issues and PRs welcome. See CONTRIBUTING.md.
MIT
When you publish a new release make sure to follow these steps so CI can build and publish to PyPI automatically:
-
Bump the version in two places:
pyproject.toml(theversionfield)voice_budget/__init__.py(the__version__string)
-
Run the test and lint suite locally:
# Run unit tests
pytest tests/ -v
# Optional: run ruff if installed
ruff check voice_budget/- Commit the version bump and push to the remote repository:
git add pyproject.toml voice_budget/__init__.py
git commit -m "chore(release): bump version x.y.z"
git push origin HEAD- Create a git tag and push it (GitHub Actions will publish on tags that start with
v):
# Create an annotated tag
git tag -a vX.Y.Z -m "Release vX.Y.Z"
# Push the tag
git push origin vX.Y.Z- CI (GitHub Actions) will run tests/lint and, on tag pushes, build and publish to PyPI using the
PYPI_API_TOKENsecret. Make sure the repository has this secret configured in Settings → Secrets → Actions asPYPI_API_TOKENbefore pushing tags.
Notes:
- Use semantic versioning (MAJOR.MINOR.PATCH) for tags (for example
v0.2.1). - If a tag already exists and you truly need to move it, coordinate with maintainers: force-updating tags that are already published to PyPI is discouraged.