Medha¶

Stop paying LLMs to answer the same question twice.

Medha is an async Python semantic cache library for AI Text-to-Query systems. It sits between your application and the LLM, intercepting questions and returning pre-cached SQL, Cypher, or GraphQL queries by matching semantic intent — no LLM call required for repeated patterns.

LLMs regenerate the same structured queries thousands of times per day. Medha intercepts before the LLM call and returns cached results in under a millisecond for repeated patterns. Works with SQL, Cypher, GraphQL, or any structured query language. Drop it in front of any LLM and watch your inference costs drop.

Features¶

Waterfall Search

Five progressive tiers, from sub-millisecond L1 cache to full semantic similarity. Each tier only activates when the previous one misses, keeping average latency minimal.
9 Vector Backends

In-memory for development, Qdrant, pgvector, Elasticsearch, VectorChord, Chroma, Weaviate, Redis Stack, Azure AI Search, and LanceDB for production.
4 Embedding Providers

Local ONNX via FastEmbed (no API key, no cost) or cloud providers: OpenAI, Cohere, and Gemini. Swap providers without losing cached data.
Production Ready

Per-entry TTL, bulk ingestion, pattern and exact invalidation, hit-rate metrics, latency percentiles, and Prometheus-compatible observability.

Quickstart¶

import asyncio
from medha import Medha, Settings
from medha.embeddings.fastembed_adapter import FastEmbedAdapter

async def main():
    embedder = FastEmbedAdapter()  # local ONNX, no API key
    settings = Settings(backend_type="memory")

    async with Medha("demo", embedder=embedder, settings=settings) as cache:
        await cache.store(
            "How many active users do we have?",
            "SELECT COUNT(*) FROM users WHERE active = true",
        )

        hit = await cache.search("Count of active users")
        print(hit.generated_query)   # SELECT COUNT(*) FROM users WHERE active = true
        print(hit.strategy)          # SearchStrategy.SEMANTIC_MATCH
        print(f"Confidence: {hit.confidence:.0%}")  # e.g. Confidence: 94%

asyncio.run(main())

Architecture¶

Medha uses a five-tier waterfall search. Each tier is only reached when the previous one misses:

graph TD
    Q[Question] --> L1[L1 Cache]
    L1 -->|HIT| R[Return Result]
    L1 -->|MISS| TM[Template Match]
    TM -->|HIT| R
    TM -->|MISS| EV[Exact Vector\nscore ≥ 0.99]
    EV -->|HIT| R
    EV -->|MISS| SM[Semantic Match\nscore ≥ 0.85]
    SM -->|HIT| R
    SM -->|MISS| FZ[Fuzzy Match\noptional]
    FZ -->|HIT| R
    FZ -->|MISS| LLM[Call LLM]
    LLM --> ST[Store in Medha]
    ST --> R

Performance

The hit rate is defined as:

\[\text{hit\_rate} = \frac{\text{total\_hits}}{\text{total\_hits} + \text{total\_misses}}\]

Even a 50% hit rate halves LLM costs for repetitive query workloads. In practice, production Text-to-Query systems see hit rates of 70–90% because users repeatedly ask structurally equivalent questions.

Installation¶

pip install medha-archai

See Getting Started for installation options, environment variables, and a full walkthrough.