Getting stuff done

TEI for Production: Embeddings and Cross-Encoder Reranking

TL;DR: Hindsight uses two inference-heavy operations: embeddings for vector search and cross-encoder for reranking. Both default to local SentenceTransformers, both can be offloaded to TEI (Text Embeddings Inference) for production. Two separate TEI instances, two separate configurations.


Two Inference Bottlenecks

Hindsight uses two model inference operations:

  1. Embeddings (bge-small-en-v1.5): Generated during retain() for every extracted fact. Also used on recall() for the query vector, but that’s a single embedding per query.
  2. Cross-encoder reranking (ms-marco-MiniLM-L-6-v2): Runs on recall() to rerank candidates after RRF fusion.

The heavy embedding load is on retain() - if you’re storing a conversation with 20 extracted facts, that’s 20 embeddings generated. On recall(), you only embed the query once.

Both default to local SentenceTransformers. Both become bottlenecks at scale.

Why Local Inference Doesn’t Scale

Hindsight defaults to local inference using SentenceTransformers. This works fine for development - no external dependencies, quick setup, runs on CPU.

The problems start at scale:

In my experience, local embeddings start showing latency issues around 100+ concurrent users, especially during retain() operations that process multiple text chunks.

Enter TEI

Text Embeddings Inference is HuggingFace’s production embedding server. It’s written in Rust and designed specifically for high-throughput inference.

Key features:

The deployment model is simple: TEI runs as a separate service, Hindsight calls it over HTTP.

Setting Up TEI for Embeddings

Docker (GPU)

1docker run --gpus all -p 8080:80 \
2  -v $PWD/data:/data \
3  ghcr.io/huggingface/text-embeddings-inference:1.8 \
4  --model-id BAAI/bge-small-en-v1.5

The -v flag mounts a local directory for model caching - avoids re-downloading weights on every restart.

Docker (CPU)

1docker run -p 8080:80 \
2  -v $PWD/data:/data \
3  ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
4  --model-id BAAI/bge-small-en-v1.5

CPU version is slower but useful for environments without GPU access.

Connecting Hindsight

Two environment variables:

1HINDSIGHT_API_EMBEDDINGS_PROVIDER=tei
2HINDSIGHT_API_EMBEDDINGS_TEI_URL=http://localhost:8080

Hindsight will now route all embedding calls to TEI.

Setting Up TEI for Cross-Encoder Reranking

The cross-encoder runs after RRF fusion to rerank candidates. It’s a separate model, separate TEI instance.

Docker

1docker run --gpus all -p 8081:80 \
2  -v $PWD/data:/data \
3  ghcr.io/huggingface/text-embeddings-inference:1.8 \
4  --model-id cross-encoder/ms-marco-MiniLM-L-6-v2

Note the different port (8081) - you need both instances running.

Connecting Hindsight

1HINDSIGHT_API_RERANKER_PROVIDER=tei
2HINDSIGHT_API_RERANKER_TEI_URL=http://localhost:8081

Alternative Models

The default ms-marco-MiniLM-L-6-v2 is optimized for speed. For better accuracy at the cost of latency:

Unlike embeddings, cross-encoders don’t have a dimension constraint - they output relevance scores, not vectors.

The 384-Dimension Constraint

This is critical: Hindsight requires exactly 384-dimensional vectors. The database schema is fixed to this size.

bge-small-en-v1.5 outputs 384 dimensions, which is why it’s the default. If you want a different model, verify dimensions first:

1from sentence_transformers import SentenceTransformer
2
3model = SentenceTransformer("your-model-name")
4test_embedding = model.encode("test")
5print(len(test_embedding))  # Must be 384

Models that work:

Models that don’t work without modification:

Trying to use a non-384d model will fail at database insertion.

Performance Tuning

TEI exposes several parameters for tuning throughput:

Batch Size

1--max-batch-tokens 16384

Controls total tokens per batch. Higher values = better GPU utilization but more memory. Start with the default and increase if you have GPU headroom.

Concurrent Requests

1--max-concurrent-requests 512

How many requests TEI queues before rejecting. Set based on expected peak load.

Tokenization Workers

1--tokenization-workers 4

CPU cores dedicated to tokenization. Defaults to auto-detect, but explicit setting helps in containerized environments where core detection can be wrong.

Production Configuration Example

Here’s what I run in production - two TEI instances, one for embeddings, one for reranking:

 1# Embeddings TEI
 2docker run --gpus all -p 8080:80 \
 3  -v /var/lib/tei/embeddings:/data \
 4  --restart unless-stopped \
 5  --name tei-embeddings \
 6  ghcr.io/huggingface/text-embeddings-inference:1.8 \
 7  --model-id BAAI/bge-small-en-v1.5 \
 8  --max-batch-tokens 32768 \
 9  --max-concurrent-requests 1024 \
10  --json-output \
11  --prometheus-port 9000
12
13# Reranker TEI
14docker run --gpus all -p 8081:80 \
15  -v /var/lib/tei/reranker:/data \
16  --restart unless-stopped \
17  --name tei-reranker \
18  ghcr.io/huggingface/text-embeddings-inference:1.8 \
19  --model-id cross-encoder/ms-marco-MiniLM-L-6-v2 \
20  --max-batch-tokens 16384 \
21  --max-concurrent-requests 512 \
22  --json-output \
23  --prometheus-port 9001

Then configure Hindsight:

1# Embeddings
2HINDSIGHT_API_EMBEDDINGS_PROVIDER=tei
3HINDSIGHT_API_EMBEDDINGS_TEI_URL=http://localhost:8080
4
5# Reranker
6HINDSIGHT_API_RERANKER_PROVIDER=tei
7HINDSIGHT_API_RERANKER_TEI_URL=http://localhost:8081

Key choices:

Health Check

TEI exposes /health for liveness probes:

1curl http://localhost:8080/health

Returns 200 when ready to serve.

Kubernetes Deployment

For Kubernetes, deploy both TEI instances as separate Deployments:

 1# Embeddings TEI
 2apiVersion: apps/v1
 3kind: Deployment
 4metadata:
 5  name: tei-embeddings
 6spec:
 7  replicas: 2
 8  selector:
 9    matchLabels:
10      app: tei-embeddings
11  template:
12    metadata:
13      labels:
14        app: tei-embeddings
15    spec:
16      containers:
17      - name: tei
18        image: ghcr.io/huggingface/text-embeddings-inference:1.8
19        args: ["--model-id", "BAAI/bge-small-en-v1.5", "--port", "80"]
20        ports:
21        - containerPort: 80
22        resources:
23          limits:
24            nvidia.com/gpu: 1
25        readinessProbe:
26          httpGet:
27            path: /health
28            port: 80
29---
30apiVersion: v1
31kind: Service
32metadata:
33  name: tei-embeddings
34spec:
35  selector:
36    app: tei-embeddings
37  ports:
38  - port: 8080
39    targetPort: 80
40---
41# Reranker TEI
42apiVersion: apps/v1
43kind: Deployment
44metadata:
45  name: tei-reranker
46spec:
47  replicas: 2
48  selector:
49    matchLabels:
50      app: tei-reranker
51  template:
52    metadata:
53      labels:
54        app: tei-reranker
55    spec:
56      containers:
57      - name: tei
58        image: ghcr.io/huggingface/text-embeddings-inference:1.8
59        args: ["--model-id", "cross-encoder/ms-marco-MiniLM-L-6-v2", "--port", "80"]
60        ports:
61        - containerPort: 80
62        resources:
63          limits:
64            nvidia.com/gpu: 1
65        readinessProbe:
66          httpGet:
67            path: /health
68            port: 80
69---
70apiVersion: v1
71kind: Service
72metadata:
73  name: tei-reranker
74spec:
75  selector:
76    app: tei-reranker
77  ports:
78  - port: 8080
79    targetPort: 80

Then configure Hindsight:

1HINDSIGHT_API_EMBEDDINGS_TEI_URL=http://tei-embeddings:8080
2HINDSIGHT_API_RERANKER_TEI_URL=http://tei-reranker:8080

Scale each independently based on load patterns - embeddings are called heavily on retain() (multiple per call), reranking only on recall().

Monitoring

TEI exports Prometheus metrics on port 9000 by default. Key metrics to watch:

If queue size grows while batch size stays low, you’re likely CPU-bound on tokenization. If batch size is high but latency is too, you need more GPU memory or replicas.

When to Switch

I think local inference is fine for:

Switch to TEI when:


TEI adds operational complexity - two extra services to manage. But it removes both inference bottlenecks. For production workloads, the separation of concerns is worth it: Hindsight handles memory logic, TEI handles the model inference.

Hindsight documentation | GitHub

#ai #agents #memory #hindsight #embeddings #reranking #production #infrastructure

Reply to this post by email ↪