TEI for Production: Embeddings and Cross-Encoder Reranking
TL;DR: Hindsight uses two inference-heavy operations: embeddings for vector search and cross-encoder for reranking. Both default to local SentenceTransformers, both can be offloaded to TEI (Text Embeddings Inference) for production. Two separate TEI instances, two separate configurations.
Two Inference Bottlenecks
Hindsight uses two model inference operations:
- Embeddings (
bge-small-en-v1.5): Generated duringretain()for every extracted fact. Also used onrecall()for the query vector, but that’s a single embedding per query. - Cross-encoder reranking (
ms-marco-MiniLM-L-6-v2): Runs onrecall()to rerank candidates after RRF fusion.
The heavy embedding load is on retain() - if you’re storing a conversation with 20 extracted facts, that’s 20 embeddings generated. On recall(), you only embed the query once.
Both default to local SentenceTransformers. Both become bottlenecks at scale.
Why Local Inference Doesn’t Scale
Hindsight defaults to local inference using SentenceTransformers. This works fine for development - no external dependencies, quick setup, runs on CPU.
The problems start at scale:
- Resource contention: Embedding computation competes with your API for CPU/memory
- No batching optimization: Each request processes independently
- Can’t scale independently: Need more embedding throughput? You have to scale the entire Hindsight instance
- No GPU utilization: Local provider runs on CPU only
In my experience, local embeddings start showing latency issues around 100+ concurrent users, especially during retain() operations that process multiple text chunks.
Enter TEI
Text Embeddings Inference is HuggingFace’s production embedding server. It’s written in Rust and designed specifically for high-throughput inference.
Key features:
- Dynamic batching: Groups incoming requests by token count, not request count. Better hardware utilization.
- Flash Attention + cuBLAS: Optimized transformer kernels for NVIDIA GPUs
- gRPC API: Lower latency than HTTP for high-frequency calls
- Prometheus metrics + OpenTelemetry: Production observability out of the box
The deployment model is simple: TEI runs as a separate service, Hindsight calls it over HTTP.
Setting Up TEI for Embeddings
Docker (GPU)
1docker run --gpus all -p 8080:80 \
2 -v $PWD/data:/data \
3 ghcr.io/huggingface/text-embeddings-inference:1.8 \
4 --model-id BAAI/bge-small-en-v1.5The -v flag mounts a local directory for model caching - avoids re-downloading weights on every restart.
Docker (CPU)
1docker run -p 8080:80 \
2 -v $PWD/data:/data \
3 ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 \
4 --model-id BAAI/bge-small-en-v1.5CPU version is slower but useful for environments without GPU access.
Connecting Hindsight
Two environment variables:
1HINDSIGHT_API_EMBEDDINGS_PROVIDER=tei
2HINDSIGHT_API_EMBEDDINGS_TEI_URL=http://localhost:8080Hindsight will now route all embedding calls to TEI.
Setting Up TEI for Cross-Encoder Reranking
The cross-encoder runs after RRF fusion to rerank candidates. It’s a separate model, separate TEI instance.
Docker
1docker run --gpus all -p 8081:80 \
2 -v $PWD/data:/data \
3 ghcr.io/huggingface/text-embeddings-inference:1.8 \
4 --model-id cross-encoder/ms-marco-MiniLM-L-6-v2Note the different port (8081) - you need both instances running.
Connecting Hindsight
1HINDSIGHT_API_RERANKER_PROVIDER=tei
2HINDSIGHT_API_RERANKER_TEI_URL=http://localhost:8081Alternative Models
The default ms-marco-MiniLM-L-6-v2 is optimized for speed. For better accuracy at the cost of latency:
cross-encoder/ms-marco-MiniLM-L-12-v2- deeper model, better rankingcross-encoder/mmarco-mMiniLMv2-L12-H384-v1- multilingual support
Unlike embeddings, cross-encoders don’t have a dimension constraint - they output relevance scores, not vectors.
The 384-Dimension Constraint
This is critical: Hindsight requires exactly 384-dimensional vectors. The database schema is fixed to this size.
bge-small-en-v1.5 outputs 384 dimensions, which is why it’s the default. If you want a different model, verify dimensions first:
1from sentence_transformers import SentenceTransformer
2
3model = SentenceTransformer("your-model-name")
4test_embedding = model.encode("test")
5print(len(test_embedding)) # Must be 384Models that work:
BAAI/bge-small-en-v1.5(384d)sentence-transformers/all-MiniLM-L6-v2(384d)
Models that don’t work without modification:
BAAI/bge-large-en-v1.5(1024d)text-embedding-3-small(1536d)
Trying to use a non-384d model will fail at database insertion.
Performance Tuning
TEI exposes several parameters for tuning throughput:
Batch Size
1--max-batch-tokens 16384Controls total tokens per batch. Higher values = better GPU utilization but more memory. Start with the default and increase if you have GPU headroom.
Concurrent Requests
1--max-concurrent-requests 512How many requests TEI queues before rejecting. Set based on expected peak load.
Tokenization Workers
1--tokenization-workers 4CPU cores dedicated to tokenization. Defaults to auto-detect, but explicit setting helps in containerized environments where core detection can be wrong.
Production Configuration Example
Here’s what I run in production - two TEI instances, one for embeddings, one for reranking:
1# Embeddings TEI
2docker run --gpus all -p 8080:80 \
3 -v /var/lib/tei/embeddings:/data \
4 --restart unless-stopped \
5 --name tei-embeddings \
6 ghcr.io/huggingface/text-embeddings-inference:1.8 \
7 --model-id BAAI/bge-small-en-v1.5 \
8 --max-batch-tokens 32768 \
9 --max-concurrent-requests 1024 \
10 --json-output \
11 --prometheus-port 9000
12
13# Reranker TEI
14docker run --gpus all -p 8081:80 \
15 -v /var/lib/tei/reranker:/data \
16 --restart unless-stopped \
17 --name tei-reranker \
18 ghcr.io/huggingface/text-embeddings-inference:1.8 \
19 --model-id cross-encoder/ms-marco-MiniLM-L-6-v2 \
20 --max-batch-tokens 16384 \
21 --max-concurrent-requests 512 \
22 --json-output \
23 --prometheus-port 9001Then configure Hindsight:
1# Embeddings
2HINDSIGHT_API_EMBEDDINGS_PROVIDER=tei
3HINDSIGHT_API_EMBEDDINGS_TEI_URL=http://localhost:8080
4
5# Reranker
6HINDSIGHT_API_RERANKER_PROVIDER=tei
7HINDSIGHT_API_RERANKER_TEI_URL=http://localhost:8081Key choices:
- Separate Prometheus ports (9000, 9001) for independent monitoring
- Lower batch tokens for reranker - cross-encoder pairs are larger than single embeddings
--json-output: Structured logs for log aggregators
Health Check
TEI exposes /health for liveness probes:
1curl http://localhost:8080/healthReturns 200 when ready to serve.
Kubernetes Deployment
For Kubernetes, deploy both TEI instances as separate Deployments:
1# Embeddings TEI
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: tei-embeddings
6spec:
7 replicas: 2
8 selector:
9 matchLabels:
10 app: tei-embeddings
11 template:
12 metadata:
13 labels:
14 app: tei-embeddings
15 spec:
16 containers:
17 - name: tei
18 image: ghcr.io/huggingface/text-embeddings-inference:1.8
19 args: ["--model-id", "BAAI/bge-small-en-v1.5", "--port", "80"]
20 ports:
21 - containerPort: 80
22 resources:
23 limits:
24 nvidia.com/gpu: 1
25 readinessProbe:
26 httpGet:
27 path: /health
28 port: 80
29---
30apiVersion: v1
31kind: Service
32metadata:
33 name: tei-embeddings
34spec:
35 selector:
36 app: tei-embeddings
37 ports:
38 - port: 8080
39 targetPort: 80
40---
41# Reranker TEI
42apiVersion: apps/v1
43kind: Deployment
44metadata:
45 name: tei-reranker
46spec:
47 replicas: 2
48 selector:
49 matchLabels:
50 app: tei-reranker
51 template:
52 metadata:
53 labels:
54 app: tei-reranker
55 spec:
56 containers:
57 - name: tei
58 image: ghcr.io/huggingface/text-embeddings-inference:1.8
59 args: ["--model-id", "cross-encoder/ms-marco-MiniLM-L-6-v2", "--port", "80"]
60 ports:
61 - containerPort: 80
62 resources:
63 limits:
64 nvidia.com/gpu: 1
65 readinessProbe:
66 httpGet:
67 path: /health
68 port: 80
69---
70apiVersion: v1
71kind: Service
72metadata:
73 name: tei-reranker
74spec:
75 selector:
76 app: tei-reranker
77 ports:
78 - port: 8080
79 targetPort: 80Then configure Hindsight:
1HINDSIGHT_API_EMBEDDINGS_TEI_URL=http://tei-embeddings:8080
2HINDSIGHT_API_RERANKER_TEI_URL=http://tei-reranker:8080Scale each independently based on load patterns - embeddings are called heavily on retain() (multiple per call), reranking only on recall().
Monitoring
TEI exports Prometheus metrics on port 9000 by default. Key metrics to watch:
te_request_duration_seconds: Embedding latency histogramte_batch_size: Actual batch sizes being processedte_queue_size: Pending requests in queue
If queue size grows while batch size stays low, you’re likely CPU-bound on tokenization. If batch size is high but latency is too, you need more GPU memory or replicas.
When to Switch
I think local inference is fine for:
- Development and testing
- Single-user applications
- Low-frequency
recall()calls (< 10/second)
Switch to TEI when:
- Latency matters at scale
- You have multiple Hindsight instances sharing inference
- You want GPU acceleration
- You need independent scaling of embedding/reranking throughput
TEI adds operational complexity - two extra services to manage. But it removes both inference bottlenecks. For production workloads, the separation of concerns is worth it: Hindsight handles memory logic, TEI handles the model inference.
#ai #agents #memory #hindsight #embeddings #reranking #production #infrastructure