3%
Accuracy gap (8B vs 70B with RAG)
88% cheaper
Cost reduction
333ms
Latency (8b_rag)
Building a customer service chatbot for FAQ handling raises an immediate cost question: do you need a frontier model? A 70B parameter model costs roughly 9× more per token than an 8B model. If a smaller model with retrieval can match the accuracy, the engineering choice is obvious. The goal of this benchmark was to quantify that gap — not assume it.
The hypothesis
For a customer service chatbot answering FAQ, using a frontier model felt wasteful. The question: can a smaller model with RAG match a larger model's accuracy at a fraction of the cost? I ran a controlled benchmark across 4 conditions — 8B and 70B models, each with and without RAG — against 100 FAQ items from a fictional Indonesian e-commerce company (TokoFiktif).
Choosing a neutral evaluator
Accuracy was judged by qwen/qwen3.6-27b via Groq. Choosing a different model family than the ones being evaluated matters — using the same family to judge its own outputs introduces bias. Qwen3 is capable enough to distinguish correct from incorrect answers reliably.
Vector search with BAAI/bge-m3
FAQ items were chunked, embedded with BAAI/bge-m3, and stored in a local Qdrant instance. At query time, the top-k most semantically similar FAQ chunks are retrieved and injected into the prompt as context. bge-m3 was chosen for its strong multilingual performance — relevant since the dataset is Indonesian.
What the numbers show
Without RAG, both models fail badly: 17% for 8B, 35% for 70B. The models are guessing. With RAG, accuracy jumps to 97% (8B) and 100% (70B) — a 3% gap. But 8b_rag is 88% cheaper ($0.0023 vs $0.0203) and 41% faster (333ms vs 564ms). The conclusion is clear: RAG is the dominant factor, not model size.
| Config | RAG | Accuracy | Latency | Cost |
|---|---|---|---|---|
| 70b_rag | ✓ | 100% | 564ms | $0.0203 |
| 8b_rag | ✓ | 97% | 333ms | $0.0023 |
| 70b_no_rag | ✗ | 35% | 421ms | $0.0119 |
| 8b_no_rag | ✗ | 17% | 261ms | $0.0014 |
100 FAQ items · evaluated by qwen/qwen3.6-27b · models via Groq free tier
RAG is the dominant variable — not model size. Without it, even the 70B model answers correctly only 35% of the time. With it, the 8B model reaches 97% accuracy at 88% lower cost and 41% lower latency. For constrained FAQ use cases, reaching for a bigger model first is the wrong instinct.
The production implementation of 8b_rag is running at tokofiktif.rival.my.id. Ask it anything about TokoFiktif — shipping policy, returns, seller registration.
Source: faq-rag-benchmark · tokofiktif