Back to Portfolio
Case Study· Personal Project

Why You Don’t Need a Bigger Model

AI EngineeringRAGPythonQdrantGroq

3%

Accuracy gap (8B vs 70B with RAG)

88% cheaper

Cost reduction

333ms

Latency (8b_rag)

The Problem

Building a customer service chatbot for FAQ handling raises an immediate cost question: do you need a frontier model? A 70B parameter model costs roughly 9× more per token than an 8B model. If a smaller model with retrieval can match the accuracy, the engineering choice is obvious. The goal of this benchmark was to quantify that gap — not assume it.

What I Did

1

The hypothesis

For a customer service chatbot answering FAQ, using a frontier model felt wasteful. The question: can a smaller model with RAG match a larger model's accuracy at a fraction of the cost? I ran a controlled benchmark across 4 conditions — 8B and 70B models, each with and without RAG — against 100 FAQ items from a fictional Indonesian e-commerce company (TokoFiktif).

2

Choosing a neutral evaluator

Accuracy was judged by qwen/qwen3.6-27b via Groq. Choosing a different model family than the ones being evaluated matters — using the same family to judge its own outputs introduces bias. Qwen3 is capable enough to distinguish correct from incorrect answers reliably.

3

Vector search with BAAI/bge-m3

FAQ items were chunked, embedded with BAAI/bge-m3, and stored in a local Qdrant instance. At query time, the top-k most semantically similar FAQ chunks are retrieved and injected into the prompt as context. bge-m3 was chosen for its strong multilingual performance — relevant since the dataset is Indonesian.

4

What the numbers show

Without RAG, both models fail badly: 17% for 8B, 35% for 70B. The models are guessing. With RAG, accuracy jumps to 97% (8B) and 100% (70B) — a 3% gap. But 8b_rag is 88% cheaper ($0.0023 vs $0.0203) and 41% faster (333ms vs 564ms). The conclusion is clear: RAG is the dominant factor, not model size.

Benchmark Results

ConfigRAGAccuracyLatencyCost
70b_rag100%564ms$0.0203
8b_rag97%333ms$0.0023
70b_no_rag35%421ms$0.0119
8b_no_rag17%261ms$0.0014

100 FAQ items · evaluated by qwen/qwen3.6-27b · models via Groq free tier

Takeaway

RAG is the dominant variable — not model size. Without it, even the 70B model answers correctly only 35% of the time. With it, the 8B model reaches 97% accuracy at 88% lower cost and 41% lower latency. For constrained FAQ use cases, reaching for a bigger model first is the wrong instinct.

Live Demo

The production implementation of 8b_rag is running at tokofiktif.rival.my.id. Ask it anything about TokoFiktif — shipping policy, returns, seller registration.

Source: faq-rag-benchmark · tokofiktif