How I define naïve and agentic RAG

Naïve RAG for me consists of a system that has 3 parts. The first part is the indexer, which gathers all knowledge that is relevant to the user base and processes it (usually chunked and embedded) and stores it into some database (usually a vector database). When a user types in some query, the system always retrieves first and transmits the search results to the generation engine, which then generates a proper answer to the user's query. People might also refer to it as single-shot RAG or static retrieval RAG.

An agentic RAG system operates slightly more dynamically. Following the same pattern to index the data as the previous approach, the system can analyse the query, decide if a retrieval is needed, analyse retrieval results, and decide to retrieve again until the query can be answered. Usually, such a system achieves this by using Tools (which are functions with specific capabilities to perform tasks) and extensive planning, validation, and decomposition of tasks.

Timeline of retrieval-enabled AI systems

Somewhere near the end of 2023, we started seeing the first production-grade RAG (retrieval-augmented generation) chatbots. These chatbots marked a new milestone in the adoption of large language models (LLMs) into enterprise work life. LLM usage started scaling in enterprises and, through different use cases, enterprises started seeing the real benefit LLMs have to offer: Rather than fearing the usage of AI because of its obvious drawbacks, a controlled and use-case-based adoption can actually help fill in the gap caused by demographic change in OECD nations like Germany [->]. For example: Older, more experienced workers were a source of knowledge and often helped with onboarding new colleagues and during this process passed on mostly undocumented knowledge to often younger colleagues. This knowledge, while precious, was not available to everyone, let alone searchable. This, among many other problems, was solvable with RAG. We created multiple systems with which experienced workers would be casually interviewed during their work and asked to explain things on video. These videos then would be used to create multimodal source documents which were then ingested into a vector database and made searchable. I remember my first board meeting in one of Germany's manufacturing giants in the automobile industry and seeing them eager to invest after our first MVP demo.

Since the interest in RAG chatbots then started skyrocketing throughout 2024 and 2025, we saw a wide range of mostly positive developments in technology and structure of RAG: LLMs started becoming more powerful with larger context windows, embedding models and rerankers went multimodal, and so on, making the problem of having to absolutely nail chunk size and overlap etc. non-fatal. Using maxsim and similar algorithms in ColPali ->, multimodal patches can be compared inline and re-sorted according to relevance in real time, making complex-to-extract documents (some enterprises have data dating back to the 1960s!) easily searchable. At some point during this period, tool calling with LLMs became so robust that it was suddenly possible to let it decide when to search or rerank or whatever workflow might be desired. The rise of Agentic RAG. Each query, instead of doing a first-pass retrieval and then submitting the result to the generation engine, would be sent back as a tool result, for example, evaluated and searched again until the right data was found. This boosted the accuracy of such systems heavily.

So the main question among RAG experts: Is RAG dead?

This has meanwhile become a meme. RAG experts (for example, folks over at LightOn in their blog) have pointed out that even with large context windows, context rot is still a big issue. If you are an enterprise with over 20 TB of documents, you are never going to fit them all inside an LLM's context window. One problem I have been noticing a lot is that enterprises that adopted RAG chatbots in 2024/25 have not developed them towards agentic systems but instead tried to fix the search by hanging on to the same chunk/overlap and ingestion pipelines. Since they are bound to enterprise approvals for more capable vector databases, they often resort to database extensions for Postgres like pgvector or even cloud offerings like Azure (Cognitive | AI) Search. These are good choices if you are mostly interested in text-vs-text similarity comparisons, but even the simplest HNSW keyword-hybrid similarity implementation requires a substantial amount of knowledge and coding. Direct low-hanging fruit I see for enterprises with chatbots is to adopt an agentic backend, which will gain more accuracy over naïve RAG.

So the question remains. Has RAG died? Personally, I believe RAG has found even more adoption than I could have imagined. We have specialized coding agents powered with specialized code embedders, we have audio/video embedders etc., even extending the scope of data we can perform RAG over. Many companies have developed specialized software that can make an index over your audio stash, making recorded data like client-support phone calls searchable. Others like Anthropic have made software engineers obsolete and do not need them anymore (apparently) ->.

RAG in 2026

If I were about to start a greenfield project that requires any kind of accuracy, I would most definitely choose an agentic architecture. Agentic paradigms like multi-agent systems, single-agent multi-tool etc. have fundamentally changed what is possible and have permanently widened use cases we thought would be impossible back then. Seamless integration of agents in tasks like legal due diligence checks is suddenly possible because we can fire up a swarm of agents which can take on subtasks and report back to a central orchestrator. We have judging algorithms that check if a subtask was fulfilled and recursively optimize subtasks. All of this has proven to be most useful in use cases that required a lot of looking up.

My go-to RAG-Setup in 2026

I would personally prefer a single agent with multi-tool if the task was to create a highly capable RAG chatbot. The main drawback of multi-agents is that such a system requires substantial effort to tune and iterate over and often is worth it for complex scenarios requiring utmost accuracy.

Indexing

For indexing, I would choose multimodal embedders and work with systems specialized in document extraction by preserving document layout. I would prefer one of the battle-tested and open-source vector databases. Since the context window of most modern embedding models has substantially changed from BERT-based models with a sequence length of 512 to around 8192 these days, we have lots of leeway to operate with chunk size and overlap. The goal is obviously to need as few hops as possible to get to the data, and these choices are crucial for recall and F1.

Vector Database

Since I am a big fan of ColPali and multivector embeddings, I would choose one of the many databases that have native support for this. Also, clustering support, performance, and reliability being other crucial factors. Personally, I have mostly been using Vespa -> if it was allowed according to enterprise guidelines or sticking to Azure and implementing complementary algorithms in backend code.

Agent Framework

I am not a big fan of having many dependencies and using the shiniest newest tools. I rather prefer native SDKs. This has spared me a lot of headaches in recent times where supply chain attacks are getting more often and more brutal (e.g. Litellm incident ->). I often resort to OpenAI's Python SDK under the package name openai and it suits my needs. My team and I have also been using langgraph and their deepagents, but other than marginal speedups, I do not see any advantage.

Evaluations

Many teams treat evaluations as optional. An evaluation setup is crucial to iterative improvement of any RAG system and gives you valuable insight into which component of a complex system is not performing as intended. The approach of having 2-3 go-to questions and typing them into every update and checking the vibe on similar answers is not scientific and definitely not scalable. Setting up a golden dataset and iterating over every engine update might be time-consuming and tiresome, but it will save you a lot of time in the long run.

Traces

I recommend tracking traces of agentic planning/analysing and tool calls. It helps track the decision tree and figure out where exactly the agent goes off the rails and, trust me, it will happen more often than you would want. Mostly, I have used mlflow and it serves its purpose, but there are others in the market which do their job just fine.

So I got it, agentic RAG is better and use that always, right?

While agentic RAG is better in terms of accuracy, among other things, there are still drawbacks:

  • Latency: decisions, tool calls, and recursion will add significant latency vs the single-shot RAG.
  • Cost: Since we are calling the LLM provider more often and consuming more tokens, costs rise linearly.
  • Routing Errors: LLM decision-making is often faulty and such errors need comprehensive checks.
  • Over Searching: It is not uncommon for an LLM to go into a spiral and be overly cautious; search budgets etc. help reduce this.
  • Failure Cascade: Sometimes everything goes wrong. This is why we need tracing and evals.

Conclusion

Agentic RAG is not a universal replacement. It is the right choice when ambiguity, multi-hop retrieval, tool orchestration, or evidence validation matter. For narrow FAQ-style workloads, naïve retrieve-then-read pipelines may still win on cost, latency, and simplicity. Jumping on the agentic train may cost you more than the return on the investments. I recommend doing the due diligence.