ShipTalk - SRE, DevOps, Platform Engineering, Software Delivery

Beyond the Magic Box: Solving AI Hallucinations with Precision RAG (with Evgeny Ilinykh)

By Harness Season 4 Episode 6

In this episode of the ShipTalk Podcast, host Dewan Ahmed (Principal Developer Advocate at Harness) sits down with Evgeny Ilinykh (Founder of GuidedMind.ai and former Tesla Engineering Manager) to move past the AI hype and get into the engineering reality of Retrieval-Augmented Generation (RAG).

If your AI agents are hallucinating, the problem probably isn't your model—it’s your retrieval layer. Evgeny breaks down how to turn the "black box" of LLMs into a transparent, production-ready system that developers can actually trust.

What we cover:
- The Death of Deterministic Software: Moving from hardcoded paths to agentic AI logic.
- The "Dark Spots" of Vector Space: Why hallucinations are actually retrieval failures.
- Contextual Retrieval: Insights into how system-level context changes the game for accuracy.
- Scaling to Production: Solving the "dirty work" of messy PDFs, table parsing, and chunking.
- Standardizing AI Delivery: Will RAG become as common as CI/CD in the modern dev stack?

RESOURCES & LINKS:
1. Connect with our guest: https://www.linkedin.com/in/eilinykh/
2. Explore GuidedMind.ai: https://guidedmind.ai/
3. Read the Anthropic Research: https://www.anthropic.com/engineering/contextual-retrieval

Dewan Ahmed: Alright. Good morning, good afternoon, good evening, time-appropriate greetings. My name is Dewan, and I'm the host of ShipTalk Podcast, where we talk about the ins and outs, ups and downs of software delivery. And I'm super excited, because with me I have Evgeny, former engineering manager at Tesla, and now founder of GuidedMind.ai, and a RAG consultant. Welcome to the show, Evgeny.

Evgeny Ilinykh: Thank you for having me. Hey, Dewan, and hi, everyone.

Dewan Ahmed: Yes, I mean, RAG, neural networks, AI... these are the things everyone is trying to learn and get excited about, but I'm sure your journey didn't start here. Your journey started somewhere that might be very different from how things are in the current world. So, could you please share with us where your journey started, and what led you to become a founder of GuidedMind, a RAG consultant?

Evgeny Ilinykh: Absolutely, yeah. My journey started more than 20 years ago in the IT world. I started my career as a .NET engineer, then it was not the obvious decision to switch to SAP, and I became a business consultant helping businesses build their business processes. I spent a big part of my career as a SAP business consultant working in big enterprises like Mars and PepsiCo.

After that, I decided to move to San Francisco, to the United States, and continue my career here. I worked in small startups first, and then at Tesla. The last 6 years of my career I spent at Tesla, and grew there from senior engineer to tech lead. Now, I decided that it's time to start my own startup. The primary reason is we live in completely different worlds right now. Previously we built software; now we build AI agents. I consider it a big shift in technology.

Dewan Ahmed: Yeah, I have two follow-up questions. First: what drew you from enterprise architecture and traditional software development to agentic AI development? And second: do you see similarities, like the lessons we learned in the last three decades of traditional software development, that still hold true in how people are building agentic AI?

Evgeny Ilinykh: Yeah, so my belief is that we build software tools that communicate with us on the information level. Previously, we were significantly limited because we built deterministic systems—precise processes where each step is hardcoded. Now, things have changed because we work with big data and large language models. We have data and put weights on top of it, which represents logic. We aren't building deterministic systems anymore, but systems with many, many possible paths.

When I first heard about this technology, I thought, "Wow, it's exciting." There are so many opportunities to build new kinds of things. Regarding your second question about what holds true: at some point, we realized this is not just a "magic box." It has intelligence, but it also has new kinds of limitations. In software, we hardcoded intelligence and knew exactly what it was. Here, we have hallucinations.

But there are similarities. For example, if you set the temperature to zero in an LLM, it becomes more precise. It limits the model's opportunity to invent new paths. This is why AI should still be considered regular software, just with many more parameters and conditions.

Dewan Ahmed: Yeah, and I know we were chatting before the call where you mentioned retrieval is the real bottleneck—not the model itself—and you were referencing a paper. Would you like to highlight more on why you think retrieval is the real bottleneck?

Evgeny Ilinykh: Yeah, absolutely. And by the way, it’s not my paper; I just admire what Anthropic is working on. One of their research papers relates to "contextual retrieval." Contextual retrieval is different from regular retrieval. Normally, we try to find something using similarity search that is close to our query. But this technology has limitations; it’s not an exact match. If you search for a specific number or a specific period, it might be limited. It might return data that is not properly ranked, or if the system only returns the "top 3" results, you might lose the most important information even if it exists in your storage.

That’s why they built a kind of two-step search: they use BM25 and semantic search together. This demonstrates why context is the most important part of AI agents. It’s not a magic box. We need to be precise and understand this new tech stack. Don't think that something magically will happen if you just put tooling together. It will still be a messy thing, and we need to be precise from an engineering perspective.

Dewan Ahmed: Yeah, talking about magic not working, I know a lot of people are familiar with "my model is hallucinating." You mentioned that hallucination starts at the retrieval layer mostly. Could you please expand more on that in practical terms?

Evgeny Ilinykh: Yeah, absolutely. When we talk about retrieval, we are talking about vector space—a huge space with many dimensions, sometimes 3,000 or more. This is efficient because those dimensions represent nuances. But what happens if you can imagine this vector space, but some areas are completely "dark"? There are only a few vectors there, or nothing close to that area. When a query reaches that specific dark area, the LLM tries to find any vector that is close, even if it's not close enough. That’s what hallucination is.

The LLM connects dots incorrectly. It picks the closest vector and predicts a token in the wrong direction. From that point, the hallucination starts to be like a dream—not close to reality. Retrieval tools can fix this by "coloring" those black areas with context. Instead of the LLM going in a wrong direction, it focuses on the provided context.

Dewan Ahmed: Is this similar to an auditing layer for retrieval? Typically, we test our code, APIs, and pipelines. Are we able to test what we’re getting from retrieval?

Evgeny Ilinykh: Yeah, for vector databases, we have a similarity score. We can see how close two vectors are to each other—your intent and the context you found. You can filter by it and remove irrelevant chunks. With RAG, we fix the problem of the OLM being a "black box" because we can see what's happening.

Dewan Ahmed: There are different types of RAG, like Agentic RAG or GraphRAG. Many people aren't aware of where to fit which one. Would you mind sharing the different types?

Evgeny Ilinykh: Absolutely. This area is growing rapidly. With GraphRAG, you search using semantic search but you also have dependencies—nodes and edges connected to each other. You can go through the tree to find related nodes. If you find one relevant node, you can say, "Give me 5 more related to it." This adds logic to how data is linked together.

If we think about coding, you have a codebase. You don’t just have one file with a class; you have blocks that use that class. That’s important because now you know all use cases and can give the LLM more context on how a function is used.

Dewan Ahmed: The Anthropic paper shows that retrieval improves when chunks carry system-level context. How important is that idea for use cases like codebases or incident response workflows?

Evgeny Ilinykh: If we are familiar with our context and efficiently provide it as part of our request, you’ll be efficient with tokens and the context window. You might return a list of 20 files and pollute your context window, making the LLM confused. Or you can be precise: "We need to build something with this function, and here is an example of how to use it." If it's properly summarized, the coding agent will be much more efficient.

Dewan Ahmed: You have been at the forefront of building production-ready systems. What does it mean when moving away from "toy demos" to production-ready retrieval?

Evgeny Ilinykh: When you go through testing and debugging, you realize your initial design might not be good enough. For example, you might find you’re spending a lot of money storing chunks in a vector database. Not all information is at the same level of importance. Some files should be summarized, or some vectors might need fewer dimensions. This increases complexity, and doing it manually is a nightmare. That’s why it’s important to have the right tools, like RAG-as-a-service, where you can experiment and iterate faster.

Dewan Ahmed: You mentioned RAG-as-a-service. What pain points led you to start GuidedMind.ai?

Evgeny Ilinykh: RAG is a perfect example of how we engineer things. You might think about just three steps: retrieval, augmentation, and generation. In reality, each step needs subtasks. For example, when you upload data, you need preprocessing. Data pipelines need to look at data and remove "pollution." In RAG, you have tables, images in PDFs, charts... you need tools to break things apart. If you have just text, that's one thing. If you have charts, they need to be extracted or replaced with text descriptions.

Then there's chunking. You can't just have a chunk size and overlap. If you have paragraphs or rows in a table, a different approach is needed. One bad file format can break your whole system. If your chunk is cut in the middle of a paragraph, the most important information might be missing.

Dewan Ahmed: What were some of the early prototypes when you were building a RAG-as-a-service platform?

Evgeny Ilinykh: The "aha" moment was an experiment where I saw someone put an entire codebase into one text file and create embeddings on top of it. Using a small local model—like 8 billion parameters—you start to see it operate on a completely different level because it's familiar with the context. It starts to perform like a cloud-based LLM. The only thing that changed was the context.

Dewan Ahmed: Is this why you think companies need retrieval observability before they try to fine-tune their models?

Evgeny Ilinykh: I believe so. Many tools focus on workflow automation, but they don't give you the precision to configure RAG. They don't cover context much; they focus on tuning or memory. But context is critical. If you want to build powerful agents, look at what Anthropic is doing—they are focusing on context a lot.

Dewan Ahmed: What needs to happen for AI agents to become more trustworthy for mission-critical workflows?

Evgeny Ilinykh: I think the industry will become more mature. Right now, things like ChatGPT or Claude are kind of black boxes. Companies have their "secrets" for how they get good results and they don't want the market to compete, so they don't tell us everything. But eventually, those hidden details will be publicly available. We’ll build more tools like RAG-as-a-service which will give us the opportunity to build agents faster and more reliably. If you can test and debug context, the agents become more reliable.

Dewan Ahmed: Do you think RAG will ever be as standardized as CI/CD or monitoring?

Evgeny Ilinykh: I believe so. It might not be in the exact shape it is right now, but it will be a part of the tech stack or a design pattern people frequently use. It will be something like relational databases.

Dewan Ahmed: What is your suggestion to leadership at companies building or adopting AI? What do they need to be careful of?

Evgeny Ilinykh: Because of the rapid changes, we are all in a rush. We have the same issue we had before with software: we try to release faster and return a buggy or non-production-ready product. With AI agents, it's the same. We need to focus on the small things. Focus on your data pipeline. Don't skip important blocks, because eventually, the whole system will fall apart in production. If your answers are irrelevant 50% of the time, what is the point of the agent? People don't trust AI agents today not because the LLMs aren't trusty, but because we haven't spent enough time making the systems perfect.

Dewan Ahmed: What is the long-term mission of GuidedMind.ai?

Evgeny Ilinykh: It's in the name: I try to build a tool that is "guided" and transparent. We want the opportunity to look at an AI agent not as a black box, but with instruments to precisely configure, debug, and test it all at once. RAG is the framework to help with that retrieval.

Dewan Ahmed: Where can our listeners find your work or connect with you?

Evgeny Ilinykh: You can find me on LinkedIn. And if you want to look at the tool I'm working on, go to GuidedMind.ai. You can try it for free or schedule a demo.

Dewan Ahmed: Awesome. Thanks so much, Evgeny, for spending time with us. Listeners, this was ShipTalk Podcast. I'm your host, Dewan, sponsored by Harness, the AI-native software delivery platform. We'll link Evgeny's LinkedIn and GuidedMind.ai in the description. Tune in for the next episode.

Evgeny Ilinykh: Thank you so much. Thanks for having me.