Lorem ipsum dolor sit amet consectetur adipiscing elit obortis arcu enim urna adipiscing praesent velit viverra. Sit semper lorem eu cursus vel hendrerit elementum orbi curabitur etiam nibh justo, lorem aliquet donec sed sit mi dignissim at ante massa mattis egestas.
Vitae congue eu consequat ac felis lacerat vestibulum lectus mauris ultrices ursus sit amet dictum sit amet justo donec enim diam. Porttitor lacus luctus accumsan tortor posuere raesent tristique magna sit amet purus gravida quis blandit turpis.
At risus viverra adipiscing at in tellus integer feugiat nisl pretium fusce id velit ut tortor sagittis orci a scelerisque purus semper eget at lectus urna duis convallis porta nibh venenatis cras sed felis eget. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl donec pretium vulputate sapien nec sagittis aliquam nunc lobortis mattis aliquam faucibus purus in.
Dignissim adipiscing velit nam velit donec feugiat quis sociis. Fusce in vitae nibh lectus. Faucibus dictum ut in nec, convallis urna metus, gravida urna cum placerat non amet nam odio lacus mattis. Ultrices facilisis volutpat mi molestie at tempor etiam. Velit malesuada cursus a porttitor accumsan, sit scelerisque interdum tellus amet diam elementum, nunc consectetur diam aliquet ipsum ut lobortis cursus nisl lectus suspendisse ac facilisis feugiat leo pretium id rutrum urna auctor sit nunc turpis.
“Vestibulum pulvinar congue fermentum non purus morbi purus vel egestas vitae elementum viverra suspendisse placerat congue amet blandit ultrices dignissim nunc etiam proin nibh sed.”
Eget lorem dolor sed viverra ipsum nunc aliquet bibendumelis donec et odio pellentesque diam volutpat commodo sed egestas liquam sem fringilla ut morbi tincidunt augue interdum velit euismod. Eu tincidunt tortor aliquam nulla facilisi enean sed adipiscing diam donec adipiscing ut lectus arcu bibendum at varius vel pharetra nibh venenatis cras sed felis eget.
I read McKinsey's report on GenAI project failures the same way I read my credit card statement - with a mix of denial and resignation. "Surely our GenAI pilots are different" , I thought. Spoiler alert: they're not. But here's what the report gets wrong.
The headline screams that most GenAI projects fail. True-ish. But the reason isn't what you think. It's not because the models are dumb. It's not because the technology doesn't work. It's failing because someone in a meeting room promised 99% accuracy in Q2, and now we're in August wondering why the chatbot keeps hallucinating about your company's non-existent product lines.
Let me be blunt: We've forgotten that LLM outputs require serious optimization work.
The Gap Between "Working AI" and "Production AI"
Here's what nobody tells you when you greenlight a GenAI project. Getting an LLM to generate something is easy. Getting it to generate something reliable enough for your actual business is a completely different beast.
I've watched brilliant engineering teams build beautiful proof-of-concepts where the model cranks out creative, coherent responses. Everyone nods. The executive sponsor gets excited. PowerPoint slides multiply like rabbits. Then someone asks the simple question: "But will it actually work for our customers?"
And that's when the lights start flickering.
The problem isn't that GenAI doesn't work. The problem is that we're treating LLM output optimization like it's some nice-to-have afterthought, when it should be the entire game from day one.
What Everyone Forgets: Optimization is 80% of the Work
Let me walk you through what we've learned the hard way:
Grounding and Guardrails —Your model needs to stay in its lane. If you're asking an LLM to help customers troubleshoot billing issues, it needs to know the difference between a possible answer and an accurate answer. That requires guardrails. Serious ones. And they take time to build.
RAG (Retrieval-Augmented Generation) —You think you can just feed your knowledge base into a model and call it a day? No. RAG requires tuning. Which documents get retrieved? In what order? How do you prevent the model from mixing signals across conflicting sources? This is where projects actually live or die, and it's unglamorous work.
Fine-tuning and Evaluation —Want your model to sound like your brand? To handle edge cases your industry cares about? To not confidently give wrong answers? That's fine-tuning. And before you can fine-tune, you need to know what "good" looks like, which means building evaluation frameworks before you even start optimizing.
The Evaluation Problem Nobody Wants to Solve —Here's the kicker: you can't just eyeball LLM outputs and call it good. You need systematic evaluation. Metrics. Baselines. Continuous testing. Most teams skip this because it feels like overhead. Then they launch, and suddenly they're dealing with production incidents that could have been caught in week two of the POC.
The Promise Problem: Why Your Timeline is Already Broken
I've been in enough rooms to recognize the pattern. Someone says: "We need 99% accuracy on customer support responses by end of Q2."
My internal alarm bells go ding ding ding.
Here's the thing about 99% accuracy with LLMs: it's achievable, but not in 12 weeks if you're starting from a cold start. It requires:
If you're committing to 99%+ accuracy on a complex task in a compressed timeline, yellow lights should be flashing everywhere. Not because it's impossible, but because you're either underestimating the work or setting yourself up to ship something that will embarrass you.
What We've Learned: The Incremental Approach
Here's where I'm going to give you something more useful than hand-wringing.
Start with honest expectations. Your first GenAI implementation won't hit production-grade accuracy. Maybe it gets to 85%. That's fine. That's actually good if it's on a well-defined problem with proper evaluation metrics.
Build evaluation into the POC stage. Don't wait until you're in production to ask "Is this actually working?" Instrument your POC to measure accuracy from day one. Set a baseline. Track improvement. Know exactly which cases the model handles well and which ones make it lose its mind. This single practice will save you from 80% of GenAI project disasters.
Create incremental delivery milestones. Don't aim for "90% accuracy across all use cases" in one sprint. Aim for "65% accuracy on high-volume, low-complexity queries, with built-in escalation to humans." Then iterate. Then expand. Then optimize. Each milestone should move you toward your goal, not bet everything on a single launch.
Be ruthlessly honest about where 95%+ accuracy is non-negotiable. In customer success, some things matter more than others. Answering "What's your return policy?" can tolerate higher error rates than "Is this charge fraudulent?" Know the difference. Invest optimization effort where it actually matters for your business.
Partner with humans, don't replace them. At least initially. GenAI works best as a force multiplier for humans, not as a human replacement. This isn't romantic; it's pragmatic. It also means your accuracy requirements can be more realistic.
The Real Reason Projects Fail
McKinsey will tell you projects fail because of organizationalresistance, poor change management, or inadequate talent. All true.
But dig deeper, and you'll find the real reason: misaligned expectations between what people think GenAI can do and what it actually takes to build production-grade GenAI systems.
People see ChatGPT making magic happen and think it's easy. They don't see the optimization work, the evaluation cycles, the guardrail engineering, the fine-tuning data preparation. They see the finished product and work backward, imagining it was fast and straightforward.
It wasn't.
The Pragmatic Path Forward
If you're leading a GenAI initiative, here's what I'd actually do:
The Honest Conclusion
McKinsey's right that most GenAI projects fail. But it's not because GenAI doesn't work. It's because we're treating it like magic instead of like the engineering problem it actually is.
LLM outputs don't become production-grade through hope and executive enthusiasm. They become production-grade through systematic optimization, rigorous evaluation, and realistic timelines.
The companies winning with GenAI aren't the ones moving fastest. They're the ones being most honest about what it takes to move well.
Here's my question for you: In your organization, where are people making promises about GenAI accuracy that feel optimistic? And where do you actually have time to build the evaluation infrastructure that makes those promises realistic?
I'm genuinely curious. This is the gap where most projects live—and where the interesting conversations actually happen.
Lorem ipsum dolor sit amet consectetur adipiscing elit obortis arcu enim urna adipiscing praesent velit viverra. Sit semper lorem eu cursus vel hendrerit elementum orbi curabitur etiam nibh justo, lorem aliquet donec sed sit mi dignissim at ante massa mattis egestas.
Vitae congue eu consequat ac felis lacerat vestibulum lectus mauris ultrices ursus sit amet dictum sit amet justo donec enim diam. Porttitor lacus luctus accumsan tortor posuere raesent tristique magna sit amet purus gravida quis blandit turpis.
At risus viverra adipiscing at in tellus integer feugiat nisl pretium fusce id velit ut tortor sagittis orci a scelerisque purus semper eget at lectus urna duis convallis porta nibh venenatis cras sed felis eget. Neque laoreet suspendisse interdum consectetur libero id faucibus nisl donec pretium vulputate sapien nec sagittis aliquam nunc lobortis mattis aliquam faucibus purus in.
Dignissim adipiscing velit nam velit donec feugiat quis sociis. Fusce in vitae nibh lectus. Faucibus dictum ut in nec, convallis urna metus, gravida urna cum placerat non amet nam odio lacus mattis. Ultrices facilisis volutpat mi molestie at tempor etiam. Velit malesuada cursus a porttitor accumsan, sit scelerisque interdum tellus amet diam elementum, nunc consectetur diam aliquet ipsum ut lobortis cursus nisl lectus suspendisse ac facilisis feugiat leo pretium id rutrum urna auctor sit nunc turpis.
“Vestibulum pulvinar congue fermentum non purus morbi purus vel egestas vitae elementum viverra suspendisse placerat congue amet blandit ultrices dignissim nunc etiam proin nibh sed.”
Eget lorem dolor sed viverra ipsum nunc aliquet bibendumelis donec et odio pellentesque diam volutpat commodo sed egestas liquam sem fringilla ut morbi tincidunt augue interdum velit euismod. Eu tincidunt tortor aliquam nulla facilisi enean sed adipiscing diam donec adipiscing ut lectus arcu bibendum at varius vel pharetra nibh venenatis cras sed felis eget.
I read McKinsey's report on GenAI project failures the same way I read my credit card statement - with a mix of denial and resignation. "Surely our GenAI pilots are different" , I thought. Spoiler alert: they're not. But here's what the report gets wrong.
The headline screams that most GenAI projects fail. True-ish. But the reason isn't what you think. It's not because the models are dumb. It's not because the technology doesn't work. It's failing because someone in a meeting room promised 99% accuracy in Q2, and now we're in August wondering why the chatbot keeps hallucinating about your company's non-existent product lines.
Let me be blunt: We've forgotten that LLM outputs require serious optimization work.
The Gap Between "Working AI" and "Production AI"
Here's what nobody tells you when you greenlight a GenAI project. Getting an LLM to generate something is easy. Getting it to generate something reliable enough for your actual business is a completely different beast.
I've watched brilliant engineering teams build beautiful proof-of-concepts where the model cranks out creative, coherent responses. Everyone nods. The executive sponsor gets excited. PowerPoint slides multiply like rabbits. Then someone asks the simple question: "But will it actually work for our customers?"
And that's when the lights start flickering.
The problem isn't that GenAI doesn't work. The problem is that we're treating LLM output optimization like it's some nice-to-have afterthought, when it should be the entire game from day one.
What Everyone Forgets: Optimization is 80% of the Work
Let me walk you through what we've learned the hard way:
Grounding and Guardrails —Your model needs to stay in its lane. If you're asking an LLM to help customers troubleshoot billing issues, it needs to know the difference between a possible answer and an accurate answer. That requires guardrails. Serious ones. And they take time to build.
RAG (Retrieval-Augmented Generation) —You think you can just feed your knowledge base into a model and call it a day? No. RAG requires tuning. Which documents get retrieved? In what order? How do you prevent the model from mixing signals across conflicting sources? This is where projects actually live or die, and it's unglamorous work.
Fine-tuning and Evaluation —Want your model to sound like your brand? To handle edge cases your industry cares about? To not confidently give wrong answers? That's fine-tuning. And before you can fine-tune, you need to know what "good" looks like, which means building evaluation frameworks before you even start optimizing.
The Evaluation Problem Nobody Wants to Solve —Here's the kicker: you can't just eyeball LLM outputs and call it good. You need systematic evaluation. Metrics. Baselines. Continuous testing. Most teams skip this because it feels like overhead. Then they launch, and suddenly they're dealing with production incidents that could have been caught in week two of the POC.
The Promise Problem: Why Your Timeline is Already Broken
I've been in enough rooms to recognize the pattern. Someone says: "We need 99% accuracy on customer support responses by end of Q2."
My internal alarm bells go ding ding ding.
Here's the thing about 99% accuracy with LLMs: it's achievable, but not in 12 weeks if you're starting from a cold start. It requires:
If you're committing to 99%+ accuracy on a complex task in a compressed timeline, yellow lights should be flashing everywhere. Not because it's impossible, but because you're either underestimating the work or setting yourself up to ship something that will embarrass you.
What We've Learned: The Incremental Approach
Here's where I'm going to give you something more useful than hand-wringing.
Start with honest expectations. Your first GenAI implementation won't hit production-grade accuracy. Maybe it gets to 85%. That's fine. That's actually good if it's on a well-defined problem with proper evaluation metrics.
Build evaluation into the POC stage. Don't wait until you're in production to ask "Is this actually working?" Instrument your POC to measure accuracy from day one. Set a baseline. Track improvement. Know exactly which cases the model handles well and which ones make it lose its mind. This single practice will save you from 80% of GenAI project disasters.
Create incremental delivery milestones. Don't aim for "90% accuracy across all use cases" in one sprint. Aim for "65% accuracy on high-volume, low-complexity queries, with built-in escalation to humans." Then iterate. Then expand. Then optimize. Each milestone should move you toward your goal, not bet everything on a single launch.
Be ruthlessly honest about where 95%+ accuracy is non-negotiable. In customer success, some things matter more than others. Answering "What's your return policy?" can tolerate higher error rates than "Is this charge fraudulent?" Know the difference. Invest optimization effort where it actually matters for your business.
Partner with humans, don't replace them. At least initially. GenAI works best as a force multiplier for humans, not as a human replacement. This isn't romantic; it's pragmatic. It also means your accuracy requirements can be more realistic.
The Real Reason Projects Fail
McKinsey will tell you projects fail because of organizationalresistance, poor change management, or inadequate talent. All true.
But dig deeper, and you'll find the real reason: misaligned expectations between what people think GenAI can do and what it actually takes to build production-grade GenAI systems.
People see ChatGPT making magic happen and think it's easy. They don't see the optimization work, the evaluation cycles, the guardrail engineering, the fine-tuning data preparation. They see the finished product and work backward, imagining it was fast and straightforward.
It wasn't.
The Pragmatic Path Forward
If you're leading a GenAI initiative, here's what I'd actually do:
The Honest Conclusion
McKinsey's right that most GenAI projects fail. But it's not because GenAI doesn't work. It's because we're treating it like magic instead of like the engineering problem it actually is.
LLM outputs don't become production-grade through hope and executive enthusiasm. They become production-grade through systematic optimization, rigorous evaluation, and realistic timelines.
The companies winning with GenAI aren't the ones moving fastest. They're the ones being most honest about what it takes to move well.
Here's my question for you: In your organization, where are people making promises about GenAI accuracy that feel optimistic? And where do you actually have time to build the evaluation infrastructure that makes those promises realistic?
I'm genuinely curious. This is the gap where most projects live—and where the interesting conversations actually happen.