Close the GenAI “Learning Gap”: Self‑Improving AI Without Fine‑Tuning

AI DEMO

2:15 PM – 3:00 PM

Room 275

SPEAKER

Ben McHone

Staff Engineering Consultant, Source Allies

SPEAKER

Matt Vincent

Founder, Source Allies

Use with AI

Copy this session's complete context to paste into ChatGPT, Claude, or any AI assistant.

Preview context block

## Session: Close the GenAI “Learning Gap”: Self‑Improving AI Without Fine‑Tuning
**Track:** AI Demo | **Time:** 2:15 PM–3:00 PM | **Room:** 275 | **Type:** AI Demo
**Conference:** CIRAS AI Summit for Iowa — May 6, 2026, Scheman Building, Iowa State University, Ames IA

### Speaker(s)

**Ben McHone** — Staff Engineering Consultant, Source Allies (Urbandale, IA)
Ben McHone is a Staff Engineering Consultant at Source Allies, specializing in deploying agentic AI systems to production. He focuses on metric‑driven development and real‑world reliability, addressing the question: How do we know we can trust this technology? Ben is a DSPy contributor, LangChain Expert Program member, and Arize / Phoenix Ambassador.

**Matt Vincent** — Founder, Source Allies (Urbandale, IA)
Matt Vincent founded Source Allies, an Iowa‑headquartered consultancy specializing in Data & AI with multiple GenAI systems in production delivering measurable ROI. He works with organizations to move generative AI from pilot to product.

### Session Description

The MIT State of AI report surfaced a brutal truth: most GenAI systems do not retain feedback, adapt to context, or improve over time. While frontier models get better with every release, enterprises rarely gain a durable advantage, because their systems don’t actually learn.

The default answer is fine‑tuning. In practice, it’s often expensive, brittle, slow to iterate, and tightly coupled to a specific model version. Worse, it can lock teams out of rapidly improving frontier models.

This session presents an alternative: learning‑loop architectures that allow enterprise GenAI systems to improve continuously, without fine‑tuning, while remaining flexible enough to adopt new models as they emerge.

You’ll see how feedback from real usage can be captured, measured, and reintegrated safely into production systems. We’ll demonstrate how observability, evaluation, and automated optimization work together to turn GenAI from a static capability into a learning system.

We’ll explore:

 	Automated Prompt Optimization: enabling systems to evolve their own instructions using Genetic‑Pareto (GEPA) techniques based on measurable feedback
 	Observability‑Driven Learning: detecting failure patterns and routing targeted corrections back into the system
 	Trust & Auditability: fitting learning loops into existing governance, compliance, and risk frameworks rather than fighting them

 

If your GenAI initiative is stuck in pilot, or producing inconsistent or stagnant results, this session shows the missing half: the learning loop that makes improvement routine instead of exceptional.

### Other sessions in the AI Demo track

- M365 Copilot Rollout: Driving Adoption and Impact at Pella (3:10 PM–3:55 PM)
- From Chatbot to Builder: Turning AI Into a Daily Collaborator Inside Real Projects (10:20 AM–11:05 AM)
- Stop Automating Broken Processes: How to Redesign Your Business Operations for the Age of AI Agents (11:15 AM–12:00 PM)
- Building Enterprise-Scale RAG Chatbots Using Azure AI Foundry (1:20 PM–2:05 PM)

### Suggested prompts for this session

- "What questions should I prepare to ask the speaker(s) at this session?"
- "Create a structured note-taking template for this session focused on actionable takeaways"
- "Based on this session description, what background reading should I do to get the most value?"
- "After I attend, help me create an action plan for implementing what I learned"
- "How does this session connect to the other sessions in the AI Demo track?"

## Session: Close the GenAI “Learning Gap”: Self‑Improving AI Without Fine‑Tuning
**Track:** AI Demo | **Time:** 2:15 PM–3:00 PM | **Room:** 275 | **Type:** AI Demo
**Conference:** CIRAS AI Summit for Iowa — May 6, 2026, Scheman Building, Iowa State University, Ames IA

### Speaker(s)

**Ben McHone** — Staff Engineering Consultant, Source Allies (Urbandale, IA)
Ben McHone is a Staff Engineering Consultant at Source Allies, specializing in deploying agentic AI systems to production. He focuses on metric‑driven development and real‑world reliability, addressing the question: How do we know we can trust this technology? Ben is a DSPy contributor, LangChain Expert Program member, and Arize / Phoenix Ambassador.

**Matt Vincent** — Founder, Source Allies (Urbandale, IA)
Matt Vincent founded Source Allies, an Iowa‑headquartered consultancy specializing in Data &amp; AI with multiple GenAI systems in production delivering measurable ROI. He works with organizations to move generative AI from pilot to product.

### Session Description

The MIT State of AI report surfaced a brutal truth: most GenAI systems do not retain feedback, adapt to context, or improve over time. While frontier models get better with every release, enterprises rarely gain a durable advantage, because their systems don’t actually learn.

The default answer is fine‑tuning. In practice, it’s often expensive, brittle, slow to iterate, and tightly coupled to a specific model version. Worse, it can lock teams out of rapidly improving frontier models.

This session presents an alternative: learning‑loop architectures that allow enterprise GenAI systems to improve continuously, without fine‑tuning, while remaining flexible enough to adopt new models as they emerge.

You’ll see how feedback from real usage can be captured, measured, and reintegrated safely into production systems. We’ll demonstrate how observability, evaluation, and automated optimization work together to turn GenAI from a static capability into a learning system.

We’ll explore:

Automated Prompt Optimization: enabling systems to evolve their own instructions using Genetic‑Pareto (GEPA) techniques based on measurable feedback
 	Observability‑Driven Learning: detecting failure patterns and routing targeted corrections back into the system
 	Trust &amp; Auditability: fitting learning loops into existing governance, compliance, and risk frameworks rather than fighting them

&nbsp;

If your GenAI initiative is stuck in pilot, or producing inconsistent or stagnant results, this session shows the missing half: the learning loop that makes improvement routine instead of exceptional.

### Other sessions in the AI Demo track

- M365 Copilot Rollout: Driving Adoption and Impact at Pella (3:10 PM–3:55 PM)
- From Chatbot to Builder: Turning AI Into a Daily Collaborator Inside Real Projects (10:20 AM–11:05 AM)
- Stop Automating Broken Processes: How to Redesign Your Business Operations for the Age of AI Agents (11:15 AM–12:00 PM)
- Building Enterprise-Scale RAG Chatbots Using Azure AI Foundry (1:20 PM–2:05 PM)

### Suggested prompts for this session

- "What questions should I prepare to ask the speaker(s) at this session?"
- "Create a structured note-taking template for this session focused on actionable takeaways"
- "Based on this session description, what background reading should I do to get the most value?"
- "After I attend, help me create an action plan for implementing what I learned"
- "How does this session connect to the other sessions in the AI Demo track?"

TRACK AI Demo

FORMAT AI Demo

ROOM 275

We’ll explore:

Automated Prompt Optimization: enabling systems to evolve their own instructions using Genetic‑Pareto (GEPA) techniques based on measurable feedback
Observability‑Driven Learning: detecting failure patterns and routing targeted corrections back into the system
Trust & Auditability: fitting learning loops into existing governance, compliance, and risk frameworks rather than fighting them

Key Takeaways

Understand the Learning Gap: Why MIT identified learning as the core barrier to scaling GenAI, and what enterprises can do about it
The Learning‑Loop Pattern: Hands‑on exposure to GEPA techniques that work across LLM providers
Self‑Improving Demo: See a small GenAI system measurably improve from user feedback during use, with no fine‑tuning required

Continue the conversation with Ben McHone at the Production & Operations Facilitated Discussion — 2:15 PM - 3:00 PM, Room 220-230-240

Continue the conversation with Matt Vincent at the Production & Operations Facilitated Discussion — 2:15 PM - 3:00 PM, Room 220-230-240

Session Recording

Session Data

Download SRT (Captions) Attendee Slides (PDF) AI-Formatted PDF Download Session Bundle (ZIP)

Transcript from Summit:

00:00 Session Introduction and Speaker Credentials Slide: 1

source allies generative ai learning gap feedback loops observability

thank you for being here this afternoon. So my name is Jake Behrens. I'm a member of Cirrus, so I have the honor of moderating this room this afternoon. So I'm going to go ahead and do our introductions here, and we'll go ahead and get rolling with this presentation. So good afternoon. It's my pleasure to introduce Matt Vincent, founder of Source Allies, and Ben McCone, staff engineering consultant at Source Allies. So Matt leads the consultancy focused on data and AI with multiple generative AI systems in production, delivering measurable business results. So Ben specializes in deploying advanced AI systems with a strong emphasis on reliability, evaluation, and building trust in real-world applications. So together they work at the forefront of helping organizations move generative AI from experimentation into scalable production-ready systems. So in today's session, they're going to explore how to close the learning gap in generative AI, demonstrating how systems can continuously improve through feedback, observability, and optimization without relying on fine-tuning.

01:02 Stanford Research and Optimize Anything Library Slide: 2

stanford research optimize anything it consultancy des moines generative ai systems

So please join me in welcoming Matt Vincent and Ben McCone. Thank you. Appreciate it. So what we're talking about today is something that Ben and I were really excited about when we first saw this paper come out from Stanford. And then actually while we were developing the talk, a super simple library came out that allows you to do what we're talking about. It's really exciting stuff. We're happy that you're here. We're happy that you weren't scared away by fine-tuning. because it is kind of a technical topic, and what we're talking about is more approachable than that. Source Allies, where Ben and I are from, is an IT consultancy based in Des Moines, and we specialize in data and AI. We've been around for almost 25 years now, and we are proud to be builders who teach. We do a lot of building, a lot of building of Gen. AI systems, and we've run into this roadblock that we're going to talk about today and how to get past it.

02:08 Speaker Backgrounds and Open Source Contributions Slide: 2

open source multi-agent systems production deployment neural networks ai practice

And then as we're working, we really love teaching and learning, leveling up together, whoever we're working with. Back in 2002, I founded Source Allies, and I was one of several teammates who started our data and AI practice almost eight years ago now. And way back when When I was in college, my dad led a neural networks research group, and I was what the keynote speaker called today one of the doubters. I did not believe that anything he was doing was actually possible, but here I am today, Ben. Yeah, my name is Ben McCone. I am an open source contributor, a staff. I'm a consultant with Source Allies. I've actually contributed to a lot of the tools that we'll show you today. I have taken multiple clients from idea concept to production with scalable single and multi-agent systems over the years, all of the buzzwords.

03:09 Recognition in AI Framework Communities Slide: 2

langchain dspy ai frameworks mit report ai applications

I've seen that progression over the last three or four years that we've been doing AI. I'm really excited to be giving this talk to everyone today. Back to you, Matt. Thanks, Ben. I was actually in San Francisco with Ben at a conference where he was speaking at, and the kind of leaders of these open source AI framework groups, like LangChain, if you do AI development, you're familiar with these, LangChain, DSPy. Everybody was super excited to see Ben. It was really weird, like we show up from Iowa and it's like, Ben, thank you for helping us. So there's some credibility up here. So what we're talking about today is a report that came out from MIT late last year talking about the state of AI in business, and it talked about this big problem that everybody's running into, this big roadblock that is the AI applications, the generator application that we're building, aren't learning.

04:12 Capturing Unspoken Organizational Knowledge Slide: 2

organizational knowledge institutional memory talent retention unspoken principles knowledge capture

You interact with it, you Tell it, no, that's a little bit wrong. This is how we work. Or you want to guide it a little bit more before you expose it to a larger audience. And it's just not learning from how you actually work within your organization. Maybe it's because of the things that you want to capture really aren't written down anywhere, but they come out through the interaction of a large group working together with the AI. This is what captures that. And it's actually what I'm hearing a lot of people being concerned about at the conference. You have people who have been at the business for a long time and have a lot of knowledge, and then people who are new to the business and don't have that knowledge, and what's going to happen when all of that talent eventually does retire? There's a lot of promise in this approach as a way to capture the knowledge and strategies and principles that are really unspoken.

05:16 Trust as the Primary Barrier to AI Adoption Slide: 2

trust ai adoption project failure capability organizational readiness

So we're going to talk about that. We're going to talk about the normal go-to. Normal go-to with using large language models and they don't fit for you is to do some fine-tuning. There's some pitfalls to that. And we're going to talk about what to do instead. And we welcome questions throughout. So just feel free to raise your hand. And we'll also be around for questions afterwards. So Most Gen. AI projects fail due to lack of trust, not capability. JC was talking about there's average intelligence, there's great human intelligence, and then there's the best AI intelligence. So we don't need more intelligent models to do a lot of the use cases. that we're trying to do right now. But there are some impediments. One of them is trust. They say AI moves at the speed of trust. Does anybody want to offer what is being done within your organization to get past, to start to build trust with AI and actually have it be something that is more usable?

06:29 Audience Examples of Building AI Trust Slide: 2

read-only access use cases human oversight database security verification

Yes. Great, yep. Sharing what's worked, having that be kind of that teaching tool so people know that there are some successes out there and where it doesn't fit. Anybody else? Yes. Do you have a lot of read-only access for AI so that if they go to create ERP or any database, the user can see what to do with that data, interpret it, create visuals for that data without risking any harm? Yeah, I like that. People love that a lot. Read-only access that prevents the AI from going off the rails and deleting databases, but still seeing what the capabilities are while you're building trust. Awesome. How about one last one?

07:28 LLM Evaluations and Expert Comparison Slide: 50

llm evaluations expert benchmarks performance measurement evaluation metrics quality assessment

Yes. So I guess we're more in the cautionary stage where it's a lot of things that recommend it's not. You can't distinct the answer. You have to know yourself and then trust it more as it gets better. Great. We're fine tuning that actually as we go. Awesome. So not necessarily trusting the output and building up. Using it as a tool but not you can't just trust it. No blind trust. Yes. Great. There's one other thing that is coming out of industry and it's this idea that We need some way to leverage the experts in our company and figure out how do you measure in comparison to how an expert would do a particular task. And that measurement capability is actually the thing that is the foundation upon which what we're talking about today is built.

08:37 Industry Emphasis on Evaluation Frameworks Slide: 7

evaluation frameworks pilots measurement mit report model quality

So measuring, how the AI is performing in industry terms. It's called LLM evaluations. So here's some industry quotes that talk about really if you're not investing in evals, you're not really shipping, you're guessing. A lot of pilots now are coming out of the gate. Pilots, when you're just trying to prove something out, are coming with some sort of measurement ability. And we'll talk more about that. So that's sort of the foundation. Then the MIT State of AI and Business report says, here's the big problem, the big barrier that everybody is running into. And again, it's not model quality, it's that these AI systems that we're interacting with aren't learning. So If you have a team or group or division department using AI and we're working with it and teaching it, like, here's how we do some things, oh, this isn't really documented, but this is how we like to work, none of that is captured.

09:49 The Groundhog Day Problem in Enterprise AI Slide: 8

iterative learning feedback loops enterprise ai team collaboration undocumented knowledge

So it literally is Groundhog's Day all over again every day. The AI is just going to keep giving you the same answer. And that's the big problem. So What big tech does to solve for that, you have feedback loops. You have the thumbs up, thumbs down. We've all seen these. Thumbs down, then you can give feedback. But who is that helping? It's not helping your team or your company. It's helping big tech make a better model that really, you know, it's improving things for everybody, but not necessarily for your particular group. So, I heard MIT talks about AI's really great for individual brainstorming, but stalls out in enterprise settings because they lack iterative learning, and I actually heard a great quote at lunch. that was someone saying, I'm not interested in AI that is just for an individual.

10:56 Moving from Individual to Team AI Usage Slide: 8

team ai enterprise settings individual use brainstorming scalability

I want to be thinking about AI that is for multiple people, that's for teams. And that's where a lot of companies are right now. They want to break out from the individual use. So MIT gave this big problem, but they actually didn't give us an answer. They just said, it's a problem. And we think the solution is that you need some sort of loop, learning loop. but they didn't tell us how. And that's what we're sharing today. Fine-tuning, that's the typical go-to. Fine-tuning requires thousands, sometimes hundreds of thousands of examples to do fine-tuning. When you do fine-tuning, you are changing the weights in the model. These big, giant, smart models are actually nothing more than a CSV file. So you're changing The numbers in the CSV, when you do that, you have higher standards when it comes to governance and how the model is governed within your organization.

11:59 Fine-Tuning Limitations and Costs Slide: 8

fine-tuning model weights governance ml ops anthropic

It's really difficult to change fine-tuned models that are the ones that we like using. We like using Anthropic and OpenAI. But fine-tuning is something that typically happens with open weight models. And then finally, when you go to fine-tuning, which really kind of changes the behavior of the model so it can adapt to your domain, it's a whole other tier of expenses. Like you're basically getting an ML ops, AI ops. It's a lot of expense to do that instead of paying pennies per interaction. Okay, so those are some of the downfalls of fine-tuning. Another interesting thing as we get into the solution is LLMs are very good at doing what you tell them, and a lot of the failures that we encounter come from not being specific enough, not removing ambiguity.

13:05 Prompt Failures as Knowledge Gaps Slide: 8

prompt engineering knowledge gaps ambiguity specificity system prompts

So most prompt failures are actually knowledge gaps. where some of these principles or strategies, we just didn't have a way, we didn't know how to express them. And then the AI gives us a result, and we're like, well, that's kind of stupid. But the key point here is that we all know what a prompt is, of course. Probably a lot of people know that there's also something called a system prompt or a system message. It's like a prompt that we never see. behind the scenes that says, be kind to the person you're working with and don't be, here are some ethical guidelines and here are some rules to follow. So the key point, though, is that prompt changes, changing that wording just a little bit can have massive differences on how the model performs.

14:07 Mathematical Approach to Prompt Engineering Slide: 28

prompt engineering mathematical optimization scientific method team copilot systematic improvement

And this used to be prompt engineering. People made fun of it. was going to be a $200,000 career, proofed engineering. But there's actually what we're showing today is a mathematical, a scientific way to change the words that are in every single application that we're using. Even if you're just using a team copilot, there's a way to give it a prompt that helps it work with your team better. and recognize how your business is working. These small prompt changes are a way to mathematically, scientifically, methodically change the prompt to get better outcomes for you. This is a guy who talks a lot about AI on the internet, and he says LLMs just, they don't come with instructions in the box. So that's kind of the thing.

15:06 LLMs Lack Built-In Instructions Slide: 13

llm instructions agents skills tools instruction optimization

We're all figuring out what are the right instructions. There was a talk early this morning that was sharing out what a model is and what agents are and what skills are and what tools are. So there are all of these places where you give instructions to the model. And again, we're figuring that out together. So some of the kind of go-tos that we have fine-tuning. Okay, it's expensive. Here's the downfall. RAG, that's where you pull in all of your documents. Maybe you have existing SOPs. That works great. We love RAG. But it doesn't help you bring out the reasoning and the strategy that your team is using when you're doing work within your organization. Memory is really good. But we don't want memorization of just things that have been done in the past.

16:10 Limitations of Current AI Improvement Methods Slide: 15

fine-tuning rag retrieval augmented generation memory systems skills

We want to extract lessons from that. So all of these things that we want just don't exist yet until this method that we're going to show you. If anybody's working on skills, skills is just a text file that helps you guide your AI to do a particular thing. And that actually isn't scaling very well. Like a lot of people are trying to figure out how do you scale that beyond just one person? How do you scale that to multiple people, A-team? So those are the problems that we're trying to answer. That's what we are going to answer today. And this is the kind of what if we could? What if we had a way to do all of this continuous improvement, to have it be fully auditable? Because instead of changing Model weights. these numbers in a giant CSV file, we're changing the instructions. And you can see, oh, the model is performing better because of these words.

17:10 Text-Based Self-Improvement System Introduction Slide: 15

optimize anything stanford auditable instructions text optimization

I can get that, like an auditor can look at that and say, I understand what that's doing. There's all of these great benefits. And they just so happen to get you similar to sometimes even better performance gains when compared to fine-tuning, the really expensive thing. I'm not going to hold you in suspense further. This is the library that came out that really does this magic thing that kind of codifies what the author at Stanford released. It allows you to take text and take some feedback on why that text was good or bad and make better text. It works on text. So it's super simple stuff. It can actually be done with a spreadsheet. Take some really bad interactions with your AI, talk to some subject matter experts, and say, why was it bad?

18:11 Extracting Principles from Feedback Slide: 15

principle extraction expert feedback text optimization generalization strategy extraction

What would an expert have said? And then run it through this. And what's interesting is you're not getting the hard-coded answers embedded into this text. you're getting higher-level principles and strategies extracted for you that can apply to your entire domain, examples not yet seen. Okay, so we're getting to an example. We're going to do a chat thing. Yeah, AI is not all about chat, but it's an example that we all recognize. So we ask a question, we get an answer, we give a thumbs up, thumbs down. We say it was a thumbs down because Someone's asking about photosynthesis because you didn't talk about light absorption details. So to be more precise, photosynthesis is a two-step blah, blah, blah. So we have a feedback loop for where the humans or maybe even smarter, super expensive models are saying, here's a better answer.

19:16 Michelin Star Chef Example Setup Slide: 15

michelin star food domain expert comparison omelet example evaluation metrics

We're applying that to food. We have lots of insurance and retail and energy and defense and med tech examples, but we're applying it to food. There's a rating. There's a rating system called Michelin Star that if you're traveling, these are like, hey, these are some useful places to go for an interesting experience. So it's like fancy food. But that's what we want to do. We have an example. It's running in code. You can stop by our booth and we'll show you the code if you want to see the details of it. But we're saying, OK, how would you make an omelet? And then how would a Michelin fancy chef make an omelet? So it's that difference between how anybody would do it and then how an expert would do it.

20:16 Iterative Optimization Process Slide: 15

iterative optimization evaluation cycles system prompt reinforcement learning performance gains

And what are we doing behind the scenes? We're basically taking, we're asking Michelin-trained chefs, what is the really great answer of how you make an omelet? What are all the things that you need to consider, temperature and all the things, seasoning? And we use AI to say, okay, the answer that was given, this is a simplified example, but the answer that was given, how many of The bullet points or statements made by the expert chef were represented in the answer that the AI gave you. Maybe one out of five. So then we tell the AI why it was wrong, and we do that for 30, 40, 50 cycles, and run it through the system, and give it a new system prompt. And you get the same kind of performance gains that you get from the really super expensive reinforcement learning.

21:22 Poor Question Quality and Generic Responses Slides: 46, 49

question quality context roast cooking generic answers expert feedback

Okay, that was a lot of detail. Food. We're talking about food. I'm going to hand it over to Matt. No, thank you. So as Matt had alluded to, we are talking food. And one of the challenges that we face in organizations is sometimes we don't know the question that we should be asking. So in this example, the question isn't really great. It's how do I make a roast? That's leaving out a lot of variables. Are there any home chefs in the room? So you may say this is a bad question because I don't know how big the roast is. I don't know what type of meat it is. I don't know what technique you're using. All of these lead to a very generic answer. Hey, let's flip every 30 to 45 minutes. not super helpful. And so we give a thumbs down and we explain that every 30 to 45 minutes, it's just actively bad advice. We need to provide better context. And 20 to 30 minutes per pound, well, that's a blunt instrument.

22:24 Learning Scientific Principles Over Memorization Slides: 46, 50, 39

maillard reaction collagen cooking oils scientific principles principle learning

We don't actually know without more of those details. So we give that feedback and we let our subject matter experts actually provide what is that ideal answer. We start with something very simple, like that answer had something like this in the prompt. It's like, hey, answer questions accurately, use any context that you have, but it doesn't even know that it's supposed to be acting like a Michelin star chef. When we are done with the process, this goes on and on. If you want to read the full ending prompt, feel free to scan that. But the interesting thing, as Matt had alluded to, is nowhere in the final result does the word roast even appear. Instead, we're talking about the scientific qualities of the answer. We're talking about the Mylar reaction, the browning that you get on your meat when you're cooking it. It's talking about collagen structures, what oils to use in what cases.

23:25 Role-Playing Prompts No Longer Effective Slide: 46

role-playing prompts model evolution prompt effectiveness system prompts prompt engineering

Said simply, it learned the principles, not the answers to the questions that we are asking. And we can see a very strong result. On the left here, how do you make a roast? This is the original. Same bad answer. But with that updated prompt, it starts out by telling you, need to choose the right cut, asking you immediately, beef, pork, or lamb? tells you how to prepare the meat, how to season it. Did you have something else? I was just going to add one thing. By the way, it's been proven that if you say act like a super experienced data scientist or act like a super experienced Michelin star chef, that doesn't work. That kind of prompting does not work. It did work a few years ago, but the models have grown since then. And why is this a problem? Why can't we just have somebody on our team sit down and write it? Well, JC had mentioned in the keynote that it is somebody's job at Anthropic to write the system prompt or the sole document.

24:29 Anthropic System Prompt Complexity Slide: 27

anthropic system prompt jepa prompt engineering claude

This document, they've been leaked. Anthropic does also release these themselves after some delay. About 24, 25,000 words and many, many lines long. I have never been part of a team that can actually justify spending that much time on one document every single cycle. So that is where JEPA, the parent library to optimize anything that Matt had introduced, comes in. This is a result of a research paper out of Stanford saying that reflective prompt evolution can outperform reinforcement learning. Essentially, Give the AI a signal, and it's a better prompt engineer than you or I. And what it does is it helps us extract the principles, practices, strategies, and techniques, specifically not rote memorization of what the answer should have been. This is important because, yeah, if we just say, when asked how to cook a roast, respond with this, it will always be correct.

25:33 Automated Feedback Loop with ChatGPT Slides: 48, 29

chatgpt feedback loop prompt improvement manual process subject matter experts

But it does not generalize to every other recipe that I may want to attempt. So how might we leverage this under the hood? How does optimize anything really work? Well, it's doing something very similar to this. I went into ChatGPT, typed this question. It was very helpful and gave me the same, the right formatting on the output. But essentially, we need to collect those weird examples. Where does it fail? This is that thumbs down, by the way, from our subject matter experts. And then we request improvement. We go in and we tell our AI, I used this prompt, this system prompt, this input, the user's question, and it gave me this weird answer. And it was weird for these reasons. And then it gives us a new prompt, and we can try again with our subject matter experts. Now, this doesn't scale very well, but it's a really good starting point. Now, what do we do if that doesn't scale? We can use programming, development, to automate this process.

26:38 JEPA vs Traditional Gradient Descent Slide: 41

jepa gradient descent natural language feedback machine learning optimization

We can use that feedback and let the system actually write its own feedback and say it's correct for these reasons. It included the right oil, it included the information about the Mylar reaction, but it missed information about the internal temperature. The difference between a traditional optimization in machine learning, also known as gradient descent, and JEPA is that This is the only signal that a traditional system will get is that number, 0 to 1. Not a whole lot of understanding of where did we go right and where did we go wrong. So that natural language feedback is invaluable. Let's take a look at a little bit of a visual here. In machine learning, large language models, this is actually what the inside of the brain kind of looks like conceptually. We have all of these hills. These are all of the expert topics. And that red ball there, that is what a traditional process will do.

27:39 Hill-Climbing Visualization of JEPA Advantage Slide: 41

hill climbing optimization local maxima solution space human feedback

It starts at some point in the map, and it starts trying to climb the hills around it, getting to the highest point on that plane. The problem is it climbs to not quite the highest hill, but to something that is, oh, middle of the road. But because of the way that technology works, it gets stuck. Meanwhile, JEPA, the blue ball, or the green ball, sorry, is actually jumping around because it's given human feedback, and it's able to see, oh, I need to jump over here, and eventually it found that peak much faster and with hundreds, not thousands, of examples. You can actually run this with as few as 10 examples, but in our example, I think we had 200 question and answer pairs from experts. So the question that I have for everyone is, We're not optimizing prompts. We're optimizing text. Where else might we see text in our AI applications? Any examples?

28:44 Text Optimization Beyond Prompts Slide: 32

system prompts skills mcp servers tool routing llm as judge

Earlier we had talked about skills. There's also MCP servers, tool routing logic. All of these are possible. The 2 that this demo focuses on are the system prompts. and what is called an LLM as a judge. This is what powers the evaluations that Matt had talked about. This is a stand-in for our users so that we can test 10s to hundreds of different prompts along the way without driving our subject matter experts up a wall. So We are almost through all of the math heavy, I promise. But this was too cool not to show, so I wanted to bring this to light. This is why JEPA is so effective compared to traditional methods. That 0 circle at the top, that is where we start. That's the initial, how do I make a roast question. It didn't do very good.

29:40 JEPA's Exploration Strategy Visualization Slide: 32

exploration strategy prompt evolution graph visualization optimization path descendant nodes

And it tried five different methods to, or five different prompts to improve. And #5 was the winner of that generation. In a traditional world, we would have thrown out one through 4. But JEPA allows us to continuously explore that space, eventually landing all the way over here on the left on child 12 that was a descendant of 1, which performed quite a bit worse than #5. It allows us to more effectively explore the expertise of our language models. And a really cool thing about This optimize anything is that the library now spits out a graph like this, and you can hover over each one of those nodes and see how the prompt has evolved. So hover over node 0, it's you're a helpful agent, try not to be rude. And then hover over node 12, and it has all the stuff in there about you need to pay attention to flavor and how you lock it in and the chemistry of cooking.

30:40 Auditable Performance on Efficient Frontier Slides: 34, 35

efficient frontier evaluation categories subject matter experts auditable performance alignment

Yeah, that's a great call out, Matt. So really, the difference said simply is the old way is this version feels better. We got a higher number. The new one says we know that it performs in these categories very well for our evaluations. We no longer are asking, is it good enough? We can now constantly say this prompt sits on the, how do we say it, the efficient frontier of our evaluations. It said simply aligns with our subject matter experts. So what actually changes when we do this? We're changing the instructions and logic. Matt had mentioned this is an auditor's dream. No longer are we looking at why is this a.5 instead of a.4? We're looking at make sure to remember the Mylar reaction. Know to use avocado oil in these scenarios, olive oil in these scenarios, and sunflower oil here. We aren't locked into specific models.

31:42 Advantages of Instruction-Based Optimization Slides: 37, 38

instructions model flexibility claude gpt open weight models

We can continue to use the Clauds and the GPTs of the world, but we still have the flexibility to use those open weight models if we choose. And because it's just instructions, it means that undoing these changes takes minutes, not days or weeks. Before, our code looked something like this, and our average score, it was getting about 67% of the answer correct. And if we looked for strict accuracy, meaning it hit every bullet point that an expert cared about, we only got 35% of the questions correct. So afterwards, we saw task-specific information. It very clearly described the inputs. It described what output it's expecting. Lean into food science. It added the domain-specific knowledge, the food chemistry. And most importantly, it defines strategies, understanding the difference between different cooking methods and the trade-offs.

32:45 9.7% Performance Improvement Results Slide: 41

performance improvement task-specific food science food chemistry strict accuracy

And the results speak for themselves. We didn't change the data available to the system. We only changed the instructions. And we ended up with a 9.7% improvement in the average score and an 8.8 in improvement in strict accuracy. Again, this is just from listening to our subject matter experts. So a couple of other in the industry at scale, we see this is an example from Shopify. Shopify runs one of the largest e-commerce platforms in the world, probably only second to like Amazon. And they were running a very expensive system, analyzing every storefront. They were spending millions of dollars a year, and they covered 13% of stores. Not very great they used JEPA to actually train a smaller model to be more effective, and now they can cover 100% of shops seventy-five times cheaper.

33:45 Shopify's 75x Cost Reduction Case Study Slide: 41

shopify e-commerce cost reduction scale coverage

They got over five times the ability, and they spent seventy-five times less. Another example from Dropbox, if you don't know when you search for a file in Dropbox, your search and the files are actually going to AI and saying, hey, does this file and description match this search term? And again, Dropbox used JEPA to use, again, a smaller model and lower their adaptation to changing needs from their users from weeks to days, all just from listening to feedback. This is a bit more local. I'm currently working with a Fortune 500 client, and they're having AI write queries against their data lake. Think Databricks or Power BI. When we started out, the AI knew nothing about that environment, and it was only scoring a 58%. In under an hour and less than $5 worth of AI usage, we got all the way up to 89%.

34:50 Dropbox Search and Fortune 500 SQL Examples Slides: 44, 47

dropbox file search sql queries data lake databricks

Again, all just leveraging existing knowledge from the team, saying yes or no. So again, let's go ahead through what changed. We went domain-specific rather than general. We kept everything task-specific. We learned strategies and prescribed what output we actually cared about. All auditable. Our data governance folks love it. So let's flip the script. Instead of giving the thumbs up and thumbs down information to ChatGPT, to Claude, to Copilot, let's bring that back internally and improve our own products, creating the competitive advantage instead of just rising with the tide. I'll leave you all with an architecture overview. This outlines what we've been talking about today. That chat has a thumbs up and thumbs down. Our users can provide feedback. And then all of that goes into this optimization pipeline. We store that in a tool, an open source tool called Phoenix, so that the data never leaves our customer's environment.

35:55 Internal Feedback Loop Architecture Slide: 49

feedback architecture phoenix user judge competitive advantage internal optimization

And then that is used to train up a user judge. And the user judge, along with the user feedback, allows us to optimize and say, yes or no, we are actually improving. This allows us to run on a weekly or monthly basis, depending on the amount of feedback we've gotten, and continuously improve with how the team uses the tool. I'll pass it back over to Matt to close this out, and then we'll be ready for some questions. So yeah, this is second to last slide. This is if you want to use this tomorrow. Here's what you can do. Yes, you need access to a developer who will pull down that optimize anything library. But all you need to give them is a spreadsheet. So a spreadsheet is 10 to 30 interactions with an AI system. And then you sit that down in front of the expert and you say, where did this go wrong? And they write down, okay, here's where they went.

36:57 Practical Implementation with Spreadsheets Slide: 50

implementation spreadsheet optimize anything expert feedback practical guide

This was like completely off. This they got, this was actually a good part of the answer. But then you get the expert feedback in there. You feed it to optimize anything. You spit out the text. That's all it is text. It doesn't matter how you're building your Gen. AI app. And there is some place, no matter what you're using, Office Copilot to custom Gen. AI, there's some place where you can drop in this text and get big improvements, again, without the expense of fine tuning. So We're obviously super excited about it. Hopefully some of that enthusiasm rubs off. And let's see, we have one other question, or one other thing that there's like resources that we have in QR codes that are at our booth where the sponsor area is, and Ben and I will be over there to take any deep dives for anybody who wants to dig into code or more specifics if you'd like to.

38:06 Comparison to Andrej Karpathy's Auto-Research Slide: 50

andrej karpathy auto-research jepa model training text optimization

But love to hear what questions you have or where we can provide some clarity. Yes, Adam. I'm just curious to understand how this compares to the Andre Carpathy approach to self-improvement and if that's been played into this model or this way of approaching improvement. the question for the recording, sorry everyone, that mic does not go through the recording, so I'll be repeating. The question was, how does this compare to the Andrea Kaparthy auto research that was unveiled, what was it, maybe a month ago? This is, they're very similar in concept. JEPA is a year, year and a half old at this point. So we're auto research is really focused on Architecture of training models this is very much optimizing text they can be one and the same. Because again, the code to train a model is also text.

39:08 Phoenix for Data Residency and Observability Slide: 50

phoenix observability data residency open source traces

So I would say that both are feasible and show the same promise. Yeah, and one of the QR codes is an Andrei Karpathy post because all things lead back to Andrei Karpathy. Yes. Adam again. Why did you decide Phoenix? And what's the significance behind the Phoenix portion of your database? The observability in AI is really important. So you need some ability to kind of log the traces or the interactions, the turns between the user and the AI. And Basically what we're doing with that product, and it's one of the products that Ben, the open source products that Ben supports and helps.

40:13 Phoenix Integration and Data Privacy Slide: 50

phoenix data privacy open source on-premises private cloud

But basically what we do from there is we pick, we kind of click through a bunch of things where we said, well, these are really great examples or these are really bad examples. And we do the classic data science thing. We turn it into a data set. We carve, we tag. 80% of it for training data. We hold out 20% for validation data. And then we point optimize anything to that data set and have it give a new prompt. And we set the new prompt in Phoenix. Phoenix also stores prompts. And then our app just automatically pulls in the new prompt text. That was one way of explaining it. Yeah, I'll go ahead and echo what you said, Matt, is absolutely, it all holds true. A little bit of a different reason of why we choose Phoenix is it is open source. We don't have to worry about the data residency problem. If we have clients that are all on-prem or all in their own cloud, it makes it really easy for us to adhere to those.

41:20 Advice for Engineering Students Slide: 50

engineering education dmacc career advice learning to learn tool agnostic

to those desires of keeping everything private, not sending off our very valuable LLM interactions and really company data to a third party. We're able to stay in control of that. And then on top of that, the feature set of Phoenix just all meshes very, very well with a system like Optimize Anything. Anyone other than Adam? He is. I guess I could ask what I've been asking pretty much everybody, but I teach engineering transfer courses for students that are going on to a four-year program from DNAC. And so I just, most of what I'm interested in here today, and he also teaches at DNAC, but what are the main tools that you would say is important that we make sure our students understand before they go out into the workforce?

42:22 Avoiding Tool Loyalty in Education Slide: 50

tool loyalty hadoop adaptability career paths data science

Two, three, four years from now, Daniel. Yeah, that's an interesting question, and I think is unfortunately a little bit... dependent on what path they choose to take. I would give different advice to somebody looking to become a data scientist, to somebody being a data engineer, different advice to a software engineer. So I think the general advice that I would give to anyone going from a two-year to a four-year degree like DMACC to Iowa State, as an example, would be remain curious, learn to learn. Don't get too hung up on one specific tool set. Because we've seen with the age of AI, things change so frequently that if we spend too much time making sure that this one tool set is perfect, we run the risk of that being out of date by the time they're out of school. We saw this back in the early 2010s with Hadoop.

43:25 Domain-Agnostic Learning Loop for AEC Workflows Slide: 50

aec industry construction revit api clash detection domain-specific ai

clusters. They were all the rage. Everyone had it. You need to go into Hadoop. And now I haven't worked with anyone that has a Hadoop cluster in a few years. So that is where I would be leaning is learn how to learn. Don't be loyal to any one tool. Understand that judgment point. Matt, do you have anything to add there? No, great answer. If I'm building a domain-specific AI tool for AEC workflows using like Revit API, where would you start with the self-improvement loop, talking with like the feedback on incorrect element detection, missed clashes, or something else? I'm sorry, I'm not familiar. Could you, what is an AEC environment there? Sorry, like construction industry. Matt, do you have any thoughts there?

44:32 Closing Remarks on Domain Adaptability Slide: 50

domain-agnostic expert feedback continuous improvement principle extraction universal applicability

The incredible thing about this is that it really is the expert feedback and saying, okay, here's why a clash was missed. Here's what I know from my 30 years of experience. And doing that 30, 40, 100, 200 times, or just setting up this loop that just every two weeks, it just pulls in anybody who gave any feedback and updates the prompt to make it better. So it doesn't matter what the strategy is, you're extracting those out. So it really becomes a domain-agnostic way to make domain-specific AI.

So good afternoon. It's my pleasure to introduce Matt Vincent, founder of Source Allies, and Ben McCone, staff engineering consultant at Source Allies. So Matt leads the consultancy focused on data and AI with multiple generative AI systems in production, delivering measurable business results. So Ben specializes in deploying advanced AI systems with a strong emphasis on reliability, evaluation, and building trust in real-world applications.

So together they work at the forefront of helping organizations move generative AI from experimentation into scalable production-ready systems. So in today's session, they're going to explore how to close the learning gap in generative AI, demonstrating how systems can continuously improve through feedback, observability, and optimization without relying on fine-tuning. So please join me in welcoming Matt Vincent and Ben McCone. Thank you.

Appreciate it. So what we're talking about today is something that Ben and I were really excited about when we first saw this paper come out from Stanford. And then actually while we were developing the talk, a super simple library came out that allows you to do what we're talking about. It's really exciting stuff.

We're happy that you're here. We're happy that you weren't scared away by fine-tuning. because it is kind of a technical topic, and what we're talking about is more approachable than that. Source Allies, where Ben and I are from, is an IT consultancy based in Des Moines, and we specialize in data and AI. We've been around for almost 25 years now, and we are proud to be builders who teach.

We do a lot of building, a lot of building of Gen. AI systems, and we've run into this roadblock that we're going to talk about today and how to get past it. And then as we're working, we really love teaching and learning, leveling up together, whoever we're working with. Back in 2002, I founded Source Allies, and I was one of several teammates who started our data and AI practice almost eight years ago now.

And way back when When I was in college, my dad led a neural networks research group, and I was what the keynote speaker called today one of the doubters. I did not believe that anything he was doing was actually possible, but here I am today, Ben. Yeah, my name is Ben McCone. I am an open source contributor, a staff.

I'm a consultant with Source Allies. I've actually contributed to a lot of the tools that we'll show you today. I have taken multiple clients from idea concept to production with scalable single and multi-agent systems over the years, all of the buzzwords. I've seen that progression over the last three or four years that we've been doing AI.

I'm really excited to be giving this talk to everyone today. Back to you, Matt. Thanks, Ben. I was actually in San Francisco with Ben at a conference where he was speaking at, and the kind of leaders of these open source AI framework groups, like LangChain, if you do AI development, you're familiar with these, LangChain, DSPy.

Everybody was super excited to see Ben. It was really weird, like we show up from Iowa and it's like, Ben, thank you for helping us. So there's some credibility up here. So what we're talking about today is a report that came out from MIT late last year talking about the state of AI in business, and it talked about this big problem that everybody's running into, this big roadblock that is the AI applications, the generator application that we're building, aren't learning.

Maybe it's because of the things that you want to capture really aren't written down anywhere, but they come out through the interaction of a large group working together with the AI. This is what captures that. And it's actually what I'm hearing a lot of people being concerned about at the conference. You have people who have been at the business for a long time and have a lot of knowledge, and then people who are new to the business and don't have that knowledge, and what's going to happen when all of that talent eventually does retire?

There's a lot of promise in this approach as a way to capture the knowledge and strategies and principles that are really unspoken. So we're going to talk about that. We're going to talk about the normal go-to. Normal go-to with using large language models and they don't fit for you is to do some fine-tuning.

There's some pitfalls to that. And we're going to talk about what to do instead. And we welcome questions throughout. So just feel free to raise your hand.

And we'll also be around for questions afterwards. So Most Gen. AI projects fail due to lack of trust, not capability. JC was talking about there's average intelligence, there's great human intelligence, and then there's the best AI intelligence.

So we don't need more intelligent models to do a lot of the use cases. that we're trying to do right now. But there are some impediments. One of them is trust. They say AI moves at the speed of trust.

Does anybody want to offer what is being done within your organization to get past, to start to build trust with AI and actually have it be something that is more usable? Yes. Great, yep. Sharing what's worked, having that be kind of that teaching tool so people know that there are some successes out there and where it doesn't fit.

Anybody else? Yes. Do you have a lot of read-only access for AI so that if they go to create ERP or any database, the user can see what to do with that data, interpret it, create visuals for that data without risking any harm? Yeah, I like that.

People love that a lot. Read-only access that prevents the AI from going off the rails and deleting databases, but still seeing what the capabilities are while you're building trust. Awesome. How about one last one?

Great. We're fine tuning that actually as we go. Awesome. So not necessarily trusting the output and building up.

Using it as a tool but not you can't just trust it. No blind trust. Yes. Great.

There's one other thing that is coming out of industry and it's this idea that We need some way to leverage the experts in our company and figure out how do you measure in comparison to how an expert would do a particular task. And that measurement capability is actually the thing that is the foundation upon which what we're talking about today is built. So measuring, how the AI is performing in industry terms. It's called LLM evaluations.

So here's some industry quotes that talk about really if you're not investing in evals, you're not really shipping, you're guessing. A lot of pilots now are coming out of the gate. Pilots, when you're just trying to prove something out, are coming with some sort of measurement ability. And we'll talk more about that.

So that's sort of the foundation. Then the MIT State of AI and Business report says, here's the big problem, the big barrier that everybody is running into. And again, it's not model quality, it's that these AI systems that we're interacting with aren't learning. So If you have a team or group or division department using AI and we're working with it and teaching it, like, here's how we do some things, oh, this isn't really documented, but this is how we like to work, none of that is captured.

You have the thumbs up, thumbs down. We've all seen these. Thumbs down, then you can give feedback. But who is that helping?

It's not helping your team or your company. It's helping big tech make a better model that really, you know, it's improving things for everybody, but not necessarily for your particular group. So, I heard MIT talks about AI's really great for individual brainstorming, but stalls out in enterprise settings because they lack iterative learning, and I actually heard a great quote at lunch. that was someone saying, I'm not interested in AI that is just for an individual. I want to be thinking about AI that is for multiple people, that's for teams.

And that's where a lot of companies are right now. They want to break out from the individual use. So MIT gave this big problem, but they actually didn't give us an answer. They just said, it's a problem.

And we think the solution is that you need some sort of loop, learning loop. but they didn't tell us how. And that's what we're sharing today. Fine-tuning, that's the typical go-to. Fine-tuning requires thousands, sometimes hundreds of thousands of examples to do fine-tuning.

When you do fine-tuning, you are changing the weights in the model. These big, giant, smart models are actually nothing more than a CSV file. So you're changing The numbers in the CSV, when you do that, you have higher standards when it comes to governance and how the model is governed within your organization. It's really difficult to change fine-tuned models that are the ones that we like using.

We like using Anthropic and OpenAI. But fine-tuning is something that typically happens with open weight models. And then finally, when you go to fine-tuning, which really kind of changes the behavior of the model so it can adapt to your domain, it's a whole other tier of expenses. Like you're basically getting an ML ops, AI ops.

It's a lot of expense to do that instead of paying pennies per interaction. Okay, so those are some of the downfalls of fine-tuning. Another interesting thing as we get into the solution is LLMs are very good at doing what you tell them, and a lot of the failures that we encounter come from not being specific enough, not removing ambiguity. So most prompt failures are actually knowledge gaps. where some of these principles or strategies, we just didn't have a way, we didn't know how to express them.

And then the AI gives us a result, and we're like, well, that's kind of stupid. But the key point here is that we all know what a prompt is, of course. Probably a lot of people know that there's also something called a system prompt or a system message. It's like a prompt that we never see. behind the scenes that says, be kind to the person you're working with and don't be, here are some ethical guidelines and here are some rules to follow.

So the key point, though, is that prompt changes, changing that wording just a little bit can have massive differences on how the model performs. And this used to be prompt engineering. People made fun of it. was going to be a $200,000 career, proofed engineering. But there's actually what we're showing today is a mathematical, a scientific way to change the words that are in every single application that we're using.

Even if you're just using a team copilot, there's a way to give it a prompt that helps it work with your team better. and recognize how your business is working. These small prompt changes are a way to mathematically, scientifically, methodically change the prompt to get better outcomes for you. This is a guy who talks a lot about AI on the internet, and he says LLMs just, they don't come with instructions in the box. So that's kind of the thing.

So some of the kind of go-tos that we have fine-tuning. Okay, it's expensive. Here's the downfall. RAG, that's where you pull in all of your documents.

Maybe you have existing SOPs. That works great. We love RAG. But it doesn't help you bring out the reasoning and the strategy that your team is using when you're doing work within your organization.

Memory is really good. But we don't want memorization of just things that have been done in the past. We want to extract lessons from that. So all of these things that we want just don't exist yet until this method that we're going to show you.

If anybody's working on skills, skills is just a text file that helps you guide your AI to do a particular thing. And that actually isn't scaling very well. Like a lot of people are trying to figure out how do you scale that beyond just one person? How do you scale that to multiple people, A-team?

So those are the problems that we're trying to answer. That's what we are going to answer today. And this is the kind of what if we could? What if we had a way to do all of this continuous improvement, to have it be fully auditable?

Because instead of changing Model weights. these numbers in a giant CSV file, we're changing the instructions. And you can see, oh, the model is performing better because of these words. I can get that, like an auditor can look at that and say, I understand what that's doing. There's all of these great benefits.

And they just so happen to get you similar to sometimes even better performance gains when compared to fine-tuning, the really expensive thing. I'm not going to hold you in suspense further. This is the library that came out that really does this magic thing that kind of codifies what the author at Stanford released. It allows you to take text and take some feedback on why that text was good or bad and make better text.

It works on text. So it's super simple stuff. It can actually be done with a spreadsheet. Take some really bad interactions with your AI, talk to some subject matter experts, and say, why was it bad?

We're going to do a chat thing. Yeah, AI is not all about chat, but it's an example that we all recognize. So we ask a question, we get an answer, we give a thumbs up, thumbs down. We say it was a thumbs down because Someone's asking about photosynthesis because you didn't talk about light absorption details.

So to be more precise, photosynthesis is a two-step blah, blah, blah. So we have a feedback loop for where the humans or maybe even smarter, super expensive models are saying, here's a better answer. We're applying that to food. We have lots of insurance and retail and energy and defense and med tech examples, but we're applying it to food.

There's a rating. There's a rating system called Michelin Star that if you're traveling, these are like, hey, these are some useful places to go for an interesting experience. So it's like fancy food. But that's what we want to do.

We have an example. It's running in code. You can stop by our booth and we'll show you the code if you want to see the details of it. But we're saying, OK, how would you make an omelet?

And then how would a Michelin fancy chef make an omelet? So it's that difference between how anybody would do it and then how an expert would do it. And what are we doing behind the scenes? We're basically taking, we're asking Michelin-trained chefs, what is the really great answer of how you make an omelet?

What are all the things that you need to consider, temperature and all the things, seasoning? And we use AI to say, okay, the answer that was given, this is a simplified example, but the answer that was given, how many of The bullet points or statements made by the expert chef were represented in the answer that the AI gave you. Maybe one out of five. So then we tell the AI why it was wrong, and we do that for 30, 40, 50 cycles, and run it through the system, and give it a new system prompt.

And you get the same kind of performance gains that you get from the really super expensive reinforcement learning. Okay, that was a lot of detail. Food. We're talking about food.

I'm going to hand it over to Matt. No, thank you. So as Matt had alluded to, we are talking food. And one of the challenges that we face in organizations is sometimes we don't know the question that we should be asking.

So in this example, the question isn't really great. It's how do I make a roast? That's leaving out a lot of variables. Are there any home chefs in the room?

So you may say this is a bad question because I don't know how big the roast is. I don't know what type of meat it is. I don't know what technique you're using. All of these lead to a very generic answer.

Hey, let's flip every 30 to 45 minutes. not super helpful. And so we give a thumbs down and we explain that every 30 to 45 minutes, it's just actively bad advice. We need to provide better context. And 20 to 30 minutes per pound, well, that's a blunt instrument.

When we are done with the process, this goes on and on. If you want to read the full ending prompt, feel free to scan that. But the interesting thing, as Matt had alluded to, is nowhere in the final result does the word roast even appear. Instead, we're talking about the scientific qualities of the answer.

We're talking about the Mylar reaction, the browning that you get on your meat when you're cooking it. It's talking about collagen structures, what oils to use in what cases. Said simply, it learned the principles, not the answers to the questions that we are asking. And we can see a very strong result.

On the left here, how do you make a roast? This is the original. Same bad answer. But with that updated prompt, it starts out by telling you, need to choose the right cut, asking you immediately, beef, pork, or lamb? tells you how to prepare the meat, how to season it.

Did you have something else? I was just going to add one thing. By the way, it's been proven that if you say act like a super experienced data scientist or act like a super experienced Michelin star chef, that doesn't work. That kind of prompting does not work.

It did work a few years ago, but the models have grown since then. And why is this a problem? Why can't we just have somebody on our team sit down and write it? Well, JC had mentioned in the keynote that it is somebody's job at Anthropic to write the system prompt or the sole document.

So that is where JEPA, the parent library to optimize anything that Matt had introduced, comes in. This is a result of a research paper out of Stanford saying that reflective prompt evolution can outperform reinforcement learning. Essentially, Give the AI a signal, and it's a better prompt engineer than you or I. And what it does is it helps us extract the principles, practices, strategies, and techniques, specifically not rote memorization of what the answer should have been.

This is important because, yeah, if we just say, when asked how to cook a roast, respond with this, it will always be correct. But it does not generalize to every other recipe that I may want to attempt. So how might we leverage this under the hood? How does optimize anything really work?

Well, it's doing something very similar to this. I went into ChatGPT, typed this question. It was very helpful and gave me the same, the right formatting on the output. But essentially, we need to collect those weird examples.

Where does it fail? This is that thumbs down, by the way, from our subject matter experts. And then we request improvement. We go in and we tell our AI, I used this prompt, this system prompt, this input, the user's question, and it gave me this weird answer.

And it was weird for these reasons. And then it gives us a new prompt, and we can try again with our subject matter experts. Now, this doesn't scale very well, but it's a really good starting point. Now, what do we do if that doesn't scale?

We can use programming, development, to automate this process. We can use that feedback and let the system actually write its own feedback and say it's correct for these reasons. It included the right oil, it included the information about the Mylar reaction, but it missed information about the internal temperature. The difference between a traditional optimization in machine learning, also known as gradient descent, and JEPA is that This is the only signal that a traditional system will get is that number, 0 to 1.

Not a whole lot of understanding of where did we go right and where did we go wrong. So that natural language feedback is invaluable. Let's take a look at a little bit of a visual here. In machine learning, large language models, this is actually what the inside of the brain kind of looks like conceptually.

We have all of these hills. These are all of the expert topics. And that red ball there, that is what a traditional process will do. It starts at some point in the map, and it starts trying to climb the hills around it, getting to the highest point on that plane.

The problem is it climbs to not quite the highest hill, but to something that is, oh, middle of the road. But because of the way that technology works, it gets stuck. Meanwhile, JEPA, the blue ball, or the green ball, sorry, is actually jumping around because it's given human feedback, and it's able to see, oh, I need to jump over here, and eventually it found that peak much faster and with hundreds, not thousands, of examples. You can actually run this with as few as 10 examples, but in our example, I think we had 200 question and answer pairs from experts.

So the question that I have for everyone is, We're not optimizing prompts. We're optimizing text. Where else might we see text in our AI applications? Any examples?

This is what powers the evaluations that Matt had talked about. This is a stand-in for our users so that we can test 10s to hundreds of different prompts along the way without driving our subject matter experts up a wall. So We are almost through all of the math heavy, I promise. But this was too cool not to show, so I wanted to bring this to light.

This is why JEPA is so effective compared to traditional methods. That 0 circle at the top, that is where we start. That's the initial, how do I make a roast question. It didn't do very good.

It allows us to more effectively explore the expertise of our language models. And a really cool thing about This optimize anything is that the library now spits out a graph like this, and you can hover over each one of those nodes and see how the prompt has evolved. So hover over node 0, it's you're a helpful agent, try not to be rude. And then hover over node 12, and it has all the stuff in there about you need to pay attention to flavor and how you lock it in and the chemistry of cooking.

We no longer are asking, is it good enough? We can now constantly say this prompt sits on the, how do we say it, the efficient frontier of our evaluations. It said simply aligns with our subject matter experts. So what actually changes when we do this?

We're changing the instructions and logic. Matt had mentioned this is an auditor's dream. No longer are we looking at why is this a.5 instead of a.4? We're looking at make sure to remember the Mylar reaction.

Know to use avocado oil in these scenarios, olive oil in these scenarios, and sunflower oil here. We aren't locked into specific models. We can continue to use the Clauds and the GPTs of the world, but we still have the flexibility to use those open weight models if we choose. And because it's just instructions, it means that undoing these changes takes minutes, not days or weeks.

Before, our code looked something like this, and our average score, it was getting about 67% of the answer correct. And if we looked for strict accuracy, meaning it hit every bullet point that an expert cared about, we only got 35% of the questions correct. So afterwards, we saw task-specific information. It very clearly described the inputs.

It described what output it's expecting. Lean into food science. It added the domain-specific knowledge, the food chemistry. And most importantly, it defines strategies, understanding the difference between different cooking methods and the trade-offs.

Again, this is just from listening to our subject matter experts. So a couple of other in the industry at scale, we see this is an example from Shopify. Shopify runs one of the largest e-commerce platforms in the world, probably only second to like Amazon. And they were running a very expensive system, analyzing every storefront.

They were spending millions of dollars a year, and they covered 13% of stores. Not very great they used JEPA to actually train a smaller model to be more effective, and now they can cover 100% of shops seventy-five times cheaper. They got over five times the ability, and they spent seventy-five times less. Another example from Dropbox, if you don't know when you search for a file in Dropbox, your search and the files are actually going to AI and saying, hey, does this file and description match this search term?

And again, Dropbox used JEPA to use, again, a smaller model and lower their adaptation to changing needs from their users from weeks to days, all just from listening to feedback. This is a bit more local. I'm currently working with a Fortune 500 client, and they're having AI write queries against their data lake. Think Databricks or Power BI.

When we started out, the AI knew nothing about that environment, and it was only scoring a 58%. In under an hour and less than $5 worth of AI usage, we got all the way up to 89%. Again, all just leveraging existing knowledge from the team, saying yes or no. So again, let's go ahead through what changed.

We went domain-specific rather than general. We kept everything task-specific. We learned strategies and prescribed what output we actually cared about. All auditable.

Our data governance folks love it. So let's flip the script. Instead of giving the thumbs up and thumbs down information to ChatGPT, to Claude, to Copilot, let's bring that back internally and improve our own products, creating the competitive advantage instead of just rising with the tide. I'll leave you all with an architecture overview.

This outlines what we've been talking about today. That chat has a thumbs up and thumbs down. Our users can provide feedback. And then all of that goes into this optimization pipeline.

We store that in a tool, an open source tool called Phoenix, so that the data never leaves our customer's environment. And then that is used to train up a user judge. And the user judge, along with the user feedback, allows us to optimize and say, yes or no, we are actually improving. This allows us to run on a weekly or monthly basis, depending on the amount of feedback we've gotten, and continuously improve with how the team uses the tool.

I'll pass it back over to Matt to close this out, and then we'll be ready for some questions. So yeah, this is second to last slide. This is if you want to use this tomorrow. Here's what you can do.

Yes, you need access to a developer who will pull down that optimize anything library. But all you need to give them is a spreadsheet. So a spreadsheet is 10 to 30 interactions with an AI system. And then you sit that down in front of the expert and you say, where did this go wrong?

And they write down, okay, here's where they went. This was like completely off. This they got, this was actually a good part of the answer. But then you get the expert feedback in there.

You feed it to optimize anything. You spit out the text. That's all it is text. It doesn't matter how you're building your Gen.

AI app. And there is some place, no matter what you're using, Office Copilot to custom Gen. AI, there's some place where you can drop in this text and get big improvements, again, without the expense of fine tuning. So We're obviously super excited about it.

Hopefully some of that enthusiasm rubs off. And let's see, we have one other question, or one other thing that there's like resources that we have in QR codes that are at our booth where the sponsor area is, and Ben and I will be over there to take any deep dives for anybody who wants to dig into code or more specifics if you'd like to. But love to hear what questions you have or where we can provide some clarity. Yes, Adam.

I'm just curious to understand how this compares to the Andre Carpathy approach to self-improvement and if that's been played into this model or this way of approaching improvement. the question for the recording, sorry everyone, that mic does not go through the recording, so I'll be repeating. The question was, how does this compare to the Andrea Kaparthy auto research that was unveiled, what was it, maybe a month ago? This is, they're very similar in concept. JEPA is a year, year and a half old at this point.

So we're auto research is really focused on Architecture of training models this is very much optimizing text they can be one and the same. Because again, the code to train a model is also text. So I would say that both are feasible and show the same promise. Yeah, and one of the QR codes is an Andrei Karpathy post because all things lead back to Andrei Karpathy.

Yes. Adam again. Why did you decide Phoenix? And what's the significance behind the Phoenix portion of your database?

The observability in AI is really important. So you need some ability to kind of log the traces or the interactions, the turns between the user and the AI. And Basically what we're doing with that product, and it's one of the products that Ben, the open source products that Ben supports and helps. But basically what we do from there is we pick, we kind of click through a bunch of things where we said, well, these are really great examples or these are really bad examples.

And we do the classic data science thing. We turn it into a data set. We carve, we tag. 80% of it for training data. We hold out 20% for validation data.

And then we point optimize anything to that data set and have it give a new prompt. And we set the new prompt in Phoenix. Phoenix also stores prompts. And then our app just automatically pulls in the new prompt text.

That was one way of explaining it. Yeah, I'll go ahead and echo what you said, Matt, is absolutely, it all holds true. A little bit of a different reason of why we choose Phoenix is it is open source. We don't have to worry about the data residency problem.

If we have clients that are all on-prem or all in their own cloud, it makes it really easy for us to adhere to those. to those desires of keeping everything private, not sending off our very valuable LLM interactions and really company data to a third party. We're able to stay in control of that. And then on top of that, the feature set of Phoenix just all meshes very, very well with a system like Optimize Anything. Anyone other than Adam?

He is. I guess I could ask what I've been asking pretty much everybody, but I teach engineering transfer courses for students that are going on to a four-year program from DNAC. And so I just, most of what I'm interested in here today, and he also teaches at DNAC, but what are the main tools that you would say is important that we make sure our students understand before they go out into the workforce? Two, three, four years from now, Daniel.

Yeah, that's an interesting question, and I think is unfortunately a little bit... dependent on what path they choose to take. I would give different advice to somebody looking to become a data scientist, to somebody being a data engineer, different advice to a software engineer. So I think the general advice that I would give to anyone going from a two-year to a four-year degree like DMACC to Iowa State, as an example, would be remain curious, learn to learn. Don't get too hung up on one specific tool set.

Because we've seen with the age of AI, things change so frequently that if we spend too much time making sure that this one tool set is perfect, we run the risk of that being out of date by the time they're out of school. We saw this back in the early 2010s with Hadoop. clusters. They were all the rage. Everyone had it.

You need to go into Hadoop. And now I haven't worked with anyone that has a Hadoop cluster in a few years. So that is where I would be leaning is learn how to learn. Don't be loyal to any one tool.

Understand that judgment point. Matt, do you have anything to add there? No, great answer. If I'm building a domain-specific AI tool for AEC workflows using like Revit API, where would you start with the self-improvement loop, talking with like the feedback on incorrect element detection, missed clashes, or something else?

I'm sorry, I'm not familiar. Could you, what is an AEC environment there? Sorry, like construction industry. Matt, do you have any thoughts there?

So it really becomes a domain-agnostic way to make domain-specific AI.