Tabular Foundation Models Meet Manufacturing: A Practical Exploration

PRODUCTION AND OPERATIONS

11:15 AM – 12:00 PM

Room 275

SPEAKER

Aditya Balu

Data Scientist, Iowa State University - Translational AI Center

Use with AI

Copy this session's complete context to paste into ChatGPT, Claude, or any AI assistant.

Preview context block

## Session: Tabular Foundation Models Meet Manufacturing: A Practical Exploration
**Track:** Production and Operations | **Time:** 11:15 AM–12:00 PM | **Room:** 275 | **Type:** Expert Talk
**Conference:** CIRAS AI Summit for Iowa — May 6, 2026, Scheman Building, Iowa State University, Ames IA

### Speaker(s)

**Aditya Balu** — Data Scientist, Iowa State University - Translational AI Center (Ames, IA)
Aditya Balu is a Data Scientist at the Translational AI Center (TrAC) at Iowa State University, with over 14 years of experience in AI and machine learning. His research spans scientific machine learning, topology optimization, additive manufacturing, and deep learning for engineering applications, with publications in venues including Nature Computational Science, ICML, and Engineering Applications of Artificial Intelligence.

He also develops and teaches AI/ML micro-credential courses through Iowa State's TrAC , covering topics from natural language processing to scientific machine learning. His work sits at the intersection of AI and manufacturing, bridging academic research with practical industry applications.

### Session Description

Manufacturing AI problems share a common profile: small labeled datasets, heterogeneous process and sensor variables, missing values, and the need for reliable predictions with minimal tuning. For years, gradient-boosted trees like XGBoost and CatBoost have been the default choice for these tabular prediction tasks — from predicting tool wear in milling to estimating creep rupture life of turbine components to detecting process anomalies.

A new class of pretrained models — tabular foundation models (TFMs) — is challenging this status quo. Models such as TabPFN, TabICL, and Mitra can ingest raw tabular data and deliver competitive predictions in seconds without task-specific training, hyperparameter tuning, or elaborate feature engineering. Their strengths — robustness to missing data, handling of mixed feature types, and strong performance in small-sample regimes — align remarkably well with the realities of manufacturing data.

This talk introduces tabular foundation models to the manufacturing and applied AI community. We begin with an accessible overview of how TFMs work and what distinguishes them from conventional ML pipelines. Through select case studies in machining and materials performance prediction, we explore what changes when a traditional ML workflow is replaced with a tabular foundation model on real manufacturing problems. We examine where these models deliver genuine advantages, where they encounter limitations, and what practical considerations arise when thinking about deployment. The talk concludes with a forward look at open opportunities at this intersection — including few-shot anomaly detection, integration with physics-informed modeling, cross-process transfer learning, and real-time shop floor deployment.

### Other sessions in the Production and Operations track

- Success Story #1 - Vision AI Efforts in Attribute Detections and Measurements (3:10 PM–3:55 PM)
- Success Story #2 - Natural Language Search for Member Benefits (3:10 PM–3:55 PM)
- Industrial AI Success Stories: Because Even My Title Needed Machine Learning (10:20 AM–11:05 AM)
- AI Attribute Intelligence: Automating Detection, Extraction, and Standardization at Scale (1:20 PM–2:05 PM)

### Suggested prompts for this session

- "What questions should I prepare to ask the speaker(s) at this session?"
- "Create a structured note-taking template for this session focused on actionable takeaways"
- "Based on this session description, what background reading should I do to get the most value?"
- "After I attend, help me create an action plan for implementing what I learned"
- "How does this session connect to the other sessions in the Production and Operations track?"

## Session: Tabular Foundation Models Meet Manufacturing: A Practical Exploration
**Track:** Production and Operations | **Time:** 11:15 AM–12:00 PM | **Room:** 275 | **Type:** Expert Talk
**Conference:** CIRAS AI Summit for Iowa — May 6, 2026, Scheman Building, Iowa State University, Ames IA

### Speaker(s)

**Aditya Balu** — Data Scientist, Iowa State University - Translational AI Center (Ames, IA)
Aditya Balu is a Data Scientist at the Translational AI Center (TrAC) at Iowa State University, with over 14 years of experience in AI and machine learning. His research spans scientific machine learning, topology optimization, additive manufacturing, and deep learning for engineering applications, with publications in venues including Nature Computational Science, ICML, and Engineering Applications of Artificial Intelligence.

He also develops and teaches AI/ML micro-credential courses through Iowa State's TrAC , covering topics from natural language processing to scientific machine learning. His work sits at the intersection of AI and manufacturing, bridging academic research with practical industry applications.

### Session Description

Manufacturing AI problems share a common profile: small labeled datasets, heterogeneous process and sensor variables, missing values, and the need for reliable predictions with minimal tuning. For years, gradient-boosted trees like XGBoost and CatBoost have been the default choice for these tabular prediction tasks — from predicting tool wear in milling to estimating creep rupture life of turbine components to detecting process anomalies.

A new class of pretrained models — tabular foundation models (TFMs) — is challenging this status quo. Models such as TabPFN, TabICL, and Mitra can ingest raw tabular data and deliver competitive predictions in seconds without task-specific training, hyperparameter tuning, or elaborate feature engineering. Their strengths — robustness to missing data, handling of mixed feature types, and strong performance in small-sample regimes — align remarkably well with the realities of manufacturing data.

This talk introduces tabular foundation models to the manufacturing and applied AI community. We begin with an accessible overview of how TFMs work and what distinguishes them from conventional ML pipelines. Through select case studies in machining and materials performance prediction, we explore what changes when a traditional ML workflow is replaced with a tabular foundation model on real manufacturing problems. We examine where these models deliver genuine advantages, where they encounter limitations, and what practical considerations arise when thinking about deployment. The talk concludes with a forward look at open opportunities at this intersection — including few-shot anomaly detection, integration with physics-informed modeling, cross-process transfer learning, and real-time shop floor deployment.

### Other sessions in the Production and Operations track

- Success Story #1 - Vision AI Efforts in Attribute Detections and Measurements (3:10 PM–3:55 PM)
- Success Story #2 - Natural Language Search for Member Benefits (3:10 PM–3:55 PM)
- Industrial AI Success Stories: Because Even My Title Needed Machine Learning (10:20 AM–11:05 AM)
- AI Attribute Intelligence: Automating Detection, Extraction, and Standardization at Scale (1:20 PM–2:05 PM)

### Suggested prompts for this session

- "What questions should I prepare to ask the speaker(s) at this session?"
- "Create a structured note-taking template for this session focused on actionable takeaways"
- "Based on this session description, what background reading should I do to get the most value?"
- "After I attend, help me create an action plan for implementing what I learned"
- "How does this session connect to the other sessions in the Production and Operations track?"

TRACK Production and Operations

FORMAT Expert Talk

ROOM 275

Key Takeaways

Tabular foundation models are a natural fit for manufacturing AI problems.
TFMs don't replace domain expertise — they lower the barrier to entry.
The intersection of tabular foundation models and manufacturing is wide open.

Continue the conversation with Aditya Balu at the Production & Operations Facilitated Discussion — 2:15 PM - 3:00 PM, Room 220-230-240

Session Recording

Session Data

Download SRT (Captions) Attendee Slides (PDF) AI-Formatted PDF Download Session Bundle (ZIP)

Transcript from Summit:

00:00 Introduction of Dr. Aditya Blue Slide: 1

aditya blue artificial intelligence machine learning manufacturing tabular foundation models

Thank you for being here in the production and operations track. My name is Jake Behrens. I'll be helping moderate the room here for these sessions today. And I have the honor of introducing our speakers in here as well. So good morning. It's my pleasure to introduce Dr. Aditya Blue, a data scientist at the Translational AI Center at Iowa State University here. So with over 15 years of experience in artificial intelligence and machine learning, Dr. Blue has contributed extensively to the field. with research published in leading venues such as Nature Computational Science, ICML, and Engineering Applications of Artificial Intelligence. So his work focuses on applying advanced AI techniques to engineering and manufacturing challenges, bridging cutting-edge research with real-world impact. So in addition, he plays an active role in developing teaching AI and machine learning programs. to help translate these innovations into practice. So in today's session, Dr. Bhalu will explore the emerging role of tabular foundational models in manufacturing, highlighting how these approaches can simplify workflows, improve predictions, and open new possibilities for industrial AI applications.

01:02 Tabular Data and ChatGPT Limitations Slide: 1

tabular data chatgpt predictions manufacturing rate large language models

So please join me in welcoming Dr. Aditya Bhalu. Thanks. Thanks a lot, Jake. I hope I'm audible. So, good morning, everyone. Thanks, Jake, for the nice introduction. So, today I'll be talking about tabular foundation models for manufacturing. Just to give you the context of how this comes in, right? So I'm sure you all have seen some kind of a tabular data, a bunch of inputs, probably an output, you want to predict whether, what is the manufacturing rate you have or what is the prediction of, when you're going to get your order from Amazon, whether it's going to come tomorrow or day after and all. So each of them, they're all tabular data in some form or the other. And Have you ever wondered, can I just dump all this information into ChatGPT and, ask it to give me the predictions and get done with it? Has anyone tried something like that so far? Okay, at least one.

02:03 Why LLMs Are Not Suited for Tabular Data Slide: 1

large language models foundation models tabular data language models data type mismatch

But I'm sure, do you have any experience you want to share in terms of, you know, what you saw when you did that? Yeah, so I'm actually just a freshman. I would say some of the more basic ones would be like, maybe I'm working on a problem. a lot of data or parts to it. I try to upload it all at once and sometimes it gets lost or it doesn't do the right calculation. Sometimes I might break it up and sometimes helps, but other times if it's just too much, doesn't work. Yeah. So as you mentioned, right, so the too much data is one thing, but fundamentally there's one problem with, you know, dumping all the tabular data to, you know, ChatGPT or any of the any of your favorite LLMs today that you're using. The problem is that they are not meant for tabular data. They are meant for language. They are meant to have a conversation with you. And they're usually called as large language models or foundation models for language, per se. But at the same time, you know, in the recent last few years, there's been a lot of shift towards, you know, building foundation models for tabular data.

03:09 Traditional Machine Learning vs Foundation Models Slide: 2

traditional machine learning feature engineering data cleaning missing data model training

And that's what is the term tabular foundation models that you're seeing here. If I want to explain in comparing what is a foundation model to what we do in traditional machine learning models, in traditional machine learning you have a data set, you have a new data set, and essentially you train a fresh model, train them, tune the hyper-parameters, validate, and then deploy. And you do this for almost every task that you are trying to work on. Quite often than not, you have to train the models from scratch. You have to do a lot of feature engineering. And quite often than not, anyone who has worked with real-world data, they know that the most important thing that you see is cleaning up the data. There are a lot of cases where you have missing data. You have cases where the data is probably noisy, and you have to do a lot of data imputation and things like that to even before get started to train the model. And what we saw as an opportunity is the foundation models, right?

04:17 Foundation Models Handle Noisy Data Slide: 2

noisy data missing data data imputation chatgpt analogy preprocessing

So, think about it today that in ChatGPT, even if you write gibberish, not perfectly grammatical sentences. and even perhaps with a lot of typos and everything, it still understands what you're saying and it still tries to figure out what you're trying to do and then gives you an output which probably is relevant to unless and until you are giving completely and expect some output, it may not. But as long as it's reasonably okay, it gives you somewhat reasonable output, right? So similarly, Even if you think of the case of Tableau Foundation Models, if there is some noise, if there is some cases where there's some missing data, there's a lot of these issues, you don't have to clean up yourself. That is 1 relief that you'll see. And the next part is obviously just like in ChatGPT, you don't for say a medical application, you don't train a separate ChatGPT model or for finance application, you don't train your own model directly.

05:21 Pre-training on Synthetic Tabular Data Slide: 2

pre-training synthetic data zero-shot few-shot domain agnostic

At least you can fine tune them later on and other things, but most often than not, you know, just using ChatGPT alone gets you quite far along. It is the same idea. which you can think of in tabular foundation models as well. But there is one foundation model that is pre-trained with a lot of synthetic data from different possibilities. So think about it, you're training on, pre-training on almost millions of tabular data. Synthetic data, but in a tabular data, it has learned a lot about what are the possibilities of how the numbers are, how the trends are shifting, what kind of trends are even possible in a data and things like that. It gets some kind of an insight from it in some sense, right? So, in the same idea, you can think of it that once you have this pre-trained model. be it for, GPT, for medical application, for finance application, for agriculture application, and whatnot, we can, use the foundation model to, make predictions 0 short or probably few short in some sense.

06:34 Zero-Shot and Few-Shot Learning Explained Slide: 2

zero-shot few-shot low-data regime training examples data efficiency

Okay, so just to, bring the idea of what 0 short or few short means, just like, in ChatGPT, you say, here are some examples of, how I want the response to be. That is 0 shot, like few shot where you know you're giving some example. And 0 shot is essentially where you're just saying that you know this is what it is, give me an output. So in both the cases you can work with it. So to summarize all the you know unique you know benefits of using a Tableau foundation model, one foundation model it helps us in you know reducing the training time. You don't have to train your own models. It makes it robust and things like that. And then it works in especially the regime of low data. Earlier, if you have to train your own model, then the question comes, you know, how much data do I need? How many data points should I collect? And it usually goes into thousands, you know, probably more than thousands as well, millions and all.

07:39 Benefits of Tabular Foundation Models Slide: 2

reduced training time data robustness low-data regime manufacturing domain agnostic

But with these kinds of tabular foundation models, maybe perhaps you just need 20 examples, 30 examples of data, and perhaps you'll get much better results than what you would have imagined. So that's one particular benefit that I see, particularly in manufacturing, where collecting data is very difficult. That is 1 particular reason why tabular foundation models are very impactful. The third is, as I said, data imputation and all, right? So as I mentioned, you don't have to clean up the data. You don't have to do any pre-processing per se before you feed the data. It can understand even if it is noisy, even if the data is missing and all. And then the 4th is domain agnostic, as I already mentioned. And then the fifth part is think about it, and that's Tabular intelligence in itself is something that's being developed in the recent times, and a lot of industries are investing on it.

08:45 Industry Investment in Tabular Intelligence Slide: 4

tabular intelligence amazon mitra prior labs startups industry investment

In fact, Amazon has its own tabular foundation model called Mitra, and Prior Labs is a startup which is working on tabular foundation models and all. And then there are other startups which are doing a lot of work in building zero-shot and few-shot models, which help us in doing a lot of tabular intelligence. I do have some demos I can show you to ascend if time permits. So, to give you an idea of, course there is a whole idea of how the model is trained and other things, but or the architecture of it and all, but the idea is that what you feed the model is your X train, which is your inputs for your training data and or whatever you call as a small set of data as a context that you're providing. And then the Y train is what is the labels that are supposed to be. And then X test is the samples, the inputs for which you want to make prediction.

09:47 Tabular Foundation Model Architecture Slide: 4

tabpfn transformer architecture self-attention prior labs context window

So just like you say here are a few examples. These are X, these are Y. And you're asking what is the prediction on a few set of X that you have. You know the inputs and you want to know what the output is. Just like you know GPT is a transformer based model. This is also a transformer-based model. Tap PFN V2 is one of the Tableau Foundation models that exist right now. Apart from that, there are many others like Tab ICL, Tab DPT, and all. These are all Tableau Foundation models, but the most famous one is Tap PFN, Prior Fitted Networks. It's created by the startup called Prior Labs, based out of Germany. I think they also have an office in New York now. So these Tap PFN models are also based on Transformer. They use the same self-attention across the rows to create a context. And right now, you know, earlier when we started working on Tap PFN and all, the context window was around 10,000 rows and, you know, 500 features or 500 columns of tabular data.

10:56 Expanding Context Window Limits Slides: 5, 4

context window tabpfn row limits column limits scalability

Now these models can work with almost around 100,000 rows. and 5000 or 2000 or 5000, one of those number of columns. Again, that's not a stopping point. If you are, hitting any of these hurdles, there are ways you can even get around those. But this is where the current, limits are for these kinds of models. Just like, when we started out with, in a ChatGPT 3.5 or 4, your context window was, 256,000 tokens or something like that. But now we are working with 1,000,000 tokens and all, so the same way the context window is something that is right now at this stage, but we are hoping that this expands even further. Yes, which one? Yes, so it's a Hallman et al. is a paper which...

11:57 Machining Surface Roughness Prediction Example Slides: 5, 6

machining surface roughness turning operation r-squared small dataset

with basically Nature paper which was published on using Tap PFN for scientific applications and all. Yeah. Thank you. Yeah. I have few examples I wanted to show. The first example is of a manufacturing machining data. And you know, there's an extreme example, but you know, I thought, let's start with this extreme example. This is actually from a paper which came out in around 2010 or something. This had results of around, the inputs are speed, the cutting velocity, the feed rate for turning operation, and then the depth of cut, and then the nose radius of the tool for inputs. Output is surface roughness that you get as an output of whatever workpiece you're working on. And simple manufacturing process, you have 4 input parameters, you have one output with zero shot, just using twenty-two rows, training rows, and using rest of the rest of the samples for testing and all, you get around.938 R-square value.

13:18 Performance on Small Sample Sizes Slide: 7

small sample size regressor performance machine learning models low-data regime manufacturing

It's quite good in terms of zero-shot getting this kind of a performance, whereas if you expect, perhaps you train your own model. When I started working on this kind of a very small data set, the best model I got was around.91,.92, given the same data split and all. So this was a very good performance when we started out. This is 1 example, as I say here, most regressors collapse below 50 samples. You don't really, how do you even work with 50 samples for any of the machine learning models? And all is a question that quite often people ask, but this is one example where we have been able to do very, with very few shots, being able to do good prediction. And then This is another example of a problem. We are in Iowa, so we should talk about agriculture to some extent or the other.

14:21 Agriculture Yield Prediction Datasets Slide: 8

yield prediction agriculture soybeans weather data missing values

So this is agriculture yield prediction. We published this in AAAI workshop earlier this year. We worked with three data sets. The 3 data sets are for soybean in US. It has about 86,000 samples. And then we have global from multiple regions and all, around 28,000 samples. And then one specific data set for European Union, around 8,600 samples. The inputs for this is, you know, as you can imagine, yield prediction, you need to know what is the kind of weather in that area. It's aggregated features of weather. and some crop information and things like that. And then you also have, so this doesn't have any missing values, but as you can see, this one has about 5 to 13% of missing values.

15:22 TabPFN Outperforms Traditional Models on Yield Prediction Slide: 9

tabpfn v2 catboost xgboost random forest autogluon

And this is categorical heavy in terms of the samples and everything. And very heterogeneous in terms of, because you're working with a very diverse and complete sample. So you can see large, complete, diverse, complete samples, and then the small but missing samples case as well. So these are, three varieties of, cases that you can see. As you can see here, tab PFNV2 with almost zero shot performs much better than, you know, all the machine learning models that we have known all along, like, you know, CatBoost, XGBoost, Random Forest. If you have worked in machine learning in tabular data for a while, you would have heard of any of these terms quite easily. And you can see that this performs much better than those 0 shot. And then there's something called as auto gluon, which is essentially, you know, fine tuning whatever you get from tap PFN on top of it to, you know, essentially make it even better. So you can, that's essentially the whole story that we have here.

16:29 Performance on Global and Missing-Data Cases Slide: 11

global dataset random forest missing values auto-imputation performance comparison

Another example is this global case where you can see that, even with zero shot, we are able to get almost close performance to this, but obviously random forest is doing better in this case, but not that different in 9716 to 9794. It's not like you have a major difference there, but still something to note in that sense. The key part is the compute part. You can get this result in less than a second rather than training a model, preparing the model, and doing all the things that you have to do for training and doing any of these things. Same way you can see this one is when you have missing samples, random forests and all don't do as good, but you know, Tap BFN too, because of all the, you know, auto imputation and things like that, it does much better than, you know, all the things.

17:30 Impact of Data Imputation on Performance Slides: 10, 11, 12

data imputation missing data automl dataset completeness performance impact

There are a few things I'm probably need to mention is how the data imputation is impacting the entire thread in general, but in terms of performance, it gives you the.91 instead of all the.93s and.97s that you have seen all along. But this certainly is an example of how it gets impacted in general. So, this is about, how we can see that, TAP PF and V2 or Tableau foundation models in general can perform much better than, what you have seen so far. This, if you want to see in terms of, a different when to use what and all, you can clearly see that, when you have a large and complete data set, you can always, if you have a large data set, you know, you can always argue that, you know, I can always Perhaps, fine-tune my model, in which case you can go with auto glue on or auto ML type architectures, where you create a model using TAPPFN, but you can always fine-tune it with auto glue on type architectures, and you do much better, but if you have diverse and perhaps complete, then you can either go...

18:44 When to Use Tabular Foundation Models vs AutoML Slide: 12

automl tabular foundation models model selection small data large data

with Tableau Foundation Models, or you can even go with AutoML type architectures. But if you have small and missing data type scenarios and things like that, going with Tableau Foundation Models helps a lot. So I think one bottom line that you'll see is, especially if you're running into a low data regime, Tableau Foundation Models certainly win. Second thing that you'll notice is that, you know, it can work with large data as well. But you can always improve because you have more data, so you can always do better. And the other thing is, the bigger picture that you need to understand is, Tableau Foundation models are not going to replace traditional machine learning models in any day. It can, in terms of, you know, it can be fine-tuned. Tableau Foundation models can be further fine-tuned using AutoML and all. But in general, the idea that we are trying to say is that, you know, Use it to get an initial guess, right?

19:47 Rapid Iteration and Digital Twin Analogy Slide: 12

rapid iteration digital twin baseline model data quality feature engineering

So it's very quick and you can get responses very quick and you can, use it to work on a bigger picture rather than, just the machine learning model that you're trying to work with. So think about it that, when we talk about physical AI or any of these things, right, simulations and all, we always say, I don't care about the accuracy of the simulation as long as I'm able to quickly iterate over and go. move forward, right? That's how the digital twin, the idea of digital twin and all work. Same way, you know, if your goal is to not just get some kind of a machine learning model, very perfectly accurate machine learning model, but you want to get some kind of, you know, close to accurate model, and then, you know, you want to quickly iterate and see, you know, what else can I do? Can I, do I need to add more data? Can I, do I need to bring more other data features and things like that? I don't want to sit on, you know, keep on training a model when I don't even know whether that model is really what I want to train. or is the data is the problem or what is the problem, right? Quite often than not, what I've seen when working with different industries is that, there is data which you need to improve on and you need to also improve on the model.

20:57 Vehicle Sensor Data from CAN Bus Slides: 13, 14

can bus vehicle sensor data agricultural combines noisy data missing values

But this at least helps me in, focusing on the model, on the data, because I know that the model can do as best as what I want in some sense. There's another example here. Vehicle sensor data, it's like, what you get from a CAN bus, the sensor data from a CAN bus to essentially, in this case, it's for, large combines to essentially detect some kind of, information of soil moisture or different things that you can get. Sorry, correct? Yes, so it is from that. Again, for the sake of, anonymity, I'm not providing what combined what data and other things, but the idea is we had about, 8 features of, canvas signals aggregated by different unit IDs of experiments that we were doing.

21:58 TabPFN Performance on Noisy Sensor Data Slide: 15

noisy data sensor reliability correlation tabpfn performance data heterogeneity

And then it's again a tableau regression model where you're essentially trying to understand in real life, there's a lot of, this is One of the most noisy data that I've ever seen, it has all kinds of heterogeneity in terms of the inputs going all the way to NANDs, but very high numbers at the same time. had very low numbers in terms of 10 to the power minus 8 and things like that. And then it also had a lot of missing values, cases where the sensor was not really robust. We don't even know whether the sensor can be reliable or not, or should I even use that data or not, and things like that. And there's no real linear structure that you can work with. So this is as real as it could get in terms of the data set that you can see. Here again, you can see all the models kind of give up when this did much better than the rest of them.

23:02 Sample Size Impact on Model Performance Slide: 16

sample size correlation performance plateau data utilization diminishing returns

Again, of course, you can say it's not that different, 878288. It's not that different, but the key part is we were able to do this in less than a day. So we could at least understand what is the data issues, what are the different things. And we could go ahead and do other things that we wanted to do, because this model is not the only thing that was stopping us. We wanted to use this model to go build something else for the sensor to improve the sensor, understand what sensors do we need to replace, and things like that. One thing that you'll see is, you know, especially if you are using the number of samples that you're using, right? So you can see... If you're using 10% of the samples, then you get 0.84 type correlation, but if you go all the way to using 50%, you get 0.88. But you can further keep increasing and see what happens, but in most of the cases, it doesn't do that well after that.

24:11 Relational Foundation Models and Kumo AI Slides: 15, 16, 17

kumo ai relational foundation model multiple tables conversational queries data relations

So I think 0.882, and I think it's more or less stuck over there. It doesn't go further from there. And But this, I think one thing I wanted to talk about is, how the industry is moving in terms of different things. So far, what we have seen is in terms of, giving a data, making the prediction and things like that. The question that I think mostly all of you may have is, okay, what do I do with it? How does it matter to me? And that's where, the idea of other models that I was talking about, like Kumo AI is 1 model. It's a relational foundation model built on top of a Tableau foundation model. So think about this as, you know, it works with multiple tables. It understands the relation between them and tries to use that to essentially have a conversation with you.

25:12 AWS Mitra and Conversational Analytics Slide: 17

aws mitra amazon quicksight s3 buckets conversational analytics natural language

can ask questions in terms of, you know, What are the insights on this? Then it will essentially identify the relations of all of them. You don't need to flatten the data of multiple tables together to essentially get one big table and then work with it. So this kind of a relational foundation model is something that people are using now, especially in DoorDash, Snowflake, and all to understand what are the relations, how do I understand the insights of them, and then go from there. The other models like AWS has, Mitra on top of that, I think anyone of here, anyone here has heard of Amazon Quick? So Amazon Quick or AWS Quick is one, another dashboard type platform which has these kinds of features of, you know, having a conversation based on a data set. You can have conversations based on tabular data set. You can connect S3 buckets and then directly work with it and have some kind of conversations with there and all.

26:13 Future of Tabular Intelligence and Auto-Imputation Slide: 18

tabular intelligence auto-imputation missing values ai insights agriculture

So that's something that I've seen people do quite a lot. And again, there are other major players that you can think of in this space which are doing something similar to this. If you ask me where the future is in some sense, you can think of Now, obviously, tablet intelligence is something that we have seen quite a bit in terms of how we can use in different spaces. I've covered agriculture, manufacturing, and autonomous vehicles and things like that. But you can use it for other applications as well, and medical application, FinOps, and a lot of applications have these things. Again, the other thing is missing values you usually try to imputate and do something on your own, but here you are using autoimputation and things like that. And that is something that helps us a lot. And perhaps that can help us in understanding probably that, maybe missing values are not really, a bug.

27:16 Reduced Data Requirements for Modeling Slide: 18

data requirements low-data regime few samples model training data collection

It's perhaps something deeper inside that you can get from those things. And quite often than not, we realize that, you know, LLMs or AI models can understand data differently from what we do. So perhaps when we get a different insight than what we have seen so far. And then I think one thing that is a relief for us is that perhaps you don't need a lot of data. So far we thought we need to collect a lot of data to train our models and do things. But perhaps we don't need a lot of data. We just need few samples, few hundreds or even thousands or even probably a million max. But you don't need a lot of data to start training your own models or using your own models for doing tabular intelligence particularly. So with this, since I have some time, I can quickly show you a demo. But before I go there, are there any questions that I can answer for you?

28:24 Edge AI Considerations and Model Size Slide: 18

edge ai edge devices model size memory constraints compute

Edge AI will be a great example where to use this stuff, right? Yes. Edge AI is something, it will be useful, but there's one caveat to understand that, you know, these are all foundation models. Just as much as you can't put a big llama model in an edge device, you'll have such considerations. But I think this is relatively easy. You can use it on your own laptop, so it's not that bad in terms of memory and compute and all. Any other questions? Thanks. I appreciate My background is in metal cutting, so I appreciate that you had the example on turning. Sometimes tabulated data has a different purpose for why it was constructed.

29:20 Tabular Data Purpose and Knowledge Encoding Slides: 17, 18

metal cutting experimental data lookup tables material properties data purpose

So like your turning example is essentially a set of experimental test results that map out some of the parameter space, but there's also guidelines in handbooks that are essentially an encoding of knowledge of look-up tables of ranges of parameters to use or look-up tables for roughness that was achieved under certain conditions. And sometimes it's a different thing. It's A look-up table like maybe material properties. So different metals have different stiffnesses, yield strengths, and ultimate tensile strengths. And I'm wondering about the role of the tabular intelligence in the context of a combination of the purpose for which the tabulated data was created and the purpose for which the user is trying to use it. Yeah, that's an excellent question, right? So I think that's very close to what I was talking about, the Kumo AI part that, you know, perhaps you may have a lot of large database, right?

30:30 Relational Databases and Natural Language Queries Slide: 18

relational databases natural language queries sql filtering kumo ai data filtering

You may have, as you mentioned, different material properties, different material manufacturing conditions, and even you may have a database of multiple manufacturing conditions like turning, cutting, milling, and all. You can have a lot of conditions which can all be part of the same database. But you can essentially, instead of you writing a particular lookup table or a SQL query, say that this is the data that I want, and then perhaps have some kind of an insight from it. You could say in a natural language that, hey, I want to find out what are the, just like, you go to perhaps your bank account now and say, I want to know what are the trends of... my last one year of purchases I've had and things like that, then it will essentially filter out the data that is relevant to it and then provide you some insights from it. So think about it in that perspective. So it can essentially do that relational database, understand the relation of multiple data or even filter out the data using a SQL query or something and give you something which is more relevant to what you want.

31:39 Handling Data Bias and Imbalanced Temperature Ranges Slide: 18

data bias imbalanced data temperature range extrapolation training distribution

But again, the key thing is to know in terms of what data sets exist with you so that you can have that kind of a relational graph built in so that you can actually do something like that. Does it make sense? Any other questions? Thank you, sir. So I got a question about the low data usage for training the model. Yeah. To understand that you don't need as much data to train the model, but if there is inherent bias in the amount of existing data that you're using for training, how does it help with extrapolating it for something that's not there in the data? So example, right? So we have temperature data. All my temperature data is around, say, 100 degrees Fahrenheit.

32:40 Model Understanding and Temperature Regimes Slide: 18

model understanding semantic meaning temperature regimes pattern recognition extrapolation limits

but there's only a few points that are, I could say, 300 Fahrenheit. I understand it works on low data, but I don't have enough data for 300 Fahrenheit. Would it still be able to do predictions correctly with lesser data, or do we have to mash the data in the beginning itself so that there is good spread of it? Right, so it's a great question, right? So, yes, there will be some Bias with the starting data that you start or the data that you're starting with, right? So, if you're saying that you're only going to start with, say, all 100 degree and then probably one or two samples of 300 Fahrenheit, maybe expecting to get some good results with 300 Fahrenheit may be an over expectation over there. Obviously, the bias is built in the model per se, because... What it is doing is it's seeing some kind of a, relation or a trend within the data and, saying it doesn't understand that it's a temperature.

33:44 Data Quality vs Model Performance Trade-offs Slide: 18

data quality model performance anomaly detection trend modeling trade-offs

It doesn't even understand that, you know, from one temperature to another temperature regime, something is changing. Just as much as, you know, ChatGPT, if you give a bunch of things and ask something as an output, it may not even do because it doesn't understand the connection between, you know, multiple files that you have provided and, you know, what is it that you're asking as an output. So same way, that extrapolation capability is certainly going to be dependent on the bias on the data that you're providing in some sense. If you provide a very clean data of, fully balanced data, then it may do much better. I think the question that we should probably look for is, you know, the way to rephrase it is, given the data, The best model performance that you could get in very, less time is going to be what you get from Tableau Foundation models. You could probably perhaps invest more energy to slightly move it by a little bit, but data is the king ultimately.

34:49 Use Case Example: Metal Formability Across Temperature Ranges Slide: 18

metal formability temperature ranges material properties data requirements manufacturing

You know, you need to probably work on the data. And that's where you'll probably, you know, if you are starting out, you see that, you know, there are a bunch of 300s and, you know, the rest of the data is in hundreds. You see, this is the best performance. performance you can get, is it sufficient? Maybe it's sufficient because you're probably doing anomaly detection. It doesn't matter whether it's predicting, thinking it is 300 or thinking it's 150. It's just for predicting anomaly. It's anomalous, so it's good. So you don't need perhaps more data. But if you are making more clear prediction of, you know, some specific trend of, you know, how... metal manufacturing processes from 100 degrees to 300 degrees. There's a complete difference on the formability, the material properties and everything change quite drastically between these two regimes. Then in that case, perhaps you may need some more data to collect in the rest of the regime. So depending on the data, but at the same time, it gives you a good quick start for you to go from there, basically. Thank you.

35:51 Demo Overview: Prior Labs and Concrete Strength Dataset Slide: 10

demo prior labs concrete compressive strength tabpfn v2 mean squared error

Any other questions? All right. Then if there are no further questions, I mean, you can, if you have questions, I can answer them later on as well. But I still want to see if I can show you a quick demo. So there are, you know, I am using two examples for a demo. One is from prior labs dot AI. That's the startup which essentially runs or built the model, the TAP PFN V2 model that I was talking about. And you can see that you can actually upload your own data set and play around with it and do things, especially, you know, on any, like either this model or previous models and things like that. And in this case, I just chose one of the samples data. As you can see, there are a lot of samples that are there already. It can be either sales or it can be industrial.

36:54 Uploading Datasets to Prior Labs Platform Slide: 10

prior labs platform csv upload excel upload dataset upload target variable

And you can see that there's a concrete compressive strength data set. That's what I had loaded earlier. As you can see, it has a bunch of features. And then finally, the target and what the prediction is in some sense. As you can see, TAP PFN in this case for this particular data set, which I can provide the exact metrics, but it gets better performance than even random forest, XGBoost and all. And linear regression has the maximum error. Tap PFNV2, 2.5 plus has the minimum error in MSE. And it provides a simple output. So this is, and you can easily upload your own data set. It allows you to directly upload either a CSV file or an Excel file with header rows of around 20 to 40,000 rows.

37:57 Local Deployment and Code Access Slide: 5

local deployment code access download model data privacy local inference

and including a column on what to predict and all. So this is a simple, you set interface-based way of how you can do it. Or if you are more like me who likes to code, then you can always get the code, run it on your own local machine. You don't want to use that server. You want to run it on your own local machine. You can just download the model and then run it in your own machine, local machine. And that is also equally easy. You can just access it from here and then run it. This is just how to run just the Tabular Foundation model alone. And there is another example, which is the Kumo RFM that I was talking about. So this is an example of, you can see in the Kumo RFM, they already have few data sets here. One of them is e-commerce. Where they have data on returns, views, items, orders, and users, and all, or the other data sets like insurance, F1 racing, and all, or you can even upload your own data set or link it to your Amazon S3 buckets or Snowflake for that matter as well.

39:12 Kumo RFM Demo: E-commerce Relational Data Slide: 5

kumo rfm relational foundation model e-commerce schema inference relational graph

And once you have, either you can infer the schema or you can actually write down the schema as well. That's an option that you can do. So once you provide the schema and all that information, it will create a graph, something like this to, you know, come up with the entire, you know, the whole idea of how each data is related to other and table is related to other. And once you have all of this ready, you know, you can, once you have the data, you can always go here. And all you have to do is, you have to select which table are you working with. So you say you're saying e-commerce, then it will say how many orders will each user have in next 30 days. And that's a question that you're asking. And then it will analyze your question. It will essentially, if you can see here, it is making a query of what in a SQL query, the product, in terms of a predict query language, they say. where you're saying, we are predicting the orders and for each user, and then it essentially finds out what is a SQL query that it needs to run and create a table.

40:23 Conversational Queries with Kumo RFM Slide: 4

conversational queries kumo rfm sql generation predict query user questions

And then once it has the table, it will make a prediction based on that. And you can further ask more questions and, you know, have a conversation in some sense. Just wanted to show you these two examples of, how you can use this to create tabular data and in a tabular foundation, use this for inference. Yes, Vijay. What was the first one? What was the first tool that you showed? It's called Prior Labs. Prior, like, Tap PFN is the model. Prior Labs is the startup which actually trained that model. Thank you. Yes, it's free, of course. I have not paid a single cent so far to them. Do you have a link to the Yes, and my slides will be there, and they have that will have the link, so you should be able to access it from there as well. So, yeah. I was going to ask a question in those regards, too, about favorite tools.

41:25 Comparison of Tabular Foundation Model Tools Slide: 4

tabpfn tab-icl tab-dpt amazon mitra tool comparison

Obviously, this is one of them. Any other favorite tools based on benefits that they might have over this? So, the thing is, this area is quite in as any AI models in space, ChatGPT, like 5.5, and then Opus, Cloud Opus, they're all fighting with each other, same way there are. So, this is when we started working on it, Tap PFN was the V1, and then we had Tap by CL, and then Tap DPT. So, these are the three major ones. Tap PFN, ICL, ICL is in context learning, and then DPT is, I think, some transformer predictive transformer. I don't know what the D is on top of my head, so these three models have been fairly... really good. And then Amazon Smithra is the other one which came out very recently. Some of these four, three or four are the ones which are right now are doing really good. If you ask me which one is best so far, I think prior labs, the PFN is well tested in so many broad areas.

42:33 Data Privacy and On-Premise Model Use Slide: 4

data privacy on-premise local inference company data model security

You know, Hitachi and so many companies have already used it. I've myself worked with several industries to, you know, help them use tabular models for their problems and all. So Tab BFN is the first go-to. If not, you can go to Tab ICL and Mitra is the third one. And that's the rating if you want, if you ask me today. Tomorrow, I don't know. Yes. Is it safe to use like company code right now or are they training models off the data you give them? So your question is, it safe to use it on company code and all? Is that company data? Yes, absolutely. Because Especially when I work with industries, I do not use the user interface of this. I literally have the trained, like download the trained model. It is so small that it even runs, I can do inference on my Mac, that one, so and it runs on that. Yeah. I think we have time for one more question. Yes. You say you download the code, you'd use it.

43:33 Hardware Recommendations: Mac vs GPU Workstations Slide: 4

mac gpu workstations google colab hardware recommendations nvidia

And the thing is, I've been asking for Macs because I think I mean, right now, because the Nvidia, you're fighting gamers for it, too, and the price is crazy, and so it's like we say it's like get the Macs, right? Yeah. Do you feel that that's like a policy to go? Well, not really. I mean, it so happened that I'm using a Mac and it's working very good. I did my PhD using GP computing and all, so yes, I understand that you know you would go with that. Perhaps the other alternative is you can train these kinds of models are easily accessible from even Google Colab or things like that. So that's another way you can quickly train the model. It's going to be very easy to do it in Google Colab as well. I agree, but Google Colab is a temporary instance. You will probably train the model then download it to your local. Yes.

44:34 Closing Remarks and Thanks Slide: 4

closing remarks thank you session end

Any other questions? I think we're right at our time there. So everybody, please join me in thanking Dr. Blue for his presentation.

So good morning. It's my pleasure to introduce Dr. Aditya Blue, a data scientist at the Translational AI Center at Iowa State University here. So with over 15 years of experience in artificial intelligence and machine learning, Dr.

Blue has contributed extensively to the field. with research published in leading venues such as Nature Computational Science, ICML, and Engineering Applications of Artificial Intelligence. So his work focuses on applying advanced AI techniques to engineering and manufacturing challenges, bridging cutting-edge research with real-world impact. So in addition, he plays an active role in developing teaching AI and machine learning programs. to help translate these innovations into practice. So in today's session, Dr.

Bhalu will explore the emerging role of tabular foundational models in manufacturing, highlighting how these approaches can simplify workflows, improve predictions, and open new possibilities for industrial AI applications. So please join me in welcoming Dr. Aditya Bhalu. Thanks.

Thanks a lot, Jake. I hope I'm audible. So, good morning, everyone. Thanks, Jake, for the nice introduction.

So, today I'll be talking about tabular foundation models for manufacturing. Just to give you the context of how this comes in, right? So I'm sure you all have seen some kind of a tabular data, a bunch of inputs, probably an output, you want to predict whether, what is the manufacturing rate you have or what is the prediction of, when you're going to get your order from Amazon, whether it's going to come tomorrow or day after and all. So each of them, they're all tabular data in some form or the other.

And Have you ever wondered, can I just dump all this information into ChatGPT and, ask it to give me the predictions and get done with it? Has anyone tried something like that so far? Okay, at least one. But I'm sure, do you have any experience you want to share in terms of, you know, what you saw when you did that?

Yeah, so I'm actually just a freshman. I would say some of the more basic ones would be like, maybe I'm working on a problem. a lot of data or parts to it. I try to upload it all at once and sometimes it gets lost or it doesn't do the right calculation. Sometimes I might break it up and sometimes helps, but other times if it's just too much, doesn't work.

Yeah. So as you mentioned, right, so the too much data is one thing, but fundamentally there's one problem with, you know, dumping all the tabular data to, you know, ChatGPT or any of the any of your favorite LLMs today that you're using. The problem is that they are not meant for tabular data. They are meant for language.

They are meant to have a conversation with you. And they're usually called as large language models or foundation models for language, per se. But at the same time, you know, in the recent last few years, there's been a lot of shift towards, you know, building foundation models for tabular data. And that's what is the term tabular foundation models that you're seeing here.

If I want to explain in comparing what is a foundation model to what we do in traditional machine learning models, in traditional machine learning you have a data set, you have a new data set, and essentially you train a fresh model, train them, tune the hyper-parameters, validate, and then deploy. And you do this for almost every task that you are trying to work on. Quite often than not, you have to train the models from scratch. You have to do a lot of feature engineering.

And quite often than not, anyone who has worked with real-world data, they know that the most important thing that you see is cleaning up the data. There are a lot of cases where you have missing data. You have cases where the data is probably noisy, and you have to do a lot of data imputation and things like that to even before get started to train the model. And what we saw as an opportunity is the foundation models, right?

And the next part is obviously just like in ChatGPT, you don't for say a medical application, you don't train a separate ChatGPT model or for finance application, you don't train your own model directly. At least you can fine tune them later on and other things, but most often than not, you know, just using ChatGPT alone gets you quite far along. It is the same idea. which you can think of in tabular foundation models as well. But there is one foundation model that is pre-trained with a lot of synthetic data from different possibilities.

So think about it, you're training on, pre-training on almost millions of tabular data. Synthetic data, but in a tabular data, it has learned a lot about what are the possibilities of how the numbers are, how the trends are shifting, what kind of trends are even possible in a data and things like that. It gets some kind of an insight from it in some sense, right? So, in the same idea, you can think of it that once you have this pre-trained model. be it for, GPT, for medical application, for finance application, for agriculture application, and whatnot, we can, use the foundation model to, make predictions 0 short or probably few short in some sense.

So to summarize all the you know unique you know benefits of using a Tableau foundation model, one foundation model it helps us in you know reducing the training time. You don't have to train your own models. It makes it robust and things like that. And then it works in especially the regime of low data.

Earlier, if you have to train your own model, then the question comes, you know, how much data do I need? How many data points should I collect? And it usually goes into thousands, you know, probably more than thousands as well, millions and all. But with these kinds of tabular foundation models, maybe perhaps you just need 20 examples, 30 examples of data, and perhaps you'll get much better results than what you would have imagined.

So that's one particular benefit that I see, particularly in manufacturing, where collecting data is very difficult. That is 1 particular reason why tabular foundation models are very impactful. The third is, as I said, data imputation and all, right? So as I mentioned, you don't have to clean up the data.

You don't have to do any pre-processing per se before you feed the data. It can understand even if it is noisy, even if the data is missing and all. And then the 4th is domain agnostic, as I already mentioned. And then the fifth part is think about it, and that's Tabular intelligence in itself is something that's being developed in the recent times, and a lot of industries are investing on it.

And then the Y train is what is the labels that are supposed to be. And then X test is the samples, the inputs for which you want to make prediction. So just like you say here are a few examples. These are X, these are Y.

And you're asking what is the prediction on a few set of X that you have. You know the inputs and you want to know what the output is. Just like you know GPT is a transformer based model. This is also a transformer-based model.

Tap PFN V2 is one of the Tableau Foundation models that exist right now. Apart from that, there are many others like Tab ICL, Tab DPT, and all. These are all Tableau Foundation models, but the most famous one is Tap PFN, Prior Fitted Networks. It's created by the startup called Prior Labs, based out of Germany.

I think they also have an office in New York now. So these Tap PFN models are also based on Transformer. They use the same self-attention across the rows to create a context. And right now, you know, earlier when we started working on Tap PFN and all, the context window was around 10,000 rows and, you know, 500 features or 500 columns of tabular data.

Just like, when we started out with, in a ChatGPT 3.5 or 4, your context window was, 256,000 tokens or something like that. But now we are working with 1,000,000 tokens and all, so the same way the context window is something that is right now at this stage, but we are hoping that this expands even further. Yes, which one? Yes, so it's a Hallman et al. is a paper which... with basically Nature paper which was published on using Tap PFN for scientific applications and all.

Yeah. Thank you. Yeah. I have few examples I wanted to show.

The first example is of a manufacturing machining data. And you know, there's an extreme example, but you know, I thought, let's start with this extreme example. This is actually from a paper which came out in around 2010 or something. This had results of around, the inputs are speed, the cutting velocity, the feed rate for turning operation, and then the depth of cut, and then the nose radius of the tool for inputs.

Output is surface roughness that you get as an output of whatever workpiece you're working on. And simple manufacturing process, you have 4 input parameters, you have one output with zero shot, just using twenty-two rows, training rows, and using rest of the rest of the samples for testing and all, you get around.938 R-square value. It's quite good in terms of zero-shot getting this kind of a performance, whereas if you expect, perhaps you train your own model. When I started working on this kind of a very small data set, the best model I got was around.91,.92, given the same data split and all.

So this was a very good performance when we started out. This is 1 example, as I say here, most regressors collapse below 50 samples. You don't really, how do you even work with 50 samples for any of the machine learning models? And all is a question that quite often people ask, but this is one example where we have been able to do very, with very few shots, being able to do good prediction.

And then This is another example of a problem. We are in Iowa, so we should talk about agriculture to some extent or the other. So this is agriculture yield prediction. We published this in AAAI workshop earlier this year.

We worked with three data sets. The 3 data sets are for soybean in US. It has about 86,000 samples. And then we have global from multiple regions and all, around 28,000 samples.

And then one specific data set for European Union, around 8,600 samples. The inputs for this is, you know, as you can imagine, yield prediction, you need to know what is the kind of weather in that area. It's aggregated features of weather. and some crop information and things like that. And then you also have, so this doesn't have any missing values, but as you can see, this one has about 5 to 13% of missing values.

As you can see here, tab PFNV2 with almost zero shot performs much better than, you know, all the machine learning models that we have known all along, like, you know, CatBoost, XGBoost, Random Forest. If you have worked in machine learning in tabular data for a while, you would have heard of any of these terms quite easily. And you can see that this performs much better than those 0 shot. And then there's something called as auto gluon, which is essentially, you know, fine tuning whatever you get from tap PFN on top of it to, you know, essentially make it even better.

So you can, that's essentially the whole story that we have here. Another example is this global case where you can see that, even with zero shot, we are able to get almost close performance to this, but obviously random forest is doing better in this case, but not that different in 9716 to 9794. It's not like you have a major difference there, but still something to note in that sense. The key part is the compute part.

You can get this result in less than a second rather than training a model, preparing the model, and doing all the things that you have to do for training and doing any of these things. Same way you can see this one is when you have missing samples, random forests and all don't do as good, but you know, Tap BFN too, because of all the, you know, auto imputation and things like that, it does much better than, you know, all the things. There are a few things I'm probably need to mention is how the data imputation is impacting the entire thread in general, but in terms of performance, it gives you the.91 instead of all the.93s and.97s that you have seen all along. But this certainly is an example of how it gets impacted in general.

So, this is about, how we can see that, TAP PF and V2 or Tableau foundation models in general can perform much better than, what you have seen so far. This, if you want to see in terms of, a different when to use what and all, you can clearly see that, when you have a large and complete data set, you can always, if you have a large data set, you know, you can always argue that, you know, I can always Perhaps, fine-tune my model, in which case you can go with auto glue on or auto ML type architectures, where you create a model using TAPPFN, but you can always fine-tune it with auto glue on type architectures, and you do much better, but if you have diverse and perhaps complete, then you can either go... with Tableau Foundation Models, or you can even go with AutoML type architectures. But if you have small and missing data type scenarios and things like that, going with Tableau Foundation Models helps a lot. So I think one bottom line that you'll see is, especially if you're running into a low data regime, Tableau Foundation Models certainly win.

Second thing that you'll notice is that, you know, it can work with large data as well. But you can always improve because you have more data, so you can always do better. And the other thing is, the bigger picture that you need to understand is, Tableau Foundation models are not going to replace traditional machine learning models in any day. It can, in terms of, you know, it can be fine-tuned.

Tableau Foundation models can be further fine-tuned using AutoML and all. But in general, the idea that we are trying to say is that, you know, Use it to get an initial guess, right? So it's very quick and you can get responses very quick and you can, use it to work on a bigger picture rather than, just the machine learning model that you're trying to work with. So think about it that, when we talk about physical AI or any of these things, right, simulations and all, we always say, I don't care about the accuracy of the simulation as long as I'm able to quickly iterate over and go. move forward, right?

That's how the digital twin, the idea of digital twin and all work. Same way, you know, if your goal is to not just get some kind of a machine learning model, very perfectly accurate machine learning model, but you want to get some kind of, you know, close to accurate model, and then, you know, you want to quickly iterate and see, you know, what else can I do? Can I, do I need to add more data? Can I, do I need to bring more other data features and things like that?

I don't want to sit on, you know, keep on training a model when I don't even know whether that model is really what I want to train. or is the data is the problem or what is the problem, right? Quite often than not, what I've seen when working with different industries is that, there is data which you need to improve on and you need to also improve on the model. But this at least helps me in, focusing on the model, on the data, because I know that the model can do as best as what I want in some sense. There's another example here.

Vehicle sensor data, it's like, what you get from a CAN bus, the sensor data from a CAN bus to essentially, in this case, it's for, large combines to essentially detect some kind of, information of soil moisture or different things that you can get. Sorry, correct? Yes, so it is from that. Again, for the sake of, anonymity, I'm not providing what combined what data and other things, but the idea is we had about, 8 features of, canvas signals aggregated by different unit IDs of experiments that we were doing.

So this is as real as it could get in terms of the data set that you can see. Here again, you can see all the models kind of give up when this did much better than the rest of them. Again, of course, you can say it's not that different, 878288. It's not that different, but the key part is we were able to do this in less than a day.

So we could at least understand what is the data issues, what are the different things. And we could go ahead and do other things that we wanted to do, because this model is not the only thing that was stopping us. We wanted to use this model to go build something else for the sensor to improve the sensor, understand what sensors do we need to replace, and things like that. One thing that you'll see is, you know, especially if you are using the number of samples that you're using, right?

So you can see... If you're using 10% of the samples, then you get 0.84 type correlation, but if you go all the way to using 50%, you get 0.88. But you can further keep increasing and see what happens, but in most of the cases, it doesn't do that well after that. So I think 0.882, and I think it's more or less stuck over there.

It doesn't go further from there. And But this, I think one thing I wanted to talk about is, how the industry is moving in terms of different things. So far, what we have seen is in terms of, giving a data, making the prediction and things like that. The question that I think mostly all of you may have is, okay, what do I do with it?

How does it matter to me? And that's where, the idea of other models that I was talking about, like Kumo AI is 1 model. It's a relational foundation model built on top of a Tableau foundation model. So think about this as, you know, it works with multiple tables.

It understands the relation between them and tries to use that to essentially have a conversation with you. can ask questions in terms of, you know, What are the insights on this? Then it will essentially identify the relations of all of them. You don't need to flatten the data of multiple tables together to essentially get one big table and then work with it. So this kind of a relational foundation model is something that people are using now, especially in DoorDash, Snowflake, and all to understand what are the relations, how do I understand the insights of them, and then go from there.

The other models like AWS has, Mitra on top of that, I think anyone of here, anyone here has heard of Amazon Quick? So Amazon Quick or AWS Quick is one, another dashboard type platform which has these kinds of features of, you know, having a conversation based on a data set. You can have conversations based on tabular data set. You can connect S3 buckets and then directly work with it and have some kind of conversations with there and all.

But you can use it for other applications as well, and medical application, FinOps, and a lot of applications have these things. Again, the other thing is missing values you usually try to imputate and do something on your own, but here you are using autoimputation and things like that. And that is something that helps us a lot. And perhaps that can help us in understanding probably that, maybe missing values are not really, a bug.

So far we thought we need to collect a lot of data to train our models and do things. But perhaps we don't need a lot of data. We just need few samples, few hundreds or even thousands or even probably a million max. But you don't need a lot of data to start training your own models or using your own models for doing tabular intelligence particularly.

So with this, since I have some time, I can quickly show you a demo. But before I go there, are there any questions that I can answer for you? Edge AI will be a great example where to use this stuff, right? Yes.

Edge AI is something, it will be useful, but there's one caveat to understand that, you know, these are all foundation models. Just as much as you can't put a big llama model in an edge device, you'll have such considerations. But I think this is relatively easy. You can use it on your own laptop, so it's not that bad in terms of memory and compute and all.

Any other questions? Thanks. I appreciate My background is in metal cutting, so I appreciate that you had the example on turning. Sometimes tabulated data has a different purpose for why it was constructed.

And I'm wondering about the role of the tabular intelligence in the context of a combination of the purpose for which the tabulated data was created and the purpose for which the user is trying to use it. Yeah, that's an excellent question, right? So I think that's very close to what I was talking about, the Kumo AI part that, you know, perhaps you may have a lot of large database, right? You may have, as you mentioned, different material properties, different material manufacturing conditions, and even you may have a database of multiple manufacturing conditions like turning, cutting, milling, and all.

You can have a lot of conditions which can all be part of the same database. But you can essentially, instead of you writing a particular lookup table or a SQL query, say that this is the data that I want, and then perhaps have some kind of an insight from it. You could say in a natural language that, hey, I want to find out what are the, just like, you go to perhaps your bank account now and say, I want to know what are the trends of... my last one year of purchases I've had and things like that, then it will essentially filter out the data that is relevant to it and then provide you some insights from it. So think about it in that perspective.

So it can essentially do that relational database, understand the relation of multiple data or even filter out the data using a SQL query or something and give you something which is more relevant to what you want. But again, the key thing is to know in terms of what data sets exist with you so that you can have that kind of a relational graph built in so that you can actually do something like that. Does it make sense? Any other questions?

Thank you, sir. So I got a question about the low data usage for training the model. Yeah. To understand that you don't need as much data to train the model, but if there is inherent bias in the amount of existing data that you're using for training, how does it help with extrapolating it for something that's not there in the data?

So example, right? So we have temperature data. All my temperature data is around, say, 100 degrees Fahrenheit. but there's only a few points that are, I could say, 300 Fahrenheit. I understand it works on low data, but I don't have enough data for 300 Fahrenheit.

Would it still be able to do predictions correctly with lesser data, or do we have to mash the data in the beginning itself so that there is good spread of it? Right, so it's a great question, right? So, yes, there will be some Bias with the starting data that you start or the data that you're starting with, right? So, if you're saying that you're only going to start with, say, all 100 degree and then probably one or two samples of 300 Fahrenheit, maybe expecting to get some good results with 300 Fahrenheit may be an over expectation over there.

Obviously, the bias is built in the model per se, because... What it is doing is it's seeing some kind of a, relation or a trend within the data and, saying it doesn't understand that it's a temperature. It doesn't even understand that, you know, from one temperature to another temperature regime, something is changing. Just as much as, you know, ChatGPT, if you give a bunch of things and ask something as an output, it may not even do because it doesn't understand the connection between, you know, multiple files that you have provided and, you know, what is it that you're asking as an output.

So same way, that extrapolation capability is certainly going to be dependent on the bias on the data that you're providing in some sense. If you provide a very clean data of, fully balanced data, then it may do much better. I think the question that we should probably look for is, you know, the way to rephrase it is, given the data, The best model performance that you could get in very, less time is going to be what you get from Tableau Foundation models. You could probably perhaps invest more energy to slightly move it by a little bit, but data is the king ultimately.

It doesn't matter whether it's predicting, thinking it is 300 or thinking it's 150. It's just for predicting anomaly. It's anomalous, so it's good. So you don't need perhaps more data.

But if you are making more clear prediction of, you know, some specific trend of, you know, how... metal manufacturing processes from 100 degrees to 300 degrees. There's a complete difference on the formability, the material properties and everything change quite drastically between these two regimes. Then in that case, perhaps you may need some more data to collect in the rest of the regime. So depending on the data, but at the same time, it gives you a good quick start for you to go from there, basically.

Thank you. Any other questions? All right. Then if there are no further questions, I mean, you can, if you have questions, I can answer them later on as well.

But I still want to see if I can show you a quick demo. So there are, you know, I am using two examples for a demo. One is from prior labs dot AI. That's the startup which essentially runs or built the model, the TAP PFN V2 model that I was talking about.

And you can see that you can actually upload your own data set and play around with it and do things, especially, you know, on any, like either this model or previous models and things like that. And in this case, I just chose one of the samples data. As you can see, there are a lot of samples that are there already. It can be either sales or it can be industrial.

As you can see, TAP PFN in this case for this particular data set, which I can provide the exact metrics, but it gets better performance than even random forest, XGBoost and all. And linear regression has the maximum error. Tap PFNV2, 2.5 plus has the minimum error in MSE. And it provides a simple output.

So this is, and you can easily upload your own data set. It allows you to directly upload either a CSV file or an Excel file with header rows of around 20 to 40,000 rows. and including a column on what to predict and all. So this is a simple, you set interface-based way of how you can do it. Or if you are more like me who likes to code, then you can always get the code, run it on your own local machine.

You don't want to use that server. You want to run it on your own local machine. You can just download the model and then run it in your own machine, local machine. And that is also equally easy.

You can just access it from here and then run it. This is just how to run just the Tabular Foundation model alone. And there is another example, which is the Kumo RFM that I was talking about. So this is an example of, you can see in the Kumo RFM, they already have few data sets here.

One of them is e-commerce. Where they have data on returns, views, items, orders, and users, and all, or the other data sets like insurance, F1 racing, and all, or you can even upload your own data set or link it to your Amazon S3 buckets or Snowflake for that matter as well. And once you have, either you can infer the schema or you can actually write down the schema as well. That's an option that you can do.

So once you provide the schema and all that information, it will create a graph, something like this to, you know, come up with the entire, you know, the whole idea of how each data is related to other and table is related to other. And once you have all of this ready, you know, you can, once you have the data, you can always go here. And all you have to do is, you have to select which table are you working with. So you say you're saying e-commerce, then it will say how many orders will each user have in next 30 days.

And that's a question that you're asking. And then it will analyze your question. It will essentially, if you can see here, it is making a query of what in a SQL query, the product, in terms of a predict query language, they say. where you're saying, we are predicting the orders and for each user, and then it essentially finds out what is a SQL query that it needs to run and create a table. And then once it has the table, it will make a prediction based on that.

And you can further ask more questions and, you know, have a conversation in some sense. Just wanted to show you these two examples of, how you can use this to create tabular data and in a tabular foundation, use this for inference. Yes, Vijay. What was the first one?

What was the first tool that you showed? It's called Prior Labs. Prior, like, Tap PFN is the model. Prior Labs is the startup which actually trained that model.

Thank you. Yes, it's free, of course. I have not paid a single cent so far to them. Do you have a link to the Yes, and my slides will be there, and they have that will have the link, so you should be able to access it from there as well.

So, yeah. I was going to ask a question in those regards, too, about favorite tools. Obviously, this is one of them. Any other favorite tools based on benefits that they might have over this?

So, the thing is, this area is quite in as any AI models in space, ChatGPT, like 5.5, and then Opus, Cloud Opus, they're all fighting with each other, same way there are. So, this is when we started working on it, Tap PFN was the V1, and then we had Tap by CL, and then Tap DPT. So, these are the three major ones. Tap PFN, ICL, ICL is in context learning, and then DPT is, I think, some transformer predictive transformer.

I don't know what the D is on top of my head, so these three models have been fairly... really good. And then Amazon Smithra is the other one which came out very recently. Some of these four, three or four are the ones which are right now are doing really good. If you ask me which one is best so far, I think prior labs, the PFN is well tested in so many broad areas.

And that's the rating if you want, if you ask me today. Tomorrow, I don't know. Yes. Is it safe to use like company code right now or are they training models off the data you give them?

So your question is, it safe to use it on company code and all? Is that company data? Yes, absolutely. Because Especially when I work with industries, I do not use the user interface of this.

I literally have the trained, like download the trained model. It is so small that it even runs, I can do inference on my Mac, that one, so and it runs on that. Yeah. I think we have time for one more question.

Yes. You say you download the code, you'd use it. And the thing is, I've been asking for Macs because I think I mean, right now, because the Nvidia, you're fighting gamers for it, too, and the price is crazy, and so it's like we say it's like get the Macs, right? Yeah.

Do you feel that that's like a policy to go? Well, not really. I mean, it so happened that I'm using a Mac and it's working very good. I did my PhD using GP computing and all, so yes, I understand that you know you would go with that.

Perhaps the other alternative is you can train these kinds of models are easily accessible from even Google Colab or things like that. So that's another way you can quickly train the model. It's going to be very easy to do it in Google Colab as well. I agree, but Google Colab is a temporary instance.

You will probably train the model then download it to your local. Yes. Any other questions? I think we're right at our time there.

So everybody, please join me in thanking Dr. Blue for his presentation.