Data Warehouses and Cloud Drives: The spit and duct tape of AI

Here we go again. New tech comes around, everyone feels the pressure, and the duct tape comes out. This time it’s AI and context.

It’s hard to argue with the revolutionary potential of deep learning and LLMs in the enterprise. The horizontal automation opportunities that emerge when you can contextualize different datasets alongside one another make a strong case for those building with this new technology. Still, as the pressure mounts to turn generative and AI work into positive unit economics and ROI, we find ourselves falling back on the same old duct tape: data warehouses and cloud storage.

We all recognize the importance of context now. Transformer technology lets us parse massive amounts of information to build the context needed for analysis, inference, and training (if you choose to do so). The limitations come from a few key areas:

The legacy tools we have at our disposal
Context cost (or tokens, as we know them), and the way these deep learning platforms need to be grounded

To start, we need to be clear about what we are trying to achieve in terms of context for generative AI. To get these deep learning models to focus, we need to provide the optimal amount of context to be able to generate the accuracy we need for these predictions. Trying to run inference without the appropriate amount of context often generates generic outputs based on internal training data. Enter our legacy tools.

Most organizations have invested a tremendous amount of time and money into building a data ecosystem that lets different parts of the organization operate. That means separate systems for finance, HR, marketing, sales, manufacturing, logistics, and more. The clever solve over the past decade or so has been cloud migration and data sharing environments. Enter in the hyperscalers and virtual data operators. The methods used here to ground our inference have to date been a recapitulation of moving everything into a centralized location to try to guide the application of generative systems with the appropriate context. There are several problems with this approach.

The first point is the overhead required to manage a variety of ETL pipelines that convert raw data sources into formats deep learning models can understand. This usually means throwing everything into a common folder and allowing the feature models to generate their own embeddings on top of the data that sits within there. This is fraught with a million sub-problems, including context sparseness, representation of quantitative information bleached of its appropriate context, as well as all the usual data cleansing and normalization challenges. In essence, what we are doing is we are transferring a bunch of files into a common folder and asking the LLM to parse through it. Even a cursory understanding of how this technology is built and the data pipeline most people use reveals a wide range of challenges related to accuracy, cost, latency, and confusion around inference.

The second main point is the cost. When you feed these models more context, they are more likely to get the context wrong. Generative systems do not inherently understand the intelligence in the data. They are simply predicting the next tokens based on their internal training data and the context you provide. This approach increases token costs because you are asking the model to use all the context every time, rather than using a sophisticated data retrieval pipeline that pulls only the context needed for the prediction.

The final point is about the real-time nature of this data. Since we update the information in these data warehouses and cloud drives through a more static ETL process, we lose the ability to integrate recency into the predictions. Depending upon when the data warehouse or Claude Drive is updated, that's when you'll get access to the latest predictions based on a particular query. Depending upon when the data warehouse or Claude Drive is updated, that's when you'll get access to the latest predictions based on a particular query. This gets exponentially worse when you start to think about agentic functions.

There are far more elegant solutions to being able to manage data in the age of AI. Replicating an old process with new technology increases costs, hurts performance, adds latency, and exposes organizations to significant security risks related to the role-based access controls for that data. We use a federated approach that maps directly to our backend systems because it lets us retrieve only the data we need for a specific inference or calculation, move it into the right environment, and avoid relying on an LLM to parse it when it is not suited to do so. To all organizations embracing AI: please do not repeat the mistakes of the past by using spit and duct tape to get a cheap, proof-of-concept AI solution in front of your leadership. To all organizations embracing AI: please do not repeat the mistakes of the past by using spit and duct tape to get a cheap, proof-of-concept AI solution in front of your leadership. If you do it right, all the potential of AI will be yours moving forward as the technology advances and the sophistication of optimization presents itself. If you do it right, all the potential of AI will be yours moving forward as the technology advances and the sophistication of optimization presents itself.