..

Trust, the last frontier of AI

I haven’t written a single line of code since last December, yet I have never produced so much code. It’s what everyone is calling the Claude Code/Codex effect. We are entering the era of autopilot programming.

The typical workflow now is to design specs to establish an implementation plan, optionally with some tests, and then feed it to your favorite coding agent. Beyond automated tests, you can essentially click through the functionalities to check them. That’s it.

Of course, this doesn’t mean the code is production-ready out of the gate, but at least you can verify correctness at a functional level. We are at the stage where agents are perfect for throwaway or personal code, and it soon will be true for almost any type of code, with continued model improvements combined with the rise of formal methods1.

If it’s not the end of software engineering as we know it, it is definitely a major shift in practice. We went from autocompletion to fully automated code generation in a few years: no Copilot, no assistant, pure auto mode. Everybody’s guru Andrej Karpathy says so himself.

Andrej Karpathy comment
Andrej Kaparthy on coding agent

What’s next after coding agents? What tasks will be totally automated? Sequoia just published a very interesting piece on the conversion from copilot to autopilot tools and its impact on different industries. It distinguishes between writing code as a form of intelligence that can fully assumed by AI, and judgment as the result of experience and expertise that remains human.

Maybe data scientists are safer than software engineers…

But when the output is not code but data-driven insight or analysis, the story changes.

Even with a simple query on a single dataset, say “what is the average student grade?”, your favorite AI model will return a number that is probably correct. But how do you know it is actually correct? Did it pick the right dataset? Did it run the correct query?

If the AI generates code in a language you don’t know, whether Scala or even Chinese, you are in for a tough time inspecting what it generated. You can constrain the AI to use a tool you understand, maybe SQL or pandas, so you can check the generated query.

But what if you ask for something more complex, say over a large collection of datasets, “who are the customers most likely to churn next month?”. Starting from scratch, this often necessitates a complex data pipeline, from table selection to data curation in the dedicated language for the data lake at hand to the training and validating of the machine learning model… How do you investigate such an AI-generated pipeline? Even if it’s generated in a language you know, whether SQL or Python, reviewing such a pipeline is time-consuming, thorough work and still error-prone. Unlike traditional software, which is reviewed once and reused many times, data analysis often has a terrible review-to-usage ratio if the pipeline must be inspected for every query.

For code generation, formal methods promise great things: once you have a formal specification, AI can generate code that is provably correct, and you pay the cost of writing the specification only once. The catch is autoformalization: creating a specification that truly matches your intent is still hard. Unlike software code, where the spec can be reused, data analysis pipelines often require new specifications for each query, so formal guarantees are less practical.

From model interpretability to AI system interpretability

One way to tackle this is to move higher in the hierarchy of abstraction and let the user choose an intermediary representation they are comfortable with to inspect the agent’s output easily. This sits between the raw generated code and the natural language interaction between the user and the agent. Such an intermediary representation has to be properly structured and trusted. This is where data transformation visualization tools like Dataiku, dbt, or Dagster could shine. They expose the structure of the pipeline, data sources, transformations, and dependencies, providing a structured view that makes AI-generated workflows easier to inspect and validate at the system level. In doing so, they introduce a layer of transparency and control that may prove surprisingly resistant to the full automation of the software industry.

The real bottleneck may no longer be producing software. It may be making AI-generated systems that we can trust, if not fully understand.

Mechanistic interpretability tries to peek inside a single model and see which circuits light up for certain patterns. But autonomous agents are more like a whole orchestra than a single instrument between multiple models, memory, tools, and reasoning loops interacting in ways that aren’t obvious from just looking at one piece2.

If AI is going to build our data pipelines, write our analysis scripts, or automate workflows at scale, the real challenge won’t be producing the correct output. It will be seeing clearly what it did, why it did it, and whether we can trust it.