Ceilings and Floors

In the last six months I've shipped two things that weren't possible eighteen months ago. One is a flight simulator of a Piper PA-28 Cherokee, built with a friend in a hotel lobby bar after a bottle of wine. The other is a service that processes legal paperwork that used to cost litigants hundreds of thousands of pounds, and now does it for under $400 in API spend.

There are two stories getting told about AI and knowledge work, and most of the time they get told as if they were the same story. They aren't. The first is about the floor coming up: average people doing average jobs now produce work that two years ago needed a trained professional. Drafting, summarising, basic analysis, first-pass code. The work juniors used to do, basically. The second story is about the ceiling: a small number of skilled people are getting something close to superpowers, and one person ships what used to take a team. Two different shifts, hitting two different populations. Conflating them gets you the wrong answer about almost everything else.

The flight sim is the ceiling story. Casefile is mostly the floor story.

Where the floor rises

Take court bundles. If you're a self-represented litigant in a UK court (divorcing, fighting an eviction, suing your employer), you have to assemble a bound, paginated, indexed bundle of every document relevant to your case. Sometimes the underlying papers run to tens of thousands of pages. Your solicitor, if you have one, can't do it efficiently because they don't know your case the way you do, and you can't do it efficiently either, because you don't know what counts as legally relevant. The whole exercise is a real minefield.

It's also priced out of reach for most people who try it on their own. Paying advocates to read every page of the underlying documents and flag what mattered can run into hundreds of thousands of pounds. A lot of cases never get brought, because the paperwork costs more than the case is worth chasing.

I've spent the last six months building a service called Casefile that takes that work off the litigant. Recent matters that would once have cost literal millions of pounds in advocate time we now process for under $400 in LLM API costs. The lawyers who've seen the resulting bundles, with twenty-plus years at the bar between them, have said they've never seen a client produce a bundle of this quality, and that their working life would be a lot easier if it became common.

Casefile is one example. Translation is another, and a slightly different one. For most translation use cases, the floor is the ceiling. People translating widely aren't chasing literary fluency; they want to read a foreign-language menu, or navigate a website in another language. When the LLM clears that bar, it's done most of the job. Customer service is well into the same territory. So is entry-level legal drafting, and the kind of routine code junior developers write in their first six months. The floor of competent output has come up across most of what knowledge workers actually do.

After the floor rises

The bit everyone talks about is displacement. Work that used to take a person now doesn't, or not to the same extent. Document review and discovery in law are about to look very different. Translation is already past that point. Customer support is well into it. A lot of paperwork-driven employment was, when you look at it carefully, charging a premium for a task whose underlying difficulty was nothing more than paying attention, and when attention got cheap, the premium left.

The bit nobody talks about, and the one that probably matters more, is that work which wasn't getting done at all starts getting done. There was no shortage of demand for it. People couldn't afford to pay for it. The clearest case I know is the legal one again. Smaller litigants couldn't afford competent paperwork, so a lot of them couldn't bring cases at all, which meant people with resources got away with screwing over people without them, because being chased was too expensive on the other side. Cost was a shield. When AI gets the floor of competent paperwork down to $400, the shield comes off.

This second half almost never makes it into the AI conversation, because most people writing about AI live in industries where the high price floor reflected genuinely hard work, not rationed access. In categories where access is what was rationed (legal aid, basic medical information, certain kinds of advice), floor-raising doesn't redistribute existing work. It causes work to happen that wasn't happening before, for people who couldn't get to it. Yes, some people lose out when the floor rises, and that part gets a lot of airtime. The other thing that happens, and almost never does, is that a lot of people who were locked out are suddenly being let in. Both matter.

Where the ceiling rises

A few months ago a friend was helping me move. We stopped at a hotel halfway through, ended up in the lobby bar with a bottle of wine, and I started showing him what I'd been seeing in Claude. He runs a mid-sized IT firm in the UK and is a few weeks from getting his private pilot's licence; he's been training in a Piper PA-28 Cherokee. Years ago I was halfway through my own PPL and stopped, because a hobby with the failure modes of light aviation was hard to defend with a newborn at home.

After a bit of brainstorming we decided to build him a flight simulator of the Cherokee, right there at the bar, that evening.

The first half hour got the bones of a working sim onto the screen: a 3D landscape with the Cherokee rendered in it, plus basic control bindings. My friend started flying it and giving real-pilot feedback. Too heavy this way. Too responsive there. The roll rate doesn't feel right when I pull through the bank. He has hundreds of hours in the actual aircraft we were trying to emulate, and his feedback was the body memory of a pilot who knew exactly what was off.

The LLM took the feedback in and did something I wasn't expecting. It downloaded flight-dynamics textbooks and papers from NASA, the FAA, and academic sources, describing what it was extracting from each as it went. Then it rebuilt the physics engine of the simulator from the equations in those documents. Not pattern-matched. Derived. Fifteen minutes after the feedback, the new physics were running.

I handed the controls back. My friend flew for thirty seconds and said: this is now almost exactly what it feels like to fly that plane.

A small studio used to spend a year on something like that. The LLM did the work of a research engineer: it read the relevant literature and built a working model from it. Syntax stopped being the bottleneck. Synthesis got compressed.

What the ceiling can't do

There was one thing the same LLM, that same evening, could not do. We kept trying to make it write an autopilot, and it kept crashing the plane straight into the ground. The autopilot routines it produced were confident and authoritative and flew us directly into hillsides every time. Every other piece of the sim worked. We could expand the world to an infinite landscape and drop airstrips wherever we wanted. We could tune the feel of the controls. We could not, no matter what we tried, get the LLM to fly the plane it had built.

The asymmetry is the lesson: the same model that read a flight-dynamics textbook and rebuilt a physics engine from it could not navigate the three-dimensional space that physics engine described. That's roughly the shape of what current LLMs are good and bad at. They're unusually good at building systems given enough context, and acting inside those systems over time, planning moves through space and against feedback they have to react to, is what they're bad at.

Designer, not pilot.

False ceilings

The autopilot was the easy kind of failure. It crashed. We saw the crash. We stopped trying to make it fly.

The harder failures look like successes.

Early on with Casefile, we ran an ingestion that processed our side's documents and the opposition's together. The system started extracting "facts of the case" that did not square with anything we knew about the matter, and at first they looked like hallucinations. They weren't.

The opposition's filings cited prior cases, as legal filings do. The system had pulled the statements of fact from those cited cases and folded them into the facts of our case. Every extracted statement was a real fact from a real legal document. The bundle would have read clean, and it would have been wrong on the inside. In the matter at hand, the cited-case facts were significantly worse than the real facts, and accepting them as ours would have made our position look far darker than it actually was.

The catch wasn't a technical inspection, it was a smell test. A person who knew the case read the extracted facts and noticed they didn't sound like the case.

This kind of error is harder to spot than fabrication, because nothing about any individual output is wrong. The atom is fine. The structure is wrong. Most discussion of LLM failure focuses on hallucination in the narrow sense, meaning made-up facts. The more common failure mode is misplacement: real facts in the wrong context, real reasoning applied to the wrong frame.

What saved us was domain knowledge: knowing what the facts of the case should sound like, specifically. What fixed it was scaffolding. We restructured the ingestion to respect the hierarchy of the documents, so that "facts cited as precedent inside a filing" couldn't be silently merged with "facts of the case being filed." Neither the smell test nor the scaffolding would have happened without someone who knew the domain. What you get out of an LLM is downstream of your ability to call bullshit on it.

What doesn't move

Floors move. Ceilings move. Taste doesn't.

Taste is the discipline of looking at an output and knowing what's wrong with it before you can say why. It's what told the lawyers the bundles were good, and it's what told us the extracted facts weren't. AI can't do it for you, and that's a real problem, because AI strongly encourages you to stop practising.

The dynamic isn't new. Organisations have been atrophying their senior people's judgment forever. The Peter Principle has been corporate folk wisdom for half a century: people get promoted to just past where they're competent, and from then on the work is mostly politicking and rubber-stamping. The only reason the output of most large organisations stays acceptable is that there are still, somewhere down the chain, relatively underpaid junior people giving a shit. They are the quality floor.

AI doesn't introduce this dynamic; it amplifies it. If you stop reading critically because the LLM has read it for you, if you stop drafting because it's drafted for you, your taste rots the way a senior manager's taste rots when they stop touching the work. It happens quietly. You don't notice when it's gone.

What that does to institutions

Scale this up and it stops being an individual problem.

What gets eaten first by an LLM matches the contours of junior knowledge work: research, drafting, summarising, basic analysis, first-pass code. The people who do that work are mostly juniors, and juniors, as I said above, are the ones holding up the quality floor. Not because juniors are smarter than seniors. Because their job is doing the work, not approving it.

Combine those facts and you get the obvious failure state at scale. Companies look at the cost line. They see that LLMs can produce work resembling what their junior employees produce. They decide they can move all the decision-making up the chain. The seniors stay. The juniors go. What gets lost, and what they don't at first notice losing, is the quality floor underneath the building.

There's no software fix for an organisation that's stopped having someone in the room who's paying attention. The output keeps coming. It even looks plausible. But the people closest to the customer are gone, along with the people closest to the product and the people closest to where things actually break. You can see the shape of this already in companies announcing mass layoffs for what is, when you do the maths, a small percentage bump in operating margin, justified by the hope that an LLM can do the work the people actually talking to their customers were doing.

This is already happening. The companies doing it are about to find out.

Two practical conclusions

First: don't use an LLM as a replacement brain, use it as a replacement team. The model is a research engineer when you need literature read, a fast draft writer when you want bones, a sparring partner the rest of the time. It's also, mind you, a junior. Broad reach. Unreliable depth. Direct it the way you'd direct a small team you respected but didn't entirely trust. Don't ask it for the answer. The judgment stays with you.

Second: if you can use an LLM that way, build yourself a new job. Don't spend the time it gives you doing the same work faster. Spend it taking on work you couldn't take on before. Move up. Otherwise you're using a senior person to babysit a junior whose mistakes you can't spot, and you haven't gained the thing the new tools were promising.

The current AI conversation is mostly about which is the bigger story, the floor or the ceiling. They aren't the same story.

The work AI is best at is the work that lets a person who already gives a shit do better than they could have done alone. Everything else is a small percentage bump justified by a hope, and the recent history of new tools entering the workplace doesn't suggest that hopes like that tend to pay out.

If all you're doing is babysitting an LLM, it might work for you commercially. I don't know that it works at a human level.

The floor rises through fluency. The ceiling rises through iteration.

The floor takes you from incompetent to not embarrassing. The ceiling takes you from competent to actually excellent.