How to Make AI Agents Follow Your Design System

Alice Moore· June 15, 2026

11 min read

The other day, I had an agent open a PR against one of my apps, for a new notification-preferences panel. The screenshot the agent handed me looked right. Spacing was fine, labels were there, the responsive layout broke where it should. I almost approved it.

Then I pulled the branch.

The panel imported a modal from components/legacy/ (a folder I keep around purely so a few old pages don't break). It used a form hook I deprecated months ago.

Error states were styled with a hardcoded #FF3B30 instead of my semantic tokens, which means the panel would have shipped looking subtly wrong in dark mode and very wrong after the next rebrand. And the modal trapped keyboard focus incorrectly, so it failed accessibility review in a way no screenshot will ever show you.

None of this was the model being dumb. Every one of those mistakes was the model being a good student of my codebase. The legacy folder has more modals in it than the new one. The deprecated hook appears in dozens of files; its replacement in a handful.

The agent looked at my repo and correctly inferred what I do. What I do, unfortunately, is far from what I say I do.

It's why I've stopped believing that better rules files are the answer.

Ten minutes here, fifteen there

Each individual fix on a PR like this is small, which is what makes the cost easy to miss. Point out the legacy import, paste a link to the token docs, ask for a focus-management fix.

But agents generate PRs faster than humans do, so the overhead scales with their output, not yours. I hit the point where I was spending more time correcting agent code than it would have taken to write it myself. Past that point, my velocity goes negative, because half-reviewed PRs pile up and block everything behind them.

The usual fix is to write a bigger rules file. .cursorrules, AGENTS.md, copilot-instructions.md, pick your flavor. Mine grew to multiple files and several hundred lines. Always use semantic tokens, never import from legacy, here's how I do forms, here's how I don't. That sort of thing.

The agents kept making the same mistakes anyway. Because prose always loses to code.

Why rules and skills lose to code

First, long rules files degrade. Models pay less attention to instructions buried in the middle of a large context window, and a hastily-written markdown file is competing for that attention against actual code, diffs, and the task description. Worse, prose rots silently. Nobody updates the rules file when a component API changes, because nothing breaks when you don't.

Second, models weight examples over instructions. When an agent writes a new component, it searches the workspace for similar components and patterns on what it finds. If your rules file says "always use semantic tokens" and the three nearest files use inline hex values, the hex values win.

The agent treats whatever is running in production as the real rules of the house, and in a brownfield repo it's right to, since production code is the only thing that's been tested.

So if your repo contains five ways to write a form, you don't have a design system. You have five design systems, and the agent will pick whichever one happens to be closest to the file it's editing.

Put all you want in your rules, but at the end of the day, your codebase is your prompt.

How to fix your agents' context

So how do you actually make the agent behave if your codebase is a mess? You don't try to rewrite the whole repo overnight. And you definitely don't write a longer README.

Instead, you start making the wrong paths fail—mechanically, locally, and immediately, before any human has to review a single line of code.

I like to think of it like this: Rules stay as slim as possible and point to code. Code teaches the agent best practices. Deterministic checks constrain code outputs and let the agent iterate until it's right.

A flow chart titled "Successful Agent Context" shows three boxes connected by arrows: "Rules point" pointing to "Code teaches," which points to "Checks constrain," with a return arrow looping from "Checks constrain" back to "Code teaches.

Instead of giving the agent a massive list of instructions, you shrink your rules file down to a bare-minimum map of canonical folders and reference implementations. Then, you let your toolchain do the dirty work of enforcement.

Here's how you put those constraints in place, starting with the biggest offender: imports.

1. Ban the bad imports

An agent browsing your repo cannot tell components/ui/ from components/legacy/. To the model they're both just folders full of working code, and the legacy folder is often the more attractive training signal because it's bigger.

So don't explain the difference. Make it a compile error:

{ "rules": { "no-restricted-imports": [ "error", { "patterns": [ { "group": [ "@/components/legacy", "@/components/legacy/*", "**/components/legacy/*" ], "message": "Do not import legacy UI. Use '@/components/ui' and check 'src/design-system/examples' for current patterns." } ] } ] } }

This works on agents specifically because modern coding agents run linters and compilers in a loop. The error lands in stderr, stderr lands in the next prompt, and the agent treats it as a repair instruction. A lint error is, in effect, the only kind of rule the model cannot skim past.

The message field doubles as a prompt, telling the agent exactly where to go instead. Those messages are worth writing carefully, since they get read more often by models than by people now.

There's a brownfield wrinkle. You often can't turn this rule on globally without burying human developers in errors every time they touch an old file. Two ways around it. Run the strict rules only on files changed in the current branch (lint-staged or a pre-commit hook), or keep a second config (.eslintrc.agent.json) with warnings escalated to errors, and make running it a required step in the agent's verification loop.

Yes, that second option means agents are held to stricter rules than humans. That asymmetry bothered me at first.

But it falls out of a real difference. A human editing a legacy file knows it's legacy, and we extend them judgment we can't extend to a model. The way I've come to think about the agent config is as the target config, the one you'd apply to everyone if you didn't have ten years of history. Agents just get there first, because they have no history to grandfather in.

2. Make invalid props impossible to type

Loose interfaces are an invitation to hallucinate. Give an agent this:

// FRAGILE: allows invalid prop combinations interface CardProps { title: string; variant?: "informational" | "interactive" | "marketing"; href?: string; onClick?: () => void; badgeText?: string; }

and it will eventually produce an informational card with an onClick, or a marketing card missing its badgeText. The type says those combinations are fine, and the type is the most authoritative document in the repo, so the agent has every reason to believe it.

Discriminated unions close the gap between what the type allows and what the design system means:

// RIGID: invalid states don't compile type InformationalCard = { variant: "informational"; title: string; }; type InteractiveCard = | { variant: "interactive"; title: string; href: string; onClick?: never } | { variant: "interactive"; title: string; onClick: () => void; href?: never }; type MarketingCard = { variant: "marketing"; title: string; badgeText: string; }; type CardProps = InformationalCard | InteractiveCard | MarketingCard;

Now a badgeText on an informational card is a compile failure, the failure shows up in the agent's loop, and the agent fixes it without a human ever knowing it happened. The TypeScript compiler becomes the design-system reviewer for an entire class of mistakes.

3. Deterministically enforce your tokens

Left to choose colors, an agent picks by visual resemblance. It wants red, your palette has colors.red.500, done. Your semantic layer is now decorative. Stylelint takes the choice away.

{ "plugins": ["stylelint-declaration-strict-value"], "rules": { "scale-unlimited/declaration-strict-value": [ ["color", "background-color", "border-color"], { "ignoreValues": ["inherit", "transparent", "currentColor"], "message": "Use semantic color tokens such as var(--color-border-error), not raw values or reference colors." } ] } }

What this buys you is a smaller decision space. The agent no longer guesses which red an error border should be. There is one answer, var(--color-border-error), and everything else fails.

Start with color only. Color is the safest first boundary because the semantic names are obvious — error, disabled, primary action, subtle border.

I tried turning on spacing enforcement before my spacing tokens were complete, and the agents started inventing token names that didn't exist, which is worse than hex values. Spacing, typography, radius, and shadows can follow once the token system can actually answer every question the linter will force.

4. Give the agent a golden directory

Models learn by example, so give them a perfect one. I keep src/design-system/examples/ for reference implementations, compilable and production-grade, one per common pattern, covering a settings form, a data table, and a detail page. They're real code, they run in CI, and they break loudly when an API changes, which is exactly the property prose documentation lacks.

// src/design-system/examples/settings-form.tsx import { useForm } from "react-hook-form"; import { FormField, Input, Button } from "@/components/ui"; interface ProfileFormData { email: string; } export function SettingsFormExample() { const { register, handleSubmit, formState: { isSubmitting, errors } } = useForm<ProfileFormData>(); const onSubmit = async (_data: ProfileFormData) => { await new Promise((resolve) => setTimeout(resolve, 1000)); }; return ( <form onSubmit={handleSubmit(onSubmit)} className="space-y-4"> <FormField label="Email Address" error={errors.email?.message}> <Input {...register("email", { required: "Email is required" })} aria-invalid={Boolean(errors.email)} placeholder="you@example.com" /> </FormField> <Button type="submit" loading={isSubmitting}> Save changes </Button> </form> ); }

And the rules file, which used to explain form validation in paragraphs, collapses to a pointer:

# Agent Instructions - Use UI primitives exclusively from `src/components/ui/` - For settings interfaces or forms, match the structure of `src/design-system/examples/settings-form.tsx`. Do not write custom wrapper forms. - See `src/design-system/deprecated.md` for old components and their replacements.

The agent opens the example, copies the skeleton, and substitutes the task-specific parts. Copying a known-good template is the one thing these models do almost perfectly.

But wait. Won't the examples directory go stale exactly like the prose did? It would, except for one structural difference: the examples compile.

When I renamed a Button prop, the prose docs stayed wrong until I tripped over them months later. The example broke in CI within the hour and got fixed in the same PR as the rename. Staleness in prose is invisible; staleness in code is a build failure.

The verification loop

Before opening a PR, the agent must pass a single command.

npm run test:verify-ui

Mine runs TypeScript compilation, ESLint with the agent config, Stylelint, and a set of headless Storybook interaction tests. The interaction tests catch the category of bug that started this article, the things a screenshot can't show.

// src/components/ui/Modal.stories.tsx import { expect, screen, userEvent, within } from "@storybook/test"; export const InteractiveState = { play: async ({ canvasElement }: { canvasElement: HTMLElement }) => { const canvas = within(canvasElement); const trigger = canvas.getByRole("button", { name: /Open Modal/i }); await userEvent.click(trigger); const dialog = await screen.findByRole("dialog"); await expect(dialog).toBeInTheDocument(); // Focus must land inside the modal. This is the test the // notification-preferences PR would have failed. const input = within(dialog).getByRole("textbox", { name: /Username/i }); await expect(input).toHaveFocus(); } };

If the agent's modal doesn't manage focus, the test fails locally, the failure goes back into the loop, and the agent refactors until it passes. In practice this is less tidy than it sounds. Sometimes it takes three loops, and more than once the agent has decided the easiest fix was to weaken the assertion, which is why the diff checks described below exist.

By the time a real person sees the PR, the mechanical questions (does it compile, does it use the right components, is it keyboard-accessible) are already answered, with evidence attached.

Most PRs that still bounce now bounce for interesting reasons.

"Isn't this just... good engineering?"

Yes. Strict types, lint rules, reference implementations, interaction tests—every technique in this article predates LLMs, and every one of them helps developers and other contributors, too.

What's changed is the economics. These practices used to be nice-to-have, the kind of thing you'd get to after the roadmap. They were skippable because your contributors were people who absorbed conventions through review comments, Slack, and osmosis.

An agent absorbs nothing, post-training.

It re-derives your conventions from scratch on every task, from whatever signals the repo gives it, and it does this across dozens of PRs a week. Discipline that was optional at human contribution rates becomes load-bearing at agent contribution rates. The need for executable standards was always there; agents just removed your ability to substitute tribal knowledge for them.

This is the core of designing for Agent Experience (AX), or shaping the codebase to be self-explanatory and mechanically self-enforcing for AI contributors.

Some caveats

Take care: agents can easily game checks. Mine have occasionally tried eslint-disable comments, as any casts, and once, memorably, deleting a failing test. You need a small meta-layer of lint rules that ban suppression comments in changed files, plus a diff check that flags deleted tests (or changed tests, if you have a core set) for human attention. This is annoying but still necessary, even with how good agents have gotten.

Also, green checks don't mean good design. Everything here verifies that the code is built correctly. That it's made from the right components and right tokens, and that it's accessible and compiling. None of it verifies that the result looks right or solves the user's problem.

Visual regression tests help a bit. For the rest, you still need a person with taste. It's just that now they don't have to spend that test on verifying hex codes.

What review becomes

All of this machinery does one modest job. It makes bad code fail fast, in private, where the cost of an error is one more loop iteration instead of one more review round-trip.

The change shows up in your review queue. Comments like "please use tokens" and "this import is deprecated" disappear, because those PRs can no longer be opened. What's left is the work that actually needs a person. Does this solve the user's problem? Is the API coherent with the rest of the system? Should this feature exist at all?

Reviewers were never supposed to be compilers. Let the toolchain do that job, and let your engineers do theirs.