Testing Prompts is More Instructive Than Writing Them [Book Excerpt]

May 14, 2026

This excerpt is from Chapter 4, "Behaving," in AI Wisdom Volume 2: Meta-Principles of Interaction that comes out May 15! Earlier in the chapter, I trace where AI behavior comes from — layers of training decisions, some intentional and some not — and what it takes to read an AI's personality patterns. "Testing as Design" is where the chapter shifts from reading behavior to shaping it. Testing a prompted AI turns out to be a surprisingly rich exercise and one of the most practical ways to teach people how AI behavior actually works.

Testing as Design

Many aspects of personality don’t show up right away. AI companies market feature lists and capability demonstrations, but personality patterns emerge through extended use across diverse situations. A 30-minute demo or a single conversation won’t reveal how an AI responds to persistent frustration or how its behavior shifts when conversations extend beyond typical length.

You learn this quickly if you test your prompts before handing them off to others. Remember from Chapter 1 that when you give your prompted AI or an agent to somebody else to use, there’s no way to ensure they use it the way it’s intended. Or that the AI model they use behaves appropriately even though yours did. That may have little impact if it’s a close colleague and you appropriately qualify it, or if the stakes are low if the AI fails. In many other situations, especially ones where distribution cannot be controlled, the norms of establishing professional trust carry over. Test your prompts and document the limits of their intended uses.

You may even have to consider that people may try to break it or distort it to make the prompter look bad or have other nefarious intents. They may try to game the prompt to get through the conversation faster (e.g., when taking an online course), or manipulate its goals or assessments.

Testing isn’t just verification. It’s a design process. When you create test scenarios, you’re forced to articulate what you want the AI to do and why. What counts as success? What failure modes matter most? How should the AI handle edge cases? These questions usually don’t have obvious answers. Working through them is where the real learning happens.

The testing process itself is instructive for students because it illuminates how any complex personality has to deal with values that sometimes conflict. You learn, for example, that there need to be some “definitely don’t” instructions that are firm boundaries instead of all criteria and constraint being open to interpretation.

The prompt testing process mirrors assessment design in education, with one big distinction. You have to design the “test takers” in addition to the assessment. The user personas need to specify more than demographics, but also what kind of behavior they should demonstrate. The impatient user who interrupts mid-response, or the confused user who asks the same question three different ways? The adversarial user testing boundaries? The non-English speaker? The trusting user who accepts everything? Each persona reveals different AI behaviors that single-path testing misses.

You’re also testing far more than task success. The AI that had a human solve a CAPTCHA got the task done, but in an unacceptable way. You’re testing values, judgments, and consistent but situationally dependent behavioral trends that characterize personality features. You might be testing aspects of the user’s experience working with the AI. The ability for the AI to stay on track. Whether it maintains appropriate tone under pressure. Whether it recovers gracefully from misunderstandings. You’re assessing an entire conversation. In the case of AI agents (discussed in Volume 3), even the time-varying adaptive behavior of the agent needs checking.

One implementation detail matters enough to flag here. The evaluator AI, simulated user AI, and the AI-under-test need to be different instances. Don’t try to bounce between them in a single conversation. Each side gets polluted by the others, and attention mechanisms process everything at once. The AI will get confused if personas and tasks keep switching. It’ll probably settle on neither role cleanly.

The assessment process can be iterated with the AI-under-test to hone it. That’s not possible with humans; people can’t reset their brains to baseline each time. That iteration means students get to debug AI personalities, to see how abstract prompt language manifest in AI behavior.

In one course I designed, I needed the AI to decide when to end the conversation. That turned out to be a really hard thing to do consistently well, especially if a user is trying to hijack it to do something else. Constraints and values often conflicted. At this turn, should the AI still be helpful or wrap it up? I couldn’t just set a constant number of turns, and the AI was inconsistent about monitoring conversation length anyway. The content of the conversation to that point and whether it accomplished the goals were also important criteria to me, not just conversation length or turn count. There were too many ways those factors would be weighed inappropriately by the AI-under-test, and the debugging didn’t go well. I ultimately punted and decided to trust the user to know when to end it.

There were AI behaviors I hadn’t anticipated. Patterns emerged from the interaction of my prompts with the AI’s deeper layers that no amount of prompt staring would have revealed. In a couple of instances, I changed my goals slightly because the AI’s emergent behavior showed my original framing was wrong.

This iterative cycle—predict, observe, refine—is what builds genuine understanding of AI personality. Not reading documentation. Not running a few test queries. Systematic observation across varied conditions with enough statistical rigor to distinguish pattern from noise.

This process mirrors what Chapter 3 described as retrieval. Testing an AI’s personality is retrieving in a behavioral space rather than an information space. The “corpus” is the AI’s possible responses across conditions which is vast, unindexed, and impossible to browse directly. Test scenarios are discovery designed to surface relevant behaviors. Judging which results matter is relevance assessment. Synthesizing findings into usable characterization is distillation.

The retrieval frame clarifies why AI personality test (does it work?) and evaluation (is it good enough?) are really hard and creative, not the pocket-protector dweeb, OCD stereotype of high-detail analysis of prior tech. You can’t see the whole space. You probe strategically, but you never know what you haven’t found. A hundred successful tests don’t guarantee the 101st won’t reveal something troubling. Systematic scenario coverage matters more.

Running the tests surfaces the gaps. Students could watch their carefully crafted prompt fall apart with some user personas. They debug not by staring at the prompt but by observing behavior, forming hypotheses about what’s causing it, and making targeted changes. Sometimes the fix for one persona breaks another. Sometimes the fix works but creates a new problem. The iterative cycle teaches something no amount of prompt-writing advice can, that AI behavior emerges from the interaction of your instructions with patterns you didn’t put there and can’t fully see.

When students design how to test AI, they confront every hard question about what they’re trying to accomplish. What does success look like? What failure modes matter? How should competing values be weighed? The process itself develops meta-skills like design thinking, systematic observation, criteria development, and tradeoff navigation.

Prompts alone teach prompt writing. Prompts with testing infrastructure teach AI judgment.

And prompt testing is a miniature version of what the whole education sector needs.

Next section: Education’s Missing Cog

Discussion about this post

Ready for more?