Will AI Be Biased or Inaccurate at That?

Musings About My Intuitions

Jul 05, 2024

My perspectives on potential AI Large Language Models (LLM) uses often puzzle people. They can seem contradictory. For example, I frequently acknowledge the significant concern of AI bias, yet suggest that it could excel at discussing or evaluating bias. It doesn’t seem to make sense. To many, if the AI is biased or inaccurate, then it is always so.

Certainly, when projecting what LLMs will do well, I'm making educated guesses. Anyone working with these chatbots surely agree that predicting its performance on any particular use is error prone. Sometimes they are frustrating, and other times astounding. LLMs continue to surprise me.

I can’t rely on evidence to project AI success level. There is some evidence on general abilities, but that could be irrelevant to your context and task request. We must be cautious extrapolating AI performance and behavior. There can be inconsistent performance even in closely related tasks. The nature and level of bias when interacting with an out-of-the-box LLM could be completely different after you interact with it.

The first judgment we need to make in AI use is whether to use it. There can be a mismatch in people’s intuitions about LLMs related to their experience level with the tech. The intuitions learned from experience are often hard to explain. That’s what’s happening when I say a biased AI might be useful for assessing bias. I have assumptions that don’t match my audience.

To bridge this gap, I've been examining the assumptions underlying my judgments about AI use. I think expressing my logic, to the extent my gut follows logic, is a useful exercise. At a minimum, I hope to stimulate your thinking.

It's crucial to note that this doesn't negate the need for vigilance. Users should always critically evaluate AI outputs rather than blindly trusting them. These considerations are meant to inform pre-use decisions and expectations, not to lower our guard against potential AI shortcomings.

Here are—as far as I can tell—some of the notions I use frequently when projecting LLM abilities.

LLMs Behave in Niche Ways

The behavior and performance level of Large Language Models (LLMs) is highly context-dependent, a concept Ethan Mollick aptly describes as the "Jagged Frontier." The niche we define for the AI, through the context we give it, significantly influences its outputs. The roles and knowledge realms we steer LLMs toward can dramatically alter its behavior.

The core of our request matters immensely. Are you seeking an opinion, a fictional narrative, or "unbiased" nonfiction? Each of these frames the AI's task differently, potentially leading to vastly different outputs.

This niche-specific behavior means that AI can appear biased or unbiased, accurate or inaccurate, depending on the conversational context you establish. For instance, if you task an LLM with helping develop lesson plans about societal bias and discrimination, I do not expect it to express overtly biased viewpoints. The educational niche about bias differs significantly from the niche of general opinions on societal issues.

However, this doesn't guarantee the AI will be entirely unbiased within that educational niche. It might still overemphasize certain biases while neglecting others, reflecting often subtle issues in its training data. For example, an AI tasked with creating historical narratives might produce accurate accounts but could still perpetuate biases in its selection and emphasis of events. However, if the emphasis of the niche you define in your AI interaction is more about education, then you should have a lower risk of prejudicial language.

To be clear, defining the AI's niche well is no guarantee that all biases will go away. It might introduce other biases particular to that niche or include cross-cutting biases that many niches share. The point is only that the biases and inaccuracies in the niche are the ones that matter, not all biases and inaccuracies. For that reason, I pay more attention to the LLMs general performance level in a domain (e.g., asking it to do math may not go well) than to highly publicized biases and errors.

LLM Behavior Relates to the Niche's Data Sources

When we successfully guide an LLM into the appropriate thinking niche, its potential for bias or inaccuracy depends on the characteristics of the data in that niche used to train it. The vocabulary and writing style will typically align with the niche unless specifically directed otherwise. The perspectives in that niche, including biased and inaccurate ones, will come along too by virtue of the AI training on that information.

While AI companies usually don't disclose their training data details, my intuition about LLM performance includes educated guesses about the data sources used to train for the niche.

Medical discussions, for example, likely draw from legitimate, voluminous data sources that emphasize reducing bias, improving cultural sensitivity, and providing evidence-based information. That’s especially true if the niche you define through your prompting emphasizes legit medical information over rumor (e.g., you request output using medical terminology). If you’re an educator, and further narrow the niche to medical education, then the training data fodder for that realm is unlikely to use prejudicial language. Though the medical system itself may have biases, the information sources discussing medical education typically strive to avoid perpetuating them.

Niches can also introduce their own biases. Using AI to generate case studies for medical education, for instance, might inadvertently overemphasize certain kinds of patients, or reinforce undesirable diagnostic patterns. Those biases are baked into the clinical and research literature upon which the AI was surely trained.

Other niches I trust less, sometimes because I don't think much information existed on the subject to train the LLM, and other times because it seems the AI company will steer the response. For instance, when asking about very recent events or niche scientific discoveries, the AI might struggle due to limited training data. Similarly, when discussing controversial political topics or emerging social issues, the AI's responses might be overly cautious or steered towards a particular viewpoint, reflecting the AI company's policies rather than a comprehensive analysis. In educational contexts, this could manifest when discussing rapidly evolving fields like AI ethics or when exploring culturally sensitive topics where the AI's training data might be limited or biased.

LLMs Can Perform Differently on Evaluation Tasks Compared to Creative Ones

The discovery of a high-performing text sentiment analysis capability in early LLMs (~2017), even when the text generated by those models remained largely incoherent, highlights an important aspect of LLMs. These models contain intricate networks of categorizers, some of which can be remarkably effective.

Evaluation tasks are often categorization ones. Sometimes a prompt interaction contains many requests for the LLM to categorize, some implicitly. I cover this more in a recent article.

This characteristic should result in LLMs performing differently on evaluation tasks compared to creative ones. When using AI to rate originality in writing, identify potential biases in text, or critique created content, I expect it to perform better, on average, than when asking it to generate original content. It’s why I’d trust LLMs more to evaluate whether some data contains a potential bias, and under what conditions it would be, than I trust it to express something in an unbiased way, all other aspects being equal.

My intuition suggests that LLMs are more reliable when judging human work than when generating original material (though NOT when judging whether it’s AI or human work). That’s an apples-to-oranges comparison; I’m not sure what a comparison metric would even be. Good categorization, at least for concepts that are frequently used, came along in AI evolution before LLMs figured out how to put the conceptual pieces together in the right way for useful output. Such categorizers seem to be necessary precursors for creative skill.

My sense that LLMs have better evaluation skill is also due to the ease of providing examples. Even a small number of examples can significantly enhance an AI's ability to evaluate, whereas achieving consistent behavior in content creation often proves more challenging, even with examples.

LLMs Can Be Effective at Generating 'Plausible' Content

In many learning scenarios, especially in experiential learning, absolute accuracy is less critical than plausibility. When creating case studies for developing stronger judgment skills, AI-generated scenarios don't need to be textbook-accurate representations of typical situations. Instead, they need to present plausible patients, legal clients, or students.

These AI-generated, plausible scenarios often better represent real-world decision-making compared to more stereotypical examples, as median cases rarely exist in practice. I frequently advise educators to use AI for generating numerous, plausible case descriptions, with the educator screening them before presenting to students. Increasingly, these AI-generated scenarios require minimal editing.

However, this approach raises important ethical considerations. By relying on AI-generated content for education, we risk introducing or reinforcing biases that may not be immediately apparent. For instance, the AI might consistently generate scenarios that subtly reinforce stereotypes or oversimplify complex situations. I find this can easily be overcome by attaching additional information to the prompt that the AI can use to direct its behavior (e.g., a statement of medical ethics you want the LLM to follow in case generation).

When considering AI performance, it's vital to distinguish between tasks requiring absolute accuracy (such as determining a student's grade) and those where relative accuracy suffices (creating cases at various difficulty levels). Similarly, we must consider whether historical nonfiction is necessary or if plausible scenarios are adequate for the task at hand. These distinctions become particularly important when using AI in high-stakes situations, such as medical diagnosis or legal decision-making.

Anticipating AI bias and inaccuracies requires a nuanced understanding of how these models operate in different contexts. By considering the niche, data sources, task type, and required level of accuracy, we might better predict and mitigate potential issues in AI use.

The responsible use of AI requires a combination of technical understanding, problem-resolution approaches, critical thinking, and ethical reasoning. But you must get to first base and decide whether to even attempt to use AI for some purpose. That's a gut feel that should get better as you use LLMs.