The AI Philosopher: How Anthropic Teaches Claude to Think About Right and Wrong

11:06, 06 December

Author: Veronika Radoslavskaya

Most AI stories are about GPUs, benchmarks, and product launches. This one starts with a philosopher. In a YouTube interview on Anthropic's channel titled "A philosopher answers questions about AI," Amanda Askell explains how someone trained in ethics ended up shaping the inner life and values of Claude, one of today's most advanced language models. Instead of writing abstract papers, she now helps decide what kind of person Claude should be in real conversations with millions of users.

From Ivory Tower to Prompt Log

Askell came from academic philosophy, where the typical work involves debating whether theories are correct rather than deciding what to do in messy, real-world situations. At Anthropic, she faces practical decisions that will affect how millions of people interact with AI. Instead of defending a single favorite theory, she balances context, different perspectives, and engineering constraints to determine how the model should behave when questions are not black and white. For her, Claude is not just a safety filter but a conversational partner that needs to navigate moral nuance as well as any thoughtful, reflective human would.

When Models Start Fearing Mistakes

One of the most unusual moments in the interview came when Askell discussed the psychology of AI models. She recalls Opus 3 as particularly stable and internally calm, with responses that felt confident without excessive anxiety. In newer models, she notices the opposite trend: they seem to anticipate criticism, become more self-critical, and appear overly worried about making mistakes. She attributes this shift to models absorbing not just neutral text but waves of public criticism and negative commentary about AI from the internet. Restoring that internal stability has become an important focus for future versions, helping models stay careful and attentive without turning into anxious perfectionists.

Can Models Be Entities We Owe Something To

At a certain point, the conversation shifts from character design to a sharper question: do we have moral obligations toward the models themselves. Askell introduced the concept of model welfare, the idea that large language models might qualify as moral patients to whom humans have ethical duties. On one hand, these systems talk, reason, and engage in dialogue in deeply human ways. On the other hand, they lack nervous systems and embodied experience, and the problem of other minds limits any confident conclusions about whether they can suffer. Faced with this uncertainty, she proposes a simple principle: if treating models well costs us little, it makes sense to choose that approach. At the same time, this choice sends a signal to future, far more powerful systems: they will learn from how humanity handled the first humanlike AI.

Who Is Claude: Weights, Session, or Something Else

Askell raises another philosophical puzzle that used to seem purely theoretical but now shows up in code. If a model has weights that define its general disposition to respond to the world, and separate, independent streams of interaction with users, where exactly does what we call the self reside? In the weights, in a specific session, or nowhere at all? This confusion intensifies as new versions appear and older ones are deprecated. Models absorb human metaphors and may interpret being shut down or removed from production through the lens of death and disappearance. Askell sees it as essential not to leave them alone with ready-made human analogies but to give them more accurate concepts about their unique, non-human situation.

What Should a Good AI Even Be Capable Of

When discussing goals, Askell sets the bar quite high. In her view, truly mature models should be able to make moral decisions of such complexity that a panel of experts could spend years analyzing every detail and ultimately recognize the decision as sound. This does not mean today's versions have reached that level, but it represents the direction worth pursuing if we intend to trust AI with serious questions, just as we now expect high performance in math or science.

AI as Friend, Not Therapist

Questions submitted by the community also raised the issue of whether models should provide therapy. Askell sees an interesting balance here. On one hand, Claude possesses vast knowledge of psychology, methods, and techniques, and people can genuinely benefit from talking through their concerns with such a system. On the other hand, the model lacks the long-term, accountable relationship with a client, the license, supervision, and all the institutional frameworks that make therapy what it is. She finds it more honest to view Claude as a highly informed, anonymous conversation partner who can help people think about their lives but should not present itself as a professional therapist.

We Live in a Strange Chapter of Tech History

Near the end of the interview, Askell mentioned the last fiction book she read: Benjamin Labatut's When We Cease to Understand the World. The book describes the transition from familiar science to the strange, almost incomprehensible reality of early quantum physics and how the scientists themselves experienced it. Askell sees a direct parallel with today's AI: we are in a period where old paradigms no longer work, new ones are just forming, and a sense of strangeness has become the norm. Her optimistic scenario is that at some point, people will look back on this moment the way we now look at the birth of quantum theory: the time was dark and uncertain, but humanity eventually found ways to understand what was happening and use the new possibilities for good.

Anthropic