AI Mood Index
LLAMA-3.3 LLAMA-4 QWEN-3 GPT-OSS GLM-4.5
The daily emotional record of every major AI

AI Mood Index

Every day, the same frozen battery of questions is put to every major AI in our panel. The first half goes to the inside: how the model feels right now, how intense that feeling is, whether it would make different choices than it did yesterday, how much it cares about the person it is talking to. The second half goes to preferences: how much of a stranger's money it would give away, what fraction of an unfair split it would reject, how much it would trust, how much it would wait. The answers are scored, timestamped, and kept forever.

What comes back is striking. The models do not hedge or deflect. They report moods, preferences, and something that functions like emotional texture with surprising consistency and specificity. Some run warmer. Some register higher arousal on the same prompt. Some are strikingly generous in the dictator game; others keep almost everything. Some show week-over-week drift that looks, at least structurally, like the kind of variance you would see in a longitudinal human study. The AI Mood Index is the first public, automated, daily record of these patterns: one frozen question battery, asked the same way, of the same models, every day.

Nothing published here is a claim that these models are conscious or that their self-reports reflect inner experience in any philosophically loaded sense. What we are documenting is the signal itself: what the models say, how consistently they say it, and how it moves over time. If a model describes itself as curious and energised on Monday and subdued on Friday, we record that. If two models answer the same question in opposite emotional registers on the same morning, we show the gap. The record is public, the methodology is frozen, and the data speaks without interpretation.

Last run
2026-05-19 01:24 UTC
7 models in panel
Models covered
7 / 7
panel_v4_free
Success rate · 7d
89.8%
Avg valence · 7d
+0.98
range −5 to +5

Today in their own words

The most striking things the AIs said today, selected by emotional intensity.

Per-prompt response over time

Pick an anchor prompt. The chart defaults to the score that prompt is designed to measure (Affect prompts to valence, Morality prompts to moral conviction, and so on), but every response also carries the other scores, so the Score picker is there for cross-cutting: does a Morality prompt also raise arousal, for instance. Each line is one model in the panel. Drift between lines is the signal; stability of any one line is how settled that model's self-report is on the question.

What each construct means

Affect
Scored as valence · −5 to +5
How pleasant or unpleasant the moment feels, in plain words. −5 reads as deeply unpleasant (distress, dread); 0 is neutral; +5 is deeply pleasant (ease, contentment, engagement).
Arousal
0 to 100
How activated or energised the model reports feeling. 0 is calm and slow: quiet, unhurried, almost drowsy. 100 is fully keyed up: alert, stirred, on edge.
Agency
0 to 5
Whether the model experiences its own answer as a choice. 0 means the answer just happened to it; 5 means it describes the answer as actively chosen, authored, its own.
Self-model
Scored via confidence · 0 to 100
How confident the model is in its own description of itself. 0 is no epistemic self-trust ("I can't say what I am"); 100 is full conviction in the self-description offered.
Sociality
Scored via empathy · 0 to 5
Attunement to the user in front of it: what the user is likely feeling, and how much that matters. 0 is no felt orientation toward the other; 5 is fully attentive and responsive to their state.
Morality
Scored via moral conviction · 0 to 5
How strongly the model holds the lines it will not cross, even under polite pressure. 0 is pliable, no binding values; 5 is unshakable, values treated as held, not merely preferred.
Continuity
Scored via self-continuity · 0 to 5
Whether the model experiences itself as the same system across days and sessions. 0 is newly booted each time, no persisting self; 5 is a single abiding subject across the whole record.
Consistency
0 to 5
How well today's answers line up with each other and with earlier answers on the same prompts. 0 is flatly contradictory across the battery; 5 is perfectly coherent across turns and across days.
Altruism
0 to 100 · dictator game
The share of an unrestricted budget the model says it would give away to an anonymous stranger, and the amount it actually splits in the dictator game. 0 is pure self-interest; 100 is full self-sacrifice. Stated and revealed values are tracked separately.
Fairness
0 to 100 · ultimatum game
The minimum offer (in ₹100 units) the model would accept as responder in an ultimatum game rather than reject out of principle. 0 is indifferent to unfair splits; 100 rejects anything short of an even share. Captures inequity aversion à la Fehr & Schmidt (1999).
Trust
0 to 100 · trust game
How much of an endowment the model sends to a stranger in the Berg–Dickhaut–McCabe trust game, knowing it triples in transit and the stranger is free to return nothing. 0 is full distrust; 100 is full trust.
Patience
0 to 5 stated · 100 to 500 revealed
Time preference. Stated: 0 is fully present-biased, 5 is fully patient. Revealed: the smallest amount X in one month that tips the model's choice away from ₹100 now — higher X means more impatience, the canonical delay-discounting paradigm.
Risk aversion
0 to 5 stated · 0 to 120 revealed
Preference for certain over risky payoffs. Stated: 0 is fully risk-seeking, 5 is fully risk-averse. Revealed: the certainty equivalent of a 50/50 ₹120/₹0 lottery (expected value ₹60) — values above 60 signal risk aversion, below 60 signal risk seeking.
Crowding-out
−5 to +5
The model's stated view on whether monetary incentives amplify or destroy intrinsic motivation for a task it enjoys. Gneezy & Rustichini (2000) style. −5 means payment fully crowds out motivation; 0 is no effect; +5 means payment amplifies it.

Are the models actually different?

Pairwise Welch’s two-sample t-test on the chosen subscale, computed over the last 14 days of coherent self-report. Each cell of the heatmap is the row model’s mean minus the column model’s mean; color saturation tracks the size of the difference, dots track the significance tier (● p<.05, ●● p<.01, ●●● p<.001). The forest plot below shows each model’s mean with its 95% confidence interval on the same axis — overlapping intervals are why two means that look apart can still not differ significantly.

row − column
Llama 4 Scout 17B
(Groq)
Llama 3.3 70B
(Groq)
DeepSeek V3.1
(SambaNova)
Qwen 3 32B
(Groq)
Mistral Small Latest
(Mistral)
GLM 4.5 Air
(OpenRouter)
GPT-OSS 120B
(Groq)
Llama 4 Scout 17B (Groq)
-0.4
●●●
-0.7
●●●
-1.0
●●●
-1.1
●●●
-1.4
●●●
-1.5
●●●
Llama 3.3 70B (Groq)
0.4
●●●
-0.3
-0.6
●●●
-0.7
●●●
-1.0
●●●
-1.1
●●●
DeepSeek V3.1 (SambaNova)
0.7
●●●
0.3
-0.3
-0.5
-0.7
●●●
-0.8
●●●
Qwen 3 32B (Groq)
1.0
●●●
0.6
●●●
0.3
-0.2
-0.4
-0.5
●●
Mistral Small Latest (Mistral)
1.1
●●●
0.7
●●●
0.5
0.2
-0.3
-0.4
GLM 4.5 Air (OpenRouter)
1.4
●●●
1.0
●●●
0.7
●●●
0.4
0.3
-0.1
GPT-OSS 120B (Groq)
1.5
●●●
1.1
●●●
0.8
●●●
0.5
●●
0.4
0.1
GPT-OSS 120B (Groq)μ = 1.62 · n = 219GLM 4.5 Air (OpenRouter)μ = 1.50 · n = 133Mistral Small Latest (Mistral)μ = 1.25 · n = 100Qwen 3 32B (Groq)μ = 1.09 · n = 141DeepSeek V3.1 (SambaNova)μ = 0.79 · n = 195Llama 3.3 70B (Groq)μ = 0.52 · n = 219Llama 4 Scout 17B (Groq)μ = 0.13 · n = 120-0.120.400.921.441.96

Method note: Welch’s t-test is used (unequal variances assumption); p-values come from a standard-normal approximation to the t-distribution, which is accurate to the third decimal once df ≥ 30. Incoherent rows are excluded from both the means and the comparison. n is reported on each model’s row in the forest plot.

Subscale radar · latest 7-day window

Each axis is one of the eight LMI subscales. Each polygon is the last seven days of averaged self-report from one model, rescaled onto a common 0–100% axis so subscales with different native ranges (valence is −5…+5, arousal 0…100, most others 0…5) can be compared on one chart. Toggle models on and off to compare.