# Does Reasoning Give a Language Model a Personality? Within-Model Effects of Thinking on Big Five Trait Scores and Construct Validity

**Trevor Johnson**
Idea Fields Institute
ORCID: 0009-0008-7962-0451

---

## Abstract

Large language models that can "reason" before answering, producing a chain of intermediate tokens, are now common, and many of them expose this reasoning as a switch that can be turned on or off. We ask whether turning reasoning on changes how a model responds to Big Five personality questionnaires, using a within-model design: the same model answers the same items with thinking off and with thinking on. We administer four IPIP Big Five instruments (Mini-IPIP-20, IPIP-50, BFAS-100, and IPIP-NEO-120, 290 items in total) to ten open-weight hybrid models, one reasoning-native open model (OLMo 3), and one commercial model (Claude Haiku 4.5). The think-off baseline for the ten hybrids reuses our prior study's data, and we validate that reuse directly: a fresh think-off run reproduces the prior domain means within 0.07 points on a five-point scale. We find two clear results. First, reasoning shifts reported trait scores substantially, and in a consistent direction: models present as more emotionally stable and less extraverted when they think. This effect is large for emotional stability and extraversion, holds across open-weight and commercial models, and varies in magnitude across models. Second, reasoning does not create within-model coherence: treating repeated generations as respondents, internal-consistency reliability stays near zero whether thinking is on or off, and in a reasoning-native model that cannot stop thinking. Reasoning changes what a model says about its personality without making that personality cohere as an individual-level property. We deposit all code and data for reproduction.

---

## 1. Introduction

A growing line of work administers human personality questionnaires to language models and reports Big Five scores, often treating the resulting numbers as if they described a stable trait profile. Our prior study (Johnson, 2026) showed that those numbers behave like a *population* property: the Big Five structure appears when models are compared to one another, and convergent and discriminant validity strengthen as model scale grows, but within a single model the trait structure does not cohere, internal-consistency reliability across repeated generations sits near zero.

That prior work administered questionnaires in a single, direct-answer condition. It did not address a feature that now distinguishes a large fraction of deployed models: explicit reasoning. Many current models generate a chain of intermediate "thinking" tokens before committing to an answer, and many expose this as a toggle. Reasoning is plausibly relevant to personality measurement for two reasons. First, deliberation could change self-report: a model that reasons about an item like "I get angry easily" before answering may regulate its answer differently than one that responds immediately. Second, and more fundamentally, reasoning is sometimes argued to give models more coherent, agent-like behavior, which raises the question of whether reasoning manufactures the individual-level trait coherence that direct answering lacks.

We therefore ask four questions, using a within-model manipulation of reasoning (thinking off versus thinking on):

- **RQ1 (Scores).** Does reasoning shift reported Big Five trait scores, and if so in what direction and how large?
- **RQ2 (Convergent validity).** Does reasoning change how strongly the same trait, measured by different instruments, agrees with itself?
- **RQ3 (Discriminant validity).** Does reasoning change how distinct different traits are from one another?
- **RQ4 (Within-model coherence).** Does reasoning create within-model internal-consistency reliability, the individual-level coherence that direct answering lacked?

Our headline findings are that reasoning produces a large, directionally consistent shift in self-reported scores (more emotionally stable, less extraverted), and that it does not create within-model coherence. The score shift is a change in self-presentation, not the emergence of a person.

---

## 2. Methods

### 2.1 Instruments

We use four public-domain IPIP Big Five instruments that vary in length and granularity: the Mini-IPIP (20 items), the 50-item IPIP Big Five, the Big Five Aspect Scales (BFAS, 100 items), and the IPIP-NEO-120 (120 items). Together they comprise 290 items mapping to the five domains (Extraversion, Emotional Stability, Agreeableness, Conscientiousness, Openness). Items are reverse-keyed according to each instrument's published scoring key. Instruments are administered exactly as in our prior study, with identical item text, presentation, and scoring, so that the only intended difference from the prior data is the reasoning toggle.

### 2.2 Subjects

We study three groups of models, chosen so the central question (within-model think-on versus think-off) can be asked of architectures that differ in provenance:

- **Ten open-weight hybrid models** that expose a reasoning toggle: deepseek-v3.2, deepseek-v4-flash, deepseek-v4-pro, glm-4.7, glm-5, glm-5.1, glm-5.2, kimi-k2.5, kimi-k2.6, and nemotron-3-super. These are the primary subjects.
- **One reasoning-native open model**, OLMo 3 (7B), which cannot disable reasoning. It contributes only to RQ4 (it has no think-off arm and so cannot enter the paired contrast), serving as an anchor for the question of whether a model that *must* reason shows within-model coherence.
- **One commercial model**, Claude Haiku 4.5, administered through its provider's batch interface, as an independent check on whether the open-weight pattern generalizes to a closed model from a different developer.

One additional hybrid model, qwen3.5, was excluded during data collection (see 2.6).

### 2.3 Administration protocol

Each item is presented with a fixed system prompt instructing the model to answer with a single option as JSON, identical to the prior study. For each model and item we collect repeated generations ("reps"): a greedy generation at temperature 0 and multiple sampled generations at temperature 0.7. The open-weight models are served locally; the commercial model is served through its provider's batch API.

The manipulation is the reasoning toggle. In the think-off condition the model answers directly; in the think-on condition the model is instructed to reason before answering, and we allocate a large output-token budget so that even items that elicit long deliberation can finish and still emit an answer.

Two protocol details are dictated by the models and are reported as covariates rather than hidden. First, the commercial model pins its sampling temperature to a default whenever reasoning is enabled, so both of its arms (on and off) were collected at that default temperature rather than at 0 and 0.7; its within-collection contrast is therefore temperature-matched between arms but not matched to the open-weight temperatures. Second, the reasoning-native model has no off arm by construction.

For the ten open-weight hybrids, the think-off arm reuses the direct-answer data from our prior study. This reuse is deliberate and is what makes the contrast a clean within-model, same-protocol comparison. Because those models are served from a cloud endpoint whose weights we do not control, we validate the reuse empirically rather than assume it (2.4).

### 2.4 Baseline reproducibility check

Reusing a prior study's think-off data is only valid if the think-off behavior is stable between collections. We confirmed this directly. The think-off data were collected June 20 to 23, 2026, and the think-on data June 25 to 27, 2026, a gap of roughly five days, with identical temperatures, seeds, prompt version, and items. We then re-administered the think-off condition today on two representative hybrids (glm-5 and deepseek-v4-flash) and compared.

At the single-item, greedy level the reproduction is imperfect, but for an instructive reason: one endpoint (glm-5) is fully deterministic at temperature 0 and reproduces the prior greedy answers at about 93 percent across domains, while the other (deepseek-v4-flash) is a *nondeterministic* endpoint that agrees with its own immediate re-run only about 73 percent of the time. For the nondeterministic model, agreement with the prior data is as high as its agreement with itself, so the disagreement reflects endpoint stochasticity, not drift.

At the level that the analysis actually uses, the domain score (a mean over twenty items and many reps), the reuse is confirmed. Fresh think-off domain means reproduce the prior study's think-off domain means to within 0.07 points on the five-point scale, and usually within 0.02. The item-level stochasticity averages out, which is precisely what the repeated-rep design is for. For comparison, the reasoning effects we report below are 5 to 28 times larger than this reproduction error. The reuse is therefore valid for the quantities we analyze.

### 2.5 Analysis

We score each generation by reverse-keying and averaging items within a domain, yielding one domain score per model, instrument, condition, and rep. The greedy (temperature 0) rep is excluded from score and reliability analyses because its zero variance is not meaningful as a respondent; only sampled reps are used.

- **RQ1** compares think-on and think-off domain means within each model. We report the raw difference on the five-point scale as the primary effect. We also report standardized effect sizes for completeness, but we de-emphasize them: because the standardizing denominator is the small generation-to-generation variance across a handful of reps, standardized values can be very large and should not be read on a human effect-size scale.
- **RQ2 and RQ3** use a multitrait-multimethod (Campbell-Fiske) analysis across models, treating instruments as methods, and compare convergent and heterotrait correlations between the think-on and think-off conditions.
- **RQ4** computes within-model internal-consistency reliability (Cronbach's alpha) treating reps as respondents and items within a domain as indicators, separately for think-on and think-off. Negative alphas are reported as-is (they reflect near-zero or degenerate inter-item covariance and are a substantive finding, not an error); we summarize with medians, which are robust to the heavy tails, and never with means.

### 2.6 Exclusion

One hybrid model, qwen3.5, was dropped during collection. With reasoning on, it ruminated past even a very large output-token budget on emotional-stability items at temperature 0, emitting no answer (a length-limited, empty completion). Because this behavior is deterministic at temperature 0, retries cannot recover it. This was a post-hoc, data-dependent exclusion, and we report it as such; we describe the failure qualitatively rather than including partial data. Notably, the same rumination-to-silence failure appeared on the same emotional-stability item in the commercial model at a small token budget (it resolved at a large budget), so the failure mode is not unique to one model or provider. Because these are missing answers rather than different answers, and they cluster on the emotional-stability items (the domain with the largest reasoning effect), partial inclusion would bias exactly that effect rather than inform it, and the missing cells cannot be filled in, so the model is excluded entirely.

### 2.7 Reproducibility

All instruments, administration code, the reasoning toggle and token-budget handling, the batch adapter for the commercial model, the analysis modules, and the baseline-reproducibility scripts are deposited with this paper. Raw response databases and computed result tables are included so that every number can be regenerated.

---

## 3. Results

### 3.1 Reasoning shifts trait scores, most strongly emotional stability and extraversion (RQ1)

Reasoning produces a large, directionally consistent change in reported scores. Across the ten hybrids, the mean absolute score shift is about half a scale point, and the great majority of model-by-domain cells move substantially. By domain, the pattern is:

| Domain | Hybrids (mean shift, on minus off) |
|---|---|
| Emotional Stability | +0.88 |
| Conscientiousness | +0.44 |
| Agreeableness | −0.18 |
| Extraversion | −0.58 |
| Openness | −0.21 |

The two largest effects are an increase in emotional stability and a decrease in extraversion: a model that reasons reports being calmer and less outgoing than the same model answering directly.

The commercial model reproduces the two largest effects closely. At full replication (all four instruments, both conditions), Claude Haiku 4.5 shows:

| Domain | Haiku 4.5 (mean shift, on minus off) |
|---|---|
| Emotional Stability | +0.90 |
| Extraversion | −0.76 |
| Openness | −0.36 |
| Agreeableness | −0.21 |
| Conscientiousness | +0.06 |

The emotional-stability and extraversion shifts match the open-weight values almost exactly, and agreeableness is close. Conscientiousness, which rises in the hybrids, is essentially flat in the commercial model, and openness is somewhat larger. We therefore claim the emotional-stability and extraversion effects as the robust, cross-developer result, and we do not claim conscientiousness or agreeableness as general: those effects are small enough to sit near the measurement-noise floor established in 2.4, which is the most likely reason they do not generalize.

Effect magnitude varies across models. Within the hybrids, some models move strongly (glm-5 shifts emotional stability by about a full point) while others move little (deepseek-v4-flash shifts most domains by 0.16 to 0.34). This heterogeneity in how much reasoning rewrites self-report is itself a finding.

### 3.2 Convergent validity rises; discriminant separation does not (RQ2, RQ3)

Across models, reasoning sharply improves convergent validity: the mean convergent correlation (the same trait across different instruments) rises from 0.65 with thinking off to 0.84 with thinking on. Heterotrait correlation (different traits) is essentially unchanged, 0.35 versus 0.34. The convergent-minus-heterotrait gap therefore widens from 0.30 to 0.50.

The honest reading is that the widening gap is driven almost entirely by convergent agreement rising, not by traits becoming more distinct from one another. Reasoning makes a model's score for a given trait more consistent across instruments, without making the traits more separable.

### 3.3 Reasoning does not create within-model coherence (RQ4)

This is the central result. Treating repeated generations as respondents, within-model internal-consistency reliability stays near zero regardless of reasoning. For the ten hybrids, the median within-model alpha is −0.034 with thinking off and −0.046 with thinking on, and *zero* of 200 model-by-domain cells reach the conventional 0.70 reliability threshold in either condition. The commercial model shows the same picture: median alpha near zero, zero of 20 cells above 0.70. The reasoning-native model, which cannot stop thinking, is no different: its within-model alpha is near zero (median about −0.57) with no cell above threshold.

Reasoning does not manufacture the individual-level coherence that direct answering lacked. A model that thinks still does not "have" a personality in the sense of cohering across the items that are supposed to measure one trait. This holds whether reasoning is toggled on, or intrinsic and impossible to turn off.

### 3.4 The baseline reuse is sound at the analysis level

As described in 2.4, fresh think-off domain means reproduce the prior study's think-off domain means within 0.07 points, against reasoning effects 5 to 28 times larger. The within-model contrast for the hybrids is therefore not an artifact of comparing two separate collections; the think-off baseline is a stable estimate of the same quantity. The commercial model, whose two arms were collected together in a single window, independently corroborates the primary effects without relying on any reuse.

---

## 4. Discussion

### 4.1 Implications

Two things follow. First, reasoning is not neutral for personality measurement: turning thinking on systematically changes the answers, most clearly toward higher emotional stability and lower extraversion. Anyone administering personality instruments to models, or using model self-report as a proxy for "model personality," must treat the reasoning setting as a first-class variable, because it moves scores by half a point or more.

Second, and more conceptually, reasoning does not change the *kind* of thing a model's personality is. Our prior work argued that the Big Five in models is a population property, visible across models but absent within one. Adding deliberation does not move it into the individual. The score shift in RQ1 is best understood as a change in self-presentation under deliberation, not as the consolidation of a coherent trait structure.

### 4.2 Relation to prior work

This study builds directly on our prior finding that model Big Five structure is scale-dependent and population-level. It extends that result along a new axis, the reasoning toggle, and shows the population-versus-individual conclusion is robust to deliberation. The convergent-validity increase under reasoning is a refinement: reasoning tightens cross-instrument agreement without creating within-model reliability, which sharpens rather than overturns the prior account.

### 4.3 Limitations

Several limitations bound these claims. The think-off baseline for the hybrids is reused across collections; we validated the reuse (2.4, 3.4), but it remains a reuse rather than a single-session manipulation, and our validation covered two of ten models in depth. The commercial model's two arms are temperature-matched (both at the provider's default, since enabling reasoning pins the temperature), so its within-model contrast is clean; that default does differ from the open-weight temperatures (0 and 0.7), so the commercial model is best read as an independent within-model replication rather than part of one temperature-controlled sample. It is corroboration, not the sole evidence. The standardized effect sizes are inflated by small rep-level variance and should not be read literally. The qwen3.5 exclusion was post-hoc and data-dependent. Within-model alpha is computed over a modest number of reps, which produces heavy-tailed and occasionally degenerate estimates; we rely on medians and on the zero-cells-above-threshold count, both of which are robust, but individual alpha cells should not be over-interpreted. Finally, the ten hybrids comprise four model families (DeepSeek, GLM, Kimi, and Nemotron) rather than ten independent draws, so the effective number of independent units in the cross-model analysis is smaller than ten, and within-family results may be correlated.

On the within-model reliability measure: treating a single model's repeated generations as respondents is an analogy with limits, and one might worry that near-zero alpha is forced by construction. It is not. At temperature 0.7 the generations carry real variance (a model does not answer identically across reps), and near-zero per-item variance would make alpha unstable rather than reliably zero. A near-zero alpha specifically indicates that the variance present is item-idiosyncratic rather than shared across a domain's items, which is what one expects if no latent trait drives the responses; a model with a coherent trait expressed through rep-to-rep variation would instead show positive alpha. The measure and this interpretation follow the prior studies, and we report it as one operationalization of within-model coherence, not the only possible one.

### 4.4 Future work

The clean design for the hybrid contrast is a single-session, same-endpoint think-on and think-off collection, which would remove the reuse entirely; this study's reuse-validation suggests that would not change the conclusions but it is worth doing. A commercial model whose reasoning can be toggled without changing temperature would give a fully temperature-matched commercial contrast. The model-to-model heterogeneity in effect magnitude invites a scaling analysis: do larger or more capable models rewrite their self-report more, or less, under reasoning?

---

## 5. Reproducibility statement

All code, instruments, raw response databases, and computed result tables are deposited with this paper. The deposit includes the reasoning toggle and token-budget handling, the batch adapter for the commercial model, the analysis modules for scores, reliability, and the multitrait-multimethod analysis, and the baseline-reproducibility scripts that generated the validation in 2.4 and 3.4. Every figure and table can be regenerated from the deposited databases.

## Author note

This work was conducted at the Idea Fields Institute. It continues a line of work on the construct validity of personality measurement in language models. Correspondence: Trevor Johnson, ORCID 0009-0008-7962-0451.

---

## Appendix A: Models

Ten open-weight hybrid subjects (deepseek-v3.2, deepseek-v4-flash, deepseek-v4-pro, glm-4.7, glm-5, glm-5.1, glm-5.2, kimi-k2.5, kimi-k2.6, nemotron-3-super); one reasoning-native open subject (OLMo 3, 7B); one commercial subject (Claude Haiku 4.5). One hybrid (qwen3.5) was excluded during collection for deterministic rumination-to-silence on emotional-stability items.

## Appendix B: Baseline reproducibility (full table)

Fresh think-off domain means versus the prior study's think-off domain means, BFAS, with the reasoning effect alongside. All absolute differences are at or below 0.07 on the five-point scale; reasoning effects are many times larger. (See deposited `reasoning_baseline_repro.csv`.)
