Iris | Mitchell Lab

The Problem

Expert UX evaluation doesn't scale.

Thorough HCI evaluation requires deep expertise across usability, accessibility, cognitive load, visual design, and behavioural psychology, disciplines rarely mastered by a single person. Evaluations involving multiple evaluators and frameworks require several days to weeks of effort, and specialist engagements can cost thousands to tens of thousands of dollars.

Iris deploys ten specialist AI agents simultaneously, each grounded in a distinct, peer-reviewed HCI framework, and returns colour-coded criterion matrices, severity scores, and a cross-framework composite score in under 90 seconds.

Three technical contributions distinguish Iris: a markdown-driven agent specification system for framework-agnostic extensibility; an interaction evidence pipeline that captures live CSS pseudo-class rules and keyboard focus screenshots via Playwright; and a cross-framework normalisation scheme mapping five heterogeneous scoring scales to a common 0–100 composite.

At a Glance

Specialist frameworks 10

Evaluator agents (parallel) 10

Input types 3

Total criteria evaluated 84

HTML-sensitive criteria 38

Frameworks

<90s

Per audit

$0.03

Per audit

How It Works

Three steps to a full audit.

Input your UI

Paste a live URL, upload a screenshot, or drop in HTML source. For URLs and HTML, Iris launches a headless Playwright browser session that:

→ Captures a full-page screenshot at 1440px width
→ Extracts all interactive CSS pseudo-class rules (:hover, :focus, :active, :invalid, :valid, :checked…)
→ Takes a keyboard focus-state screenshot (Tab × 2) to verify visible focus rings
→ Returns the fully-rendered DOM including ARIA attributes, heading structure, and label associations

10 agents evaluate in parallel

An Orchestrator Agent (Claude Haiku) uses tool-use to dispatch all ten EvaluatorAgents in a single API round-trip. A concurrency semaphore (max 4 in-flight) prevents rate-limit exhaustion. Each agent receives the full input bundle, screenshot, focus screenshot, rendered DOM, and interactive CSS, and evaluates independently against its framework. Failed evaluations are retried with exponential backoff (5s, 10s, 20s). Results stream back to the client via SSE/NDJSON as each framework completes.

Normalised results & composite grade

Five heterogeneous scoring scales (0–4↓ severity, 1–10↑, 1–10↓, pass/fail, and 0–4↓ ability) are normalised to a common 0–100 range. Per-criterion severity labels (None / Low / Medium / High / Critical) are assigned by each evaluator. Framework means are averaged into a composite score mapping to a letter grade: A+ (≥90), A (≥80), B (≥70), C (≥60), D (≥50), F (<50).

Scoring

One composite score from five scales.

The 10 frameworks use five heterogeneous raw scoring schemes. Iris normalises all raw scores to a common 0–100 range, averages within each framework, then averages across frameworks to produce a single composite score and letter grade.

Normalisation Functions

Nielsen & Ability Heuristics, 0–4, lower is better

N_Niel(s) = 4 − s4 × 100

s = 0 (no problem) → score 100 · s = 4 (catastrophe) → score 0

Shneiderman, Norman, HEART, Honeycomb, Fogg, 1–10, higher is better

N_std(s) = s − 19 × 100

s = 1 → score 0 · s = 10 → score 100

Cognitive Load, 1–10, lower is better

N_CLT(s) = 10 − s9 × 100

s = 1 (minimal load) → score 100 · s = 10 (overload) → score 0

WCAG 2.1, binary pass/fail

N_WCAG(x) = 100,if x passes 0,if x fails

Each criterion flip is worth 100/15 ≈ 6.7 points, making WCAG disproportionately sensitive to sampling temperature at T > 0.

Aggregation & Grading

Step 1, Framework mean (over K criteria)

F̄ = 1K Σk=1..K N(s_k)

Step 2, Composite score (over M frameworks)

C = 1M Σm=1..M F̄_m ∈ [0, 100]

Grade thresholds

A+	C ≥ 90	Exceptional
A	C ≥ 80	Strong across all dimensions
B	C ≥ 70	Good; targeted gaps remain
C	C ≥ 60	Adequate; structural fixes needed
D	C ≥ 50	Significant issues
F	C < 50	Fundamental failures

Normalisation ensures 100 always represents the theoretical best outcome under a given framework, enabling direct comparison of, for example, a WCAG pass rate against a Nielsen severity profile. Per-criterion severity labels (None / Low / Medium / High / Critical) are assigned independently by each evaluator agent.

The 10 Frameworks

Why 10 frameworks: not one?

No single framework sees everything. WCAG misses persuasion. Nielsen misses accessibility quality. Fogg misses visual hierarchy. Running all 10 together surfaces blind spots that any single evaluation would leave hidden, and cross-framework agreement on an issue is a reliable signal of severity.

Coverage

Usability, accessibility, cognitive load, visual design, persuasion, ability-aware design, all measured in a single pass with 84 criteria across 10 disciplines.

Cross-validation

When 7 of 10 frameworks flag the same issue, you know it's real, not a quirk of one evaluator's lens. Multi-framework consensus is the strongest severity signal.

Prioritisation

The composite grade and per-framework scores show which dimensions to fix first. Iterating on Iris output alone took a Grade B design to Grade A in our validity study.

01 · Usability Visual + Code · 10 criteria

0–4↓

Nielsen's 10 Usability Heuristics

The industry-standard usability checklist. Visibility of system status, error prevention, flexibility, recognition rather than recall. Four criteria use live CSS and ARIA evidence (focus rings, :invalid states, skip links).

Nielsen, 1994 ↗

02 · Usability Visual + Code · 8 criteria

1–10↑

Shneiderman's 8 Golden Rules

Interaction principles emphasising consistency, informative feedback, and user control. CSS transition/animation rules serve as proxy evidence of dynamic feedback; :invalid/:valid CSS and pattern attributes assess error prevention.

Shneiderman, 1987 ↗

03 · Accessibility Visual + Code · 15 criteria

Pass/Fail

WCAG 2.1 (Level AA)

All 15 criteria are HTML-sensitive, the only framework with uniform code dependence. Perceivable, Operable, Understandable, Robust. The focus-state screenshot directly supports WCAG 2.4.7 Focus Visible; extracted CSS detects outline:none violations.

W3C, 2018 ↗

04 · Cognitive Visual + Code · 8 criteria

1–10↓

Cognitive Load Assessment

Intrinsic, extraneous, and germane load based on Sweller's CLT. Progressive disclosure assessed via aria-expanded, details/summary, and tab panel patterns. Miller's Law (7±2) informs the chunking criterion.

Sweller, 1988 ↗

05 · Visual Visual only · 10 criteria

1–10↑

Visual Design Principles

Gestalt psychology, hierarchy, contrast, alignment, proximity, similarity, white space, typography, colour harmony, figure-ground, and compositional balance. Screenshot-only; these properties cannot be inferred from source code.

Visual Design Principles ↗

06 · UX Goals Visual only · 5 criteria

1–10↑

Google HEART Framework

Happiness, Engagement, Adoption, Retention, Task Success. Applied inferentially from interface signals, scores represent predictive rather than measured assessments.

Rodden et al., 2010 ↗

07 · UX Goals Visual + Code · 7 criteria

1–10↑

UX Honeycomb

Morville's 7 facets, Useful, Usable, Desirable, Findable, Accessible, Credible, Valuable. The Accessible facet benefits from ARIA landmark and heading analysis. Credible and Valuable assessed from trust signals and value proposition clarity.

Morville, 2004 ↗

08 · Design Theory Visual + Code · 7 criteria

1–10↑

Don Norman's Design Principles

Affordances, signifiers, mappings, constraints, feedback, and discoverability. CSS cursor:pointer rules serve as proxy evidence of affordance quality; transition/animation declarations indicate feedback provision.

Norman, The Design of Everyday Things, 1988 ↗

09 · Persuasion Visual only · 5 criteria

1–10↑

Fogg Behavior Model

Motivation, Ability (simplicity), and Prompts, does the interface drive target user actions? Evaluates call-to-action prominence, friction elimination, and reward signals.

Fogg, 2009 ↗

10 · Inclusive Visual + Code · 9 criteria

0–4↓

Ability Heuristics

Grounded in Wobbrock's Ability-Based Design, these nine heuristics move beyond binary WCAG compliance to evaluate the quality of accessibility features across diverse disability groups. In an empirical study with 37 HCI students, Ability Heuristics surfaced significantly more accessibility quality issues than WCAG and Nielsen combined, with equivalent workload. Adaptability · Equitable Experience · Flexible Task Completion · Efficiency · Multiple Modalities · Understandable Messages · Ease of Adoption · Ability Data Transparency · Help & Support.

Mitchell et al., CHI 2026 ↗ PDF ↗

Construct Validity

Does it actually measure quality?

We built four versions of a single webpage with systematically varying design quality, from catastrophic (every criterion deliberately violated) to Grade A, and confirmed that Iris scores progress monotonically across all 10 frameworks.

11.9

Composite

V1, Catastrophic design. Red-on-red text, no semantic HTML, outline:none on all elements, marquee, Comic Sans at 9px.

66.2

Composite

V2, Structural fix. Semantic HTML, WCAG-compliant contrast, labelled inputs, keyboard navigation. Flat visually.

74.7

Composite

V3, Full polish. Design token system, hero, card grid, keyboard shortcuts, social proof, quiz system.

80.4

Composite

V4, Iterative refinement guided directly by V3 Iris output. Accessibility toolbar, ARIA accordion, prefers-reduced-motion.

Framework	V1 (F)	V2 (C)	V3 (B)	V4 (A)	V1→V2
Nielsen	18.3	61.0	72.0	77.0	+42.7
Shneiderman	7.9	60.2	75.3	82.9	+52.3
WCAG 2.1	13.0	93.0	87.0	87.0	+80.0
Cognitive Load	13.0	73.6	77.8	80.3	+60.6
Visual Design	7.8	70.4	76.1	81.1	+62.6
HEART	6.7	51.1	73.4	80.4	+44.4
UX Honeycomb	16.7	74.6	82.3	85.9	+57.9
Norman	8.0	77.5	79.7	83.0	+69.5
Fogg BM	11.1	50.4	75.6	82.2	+39.3
Ability Heuristics	16.3	50.0	48.0	64.0	+33.7
Composite (10 fw)	11.9	66.2	74.7	80.4	+56.4
Letter Grade	F	C	B	A

All versions evaluated at T=0 (greedy decoding), K=3 runs (V4: K=1). Scores ≥80 in bold.

Monotonic Validity

All 10 frameworks score below 20 on V1, confirming Iris does not produce mid-range scores regardless of quality, a failure mode observed in single-prompt evaluators.

Largest Single Gain

WCAG 2.1 rose +80 points from V1→V2, confirming that structural accessibility fixes are correctly detected. The V4 accessibility toolbar produced the largest V3→V4 gain in Ability Heuristics (+16 points).

Iterative Refinement

V4's targeted changes were guided entirely by V3's Iris per-criterion report, no expert human re-assessment between iterations. Result: Grade B → Grade A.

Reproducibility

Temperature matters for evaluation systems.

LLM sampling temperature controls the entropy of token selection. At T=0 (greedy decoding), the model deterministically selects the highest-probability token, producing maximally reproducible outputs. For an automated evaluation system whose scores are intended to serve as actionable metrics, inter-run variance is a quality defect.

We ran a temperature sweep (T ∈ {0, 0.5, 1.0}, K=3 runs each) on a static page. Two findings emerged:

Structurally-anchored frameworks (WCAG, HEART, Fogg BM) achieve σ=0.00 at T=0, greedy decoding fully eliminates sampling randomness for binary/enumerated criteria.
Systematic downward bias: composite mean decreases monotonically with temperature (70.4 → 67.9 → 64.3 at T=0, 0.5, 1.0). Higher temperatures introduce both noise and a downward scoring bias, a double cost.

Composite score by temperature (static page, K=3)

T = 0.0 (recommended) 70.4

T = 0.5 67.9

T = 1.0 (API default) 64.3

Iris uses T=0 by default. The −6.1 point gap between T=0 and T=1.0 represents a systematic bias, not random variance.