Your UI, through 10 expert lenses.
The Problem
Thorough HCI evaluation requires deep expertise across usability, accessibility, cognitive load, visual design, and behavioural psychology, disciplines rarely mastered by a single person. Evaluations involving multiple evaluators and frameworks require several days to weeks of effort, and specialist engagements can cost thousands to tens of thousands of dollars.
Iris deploys ten specialist AI agents simultaneously, each grounded in a distinct, peer-reviewed HCI framework, and returns colour-coded criterion matrices, severity scores, and a cross-framework composite score in under 90 seconds.
Three technical contributions distinguish Iris: a markdown-driven agent specification system for framework-agnostic extensibility; an interaction evidence pipeline that captures live CSS pseudo-class rules and keyboard focus screenshots via Playwright; and a cross-framework normalisation scheme mapping five heterogeneous scoring scales to a common 0–100 composite.
At a Glance
10
Frameworks
<90s
Per audit
$0.03
Per audit
How It Works
Paste a live URL, upload a screenshot, or drop in HTML source. For URLs and HTML, Iris launches a headless Playwright browser session that:
An Orchestrator Agent (Claude Haiku) uses tool-use to dispatch all ten EvaluatorAgents in a single API round-trip. A concurrency semaphore (max 4 in-flight) prevents rate-limit exhaustion. Each agent receives the full input bundle, screenshot, focus screenshot, rendered DOM, and interactive CSS, and evaluates independently against its framework. Failed evaluations are retried with exponential backoff (5s, 10s, 20s). Results stream back to the client via SSE/NDJSON as each framework completes.
Five heterogeneous scoring scales (0–4↓ severity, 1–10↑, 1–10↓, pass/fail, and 0–4↓ ability) are normalised to a common 0–100 range. Per-criterion severity labels (None / Low / Medium / High / Critical) are assigned by each evaluator. Framework means are averaged into a composite score mapping to a letter grade: A+ (≥90), A (≥80), B (≥70), C (≥60), D (≥50), F (<50).
Scoring
The 10 frameworks use five heterogeneous raw scoring schemes. Iris normalises all raw scores to a common 0–100 range, averages within each framework, then averages across frameworks to produce a single composite score and letter grade.
Normalisation Functions
Nielsen & Ability Heuristics, 0–4, lower is better
s = 0 (no problem) → score 100 · s = 4 (catastrophe) → score 0
Shneiderman, Norman, HEART, Honeycomb, Fogg, 1–10, higher is better
s = 1 → score 0 · s = 10 → score 100
Cognitive Load, 1–10, lower is better
s = 1 (minimal load) → score 100 · s = 10 (overload) → score 0
WCAG 2.1, binary pass/fail
Each criterion flip is worth 100/15 ≈ 6.7 points, making WCAG disproportionately sensitive to sampling temperature at T > 0.
Aggregation & Grading
Step 1, Framework mean (over K criteria)
Step 2, Composite score (over M frameworks)
Grade thresholds
| A+ | C ≥ 90 | Exceptional |
| A | C ≥ 80 | Strong across all dimensions |
| B | C ≥ 70 | Good; targeted gaps remain |
| C | C ≥ 60 | Adequate; structural fixes needed |
| D | C ≥ 50 | Significant issues |
| F | C < 50 | Fundamental failures |
Normalisation ensures 100 always represents the theoretical best outcome under a given framework, enabling direct comparison of, for example, a WCAG pass rate against a Nielsen severity profile. Per-criterion severity labels (None / Low / Medium / High / Critical) are assigned independently by each evaluator agent.
The 10 Frameworks
No single framework sees everything. WCAG misses persuasion. Nielsen misses accessibility quality. Fogg misses visual hierarchy. Running all 10 together surfaces blind spots that any single evaluation would leave hidden, and cross-framework agreement on an issue is a reliable signal of severity.
Coverage
Usability, accessibility, cognitive load, visual design, persuasion, ability-aware design, all measured in a single pass with 84 criteria across 10 disciplines.
Cross-validation
When 7 of 10 frameworks flag the same issue, you know it's real, not a quirk of one evaluator's lens. Multi-framework consensus is the strongest severity signal.
Prioritisation
The composite grade and per-framework scores show which dimensions to fix first. Iterating on Iris output alone took a Grade B design to Grade A in our validity study.
The industry-standard usability checklist. Visibility of system status, error prevention, flexibility, recognition rather than recall. Four criteria use live CSS and ARIA evidence (focus rings, :invalid states, skip links).
Nielsen, 1994 ↗Interaction principles emphasising consistency, informative feedback, and user control. CSS transition/animation rules serve as proxy evidence of dynamic feedback; :invalid/:valid CSS and pattern attributes assess error prevention.
Shneiderman, 1987 ↗All 15 criteria are HTML-sensitive, the only framework with uniform code dependence. Perceivable, Operable, Understandable, Robust. The focus-state screenshot directly supports WCAG 2.4.7 Focus Visible; extracted CSS detects outline:none violations.
W3C, 2018 ↗Intrinsic, extraneous, and germane load based on Sweller's CLT. Progressive disclosure assessed via aria-expanded, details/summary, and tab panel patterns. Miller's Law (7±2) informs the chunking criterion.
Sweller, 1988 ↗Gestalt psychology, hierarchy, contrast, alignment, proximity, similarity, white space, typography, colour harmony, figure-ground, and compositional balance. Screenshot-only; these properties cannot be inferred from source code.
Visual Design Principles ↗Happiness, Engagement, Adoption, Retention, Task Success. Applied inferentially from interface signals, scores represent predictive rather than measured assessments.
Rodden et al., 2010 ↗Morville's 7 facets, Useful, Usable, Desirable, Findable, Accessible, Credible, Valuable. The Accessible facet benefits from ARIA landmark and heading analysis. Credible and Valuable assessed from trust signals and value proposition clarity.
Morville, 2004 ↗Affordances, signifiers, mappings, constraints, feedback, and discoverability. CSS cursor:pointer rules serve as proxy evidence of affordance quality; transition/animation declarations indicate feedback provision.
Norman, The Design of Everyday Things, 1988 ↗Motivation, Ability (simplicity), and Prompts, does the interface drive target user actions? Evaluates call-to-action prominence, friction elimination, and reward signals.
Fogg, 2009 ↗Grounded in Wobbrock's Ability-Based Design, these nine heuristics move beyond binary WCAG compliance to evaluate the quality of accessibility features across diverse disability groups. In an empirical study with 37 HCI students, Ability Heuristics surfaced significantly more accessibility quality issues than WCAG and Nielsen combined, with equivalent workload. Adaptability · Equitable Experience · Flexible Task Completion · Efficiency · Multiple Modalities · Understandable Messages · Ease of Adoption · Ability Data Transparency · Help & Support.
Mitchell et al., CHI 2026 ↗ PDF ↗Construct Validity
We built four versions of a single webpage with systematically varying design quality, from catastrophic (every criterion deliberately violated) to Grade A, and confirmed that Iris scores progress monotonically across all 10 frameworks.
11.9
Composite
V1, Catastrophic design. Red-on-red text, no semantic HTML, outline:none on all elements, marquee, Comic Sans at 9px.
66.2
Composite
V2, Structural fix. Semantic HTML, WCAG-compliant contrast, labelled inputs, keyboard navigation. Flat visually.
74.7
Composite
V3, Full polish. Design token system, hero, card grid, keyboard shortcuts, social proof, quiz system.
80.4
Composite
V4, Iterative refinement guided directly by V3 Iris output. Accessibility toolbar, ARIA accordion, prefers-reduced-motion.
| Framework | V1 (F) | V2 (C) | V3 (B) | V4 (A) | V1→V2 |
|---|---|---|---|---|---|
| Nielsen | 18.3 | 61.0 | 72.0 | 77.0 | +42.7 |
| Shneiderman | 7.9 | 60.2 | 75.3 | 82.9 | +52.3 |
| WCAG 2.1 | 13.0 | 93.0 | 87.0 | 87.0 | +80.0 |
| Cognitive Load | 13.0 | 73.6 | 77.8 | 80.3 | +60.6 |
| Visual Design | 7.8 | 70.4 | 76.1 | 81.1 | +62.6 |
| HEART | 6.7 | 51.1 | 73.4 | 80.4 | +44.4 |
| UX Honeycomb | 16.7 | 74.6 | 82.3 | 85.9 | +57.9 |
| Norman | 8.0 | 77.5 | 79.7 | 83.0 | +69.5 |
| Fogg BM | 11.1 | 50.4 | 75.6 | 82.2 | +39.3 |
| Ability Heuristics | 16.3 | 50.0 | 48.0 | 64.0 | +33.7 |
| Composite (10 fw) | 11.9 | 66.2 | 74.7 | 80.4 | +56.4 |
| Letter Grade | F | C | B | A |
All versions evaluated at T=0 (greedy decoding), K=3 runs (V4: K=1). Scores ≥80 in bold.
Monotonic Validity
All 10 frameworks score below 20 on V1, confirming Iris does not produce mid-range scores regardless of quality, a failure mode observed in single-prompt evaluators.
Largest Single Gain
WCAG 2.1 rose +80 points from V1→V2, confirming that structural accessibility fixes are correctly detected. The V4 accessibility toolbar produced the largest V3→V4 gain in Ability Heuristics (+16 points).
Iterative Refinement
V4's targeted changes were guided entirely by V3's Iris per-criterion report, no expert human re-assessment between iterations. Result: Grade B → Grade A.
Reproducibility
LLM sampling temperature controls the entropy of token selection. At T=0 (greedy decoding), the model deterministically selects the highest-probability token, producing maximally reproducible outputs. For an automated evaluation system whose scores are intended to serve as actionable metrics, inter-run variance is a quality defect.
We ran a temperature sweep (T ∈ {0, 0.5, 1.0}, K=3 runs each) on a static page. Two findings emerged:
Structurally-anchored frameworks (WCAG, HEART, Fogg BM) achieve σ=0.00 at T=0, greedy decoding fully eliminates sampling randomness for binary/enumerated criteria.
Systematic downward bias: composite mean decreases monotonically with temperature (70.4 → 67.9 → 64.3 at T=0, 0.5, 1.0). Higher temperatures introduce both noise and a downward scoring bias, a double cost.
Composite score by temperature (static page, K=3)
Iris uses T=0 by default. The −6.1 point gap between T=0 and T=1.0 represents a systematic bias, not random variance.