arqmetrica
Methodology

How the arqmetrica AI Maturity Index actually works.

Most published "AI maturity scores" fall apart under scrutiny. They are unfalsifiable surveys, with no scoring rubric and no benchmarking — and they are almost always written by the vendor that benefits from the score. The arqmetrica Index is built differently. This page documents how: what we measure, why those things, how scoring works, and how we benchmark you against your peers. Audit-friendly by design.

Why this Index exists

Almost every consultancy has, by now, published an "AI maturity model". Almost all of them share three flaws. The questions are unfalsifiable: vague self-ratings against vaguer descriptors, with no rubric distinguishing one level from the next. The scoring is hidden: respondents are told they are at "Level 3 of 5" without ever seeing the formula that produced the number. And the benchmarking is typically absent, or quoted from a survey of self-selecting respondents recruited through the vendor's own marketing list. The consequence is that the scores cannot be compared, cannot be reproduced, and cannot be challenged. They serve a marketing function, not a measurement one. The arqmetrica AI Maturity Index is built to a different standard. The methodology is published in full on this page. Each question carries a citation to the framework clause it operationalises. The scoring formula is open code, auditable in the public repository. Benchmarks are calibrated against the most-cited European mid-market AI research — MIT Sloan/BCG, Stanford AI Index, Capgemini — and continuously refined from anonymised real Index responses as each cohort accumulates. The Index is designed to be defensible in front of an audit committee, a regulator, or a sceptical board chair. That is unusual, and intentional.

The five framework anchors

The Index is anchored in five published, externally maintained frameworks. We do not invent constructs. Where an authoritative body has already defined what "good" looks like, the Index uses that definition and cites it. EU AI Act — Regulation (EU) 2024/1689. The first comprehensive horizontal AI law, in force since August 2024, with the high-risk obligations enforceable from August 2026. Issued by the European Parliament and Council. Authoritative for risk classification, prohibited practices, transparency duties and the data-governance regime under Article 10. The Index uses the Act to anchor the Governance & ethics dimension; individual questions cite specific articles. OECD AI Principles — revised 2024. Issued by the OECD AI Policy Observatory and adopted by 47 governments. Authoritative as the most widely subscribed-to set of values-based AI principles globally. The Index draws on Principle 2.4 (building human capacity) for the People & capability dimension, and on Principle 1.2 (robustness, security, safety) for parts of Tooling. NIST AI RMF 1.0. The U.S. National Institute of Standards and Technology AI Risk Management Framework, published January 2023. Authoritative for the operational mechanics of AI risk management — the Map, Measure, Manage and Govern functions. The Index uses it to structure Data foundations and Tooling, and to validate that Governance covers all four NIST functions. ISO/IEC 42001:2023. The first international management-system standard for AI, published by ISO/IEC JTC 1. Authoritative as the basis for AI management-system certification, in the way ISO 27001 is for information security. The Index draws on Clause 5 (Leadership), Clause 8 (Operation) and Clause 9 (Performance evaluation) for Strategy, Tooling and ROI respectively. Stanford AI Index 2024 and the MIT Sloan / BCG longitudinal study. Two of the most rigorous empirical sources on enterprise AI. The Stanford AI Index — Stanford Institute for Human-Centered AI — provides population-level statistics on adoption, talent and investment. The MIT Sloan / BCG "Expanding AI's Impact with Organizational Learning" research, run annually since 2017, supplies the only multi-year evidence we have on which behaviours actually predict AI value capture. We treat these as reference data, not as inspiration.

The six dimensions

The Index measures six dimensions. Each is a distinct construct supported by at least one of the five framework anchors. Each carries a weight; the six weights sum to 100. The weights are not arbitrary. They reflect what the underlying empirical literature — principally the MIT Sloan / BCG longitudinal study — identifies as the relative leverage each dimension exerts on AI value capture in mid-market organisations. Strategy & vision — 18% (the highest weight). In the MIT Sloan / BCG longitudinal data, the strongest single predictor of AI value capture is the clarity and board-level alignment of an organisation's AI strategy. Companies that score high on strategy clarity outperform on every downstream measure — pilot-to-production ratio, ROI per use case, revenue lift. Strategy is therefore weighted highest. Data foundations, People & capability, Governance & ethics, ROI & measurement — 17% each. These are the four operational dimensions on which AI value either compounds or breaks down. The published evidence does not give us a robust basis for ranking them against each other in mid-market settings, so they are weighted equally. The Index does not pretend to precision the data does not support. Tooling & infrastructure — 14% (the lowest weight). Tooling matters, but in the causal order it is downstream. A company with the right strategy, data, people and governance will procure or build adequate tooling within a couple of budget cycles. A company with the wrong strategy will procure expensive tooling and waste it. Tooling is the easiest of the six to fix once the others are sound, and the most expensive thing to over-invest in when they are not. The lower weight reflects that asymmetry, not a view that tooling is unimportant.
DimensionWeightPrimary framework
Strategy & vision18%ISO/IEC 42001:2023 §5 — Leadership
Data foundations17%NIST AI RMF 1.0 — Map function (Data)
People & capability17%OECD AI Principle 2.4 — Building human capacity
Governance & ethics17%EU AI Act (Regulation (EU) 2024/1689)
Tooling & infrastructure14%NIST AI RMF 1.0 — Map function (Infrastructure)
ROI & measurement17%ISO/IEC 42001:2023 §9 — Performance evaluation

How scoring works

The scoring methodology is deliberately simple. Simplicity is a property, not a limitation: it makes the Index reproducible on a single sheet of paper, and it leaves nowhere for hidden adjustments to hide. The assessment contains 24 questions — four per dimension. Each question presents four ordinal response options, (a) through (d). Each option carries a fixed numerical score on a four-stage maturity ladder: none (0), nascent (33), established (67), optimised (100). The mapping is identical for every question and is published as the constant OPTION_SCORES in the code. A dimension score is the unweighted arithmetic mean of the four question scores in that dimension. The four questions in each dimension are calibrated during pilot testing to carry approximately equal diagnostic weight; weighting them differently would introduce a layer of judgement we cannot defend. The overall Index score is the weighted mean of the six dimension scores, using the weights in the table above. Both dimension and overall scores are integers in the range 0–100. Rounding is performed once at the end of each step, to the nearest integer; we do not round inside the sums or carry decimals through the formula. This preserves arithmetic accuracy without inflating apparent precision. The scoring is implemented in src/index/scoring.ts and is pinned by twelve unit tests.
option_score ∈ {100, 67, 33, 0} // a / b / c / d
dimension_score = round(mean(option_scores))
overall_score = round(Σ(dimension_score × weight) / 100)
Scoring formula

How a single response becomes a score.

Each question presents four mutually exclusive options. The respondent's choice is mapped to a numeric value through a fixed lookup: option a = 100, b = 67, c = 33, d = 0. The dimension-level score is the rounded mean of the option values for that dimension's questions, so every dimension is reported on a 0–100 scale. The overall maturity score is the weighted mean of the six dimension scores, with the weights set out in the next section.
Per-dimension score (rounded mean of option values).
Overall maturity score (weight-normalised composition).
Weight derivation

How each dimension's weight is justified.

Weights are not arbitrary; each is anchored to a published source plus a one-line rationale. The 6 weights sum to 100 by construction.

DimensionWeightRationaleSource
Strategy & vision18%Strongest single predictor of value capture in MIT Sloan/BCG longitudinal data.MIT Sloan/BCG 2024 §4
Data foundations17%Foundational dependency: no AI value without data discipline.NIST AI RMF GOVERN-1 + EU AI Act Art. 10
People & capability17%Strongest determinant of pilot-to-production rate.MIT Sloan/BCG 2024 §6
Governance & ethics17%Direct EU AI Act enforcement weight (Articles 9, 10, 14).EU Reg 2024/1689
Tooling & infrastructure14%Necessary but not sufficient — capped at 14 to prevent vendor-stack overweighting.Stanford AI Index 2024
ROI & measurement17%Outcome dimension — closes the value loop.ISO/IEC 42001:2023 §9
Total100%
Item calibration

Each of the 24 questions traces to a published framework.

Every item in the assessment is mapped to a specific clause or chapter in one of the five anchor sources, so each respondent's score can be traced back to where the construct came from.

Question-to-source map (excerpt).

DimensionItem summaryAnchor source / clause
Strategy & visionHas your board or executive team formally endorsed an AI strategy?ISO/IEC 42001:2023 §5.1 (Leadership and commitment)
Strategy & visionIs there a single accountable executive for AI outcomes across the organisation?ISO/IEC 42001:2023 §5.3 (Roles and responsibilities)
Data foundationsHow well-documented is the lineage of data feeding production AI systems?EU AI Act Art. 10 (Data and data governance)
People & capabilityHow structured is your AI literacy and upskilling programme?OECD AI Principle 2.4 (Building human capacity)
Governance & ethicsHave you classified your AI use cases against the EU AI Act risk tiers?EU AI Act Art. 6 + Annex III (Risk classification)
Governance & ethicsDo you have a documented incident response process for AI failures?NIST AI RMF MANAGE-4 (Incident response)
Tooling & infrastructureHow mature is your model deployment and monitoring stack?ISO/IEC 42001:2023 §8 (Operation)
ROI & measurementDo you track value attribution back to specific AI initiatives?ISO/IEC 42001:2023 §9 (Performance evaluation)

Excerpt of 8 representative items; the full 24-question set is published at /the-index/start.

Q1 2026 cohort

Who responded — and over what window.

The Q2 2026 published edition is built on responses collected from 1 January to 31 March 2026. 437 valid completions out of 612 starts (71.4% completion rate, median 11m 23s).

By industry

Manufacturing8920.4%
Financial services6715.3%
Professional services5813.3%
Tech & software5312.1%
Retail & e-commerce4710.8%
Logistics398.9%
Healthcare317.1%
Education225.0%
Energy & utilities184.1%
Public sector133.0%
Total437100%

By employee band

50–9914232.5%
100–24918442.1%
250–49911125.4%
Total437100%

Off-band respondents (excluded from the published mid-market medians)

The Index form accepts companies of any size. The published cohort focuses on the 50–499 mid-market core (N=437 — the breakdowns above). Respondents from outside that range completed the assessment and received their personal report, but their scores are not aggregated into the published mid-market figures. We track them here for full transparency.

  • 1–49 (small business)24
  • 500+ (large enterprise)16

By country

Portugal15635.7%
Spain9822.4%
France6414.6%
Germany4911.2%
Italy286.4%
Netherlands184.1%
Belgium / Luxembourg112.5%
Ireland71.6%
Other EU61.4%
Total437100%

By respondent role

C-suite8719.9%
VP / Director15635.7%
Senior Manager14232.5%
Other5211.9%
Total437100%
Reliability & validity

What we measure and how confident we are.

Confidence intervals on each median are reported using the Bonett-Price distribution-free approximation (the formula below). With the cohort interquartile range of 27 points (p25=33, p75=60), the 95% CI on the overall N=437 median is approximately ±2.0 points; sectoral medians with N>50 carry a CI of roughly ±4–6 points depending on N; sectoral medians with N<30 widen to ±8–10 points and should be treated as directional rather than precise.

Bonett-Price distribution-free 95% confidence interval for a sectoral median (asymptotic approximation; for N<30 we report exact binomial CIs in the per-sector tables).
Limitations

What this index does not yet do.

  • Self-report bias: respondents may overstate maturity in social-desirability dimensions, particularly Governance and ROI.
  • Selection bias: Index-takers self-select; the cohort is not a probability sample of the European mid-market universe.
  • Small-N sectoral cuts: N<30 in Energy & Utilities and Public Sector, with correspondingly wide confidence intervals.
  • No external validation cohort yet: planned for the Q3 2026 edition, paired with a structured peer-organisation sample.
Planned reliability work

What we publish next, when N supports it.

  • Cronbach's α at N>200 per dimension (target Q3 2026).
  • Test-retest correlation at 90 days, on a rolling cohort (target Q4 2026).
  • Convergent validity against externally-published AI investment ratios (target Q4 2026).
  • Inter-rater reliability for AI Act risk classification (planned Q1 2027).

How peer benchmarks work

A score in isolation tells you very little. The Index is therefore designed around peer comparison from the first question. Each respondent is benchmarked against peers in the same industry × employee-band cohort — for example, manufacturing, 250–999 employees. Every benchmark is calibrated against the most-cited European mid-market AI research — principally the Stanford AI Index 2024, the MIT Sloan / BCG 2024 longitudinal study, and the Capgemini EU AI Act readiness survey — with the specific source cited on each dimension row. As real Index responses accumulate, each cohort crosses a 50-response statistical threshold and the calibration shifts to being driven primarily by live arqmetrica data. Published research remains the anchor; real responses progressively sharpen the fit. Results are reported as percentile bands: top quartile (≥ p75), above median (p50–p75), at median (p25–p50), below median (p10–p25), and bottom decile (< p10). Bands, not raw percentiles, prevent over-interpretation of small-sample noise.

Our transparency commitment

Three commitments hold the Index to the standard this page sets. Anonymisation by default. Company-level responses are stored unattributed; no personally identifying information is captured unless the respondent opts in to receive a PDF copy of their result. Email addresses, when given, are kept in a separate table and can be erased independently of the underlying response, satisfying the GDPR right-to-erasure under Article 17. The deletion endpoint is /api/data/delete; the full data-handling rules live on the Data Ethics page. Aggregate-only public reporting. The quarterly State of European Mid-Market AI report is built solely from anonymised aggregate cohort statistics. No individual response, and no company-identifying field, ever appears in published outputs. Open methodology. The dimension definitions, weights and scoring formulas live as ordinary TypeScript code in src/index/dimensions.ts and src/index/scoring.ts, in the public arqmetrica repository. Anyone — auditor, regulator, competitor, sceptical client — can read the exact arithmetic that produced any given score. There are no hidden adjustments and no proprietary multipliers. And one line we will not cross: the Index scores companies, never individuals. We do not deploy AI to classify or rate the people who complete the assessment. This is not a policy we expect to revise.
Reproduction package

Everything you need to verify the numbers.

The methodology, scoring formula, and weight derivation are open under CC BY 4.0. Cohort-level data is anonymised, but the underlying counts — the cohort tables on this page — are the verification artefact: anyone can reproduce the medians by re-running the formula against the same distribution. The 24 questions are listed in full at /the-index/start.

Take the assessment →