A Quiz in a Lab Coat

Mike White

21 Nov 2025 — 2 min read

A quiz in a lab coat with an owl looking on

I took a pre-test on a learning platform and scored 100%. The platform was supposed to provide me with recommendations for studying based on my knowledge and experience, but, according to the assessment, there were “no observable skill gaps.” I am an expert, at least based on this pre-test.

There were 25 questions from a validated question pool of over 2000 questions. The assessment is timed and uses what they call an adaptive testing engine all based on industry benchmarking analytics.

The issue is that I’m not expert in the domain, though maybe my ego would love for that to be true. The truth is, the assessment is garbage.

Let’s explore why,

25 questions cannot reliably measure expertise across a domain that normally requires ~80–120 items

In psychometrics, reliability is a function of:

domain breadth
item discrimination
item difficulty spread
number of items

25 items is:

barely enough for entry-level skill classification,
nowhere near enough to make a claim like “no observable skill gaps” across an entire domain.

Even NCLEX, FAA tests, and CompTIA exams use 70–150 items with enormous research backing.

A validated pool of 2,000 questions doesn’t matter if you’re only sampling 1% of it.

“Adaptive testing engine” is usually marketing language unless it’s true IRT

Legitimate adaptive testing uses:

2PL or 3PL Item Response Theory
item difficulty calibration
discrimination parameters
exposure control
Bayesian ability estimation

Most commercial “adaptive tests” instead just do:

If correct → harder question; if incorrect → easier question These types of tests are not adaptive measurements; they're just branching quizzes.

Branching ≠ CAT (Computer Adaptive Testing).

If you end with 100%, the algorithm failed to probe your upper limit

A proper CAT exam will:

push you until you hit your ceiling,
present items until your ability estimate stabilizes,
continue until the measurement error drops below a threshold.

If you hit 100%, that means one of two things:

The test had no upper-difficulty items in the domain, or
The engine reached its stopping rule too early (a common problem with poorly implemented CATs).

Either way: It didn’t actually measure your ability. It just ran out of questions.

No “observable skill gaps” is a nonsense conclusion from such a small sample

This is like giving someone:

25 random math problems
They happen to be easy
You conclude: “You have mastered all mathematics.”

A high score on a narrow sample ≠ mastery of the whole domain.

In psychometrics, this is called domain underrepresentation, which is the single most common assessment error.

The Real Issue

This platform is offering:

a diagnostic test
personalized learning recommendations
confidence-based claims about your “expert” level

But mathematically: A 25-item sample cannot produce that level of diagnostic precision.

The recommendations engine has no data to work with, so it returns the only thing it can:

“No observable skill gaps.”

This is not an insight; it’s a failure state.

What does a Real Adaptive Diagnostic Look Like

A legitimate assessment would:

map skills to a competency model
sample multiple items per skill
adapt at the skill-cluster level
continue until confidence intervals shrink
estimate your ability, not your score
return granular skill-gap probabilities

And would require at least:

60–120 items for broad domains
30–50 items for narrow domains
or continuous sampling until uncertainty is low

Anything else is a quiz wearing a lab coat.

A Quiz in a Lab Coat

Mike White

25 questions cannot reliably measure expertise across a domain that normally requires ~80–120 items

“Adaptive testing engine” is usually marketing language unless it’s true IRT

If you end with 100%, the algorithm failed to probe your upper limit

No “observable skill gaps” is a nonsense conclusion from such a small sample

The Real Issue

What does a Real Adaptive Diagnostic Look Like

Read more

Applying the Feynman Technique to Learning System Design

Understanding Organizational Friction

Apps without Developers

The Three Problems Training Can Actually Help Solve.