Optimal Number of Test Runs

I’ve been looking at some public tests and many people seem to run only one test. Others run 3. I run 10. Is there a sweet spot?

Funnily, in Pat’s recent article in the Performance Calendar, he says that he runs 9 tests.

“Each test was configured for 9 runs so that we could collect enough data to be reasonably confident with the results…”

Yeah, > 1 is absolutely critical if you are looking at the times. If you are just looking at how the page is constructed (requests, waterfall, caching, etc) then 1 works ok for a quick snapshot.

I usually do 9 instead of 10 because the times are from the median run and median makes more sense for an odd number of samples. Depending on the page, even 9 may not be enough. If you click on the “plot full results” link below the data table you can see what the distribution of the times looks like for all of the runs and where the median run times were. Eyeballing that should give you better confidence if you are looking at a representative result or a page that has wildly variable performance.

Thanks, Pat. Nice pro tip to run an odd number of tests, now I’m curious how the median is selected for even tests. (tiebreaker?)

Higher sample size is always better, but is there an imposed/suggested upper limit to the number of tests? Would not want to hog the tool’s resources.

Yeah, the tool doesn’t let you submit for more than 10 runs which is why you see a lot around 9 or 10 :slight_smile:

The median for an even number of tests is actually the test that’s on the faster side of the midpoint I believe (not a true median since it is picking a specific run).


Would it be possible to allow users to select the criteria based upon which the median is selected? For example, Speed Index or Start Render instead of Load Time?

I believe you may be working on this already, as I stumbled across a query param somewhere (can’t find it now) called something like criteria=loadTime

Best Regards,

Jason Hofmann

I don’t have UI for it right now but you can pass medianMetric=X to have it use an arbitrary metric (if you look at the XMLResult for a given test you can see all of the metric names).

Did I document any that aren’t supported or correct? (It’s hard to tell by testing, since a few of them actually would and should return the same result are the default, loadTime):

those should all work (when/if they apply - domTime doesn’t always). It also works for all of the non-time measurements (bytes, requests, etc).