Benchmark methodology

dvno · July 27, 2015

Hi all, I'm new to the site. I'm getting back into custom pc building after a good long eight to ten year hiatus (back when AMD named their processors after horses and all).

I've been doing a lot of research into my upcoming build, and have glossed over a ton of benchmarks. And in doing so, I've come up with some generic questions that I was hoping might spark a general discussion about benchmark methodology. Some of these questions are completely nit-picky, and others stem from my background in designing and running psych and other empirical studies.

1. Why aren't folks doing the benchmarking more transparent about replication (assuming they replicate)? Like, how many times is a benchmark run, and how do they ensure that the system is in a similar state (heat and otherwise) before each benchmark? How can they identify order effects, if they exist, and how do they control for them (maybe running 3DMark before PCMark will result in different scores than the other way around)?

2. Why don't folks employ basic stats to identify whether some purported difference is actually significant? Like, basic means analyses (t-tests) and identifying standard errors?

3. Would the implementation of more transparent methods and better stats be something that members in the community would want to see, or why not?

and, 4. What are the basic methods that LTT uses, in terms of replication and confound elimination?

Yoinkerman · July 27, 2015

They gloss over that stuff because no one cares to listen to the testing methodology over and over and over again.

Mostly they run the benchmarks on a clean windows install and cold boot in between each run. The amount of difference temp makes in between each run isn't significant, it doesn't take a statistician to figure out a 30 point delta on a 5000 3dmark score between runs isn't significant. I'm sure they could figure out margins of error and stuff but largely there's no point. Benchmarks are supposed to provide general ideas, not be research.

Edit: testing methods are around, if you dig deeper

dvno · July 27, 2015

Good point. I think that answered (at least subjectively) one question I had - whether people care about the specifics and the stats. But let me press the point again, in a different way:

A 30 "unit" difference in any benchmark is meaningless unless both we know what we can attribute the difference to, and whether we know it's a difference that is replicatable.

Coming from an ameuteurish stats background, I don't know whether a 30 unit difference means anything unless there is some standard measure of error, or at least some guidelines for replication.

Also, even if most people didn't care, if one benchmarked well (in a statistically relevant method), then you could post that data as well, for those who might care, even if most people wouldn't. That is, part of what it is to benchmark well is to do it in some stastically tractable method. Or isn't it?

LogicalDrm · July 28, 2015

Testing with PC software isn't so exact nor it doesn't need to be as exact like scientific testing. Most times each test is run once. By my own tests I've found that difference between tests is 1-20 points from 1k-10k points. Not so important. If you want to see difference that is significant, we would be talking about few hundred points difference. This is also thing that you can notice on many LTT comparisons. Graph has some difference but still Luke says that the two things (ie. GPUs) are in par.

With something like temp testing, there are bit more to check. Like ambient and whether other components are being stressed at the same time or not. These things make comparing two reviews just worthless. Inside single review, temps are usually presented as delta between ambient and max/avg of cores. That eliminates most errors what testing could present. With many CPU cooler reviews you can also see that tests are made with ton of different configs using fans, speeds and different overclock settings. That would be as much variation as you get in PC hardware testing.

As a natural science student I'm bored to death staring at equations and configs they present in EVERY scientific article. Its quite enough that reviews post their testing rig and OC settings without going further into detail. Like explaining difference between precision and accuracy every time.

Yoinkerman · July 28, 2015

They have the testing methodology somewhere in a document so people can follow along at home. I haven't bothered to look for it myself.

A whole pile of people all over the world in a bunch of different situations all run the synthetic benchmarks of 3dmark and the like that you can compare your results to. With a given CPU/GPU combo there's a small spread of results. Some motherboards produce slightly higher scores sometimes, some slightly lower. After a while of doing a bunch of benchmark runs, you kind of get a feel for what to expect and where each part should fall. If it doesn't perform as expected then something is up.

No, they don't benchmark to a scientific standard. Basically a CPU gets about x many 3dmarks and a gpu gets y many 3dmarks, and if the results from testing aren't close they investigate further. You could probably figure out the curve of results and how many fall in the 95th or 98th percentile and figure out the proper margin of error but there's no need when you can just eyeball it.

Tie Lightning · July 29, 2015

For me, benchmarks get to the point where I don't really mind if they don't tell me how they did it... If I see a benchmark chart from 4 separate sources and the r9 fury performs slightly better than the 980 but less than the fury x and the 980ti...i don't care because that is how it is. I don't trust just one source but when there are multiple published sources unanimously agreeing I don't think twice...

LogicalDrm · July 29, 2015

For me, benchmarks get to the point where I don't really mind if they don't tell me how they did it... If I see a benchmark chart from 4 separate sources and the r9 fury performs slightly better than the 980 but less than the fury x and the 980ti...i don't care because that is how it is. I don't trust just one source but when there are multiple published sources unanimously agreeing I don't think twice...

For me many GPU and CPU benchmark scores have become irrelevant. I just want to see the graph and compare how tall/long the piles are.

dvno · July 29, 2015

EDIT: SUMMARY

not everyone cares that much about benchmarks.
not everyone should care that much about benchmarks.
You should care about benchmark methods if:

You want to be able to compare various benchmarker scores.
You want to be able to compare various benchmarker methods.
You consider / want to consider benchmarking as a quasi or bona fide science.
You want to find the best way to benchmark components; where that means that you're really identifying what is causing a purported difference in a benchmark, or you've devised a bona fide replicatable set up / method, or you're using basic stats to identify significant differences.

P.s. What the hell is a "mark" unit in 3DMark anyhow? Any unit in any of these benchmarks? That's probably a separate issue:

***

Good points all around. I think it is safe to say, given what everyone has written, that most folks who are using benchmarks to guide their purchasing decisions are being very conscientious about checking as many benchmarks as possible and taking them in with a skeptical eye. That still leaves three related and general questions open: What would it take to "benchmark better?" How feasible is it to "benchmark better?" And how might one encourage people to be more transparent about their benchmarking (more on that later)? A few comments before I dig into these questions a little more.

They have the testing methodology somewhere in a document so people can follow along at home. I haven't bothered to look for it myself.

A whole pile of people all over the world in a bunch of different situations all run the synthetic benchmarks of 3dmark and the like that you can compare your results to. With a given CPU/GPU combo there's a small spread of results. Some motherboards produce slightly higher scores sometimes, some slightly lower. After a while of doing a bunch of benchmark runs, you kind of get a feel for what to expect and where each part should fall. If it doesn't perform as expected then something is up.

No, they don't benchmark to a scientific standard. Basically a CPU gets about x many 3dmarks and a gpu gets y many 3dmarks, and if the results from testing aren't close they investigate further. You could probably figure out the curve of results and how many fall in the 95th or 98th percentile and figure out the proper margin of error but there's no need when you can just eyeball it.

1) It would be sweet to find that document. If anyone can send a link I'd super love to nerd out over it.

2) It would be sweet to have some way to collate and organize these benchmark scores across benchmarks for a given component (Though, I'm sure that someone will quickly point out an obvious source to this, as it probably does exist but I'm just too daft to find it).

3) I don't know about "eyeballing it," standard errors and significances depend on sample sizes: A difference of 2% might be meaningful when you have a sample size of hundreds, but not if you have a smaller sample. The problem then becomes, "what unites two or more members of a sample?" If you're concerned with one gpu across the board versus one gpu + cpu combo, then you've got two very different samples.

For me, benchmarks get to the point where I don't really mind if they don't tell me how they did it... If I see a benchmark chart from 4 separate sources and the r9 fury performs slightly better than the 980 but less than the fury x and the 980ti...i don't care because that is how it is. I don't trust just one source but when there are multiple published sources unanimously agreeing I don't think twice...

4) I get it. I think that this goes to the introductory point: Most people who know a thing or two about pc parts probably approach benchmarks a certain way (Although, it might be nice to do a study to see what pc enthusiasts are looking for in a benchmark, but that's another project that will require an interesting ethical review board application).

For me many GPU and CPU benchmark scores have become irrelevant. I just want to see the graph and compare how tall/long the piles are.

5) I don't quite buy this. You can manipulate most graphs to make any difference in two or more bars look gargantuan or minuscule. But then again, some people might not care about benchmarks at all. But then, we have to ask two separate questions: Why not? Is it because they aren't providing meaningful / useful / reliable data? Or is there a better way to choose parts etc.?

******

This all came about because of a LTT video where Luke, who seems like a really swell guy, claims that he is doing some science and that we should click the "like" button to see him do more science on some video card comparison. I understand that he was probably saying this in a joking way. I didn't really think of benchmarks as science until then, but then I got to thinking: "This isn't science, it's not nearly controlled or replicatable enough," which then got me asking what it would take to make benchmarking a science, and whether that would be a good thing.

So here are some preliminary ideas:

A) Be more transparent about what and how you controlled for confounds. For synthetic tests: How often you installed a new image, cool down times in between tests, piloting tests to observe and control for any order effects (if there are any; say running one benchmark before the other messes with some low-level components that then might have a residual effect on another test), etc.

A(i)) For thermal tests, the clear and simple solution is to used an air-conditioned room. LTT probably will be able to do this once they move to their new office, which looks very nice and much more lab-worthy.

B ) Be more transparent about replication: How many times did you run the test, and why did you stop at some number versus another number (say 3 versus 4 data points). Probably the best way to do this is to look for averages, and then figure out whether the next test is significantly different than the previous average (where an average score set would approach a gaussian distribution).

C) Use stats derived from (B )to enable one benchmarker to really compare, in a more apples to apples way, their benchmark methods to another benchmarker, and have a discussion about which methods are better, with data to support their claim; which might help us come to...

D) An easily searchable, collated database for benchmark scores across components and control conditions.

But, again, these are just some ideas. And again, for most people I don't think that this level of replication and accuracy is necessary, or even wanted, apparently. I don't think we would lose anything by advocating for a tighter level of benchmark design and methods, those who just want to glance at the charts still can, but those who want to make meaningful comparisons between benchmarks will now be able to do so. It's a false dilemma to suggest that advocating for better benchmark methods will be a detriment to those who don't particularly care either way. In any case, such tighter scrutiny of benchmark methods is certainly is a minimum if you want to bring in the title of science into the discussion, or even if you want to make an informative discussion about benchmarks between people and between components.

But I could be totally wrong.

Yoinkerman · July 29, 2015

Doing benchmarks more scientifically also means paying an employee umpteen more hours to do work that doesn't have a lot of forward value.

I guess at the end of the day, more empirical testing costs money that reviewers aren't willing to spend.

There are some controls in place, though I couldn't say offhand what they are or how they account for different things.

I would think they reimage windows for every different video card to minimize driver interference, cold boot to account for program interference (or order error), and do each benchmark 3+ times. I'm sure they've ran the same benchmark 30 times and have decided 3 or 5 produces a set they're happy with. Having a 3 point average produce a result that's within a few % of a 30 point average is close enough to not waste the manpower.

LogicalDrm · July 30, 2015

So many points to reply to I'll get back to this later today when I have batter keyboard. But for starters, don't take everything said in reviews to letter. After all, they are made either to sell or to entertain.

Sign In

Benchmark methodology

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

The Future of PC Cooling?

Latest From ShortCircuit:

The coolest looking monitor. Period. - ASUS ROG display at Computex (Sponsored)

Latest From TechLinked:

Microsoft Just Can’t Help Itself

Latest From GameLinked:

Wait wasn't this game dead??

Latest From Tech Quickie:

Who's Tracking Your Phone Right Now?

Latest From The WAN Show:

Pizza Hut is Being Sued Over AI