|
From the December
1990 meeting, BASS Vol. 19 No. 3
WISHFUL THINKING
by Tom Nousaine (Illinois)
In the November 1990 Stereophile
editor John Atkinson and staffer Will Hammond provide a good model for wishful-thinking
analysis in discussing the results of their CD-tweak listening tests ("As
We See It: Music, Fractals and Listening Tests"). In January of 1991 Martin
Colloms reprises the original wishful-thinking paradigm in discussing his well-known
1986 amplifier comparisons ("As We See It: Working the Front Line").
I call their analyses wishful because
they draw conclusions based on evidence that doesn't support such findings.
In the CD-tweak test Atkinson and
Hammond conducted a 3222-trial single-blind listening experiment to determine
whether CD tweaks (green ink, Armor-All, expensive transports) altered the sound
of compact-disc playback. Subjects overall were able to identify tweaked vs untweaked
CDs only 48.3% of the time, and the proportion that scored highly (five, six,
or seven out of seven trials--Stereophile's definition of a keen-eared listener)
was well within the range to be expected if subjects had been merely guessing.
Atkinson declared that there were
"some listeners who could and did hear a difference." In response to
several letters showing how the statistics didn't support this conclusion, Hammond
insisted that "...the total of the tweaks used resulted in a sonic difference
that was detected correctly well beyond the probability of it being a chance occurrence"
(February 1991, p. 65).
Given the numbers published, this
conclusion is simply not supported. However, there were analyses which seemed
to support positive results. For example, an analysis of one musical selection,
through all listening sessions, judged by males comparing different transports
is shown as being "significant: p.001," i.e., the probability of these
scores occurring from chance alone is less than 0.1%.
Further analysis shows 71% (132/186)
correct identifications when A and B were different and only 32% correct (62/194)
when they were the same. The first proportion would be significant when compared
with the 50% criterion, which is a score that exceeds 50% by an amount that depends
on the size of the sample. The difference between 71% and 32%, moreover, seems
too great to be a chance happening.
So doesn't this support their conclusions?
Nope-they used the wrong criterion for comparison. When the trials where B was
different from A (A-B or B-A) are combined with the trials where A and B were
the same (A-A and B-B), the combined score of 50.7% correct is not significantly
different from what one would expect by chance. The data do suggest two important
things, though. First, listeners are disposed to report differences even when
there are none. This group in this example reported a difference 68% of the time
when the second presentation was the same as the first. Second, one should have
an equal number of same (A-A/B-B) and different (A-B/B-A) trials when the 50%
criterion is employed. Otherwise the criterion score must be adjusted to account
for response bias, the tendency of subjects to report differences even when a
component is compared with itself. [There is an additional bias problem in the
later trials if the subjects know that the number of same and different trials
are equal. This is not a simple matter. Pub.]
This sort of response bias was first
seen in the blind amplifier tests staged by Quad in 1978. In those, the experimenters
used the preference style of test: subjects were asked, "Do You Prefer A,
Prefer B, or Have No Preference?" Subjects expressed a preference for either
A or B 35% of the time when the amplifier was being compared with itself. They
were, in other words, biased to prefer A or B (i.e., to report a difference) even
when A = B. The people at Quad reported this bias correctly, concluding that based
on the numbers these subjects were unable to identify amplifiers by sound alone.
Years later, in 1986, Martin Colloms
claimed to have proved that amplifiers sound different with a 63% correct rate
in a double-blind test report ("Amplifiers Do Sound Different," Hi-Fi
News and Record Review, May 1986). In this case Colloms made large analytical
errors. He ignored an unusually large part of his experiment (approximately 25%
of the trials), a choice that may have introduced experimental bias. Colloms based
his analysis only on the trials where the amplifiers were different, without compensating
for the response bias already discussed. Listeners scored 63.3% correct during
those trials where the amplifiers were different (95 of the 150 A-BB-A trials).
However, subjects scored correctly only 65% of the time when the amplifiers were
the same (26 of 40 A-A/B-B trials.) Another way of saying this is that subjects
reported a difference 35% of the time (14/40 trials) when there could have been
no difference.
There are two analytical ways to
compensate: 1) compare the correct rate of the sames and the differents; 63.3%
vs 65% is not a significant difference, and 2) adjust the criterion score. Because
of response bias, we would expect a hypothetical 100-trial study in which differences
were inaudible and which had all different comparisons to produce 67.5 correct
responses-35 correct responses because of bias plus 50% of the remaining 65 trials
by guessing. Thus a 63.3% correct rate is below the 67.5% expected due to chance
alone. [It seems to me that Nousaine is trying to have it both ways here: if a
score in the neighborhood of 67% is to be expected on the A-BB-A trials because
of bias toward reporting a difference, then 65% correct is all the more significant
on the A-A/B-B trials, where the subjects must overcome this bias in their answers.
If you combine the "same" and "different" trials for the Colloms
tests, as the author does for the CD-tweak tests, the results do appear significant.
See the note at the end of the article. Pub.]
Note that the much attacked ABX technique,
where a forced choice is made, is free of this problem. In an ABX test a criterion
of 50% due to chance is correct given a large enough sample size; however, most
researchers recommend a 75%-correct criterion to eliminate the possibility that
small bias errors will influence the results.
Returning to Colloms, what if the
4.2% point differential (67%-63.3%) were significant? That is, what if the 4.2%
greater rate at which subjects scored wrong was more than we could attribute to
chance alone? The most logical conclusion is that there was some sort of bias
or systematic error introduced into the study. In Colloms's study this is likely.
He arbitrarily excluded a large number of trials because of poor test conditions.
There may have been bias built into the test procedure itself, or the exclusion
itself may have systematic.
And returning again to Stereophile
and the CD tweaks, notice the strength of the response bias compared with earlier
tests. In the Quad experiments, audio professionals reported preferences about
a third of the time when amplifiers being compared were the same. Colloms's subjects
heard differences 35% of the time when amplifiers were compared with themselves.
However, the Stereophile report discloses that their subjects answered "different"
a whopping 58% of the time in total when the presentations were identical (41.2%
correct in A-A and B-B comparisons.)
This is a phenomenon we must be aware
of in ourselves. People with an interest in sound will tend to hear things simply
from trying not to miss them. These data show we are disposed to "hear"
or "guess about" nonexistent differences one-third to one-half of the
time even if the "coach" is blind. Imagine how strong the tendency can
be when the coach is a trusted friend, reviewer, or salesperson who imposes no
scientific controls on himself. But can't we all just put aside our biases when
listening? Obviously, according to these data, not. [Not being able to put aside
one's biases perfectly and cleanly, simply through effort, is another way of saying
that no one is immune to the placebo effect.--Ed.]
Sometimes research can seem to grow
in significance over time. For instance, in his February 1991 "As We See
It," Martin Colloms recalls his 1986 experiment as being validated by a statistician
who supported his conclusion that amplifiers sound different, even after such
colleagues as Stanley Lipshitz pointed out weaknesses in the analysis. Lipshitz
himself told me that as far as he knew, Colloms has never disclosed any additional
analyses that supported his conclusions, and I can't see any way to support them
given the extensive body of data supplied in the HFNRR report.
Based on their response to previous
controversies, I predict that within two years Stereophile magazine will refer
to their experiment as having proved that CD tweaks were reliably identified under
blind conditions, ignoring the valid statistical objections already raised in
its own pages.
It's sad. While I continue to subscribe
to the magazine for entertainment, I'm not so sure I can accept evaluations of
other people's products in a publication that is not able to evaluate its own
research rationally-especially when many of its writers, and its editor, spend
so much time pointing fingers at those who over the years have added much to our
knowledge of audio.
2002 Update from Tom Nousaine:
There are two available ABX-style
comparison devices. QSC sells an ABX box and there is a pc-based system (PCABX
available free from www.pcabx.com) from Arny Krueger, one of the original ABX
Company guys. I have four available; the two above, a one-off made for Bob Carver
and the original ABX box.
[Publisher's note: A discussion
with Nousaine revealed that he expects results from same/different tests that
are different from those from ABX tests. We did not reach a consensus about
how to interpret the former. He seemed to interpret the 65% correct scores in
Colloms's A-AB-B amplifier tests as an illustration of a 35% bias toward hearing
differences here there are none, out of his expectation that when the amplifiers
are the same the subjects should say so 100% of the time. But 25/40 correct
is significant to better than 95% if you use the standard ABX criterion. Does
this mean that the ABX's 50% baseline doesn't apply to same/different tests?
Yes, Nousaine said. This answer surprised me; I always thought that the data
were analyzed the same way. Professor Richard Greiner, to pick one example,
conducted same/different tests for his recent AES paper on the detection of
polarity, and carried out the statistical analysis in the familiar manner-with
a null result being 50% correct answers.
One thing we did agree on was that
switching during musical passages (what I call a running-music test) was much
more likely to generate false-difference reports than hearing the same piece
or short passage over and over (a repeated-music test). Without knowing which
method was used, one can't precisely predict the tendency to report nonexistent
differences. I believe that repeated-music testing should be used because it
appears to be more sensitive, and because it more closely mimics the listening
habits of both subjective reviewers and casual listeners. We should take every
opportunity to refine our tests for greater sensitivity and make them duplicate
actual listening conditions more closely.
Nousaine gave me a preliminary
report on an ongoing experiment designed to do these things. Among the new protocols
he uses is one designed to mimic what happens when the non-blind listener chooses
a certain passage that best illustrates a particular sonic feature: After one
round of tests, a second round is conducted using only the music on which correct
answers have been given. The results of this experiment, an amplifier comparison
conducted on an audiophile system by its owner, will be published in due course.--EBM)
| Here is a reader's comment: |
| |
| Dear Sir,
I read the article "Wishful Thinking"
on your website with a lot of interest. It was very enlightening on
the misuse of statistics and how misunderstood they are. Indeed, the
articles criticized in "Wishful thinking" manipulated the
numbers to give a positive conclusion. As the author, Mr. Nousaine,
pointed out, they only used part of their data to support their claims.
Mr. Nousaine then started some complex reasoning to correct this bias.
Unfortunately, the reasoning used by Mr. Nousaine
is also incorrect ...
And the note by the Publisher at the bottom of
the article also shows he does not grasp the statistical theory to
be used in this case.
I will try to explain how this data should be
interpreted.
In the testings described, where subject listen
to pair of audio samples and must tell if they are different or not,
a major problem occurs : we do not know what is the propensity of
the subjects to push that damn "I heard a difference" button
even when there is no difference.
The only way to interpret the data is therefore
to test if the "I heard a difference" answers are randomly
distributed. This means to look if the frequency of answers "I
heard a difference" was higher when there was actually a difference
than when there was not.
In the CD-tweak test from Stereophile for example,
the subjects anwered 132 times "I heard a difference" in
the 186 trials where there was a difference. That is 71% of the time.
But they also said "I heard a difference" 132 times in the
194 trials where there was no difference. That is 68% of the time.
It is not very difficult to see that 68% is not statistically different
than 71% (also see below)
However, in the amplifiers test, the results are
different : we have 95 answers "I heard a difference" in
the 150 trial where there was actually a difference (63%), but the
subjects only answered "I heard a difference" 14 times out
of the 40 trials were there was no difference (35%).
Here, 63% looks different than 35%, but as the
sample sizes are very different, a little stricter mathematics have
to be done. We have to actually calculate if chance alone could not
give such a difference. If the 109 answers "I heard a difference"
are randomly distributed among the 190 trials, what would be the probability
to get 14 or less such answers in a random subsample of 40 trials
by chance alone ? This is a classical "marbles in a bag"
problem that can be calculated using the hypergeometric distribution.
Using the nice calculator at Stat Trek (http://www.stattrek.com/Tables/Hypergeometric.aspx
, just enter the parameters of the problem 190, 40, 109, 14) , we
can calculate that the probability of obtaining these numbers by chance
alone is about 0.12%. This is well below any statistical criteria
for randomness, so we must conclude that, statistically, the subjects
could pinpoint a faint difference between the amplifiers, even if
they have a high propensity to hear a difference when there is none
(Doing the same hypergeometric calculation on
the cd-tweak example, the probability of obtaining those numbers by
chance would be 30.5%, which is much higher than the usual 5% threshold
uses in statistics. We must therefore conclude that the answers are
randomly distributed and that the subjects press the "I heard
a difference" button 69% (264/380) of the times, whether there
actually exists a difference or not.)
I hope this clarifies the matter a bit. Debunking
false ideas is good, but only if you use correct reasoning.
Best regards,
Olivier Van Cantfort
|
|
|