Wishful Thinking
A sample article from the archives of the B.A.S. Speaker

From the December 1990 meeting, BASS Vol. 19 No. 3

WISHFUL THINKING by Tom Nousaine (Illinois)

In the November 1990 Stereophile editor John Atkinson and staffer Will Hammond provide a good model for wishful-thinking analysis in discussing the results of their CD-tweak listening tests ("As We See It: Music, Fractals and Listening Tests"). In January of 1991 Martin Colloms reprises the original wishful-thinking paradigm in discussing his well-known 1986 amplifier comparisons ("As We See It: Working the Front Line").

I call their analyses wishful because they draw conclusions based on evidence that doesn't support such findings.

In the CD-tweak test Atkinson and Hammond conducted a 3222-trial single-blind listening experiment to determine whether CD tweaks (green ink, Armor-All, expensive transports) altered the sound of compact-disc playback. Subjects overall were able to identify tweaked vs untweaked CDs only 48.3% of the time, and the proportion that scored highly (five, six, or seven out of seven trials--Stereophile's definition of a keen-eared listener) was well within the range to be expected if subjects had been merely guessing.

Atkinson declared that there were "some listeners who could and did hear a difference." In response to several letters showing how the statistics didn't support this conclusion, Hammond insisted that "...the total of the tweaks used resulted in a sonic difference that was detected correctly well beyond the probability of it being a chance occurrence" (February 1991, p. 65).

Given the numbers published, this conclusion is simply not supported. However, there were analyses which seemed to support positive results. For example, an analysis of one musical selection, through all listening sessions, judged by males comparing different transports is shown as being "significant: p.001," i.e., the probability of these scores occurring from chance alone is less than 0.1%.

Further analysis shows 71% (132/186) correct identifications when A and B were different and only 32% correct (62/194) when they were the same. The first proportion would be significant when compared with the 50% criterion, which is a score that exceeds 50% by an amount that depends on the size of the sample. The difference between 71% and 32%, moreover, seems too great to be a chance happening.

So doesn't this support their conclusions? Nope-they used the wrong criterion for comparison. When the trials where B was different from A (A-B or B-A) are combined with the trials where A and B were the same (A-A and B-B), the combined score of 50.7% correct is not significantly different from what one would expect by chance. The data do suggest two important things, though. First, listeners are disposed to report differences even when there are none. This group in this example reported a difference 68% of the time when the second presentation was the same as the first. Second, one should have an equal number of same (A-A/B-B) and different (A-B/B-A) trials when the 50% criterion is employed. Otherwise the criterion score must be adjusted to account for response bias, the tendency of subjects to report differences even when a component is compared with itself. [There is an additional bias problem in the later trials if the subjects know that the number of same and different trials are equal. This is not a simple matter. Pub.]

This sort of response bias was first seen in the blind amplifier tests staged by Quad in 1978. In those, the experimenters used the preference style of test: subjects were asked, "Do You Prefer A, Prefer B, or Have No Preference?" Subjects expressed a preference for either A or B 35% of the time when the amplifier was being compared with itself. They were, in other words, biased to prefer A or B (i.e., to report a difference) even when A = B. The people at Quad reported this bias correctly, concluding that based on the numbers these subjects were unable to identify amplifiers by sound alone.

Years later, in 1986, Martin Colloms claimed to have proved that amplifiers sound different with a 63% correct rate in a double-blind test report ("Amplifiers Do Sound Different," Hi-Fi News and Record Review, May 1986). In this case Colloms made large analytical errors. He ignored an unusually large part of his experiment (approximately 25% of the trials), a choice that may have introduced experimental bias. Colloms based his analysis only on the trials where the amplifiers were different, without compensating for the response bias already discussed. Listeners scored 63.3% correct during those trials where the amplifiers were different (95 of the 150 A-BB-A trials). However, subjects scored correctly only 65% of the time when the amplifiers were the same (26 of 40 A-A/B-B trials.) Another way of saying this is that subjects reported a difference 35% of the time (14/40 trials) when there could have been no difference.

There are two analytical ways to compensate: 1) compare the correct rate of the sames and the differents; 63.3% vs 65% is not a significant difference, and 2) adjust the criterion score. Because of response bias, we would expect a hypothetical 100-trial study in which differences were inaudible and which had all different comparisons to produce 67.5 correct responses-35 correct responses because of bias plus 50% of the remaining 65 trials by guessing. Thus a 63.3% correct rate is below the 67.5% expected due to chance alone. [It seems to me that Nousaine is trying to have it both ways here: if a score in the neighborhood of 67% is to be expected on the A-BB-A trials because of bias toward reporting a difference, then 65% correct is all the more significant on the A-A/B-B trials, where the subjects must overcome this bias in their answers. If you combine the "same" and "different" trials for the Colloms tests, as the author does for the CD-tweak tests, the results do appear significant. See the note at the end of the article. Pub.]

Note that the much attacked ABX technique, where a forced choice is made, is free of this problem. In an ABX test a criterion of 50% due to chance is correct given a large enough sample size; however, most researchers recommend a 75%-correct criterion to eliminate the possibility that small bias errors will influence the results.

Returning to Colloms, what if the 4.2% point differential (67%-63.3%) were significant? That is, what if the 4.2% greater rate at which subjects scored wrong was more than we could attribute to chance alone? The most logical conclusion is that there was some sort of bias or systematic error introduced into the study. In Colloms's study this is likely. He arbitrarily excluded a large number of trials because of poor test conditions. There may have been bias built into the test procedure itself, or the exclusion itself may have systematic.

And returning again to Stereophile and the CD tweaks, notice the strength of the response bias compared with earlier tests. In the Quad experiments, audio professionals reported preferences about a third of the time when amplifiers being compared were the same. Colloms's subjects heard differences 35% of the time when amplifiers were compared with themselves. However, the Stereophile report discloses that their subjects answered "different" a whopping 58% of the time in total when the presentations were identical (41.2% correct in A-A and B-B comparisons.)

This is a phenomenon we must be aware of in ourselves. People with an interest in sound will tend to hear things simply from trying not to miss them. These data show we are disposed to "hear" or "guess about" nonexistent differences one-third to one-half of the time even if the "coach" is blind. Imagine how strong the tendency can be when the coach is a trusted friend, reviewer, or salesperson who imposes no scientific controls on himself. But can't we all just put aside our biases when listening? Obviously, according to these data, not. [Not being able to put aside one's biases perfectly and cleanly, simply through effort, is another way of saying that no one is immune to the placebo effect.--Ed.]

Sometimes research can seem to grow in significance over time. For instance, in his February 1991 "As We See It," Martin Colloms recalls his 1986 experiment as being validated by a statistician who supported his conclusion that amplifiers sound different, even after such colleagues as Stanley Lipshitz pointed out weaknesses in the analysis. Lipshitz himself told me that as far as he knew, Colloms has never disclosed any additional analyses that supported his conclusions, and I can't see any way to support them given the extensive body of data supplied in the HFNRR report.

Based on their response to previous controversies, I predict that within two years Stereophile magazine will refer to their experiment as having proved that CD tweaks were reliably identified under blind conditions, ignoring the valid statistical objections already raised in its own pages.

It's sad. While I continue to subscribe to the magazine for entertainment, I'm not so sure I can accept evaluations of other people's products in a publication that is not able to evaluate its own research rationally-especially when many of its writers, and its editor, spend so much time pointing fingers at those who over the years have added much to our knowledge of audio.

2002 Update from Tom Nousaine:

There are two available ABX-style comparison devices. QSC sells an ABX box and there is a pc-based system (PCABX available free from www.pcabx.com) from Arny Krueger, one of the original ABX Company guys. I have four available; the two above, a one-off made for Bob Carver and the original ABX box.

[Publisher's note: A discussion with Nousaine revealed that he expects results from same/different tests that are different from those from ABX tests. We did not reach a consensus about how to interpret the former. He seemed to interpret the 65% correct scores in Colloms's A-AB-B amplifier tests as an illustration of a 35% bias toward hearing differences here there are none, out of his expectation that when the amplifiers are the same the subjects should say so 100% of the time. But 25/40 correct is significant to better than 95% if you use the standard ABX criterion. Does this mean that the ABX's 50% baseline doesn't apply to same/different tests? Yes, Nousaine said. This answer surprised me; I always thought that the data were analyzed the same way. Professor Richard Greiner, to pick one example, conducted same/different tests for his recent AES paper on the detection of polarity, and carried out the statistical analysis in the familiar manner-with a null result being 50% correct answers.

One thing we did agree on was that switching during musical passages (what I call a running-music test) was much more likely to generate false-difference reports than hearing the same piece or short passage over and over (a repeated-music test). Without knowing which method was used, one can't precisely predict the tendency to report nonexistent differences. I believe that repeated-music testing should be used because it appears to be more sensitive, and because it more closely mimics the listening habits of both subjective reviewers and casual listeners. We should take every opportunity to refine our tests for greater sensitivity and make them duplicate actual listening conditions more closely.

Nousaine gave me a preliminary report on an ongoing experiment designed to do these things. Among the new protocols he uses is one designed to mimic what happens when the non-blind listener chooses a certain passage that best illustrates a particular sonic feature: After one round of tests, a second round is conducted using only the music on which correct answers have been given. The results of this experiment, an amplifier comparison conducted on an audiophile system by its owner, will be published in due course.--EBM)

Here is a reader's comment:
 
Dear Sir,

I read the article "Wishful Thinking" on your website with a lot of interest. It was very enlightening on the misuse of statistics and how misunderstood they are. Indeed, the articles criticized in "Wishful thinking" manipulated the numbers to give a positive conclusion. As the author, Mr. Nousaine, pointed out, they only used part of their data to support their claims. Mr. Nousaine then started some complex reasoning to correct this bias.

Unfortunately, the reasoning used by Mr. Nousaine is also incorrect ...

And the note by the Publisher at the bottom of the article also shows he does not grasp the statistical theory to be used in this case.

I will try to explain how this data should be interpreted.

In the testings described, where subject listen to pair of audio samples and must tell if they are different or not, a major problem occurs : we do not know what is the propensity of the subjects to push that damn "I heard a difference" button even when there is no difference.

The only way to interpret the data is therefore to test if the "I heard a difference" answers are randomly distributed. This means to look if the frequency of answers "I heard a difference" was higher when there was actually a difference than when there was not.

In the CD-tweak test from Stereophile for example, the subjects anwered 132 times "I heard a difference" in the 186 trials where there was a difference. That is 71% of the time. But they also said "I heard a difference" 132 times in the 194 trials where there was no difference. That is 68% of the time. It is not very difficult to see that 68% is not statistically different than 71% (also see below)

However, in the amplifiers test, the results are different : we have 95 answers "I heard a difference" in the 150 trial where there was actually a difference (63%), but the subjects only answered "I heard a difference" 14 times out of the 40 trials were there was no difference (35%).

Here, 63% looks different than 35%, but as the sample sizes are very different, a little stricter mathematics have to be done. We have to actually calculate if chance alone could not give such a difference. If the 109 answers "I heard a difference" are randomly distributed among the 190 trials, what would be the probability to get 14 or less such answers in a random subsample of 40 trials by chance alone ? This is a classical "marbles in a bag" problem that can be calculated using the hypergeometric distribution. Using the nice calculator at Stat Trek (http://www.stattrek.com/Tables/Hypergeometric.aspx , just enter the parameters of the problem 190, 40, 109, 14) , we can calculate that the probability of obtaining these numbers by chance alone is about 0.12%. This is well below any statistical criteria for randomness, so we must conclude that, statistically, the subjects could pinpoint a faint difference between the amplifiers, even if they have a high propensity to hear a difference when there is none

(Doing the same hypergeometric calculation on the cd-tweak example, the probability of obtaining those numbers by chance would be 30.5%, which is much higher than the usual 5% threshold uses in statistics. We must therefore conclude that the answers are randomly distributed and that the subjects press the "I heard a difference" button 69% (264/380) of the times, whether there actually exists a difference or not.)

I hope this clarifies the matter a bit. Debunking false ideas is good, but only if you use correct reasoning.

Best regards,
Olivier Van Cantfort


 

The Boston Audio Society
PO BOX 260211
Boston MA 02126

problems? email Barry: webmaster@bostonaudiosociety.org

updated 5/15/07