On blind A/B testing

Mark Hammer · November 16, 2007, 11:24:35 AM

Years ago, I remember reading something in JAES (Journal of the Audio Engineering Society) regarding appropriate methods for blind listening tests. It may have even been by Canadian audio enginnering legend Floyd Toole. I may well be wrong on both counts, but I know I read it somewhere.

The gist was that the switching back and forth cannot be a simple matter of A->B->A->B->A->B->A->B->A->B->A->B. Rather, the pattern has to be random, such that the listener cannot know for certain whether the next sound heard is via the same equipment as the last sample or different equipment. Even when someone else is wielding the controls and toggles back and forth between A and B, the knowledge that every other sample is produced via the same means can still lead to expectations exagerrating and biasing what is perceived.

I know back in the days when I used to be an animal researcher, we would use something referred to as the Gellerman series for running rats in mazes. Surprisingly, there is no wikipedia entry I can find for Gellerman series, but they are essentially several random series of A/B alternations. So, AABABBBABBAAABA, etc. We used them to arrange for baiting one of the two alternative routes a rat might take in a simple maze, or buttons a pigeon might peck, or food coverings a monkey might lift up to find food. Anywhere that you have two possible choices and you don't want the individual to be able to predict which is going to occur next.

Arranging for blind testing on one's own, at home, is difficult to do. I mean, if you're the one throwing the toggle, then obviously you know what to expect, right? The ideal would be to have some automated way to generate a randomly-selected A-vs-B outcome and have the circuit switched FOR you, with information about which of the two choices was made on that occasion only available after the fact, rather than before it. My understanding is that noisy diodes are often used as sources of random number generators or some other random value. Perhaps it might be possible to use a noise source to generate a randomly-selected switch 1 or switch 2 outcome.

I suppose that "proper" blind A/B home-testing would involve a "decision box" that had some screw terminals, or perhaps jacks, such that the two technical changes/alternatives to evaluate would be connected for switching. On one side of the box, unseen, might be two or 3 low illumination LEDs. These would let you know after the fact what you had just heard and evaluated. Obviously, any opportunity to see even a tiny bit of their light would have to be precluded. (You don't want any "Clever Hans" effects: http://en.wikipedia.org/wiki/Clever_Hans ). The user would employ some sort of remote momentary switch to request another changeover. So, you press the button, listen to whatever you're supposed to be listening to, and either you note what LED was lit that time, or someone else helping you does it. You press again, write down your evaluation, and note which choice it was, and so on.

The overarching goal is that the user be able to select between 2 or 3 sources for comparison purposes, without ever knowing in advance which of the alternatives is being evaluated.

Now, one of you CMOS-and-relay geniuses has to be able to come up with something that can do that, right? Alternatively, are there already existing circuits that can do this or be adapted to do it? I know the 70's and 80's had a surfeit of simple LED-based game circuits/projects. Maybe there is something in one of those wonderful Rudolf Graf circuit compendiums.

dirk · November 16, 2007, 12:01:50 PM

http://www.provide.net/~djcarlst/abx_new.htm

dirk · November 16, 2007, 11:06:36 PM

I think the double blind test is the greatest invention of the 20th century. It has saved milions of lives, because now for the first time ever humans can really know if something works or not. You would think that this is the end of all snakeoil products, but I guess people like lies much more than they like the truth.

The audio industry is no exception. Lots of professionals in the audio industry think that audio is completely subjective. Total nonsense of cause, the only thing subjective is if you like it or not. All the rest can be measured objectively.

R.G. · November 17, 2007, 08:32:20 AM

QuoteNow, one of you CMOS-and-relay geniuses has to be able to come up with something that can do that, right?

I don't know that I'd lay claim to "CMOS-and-relay genius" but I do know a CMOS random-order circuit that would work. I'll see what I can gen up. I think four DIPS would work OK.

Quotehttp://www.provide.net/~djcarlst/abx_new.htm

This one is really neat, though.

dirk · November 17, 2007, 11:01:41 AM

Here's a hardware version without microcontroler.
http://sound.westhost.com/abx-tester.htm

R.G. · November 17, 2007, 11:11:45 AM

Yeah, that's a good source for the actual switching rear end.

I found a set of code for the 8-pin DIP PIC 12C508 controller that would do the work. It produces a 25-bit pseudorandom sequence. If you hooked it up so the thing shifted state once each time a switch was closed, it would provide a semi-infinite length of ABX selections for that rear-end when it was made from relays. The pattern repeats every five minutes when clocked at 500khz, so with manual clocking it would last centuries before repeating. You'd never need to turn the controller off, only the relay power.

So Mark, was that something like you had in mind? A set of relays that select A, B, or X, and which changes randomly between them each time a switch was pushed? Takes one 8-pin DIP and whatever it takes to drive the relays. Probably better not to use solid state switching to avoid any possibility of the switches polluting the process.

dirk · November 18, 2007, 01:16:05 AM

Although double blind testing is a great tool, it can be misused. Just like all statistics.

If the difference between the test subjects is small, you need lots of tests.

Mark Hammer · November 19, 2007, 11:24:56 AM

Quote from: R.G. on November 17, 2007, 11:11:45 AM
So Mark, was that something like you had in mind? A set of relays that select A, B, or X, and which changes randomly between them each time a switch was pushed? Takes one 8-pin DIP and whatever it takes to drive the relays. Probably better not to use solid state switching to avoid any possibility of the switches polluting the process.

More or less. Naturally, I was hoping for something that would not involve programming a PIC.

Just to go back to the "Gellerman series" for a bit, one of the properties of those A-B sequences was that they would not "randomly" present one of the possible choices so many times in a row as to bias expectations. They were originally developed to present two-choice alternatives to rats and similar beasties, but I they were also used for research with higher species (monkeys, children, etc). If I remember correctly, although there was a possibility that the next one or two outcomes could be the same as the last one, no single outcome would occur more than 3 times in a row. Once you hit the same event occurring 4 or more times, you reach a threshold where people/rats start to expect things. With only a couple of possible outcomes occurring, you don't want the judge/decider/behaver to either expect that things will stay the same, or expect that they will necessarily shift. The intent of the series is for them to simply not know in advance. As you can imagine, creating "randomness" when the range of possible outcomes is very small, is a tough row to hoe. Creating perceived randomness when the range of possibilities is very large is a breeze by comparison.

You know, for that matter, you could program a Gellerman series into a 4017 counter (with open/closed dipswitches) since each series is 10 steps, and then randomly select from one of several 4017 outputs. Chip A has series 1, chip B has series 2, etc.

~arph · November 19, 2007, 11:53:21 AM

What about a high speed clock to get a random A/B.. (see Andrew's Vanishing point) and two flipflops in series. A led coupled to the last one so one would always see the delayed A/B. Flipflop Clock is on a footswitch. (debounced). Or did I miss something?

runmikeyrun · November 19, 2007, 01:19:25 PM

this is how NOS tubes (and other components) get such high regard and high prices. While i do not dispute the difference between say a GOOD mullard and BAD chinese current production 12ax7 there are a lot of subjectivities when it comes to tubes and amps. Some people have been known to use worn NOS tubes just "because it's an NOS RCA". I myself tried several NOS tubes in my head and went back to the new production JJs because they were stronger and worked better.

Use the example below:

Our guitarist has a JCM 800 that for years delivered a punishing mix of low end thunder and amazing well rounded distortion. We had surmised it had original equipment NOS Siemens EL34s (and probably pretty decent preamp tubes) which are prized for their low end punch and grind, and his amp certainly delivers that. A photo of a power tube and careful comparison by several cork sniffers on another board confirmed this. Upon checking further we discovered that the preamp was full of- GROOVE TUBES! Surprise #1. Just to be safe, an ebay auction of a matched quad of siemens EL34 (pictures matched the amp's tubes) was purchased in fear of a rise in prices when new ones would be needed.

Well, after another couple of years it was time to put the new power tubes in. So, a recent trip to the amp doctor reveals surprise #2- the current power tubes are SOVTEKS and NOT Siemens! So all these great attributes of the so-called siemens were coming from run of the mill sovteks. Surprise #3- the matched quad from ebay were chinese- not the Siemens the guitarist thought he was purchasing. The tech then tried to offer him some NOS quad for $300- after we had just learned that cheapie sovteks were delivering more than what we needed.

So what's the lesson? Trust your ears. His JCM 800 sounded amazing with GT pres and Sovtek powers. If a mullard ge sounds good in your circuit, then go with it. If a 2N5088 works, use it. If you think that marketing a JRC4558 is going to increase sales of your pedal even though it sounds exactly the same as any other 4558 then use it. I feel that even an enclosure can affect the way people think a pedal sounds. Look at the Mesa distortion pedals- they look the way people think they should sound. Some people go insane looking for the right sounding chip, diode combination, LED color, all kinds of things. I am guilty myself and have learned that if it sounds good use it.

frankclarke · November 19, 2007, 01:26:26 PM

An old Belfast design would be mercury-tilt switches on a see-saw, shake it up, listen and open the lid to reveal A or B.
Or a rotary switch that rotates past 360 degrees with a spin-the-wheel interface.
A "blinking led" multivibrator circuit with a stop switch and relays would work.
This guy powers up with random LEDs:
http://www.play-hookey.com/digital/experiments/counter_ic_4029.html

Mark Hammer · November 19, 2007, 03:07:33 PM

Quote from: runmikeyrun on November 19, 2007, 01:19:25 PM
So what's the lesson? Trust your ears. His JCM 800 sounded amazing with GT pres and Sovtek powers. If a mullard ge sounds good in your circuit, then go with it. If a 2N5088 works, use it. If you think that marketing a JRC4558 is going to increase sales of your pedal even though it sounds exactly the same as any other 4558 then use it. I feel that even an enclosure can affect the way people think a pedal sounds. Look at the Mesa distortion pedals- they look the way people think they should sound. Some people go insane looking for the right sounding chip, diode combination, LED color, all kinds of things. I am guilty myself and have learned that if it sounds good use it.

I agree...up to a point. The trouble is that "ears" also involves memory, and memory involves expectations. People hear all sorts of things that aren't really there if they expect to. Just try taking a shower the next time you're waiting to get a call back about a job, or the next time the baby is sick, and try NOT hearing the phone ring or the baby cry. Obviously, since we rely on our senses every day all day, and we're not dead yet, one should place some faith in one's ears. And there are often differences/changes in tone that are so evident that even those without expectations or any specialized training can detect and sometimes even describe (Your great-grandmother could easily tell the difference between a cello played through a Fuzz Face, and a cello recorded clean). But the boundary between when ears should be trusted, and when they shouldn't, is much fuzzier than we'd like to think and many times things get heard that are either nonexistent or so fleeting and situationally dependent as to be nonexistent.

For these reasons, the desire to study what is actually heard quite apart from expectations prompted the development and use of psychophysical methods, among them blind A/B testing, that takes steps to get around expectation effects. I look at this as an issue of freedom from slavery to misplaced beliefs. There may be things we only think we hear that nevertheless harness us into unnecessary financial commitments or ways of doing things. A good test methodology lets you know exactly how the world works, so that you can focus on the useful and necessary and step around the pointless. Hell, I bet you if we did the proper blind A/B testing, we could probably reduce the number of different distortion pedals sold by Musictoyz by at least 30%.

FWIW, I was listening to an interview with producer Daniel Lanois (U2, Dylan, Emmylou Harris, etc) the other night, and he was talking about how production is different during the day vs nighttime, and that some musicians prefer and even insist on doing mixes or other post-production at certain times of the day. So, there may even be some auditory phenomena that only show up during daylight or when daylight ends, for all we know.

In any event, I thank all those folks contributing to this thread. I imagine there are many on this forum that would like a certain level of certainty regarding sonic experiments they do at home on their own. And, if a small simple piece of equipment permits them to do more rigorous research on their own and report back, so much the better. It will advance us all.

R.G. · November 19, 2007, 05:40:58 PM

Quote from: Mark Hammer on November 19, 2007, 11:24:56 AM
... Naturally, I was hoping for something that would not involve programming a PIC.
Yep, I figured.
Quote from: Mark Hammer on November 19, 2007, 11:24:56 AMJust to go back to the "Gellerman series" for a bit, one of the properties of those A-B sequences was that they would not "randomly" present one of the possible choices so many times in a row as to bias expectations. They were originally developed to present two-choice alternatives to rats and similar beasties, but I they were also used for research with higher species (monkeys, children, etc). ...
You know, for that matter, you could program a Gellerman series into a 4017 counter (with open/closed dipswitches) since each series is 10 steps, and then randomly select from one of several 4017 outputs. Chip A has series 1, chip B has series 2, etc.
I'll tell you what - can you provide me with a good, long readout of the Gellerman series? I'll gin up a set of programming to do it into a loop, and set it up to start at a random place in the loop each time it's used.

brett · November 19, 2007, 07:13:15 PM

QuoteThe trouble is that "ears" also involves memory, and memory involves expectations. People hear all sorts of things that aren't really there if they expect to.

This is highly relevant. But it is like the tip of the iceberg.

Even logical, de-biased testing will still be almost completely unsatisfactory. Why? Because it will still be full of expectations. *People* are full of expectations (except for new-borns).

Image the results of some perfect test of NOS tubes. If the tests show no difference with cheap chinese tubes, those who previously thought NOS tubes were overpriced won't change their minds (OK, this is obvious). If the tests show a difference, very few of those who expected NOS to be better will change their minds. Instead, they will justify their previous position by attacking the test, the testers, and almost anything else. They'll say things like "but they tested in a completely quiet room" or "they tested in a weird circuit". As a researcher, I've watched this for 25 years.

For almost everyone, believing is seeing. Seeing is not believing.

cheers

Mark Hammer · November 19, 2007, 07:56:21 PM

Well I suppose there is no cure for those whose beliefs remain impervious to data. I hear you loud and clear on that one. We'll leave them to their own little universe. Me, I'm just aiming for something to assist those doing little experiments on their own, with an open mind, and wish to know if they are knocking on the door of opportunity, or banging their head against a brick wall.

R.G., I did a fair amount of Gellerman googling today, but all I could find were research reports referring to them. I have some old research methods books from the 40's downstairs, and I may have even saved all my rat data from 1977 when I was using the series (paid to move that crap across the continent thrice!). I'll see if I can find the information in those sources. The great thing about them is that after a bit, you can't tell where the heck you are in the sequence or which sequence you're using. All you know is you have no idea what's coming next.

frankclarke · November 19, 2007, 08:41:01 PM

Purely random is OK in double blind, it doesn't have to be Gellerman or any specific methodology. There are a limited number of paths to exercise, each path doesn't cost anything, so everything will get tested soon enough. I remember the guy prepping my rats for research, they weren't going down any mazes after that.

mac · November 20, 2007, 12:27:32 PM

I read somewhere long ago that an A/B test can be influenced to some degree by the set of individuals selected. A test with a set of weak character people may differ from a set of strong personality people.
The reason is that weak people can be higly influenced by previous data, like "NOS tubes sound better" or a "vintage AC30 is better than a new one". They do believe that because someone told them or read it in a magazine. When they go to the A/B test they have some kind of prejudment.
In the other hand strong character ones tend to develop their own thoughts.

Hard to say it in english, perdon.

mac

Mark Hammer · November 20, 2007, 01:17:38 PM

Quote from: mac on November 20, 2007, 12:27:32 PM
I read somewhere long ago that an A/B test can be influenced to some degree by the set of individuals selected. A test with a set of weak character people may differ from a set of strong personality people.
The reason is that weak people can be higly influenced by previous data, like "NOS tubes sound better" or a "vintage AC30 is better than a new one". They do believe that because someone told them or read it in a magazine. When they go to the A/B test they have some kind of prejudment.
In the other hand strong character ones tend to develop their own thoughts.

Hard to say it in english, perdon.

mac

No, you said it well, and clearly. This is why "blind" testing, with an unknown sequence is used; so that the personality of the individual listening does not get in the way. If they do not know what they are going to hear next, then their expectations will not influence their perception in any systematic way. Sometimes, even when you WANT to be honest and unbiased, it is very difficult to prevent what you believe or know from influencing what you hear. This is why I am interested in a way for people to be able to conduct unbiased "blind" testing on their own, at home, without anyone else's assistance.

frankclarke · November 20, 2007, 02:28:55 PM

I have a programmable switcher in the mail, so I think I will actually rig up someting akin to Mark's suggestion. Make 2 copies of an effect, double-blind AB them to make sure they sound the same (There must be an online program to do do the sadistics). Then make a tweak, and go through the routine again. If you can hear the difference (95% confidence), and you like the tweak better, apply the tweak to both, make sure they both sound the same, and carry on. It would be good if the test took about 5 mins, 10 mins per mod plus soldering and design time. Painful, but at the end of the day you have something way better that the original. Subject to double blind trials

. Does the sound of a previous variant affect the perception of the next one though? Dang.

OK scope questions for Mark:

For block randomization from a chip, do you want the chip to generate a series, do the switching and give a readout afterwards Mark? Do you want blocking (an equal number of A's and B's? Is AAAAABBBBB ever an acceptable output or do you need stratification?
I'm wondering if I could do a simple mechanical A/B seesaw to do one-at-a-time. That isn't group randomization, it would be minimization.

Mark Hammer · November 20, 2007, 03:26:07 PM

Quote from: frankclarke on November 20, 2007, 02:28:55 PM
Does the sound of a previous variant affect the perception of the next one though?

Depends on what you're measuring/presenting and whether there are any receptor fatigue or contrast effects. For instance, if one was comparing two fuzzes played through a Marshall at high volume with loads of searing sizzle, it may be advisable to interpolate a period of silence to let those poor little ear neurons recover. If it was two 60db tones and I was asking you to determine if one was a higher pitch, compared to some reference, I could probably just present them via headphones one after the other, at 5-second intervals or something, without any worries.

The sequential impact of the previous stimulus on the subsequent one is one of the principal reasons why we know far more about hearing and vision than we do about taste and smell. I can do all manner of doule-blind testing of tastes, but I have to let you get the previous taste out of your mouth before presenting the next one. Same thing with smells. Take one whiff and you have to wait a while for the previous smell to be metabolized and fade out before presenting the next one. Takes a whole lot longer to do that than it takes to present a bunch of consecutive images, or a bunch of 3-second sound samples.

Of course, the optimum waiting time for each of the different senses is not the same thing as the optimum waiting/retention time for human memory, which is the pivotal factor no matter what you're testing. So, it may well take 20 minutes for the taste of that last Scotch Bonnet pepper to finally leave your mouth, before I present the next chili and ask you to tell me if its hotter. You may have no particular place to go and don't mind suffering through 100 consecutive comparisons. After 20 minutes, the direct physical effects of the previous event may have dissipated, but how accurate will your memory be after 20 minutes when I ask you to tell me if THIS pepper is hotter or milder than the last one? Ideally, any form of double-blind testing needs to present the stimuli to be compared/tested in a manner which prevents them from interfering with each other (i.e., what you sense), prevents expectations from corrupting what you perceive (separate from what you sense), and prevents the recollection of any reference or standard used for comparison (in A/B/Y testing) from fading too much.

It's tricky knowing what humans perceive.

News:

On blind A/B testing