Formant Filter / Vowel circuits or how to do it?

Started by Mr. Lime, October 22, 2019, 10:11:20 AM

Previous topic - Next topic

Rob Strand

QuoteAnyone have any insights into the Korg Miku pedal, and its' approach to mimicking speech?  I'm not aiming to clone a Miku. I'm just in interested in the approach they took to the use of Vocaloid or something like it.  Perhaps simply slowing down a sample would reveal a little.
My guess is Miku is the name they gave to the Vocaloid "singer" profile.   Some info here.  Seems like there should be a patent.
https://en.wikipedia.org/wiki/Vocaloid

(That product suffers from unfortunate switch placement.  #whatweretheythinking)
Send:     . .- .-. - .... / - --- / --. --- .-. -
According to the water analogy of electricity, transistor leakage is caused by holes.

Mark Hammer

It's sort of the other way around.  Hatsune Miku is the name of what is essentially an anime character who gives live holographic "performances", using Vocaloid to generate the singing, in a manner that can be synced with the CGI stage hologram.  Apparently a very popular act, hence the naming of the Korg pedal after "her".

Rob Strand

QuoteApparently a very popular act, hence the naming of the Korg pedal after "her".
I found the same stuff about 2 mins after I posted.    Initially the pedal was only released in Japan partly due to the enormous popularity of Miku there but was then released elsewhere.

This video is pretty good.  Gives a good overall perspective of history, motives and sounds,
https://www.youtube.com/watch?v=aveUEZkcQno

Pretty sure it's totally DSP and  un-analogable.
Send:     . .- .-. - .... / - --- / --. --- .-. -
According to the water analogy of electricity, transistor leakage is caused by holes.

PRR

> mid 1800's but it's not clear at what point it could be considered speech.

Nice grip on that tube.

  • SUPPORTER

Mark Hammer

Quote from: Rob Strand on October 26, 2019, 09:09:50 PM
Pretty sure it's totally DSP and  un-analogable.
I suspect that what requires it to be totally DSP is the manner in which it progresses through a variety of vowel-like sounds that have the superficial characteristics of speech.  In other words, it doesn't just go wah-wah-wah (or even yoy-yoy-yoy - see what I did there, Jimi? )  So, insomuch as there needs to be a sort of decision about what formants to produce next, it MUST be digital.  But, as various speech-faking circuits have demonstrated over the years, if the electronic production of formants is deliberately limited in its range, it very much CAN be analogable.  I guess the associated question that is compelled is "Can a little more variety in formant production than mere wah-wah be produced without having to resort to DSP?".

For a number of years, I've been blathering on about the idea of a "talked-to" pedal.  Not a vocoder, as such, but something that responded to the amplitude of different vocal frequency bands, and used them to control various parameters.  Naturally, it would be difficult to provide much fine control over parameters using only voice (I find it hard to imagine intentional control over more than 3 parameters), but if paired up with a wah-like foot-pedal, I would imagine a great deal of articulation could be achieved; especially if, say, any detection of gutteral sounds increased distortion to be mixed in.

ElectricDruid

(That Hatsune Miku video is totally pop-gone-meta, blew my mind.)

Yes, it can be done in analog, even with multiple formants, but the amount of circuitry starts to get large. A vocoder is a decent demonstration of both the principle and the problem - can generate highly speech-like sound (intelligible, even), but will use a lot of circuitry to get there.

We could design a system with several sweepable bandpass filters with control over Frequency and Resonance (so perhaps a voltage-controlled state variable design - there are lots of those). Each filter would need to be in a specific position for each vowel. We could attach (say) an eight-step sequencer to each filter to control the cutoff, with each step adjusted to where we need it (say the first five were AEIOU, then add a few more). Bonus points if we add a second sequencer to each channel for resonance too, so we can alter the width of individual peaks. Moving instantly from one vowel to another isn't realistic or that interesting, so let's add a VC-glide circuit on our CV outputs for both Cut-off and Resonance too. That can probably take a global control, just to keep things reasonably under control, though it wouldn't have to.

How you want to trigger this beastie is up to you. You could step through each vowel in turn by incrementing the sequencers. You could jump to a specific vowel by sending them some three-bit binary input from somewhere. You could do something else entirely.

There's nothing there that's hard for analog. It's just that there's a lot of it, and it would take a lot of circuit building before you even knew if it sounded any good and was worth pursuing. One advantage of digital/dsp solutions is the low cost development...take the same general purpose board you used last time, bung some new code on it, kick a few ideas around, see if it's a go or a no-go. Fast and cheap.



StephenGiles

So if the amount of circuitry gets large it becomes rackmount surely, with pretty flashing lights to impress your girlfriend!!
"I want my meat burned, like St Joan. Bring me pickles and vicious mustards to pierce the tongue like Cardigan's Lancers.".

anotherjim

IIRC, the Roland Vocoder/strings keyboards had choir voicings using basic formant filtering -  the full Vocoder part is separate. They are barely convincing - but wait till the ensemble chorus is switched on. This video demo of the Behringer VC340 clone has a section devoted to it from 11:56...


Mark Hammer

Most vocoders will have some means for introducing simulated sibilance into the mix/output.  A guitar on its own will not have much frequency content "up there", and certainly not in any sustained way I.e., you may get it AS you pick, but not immediately after that).  And although vowel production does not require the high-frequency content needed for plosives and fricatives, having at least some sibilance may be useful in making formant production feel more realistic and speech-like.  I recall getting a letter from RG several decades back with an idea he had about a means to introduce "breathing" into a pedal.  I should see if I can dig that note up.

The question is "How" to bring natural sibilance of a guitar into a processed sound.  Here, I think it useful to borrow something from acoustic simulators and "exciters'.  That is, highpass filter one copy of the guitar signal, and clip the dickens out of what's left, for mixing in selectively with the filter-shaped guitar signal.  The Miku pedal, at least from the demos I've heard, does not incorporate any simulated sibilance, only vowels.

Rob Strand

QuoteThere's nothing there that's hard for analog.
If you look at the wiki link I posted you see there's a Singer Library.  In the description,
"The database must have all possible combinations of phonemes of the target language, including diphones (a chain of two different phonemes) and sustained vowels, as well as polyphones with more than two phonemes if necessary"

Not the thing for analog.  As for approximating some of the features with a large analog ckt, sure.
Send:     . .- .-. - .... / - --- / --. --- .-. -
According to the water analogy of electricity, transistor leakage is caused by holes.

Rob Strand

#30
QuoteThe question is "How" to bring natural sibilance of a guitar into a processed sound.
The Miku unit does have some level sensitive behaviours which are demo'd in that video.
How invoke a 's' vs a 'p' on  guitar hmm ...  (slide or pluck  ;D)  The start sounds of the Miku are kind of non-consonant.
With a digital device you could assign what signal feature triggers what consonant but I suspect you would need a
a number of modes so you could select different ways the unit operated.  It would sound pretty weird if you only go 's' or 'p' as well.
Send:     . .- .-. - .... / - --- / --. --- .-. -
According to the water analogy of electricity, transistor leakage is caused by holes.

ElectricDruid

Quote from: Rob Strand on October 27, 2019, 05:58:49 PM
QuoteThere's nothing there that's hard for analog.
If you look at the wiki link I posted you see there's a Singer Library.  In the description,
"The database must have all possible combinations of phonemes of the target language, including diphones (a chain of two different phonemes) and sustained vowels, as well as polyphones with more than two phonemes if necessary"

Not the thing for analog.  As for approximating some of the features with a large analog ckt, sure.

Agree that trying to build a database is going to be hard for an analog circuit! But the OP's original request was for formant filter/vowel sounds, not an analog recreation of the Vocaloid software.

The actual sound production model they're using wouldn't be beyond reach:



It's a lot of filters to build, but then, so's a vocoder.

Fancy Lime

Von Kempelens contraption in action:



Uses German words but I think one gets the idea. Pretty darn impressive for the time, imho.

Andy
My dry, sweaty foot had become the source of one of the most disturbing cases of chemical-based crime within my home country.

A cider a day keeps the lobster away, bucko!

Fancy Lime

Quote from: StephenGiles on October 27, 2019, 05:34:16 PM
So if the amount of circuitry gets large it becomes rackmount surely, with pretty flashing lights to impress your girlfriend!!
Du can nicht impressen der girlfriend mit das blinkenlichten.
My dry, sweaty foot had become the source of one of the most disturbing cases of chemical-based crime within my home country.

A cider a day keeps the lobster away, bucko!

Fancy Lime

#34
And then there is the "Voder":



A machine from 1939, developed at Bell Labs by Homer Dudley. Can anyone please tell me why on earth half the schematics (e.g. at 4:10) are in German, of all languages? For an American machine. In nineteen thirty nine?

Also: why did I ever put up with singers when I could have built that? Way better singing voice, intonation and articulation than any of the vocalists I had the joy of working with. Almost beats Hatebeak for coolness factor, too. Almost.

Andy
My dry, sweaty foot had become the source of one of the most disturbing cases of chemical-based crime within my home country.

A cider a day keeps the lobster away, bucko!

tubegeek

Mark is SPOT ON about "How To Wreck A Nice Beach" - fascinating, unique book. The Voder research is definitely a big part of the early part of the book. And RG - among Bell Labs' applications was the first scrambler for secure comms (between Roosevelt & Churchill during WWII.) A room full of gear at either end, doing encoding, scrambling and decoding ALL ANALOG of course. A lot in there about Claude Shannon too.

The title is a rendition of the faulty decoding that can happen on the phrase "How to recognize speech." The book is full of weird stuff like that - it occupies the very-slim space where the military-industrial complex and early rap records intersect. A totally essential read for weirdos like... well, if you don't recognize yourself then you may not be one.
"The first four times, we figured it was an isolated incident." - Angry Pete

"(Chassis is not a magic garbage dump.)" - PRR

Rob Strand

#36
QuoteBut the OP's original request was for formant filter/vowel sounds, not an analog recreation of the Vocaloid software.
I think Mark's questions created a minor fork in the thread.


Very cool videos.

QuoteUses German words but I think one gets the idea. Pretty darn impressive for the time, imho.
The technique to get the 'L' sound was very cool as well.

QuoteA machine from 1939, developed at Bell Labs by Homer Dudley.
Sounds better than the chips that came out in the 80s, mainly because of the pitch changes.
Similar type of speech generator was used by Stephen Hawking.


QuoteGerman, of all languages? For an American machine. In nineteen thirty nine?
Maybe there was a European patent?
Send:     . .- .-. - .... / - --- / --. --- .-. -
According to the water analogy of electricity, transistor leakage is caused by holes.

PRR

> The technique to get the 'L' sound was very cool as well.

How do you say "l"? I never noticed, but you put your tongue in the center of your mouth-jug. Keeps lengthwise resonance but shifts transverse resonance way up.

The 300:30 selection ratio and year of practice may be why we can't affordably replace singers with voder-talkers.
  • SUPPORTER

MaxPower

Yep, things went bonkers pretty quickly. Synths can create vowel sounds. Parallel filters help so parallel parametric eqs should work. The eqs would need to be re-tuned for each vowel sound. A microcontroller project maybe?

Or you can build an eq/filter for each vowel sound. Add a step sequencer....
What lies behind us and what lies before us are tiny matters, compared to what lies within us - Emerson

Mark Hammer

Without wishing to get unnecessarily tangential, let's distinguish between processing intended to produce relatively static, or at least adjustable, timbres that are reminiscent of vocal sounds, and processing intended to create the illusion of human speech.  That is processing that changes in time, in a manner that mimics how human speech changes.  In a sense, it is a superficial "Turing test", based purely on acoustic cues.

When I was working, and attending a meeting, I noticed that whenever I inhaled audibly, all eyes at the table would turn to me, in expectation that I was going to say something, and potentially something of import (the fools!).  But consider the oral cue here.  If one is going to "make a point", that utterance is intended to come out in one big continuous stream, rather than in halting little snippets.  So the big inhale is needed to provide the air for a longer utterance.  We may not be consciously aware of it, but it's a cue nonetheless.

Derek Trucks uses a similar device when playing slide solos.  He'll often slide down the neck, in an unpitched multi-string gliss, before making "the big statement" and it sounds for all the world like a big inhale.

To tie all of this together, I introduced this to draw attention to what makes us THINK sounds produced by an electronically processed instrument sound "just like" speech.  It's not JUST the formants and their shaping, but the overall speech cues.  It's part of how your dog knows (or assumes) that you're not just mumbling to yourself, but are "communicating" to it.  And conversely, it's how stable-level synthetic speech, without such cues, sounds robotic and inhuman.