Thinking formant filters there aren't lots of analog circuits around it seems.
Sure there are anti wah pedals and kind of talking pedals but none of them really adress a wide range of different frequency shifts.
R.G.'s Walking Sing- Wah got some clever tricks but the controls seem to be very limited.
I would love to have "pig squeals" like they can be heard on gore grind accessable by a guitar effect.
What's the best/simplest approach to archive formant filters, any guesses?
First I thought about parallel phasers with 6 stages to have 3 shifting notches and a phase inverted LFO but the tone control is limited to the resonance control.
Second I thought about parallel state variable filters where we have much better frequency control over the shifting. Again LFO sweeping anti parallel. Parasit Studio Sentient Machine for example.
But that's not nearly satisfying as there's only one notch, so there might be more than one SVF in series.
I think about something like parametric EQs, at least two in series with again two in parallel. Frequency pots are controlled by vactrols like a common Mutron Filter but with antiparallel LFO sweeping or even two LFOs.
BYOC's Parametric EQ looks a little like this:
http://byocelectronics.com/parametricEQschem.pdf (http://byocelectronics.com/parametricEQschem.pdf)
Am I barking up the right tree or what are your guys suggestions on this topic?
You should read up a little on speech-production. I don't mean that as any admonishment, but rather it may provide some ideas.
Spectrograms of speech sounds show formants as essentially bursts of acoustic energy within a given band, and moving in a given way. The "movement" is both in terms of the amplitude envelope and where the energy-band is situated.
As dual-band "anti-wahs" illustrate, formants don't all move in the same direction at the same time.
Pulling up a deck chair for this one. :icon_lol:
Thanks for the quick response, Mark.
I see your point concerning the movement and I understand that this may be the tricky part of an analog circuit that should behave like a studio plugin.
On the other side it might not be necessary to have a two-dimensional movement. When I watch this video for example, lots of sounds can be archived just by sweeping peaks from left to right while cutoff frequency and resonance give tone controls.
Not sure and I might be wrong but I would rather focus on having different peaks from bandpass and notch amplitudes?
Starts at ~ 1:50:
Taking a closer look at the Sing-Wah, R.G. mentioned that parametric controls could (theoretically) replace the inductors.
So we come closer having controls for frequency, boost/cut and resonance altought he found the simulations not to be too promising..
On the other hand, there are problems with cap switching and inductors as well.
(http://www.geofex.com/article_folders/EQs/parmet.gif)
Here's the full article:
http://www.geofex.com/article_folders/sing-wah/sing-wah.htm (http://www.geofex.com/article_folders/sing-wah/sing-wah.htm)
Vocoders attempt to mimic voice sounds by simply modulating the amplitude of fixed bands. So if one's voice has peaks at, say, 350, 540, and 900hz, those bands would be emphasized in the instrument signal, to the neglect of all other frequencies. Because vocoders simply track the relative energy of the various individual bands, and not so much the movement of speech energy across the spectrum, more realistic synthesis of speech sounds relies on having more bands for detection/modulation. For instance, the old PAiA/Craig Anderton Vocoder only used 8 bands, It was okay, but not strikingly voice-like. More realistic voice qualities would need 12 or more bands.
You will note in the first plug-in shown on a scope-pic, that not only do the peaks move around, but their width changes too. Interestingly, that seems to be a property of some wahs. They are not simply "bandpass filters" providing the same selectivity no matter where the sweep is situated at the moment, but tend to change Q with sweep. That may well be responsible for their vocal-like qualities, insomuch as the frequency content is modified like formants are.
Quote from: Mark Hammer on October 22, 2019, 04:02:26 PM
... not only do the peaks move around, but their width changes too. Interestingly, that seems to be a property of some wahs.
Yes, and I would hypothesize this property is one of the key differences between "vocal" sounding wahs and wahs that some think sterile.
You might need to consider using a microcontroller to make the controls more flexible.
Check out Jurgen Haible stuff. He was always doing crazy stuff like this and has some good filter circuits that may be worth looking at.
Modern hearing-aids which attempt to emphasize speech over noise have grown to "128 channels", in an attempt to handle each formant(*) individually.
Also because DSP channels have got really cheap, and to offer a higher-price model "better" than the 16-channel jobs. OTOH there is much evidence that until computers and their trainers get smarter, a 3-channel aid does very well against many-channel processing.
(* Dang, another word I have to teach my spell-chucker.)
An unverified schematic from last year.
I dont think I ever breadboarded it.
(https://i.postimg.cc/G8kq9V41/Trikbox-SVF-Anti-Wah-1-unverified.png) (https://postimg.cc/G8kq9V41)
And another I feel sure I tested this though cant remember if I liked it.
(https://i.postimg.cc/MM2JCkkk/Trikbox-Flo-Rida-Anti-Wah1-folie-a-deux-style-filters.png) (https://postimg.cc/MM2JCkkk)
I second Mark's suggestion to look into actual speech production and vowel formant frequencies. It's interesting stuff and you'll come out with different ideas.
I did my own research into this some time ago. I was trying to write a digital "vowel oscillator" algorithm. The idea was that it would have two controls, one for frequency, and another for vowel shape. I hoped to get something that could "sing" vowel sounds.
I discovered that vowels are distinguished by the number and frequency of their peaks. These vary slightly between speakers, and between men and women and children (who have voices pitched differently) but that the overall pattern is similar enough for us to identify the vowel. At a minimum, three peaks are enough to distinguish the vowels, but four or five makes the job much easier.
One key part is that the formants *don't* change much with changing pitch. So if you sing a low "Aaa" and a high "Aaa" the formants are pretty much the same (since they're both you singing it) - after all, that's what identifies it as you singing an A. A side-effect of this is that if the sung pitch goes too high, it can go *above* the range of some important formant peaks. They "fall off the bottom". This is why soprano opera singers are hard to understand. We've not got enough energy in the lower ranges that we need to be able to hear important formants that help us identify which vowel it is. Apparently people who write operas (librettists) are aware of this and either make sure that nothing of vital importance to the plot is sung in the upper registers, or make sure that it's repeated by someone else lower down! So the soprano goes "Oh my god! She's killed him!" and then the tenor goes "No! I don't believe it! She can't have killed him." just to make sure you get it.
Checking my old code, I see a reference in the comments:
Formant data is taken from "Acoustic Characteristics of American English Vowels", Hillenbrand, Getty, Clark, and Wheeler, 1995.
I hope it helps.
> the formants *don't* change much with changing pitch.
Pitch is in the vocal cords. Formants are the shape/size of the mouth cavity. Different things.
> This is why soprano opera singers are hard to understand.
You don't have to understand the opera soprano. It's an instrument, not a telegram.
> or make sure that it's repeat by someone else lower down! So the soprano goes "Oh my god! She's killed him!" and then the tenor goes "No! I don't believe it! She can't have killed him."
And yeah, that too. Well spotted.
Mezzo: We must be going.
Soprano: Oh yes we're going!
Bari: We are going, oh yes we are going...
Chorus: Going now! Going now!!
Soprano: We really hate to leave you but we really must be going
Bari: Yes we are going, going now
Tenor: Fetch my hat, fetch my gloves (servant enters with hat&gloves)
Sopr: Give our best to all your loves! (Goodbye hugs all around)
Alto: Going!
Bass: Going!
Sopr: Going Now! Now! Now!
Host: (taps hourglass) Vaya con Dios, dammit, already! Get going NOW!
three Mary Fords: Vaya con Dios
ALL: Going now! Going Now!
Bari: Going!
Alto: ........ Going!
Tenor: .................Going!
Sopr: ..........................Going!
ALL: Going going going!!!! Now! Now! Now!
Host: (sotto voice) real soon now.
(Orchestral flourish, everybody dances in circles.....) Going now! Going now!
I saw a modulated filter somewhere using PWM to control analogue switches for a switched resistor filter, and it made quite convincing yi-yi-yi-yi-yi sounds from a guitar input. (Hope this makes sense). It may have been on this forum.
EDIT: found it, Parasit studio's Sentient Machine: https://www.parasitstudio.se/sentientmachine.html
Quote from: ElectricDruid on October 24, 2019, 01:42:25 PM
I second Mark's suggestion to look into actual speech production and vowel formant frequencies. It's interesting stuff and you'll come out with different ideas.
I did my own research into this some time ago. I was trying to write a digital "vowel oscillator" algorithm. The idea was that it would have two controls, one for frequency, and another for vowel shape. I hoped to get something that could "sing" vowel sounds.
I discovered that vowels are distinguished by the number and frequency of their peaks. These vary slightly between speakers, and between men and women and children (who have voices pitched differently) but that the overall pattern is similar enough for us to identify the vowel. At a minimum, three peaks are enough to distinguish the vowels, but four or five makes the job much easier.
One key part is that the formants *don't* change much with changing pitch. So if you sing a low "Aaa" and a high "Aaa" the formants are pretty much the same (since they're both you singing it) - after all, that's what identifies it as you singing an A. A side-effect of this is that if the sung pitch goes too high, it can go *above* the range of some important formant peaks. They "fall off the bottom". This is why soprano opera singers are hard to understand. We've not got enough energy in the lower ranges that we need to be able to hear important formants that help us identify which vowel it is. Apparently people who write operas (librettists) are aware of this and either make sure that nothing of vital importance to the plot is sung in the upper registers, or make sure that it's repeat by someone else lower down! So the soprano goes "Oh my god! She's killed him!" and then the tenor goes "No! I don't believe it! She can't have killed him." just too make sure you get it.
Checking my old code, I see a reference in the comments:
Formant data is taken from "Acoustic Characteristics of American English Vowels", Hillenbrand, Getty, Clark, and Wheeler, 1995.
I hope it helps.
One needs to keep in mind that vowels are recognizable as such, and "the same", whether spoken by a 5 year-old, a 95 year-old, an Australian, an Italian, a Russian, a man, a woman, a buddhist monk, a person with a raspy voice. It's not the absolute frequency content, but the relative content, and the pattern, including how the relationships between bursts are maintained. That's the miracle of speech recognition in humans. We're attuned to, and track, patterns that let us recognize words in spite of accent, pitch, rasp/fry,and other inconsistencies across human speakers.
Making a guitar mimic speech sounds, especially vowels, is somewhat easier, because there is no accent to contend with, and the vowel to be mimicked is a "pure" sound, rather than a vowel as spoken by a Tuvan throat singer.
Quote
One needs to keep in mind that vowels are recognizable as such, and "the same", whether spoken by a 5 year-old, a 95 year-old, an Australian, an Italian, a Russian, a man, a woman, a buddhist monk, a person with a raspy voice.
...
That's the miracle of speech recognition in humans.
Indeed. I remember reading articles on speech recognition saying how AT&T had difficulties rolling out speech recognition due to the different ethnic groups in the US.
Really nice article, written by a well known DSP Guru from AT&T/Bell Labs,
https://www.cs.brandeis.edu/~cs136a/CS136a_docs/Speech_Recognition_History.pdf
Thanks for that, Rob.
I'll return the favour by noting a decent book on the evolution of synthetic voice in music, wartime communications, etc., including vocoding, talk boxes, and such: "How to Wreck a Nice Beach" ( https://www.researchgate.net/publication/283979041_How_to_Wreck_a_Nice_Beach_The_Vocoder_from_World_War_II_to_Hip-Hop_the_Machine_Speaks_Dave_Tompkins ). It's a little chaotic, in terms of writing style, but covers a great deal of the last 100 years, and the interconnections between the many ways in which such technology was developed, marketed, and incorporated into popular culture. A wealth of musical suggestions as well. A lot of artists I had simply never heard of before.
QuoteIt's a little chaotic, in terms of writing style, but covers a great deal of the last 100 years, and the interconnections between the many ways in which such technology was developed, marketed, and incorporated into popular culture. A wealth of musical suggestions as well.
Amazing, the vocoder has its origins in Bell Labs I suppose if you listened to scrambled speech as part of your job it only takes one bright spark to make use of it elsewhere. As a child even single side-band CB radio sounded cool to me.
QuoteA lot of artists I had simply never heard of before.
I felt equally out of touch reading the first page and a half.
Bell Labs built on a body of knowledge from the even earlier work done in making vocal tract mechanical models. I'd have to look up the dates, but in the ?1890s? there were talking machines which produced very reasonable speech when operated by trained women moving levers and such to move the model tracts' parts.
Two formant filters are reputed to be enough for vowel mimicry. Some variants of vowelizers/dipthongizers use a third, but rarely. OTA state variable filters are a nice analog way to mechanize this, but that leaves open the question of how drive the filters to the intended frequencies. I haven't built one of this kind of thing in over a decade. Today, I'd use a uC with a table for the control voltages, PWM to make the control voltages and some kind of "gliding" control to move the sounds smoothly between intended vowel approximations.
You could do the same in a dsp, of course.
While you could connect vocoders with scramblers, Bell's interest was probably:
* Helping mutes talk (mirror of helping deaf hear, which Bell worked on).
* Putting coded speech on bad telegraph lines to carry more conver$ations with less copper.
Vocoders never replaced direct speech for general telephony. Unexperienced people do not understand simple vocoders.
A few vocoder mute-aids were prototyped but never caught on. Throat-buzzers (Electrolarynx) can replace lost vocal cords so throat/lips can shape a "speech" which is not pretty but becomes intelligible.
When a vocoder is complex enough for special uses (Type-n-Talk), there's probably some better way to do it (storing recorded snips in ROM).
QuoteBell Labs built on a body of knowledge from the even earlier work done in making vocal tract mechanical models. I'd have to look up the dates, but in the ?1890s? there were talking machines which produced very reasonable speech when operated by trained women moving levers and such to move the model tracts' parts.
Page 4 of the article I posted goes back to 1773 and 1791 then some more in mid 1800's but it's not clear at what point it could be considered speech.
Anyone have any insights into the Korg Miku pedal, and its' approach to mimicking speech? I'm not aiming to clone a Miku. I'm just in interested in the approach they took to the use of Vocaloid or something like it. Perhaps simply slowing down a sample would reveal a little.
QuoteAnyone have any insights into the Korg Miku pedal, and its' approach to mimicking speech? I'm not aiming to clone a Miku. I'm just in interested in the approach they took to the use of Vocaloid or something like it. Perhaps simply slowing down a sample would reveal a little.
My guess is Miku is the name they gave to the Vocaloid "singer" profile. Some info here. Seems like there should be a patent.
https://en.wikipedia.org/wiki/Vocaloid
(That product suffers from unfortunate switch placement. #whatweretheythinking)
It's sort of the other way around. Hatsune Miku is the name of what is essentially an anime character who gives live holographic "performances", using Vocaloid to generate the singing, in a manner that can be synced with the CGI stage hologram. Apparently a very popular act, hence the naming of the Korg pedal after "her".
QuoteApparently a very popular act, hence the naming of the Korg pedal after "her".
I found the same stuff about 2 mins after I posted. Initially the pedal was only released in Japan partly due to the enormous popularity of Miku there but was then released elsewhere.
This video is pretty good. Gives a good overall perspective of history, motives and sounds,
https://www.youtube.com/watch?v=aveUEZkcQno
Pretty sure it's totally DSP and un-analogable.
> mid 1800's but it's not clear at what point it could be considered speech.
Nice grip on that tube.
(https://i.postimg.cc/gwm0pG5Z/Speaking-Machine-mid-1800s.gif) (https://postimg.cc/gwm0pG5Z)
Quote from: Rob Strand on October 26, 2019, 09:09:50 PM
Pretty sure it's totally DSP and un-analogable.
I suspect that what requires it to be totally DSP is the manner in which it progresses through a variety of vowel-like sounds that have the
superficial characteristics of speech. In other words, it doesn't just go wah-wah-wah (or even yoy-yoy-yoy - see what I did there, Jimi? ) So, insomuch as there needs to be a sort of decision about what formants to produce next, it MUST be digital. But, as various speech-faking circuits have demonstrated over the years, if the electronic production of formants is deliberately limited in its range, it very much CAN be analogable. I guess the associated question that is compelled is "Can a little more variety in formant production than mere wah-wah be produced
without having to resort to DSP?".
For a number of years, I've been blathering on about the idea of a "talked-to" pedal. Not a vocoder, as such, but something that responded to the amplitude of different vocal frequency bands, and used them to control various parameters. Naturally, it would be difficult to provide much fine control over parameters using
only voice (I find it hard to imagine intentional control over more than 3 parameters), but if paired up with a wah-like foot-pedal, I would imagine a great deal of articulation could be achieved; especially if, say, any detection of gutteral sounds increased distortion to be mixed in.
(That Hatsune Miku video is totally pop-gone-meta, blew my mind.)
Yes, it can be done in analog, even with multiple formants, but the amount of circuitry starts to get large. A vocoder is a decent demonstration of both the principle and the problem - can generate highly speech-like sound (intelligible, even), but will use a lot of circuitry to get there.
We could design a system with several sweepable bandpass filters with control over Frequency and Resonance (so perhaps a voltage-controlled state variable design - there are lots of those). Each filter would need to be in a specific position for each vowel. We could attach (say) an eight-step sequencer to each filter to control the cutoff, with each step adjusted to where we need it (say the first five were AEIOU, then add a few more). Bonus points if we add a second sequencer to each channel for resonance too, so we can alter the width of individual peaks. Moving instantly from one vowel to another isn't realistic or that interesting, so let's add a VC-glide circuit on our CV outputs for both Cut-off and Resonance too. That can probably take a global control, just to keep things reasonably under control, though it wouldn't have to.
How you want to trigger this beastie is up to you. You could step through each vowel in turn by incrementing the sequencers. You could jump to a specific vowel by sending them some three-bit binary input from somewhere. You could do something else entirely.
There's nothing there that's hard for analog. It's just that there's a lot of it, and it would take a lot of circuit building before you even knew if it sounded any good and was worth pursuing. One advantage of digital/dsp solutions is the low cost development...take the same general purpose board you used last time, bung some new code on it, kick a few ideas around, see if it's a go or a no-go. Fast and cheap.
So if the amount of circuitry gets large it becomes rackmount surely, with pretty flashing lights to impress your girlfriend!!
IIRC, the Roland Vocoder/strings keyboards had choir voicings using basic formant filtering - the full Vocoder part is separate. They are barely convincing - but wait till the ensemble chorus is switched on. This video demo of the Behringer VC340 clone has a section devoted to it from 11:56...
Most vocoders will have some means for introducing simulated sibilance into the mix/output. A guitar on its own will not have much frequency content "up there", and certainly not in any sustained way I.e., you may get it AS you pick, but not immediately after that). And although vowel production does not require the high-frequency content needed for plosives and fricatives, having at least some sibilance may be useful in making formant production feel more realistic and speech-like. I recall getting a letter from RG several decades back with an idea he had about a means to introduce "breathing" into a pedal. I should see if I can dig that note up.
The question is "How" to bring natural sibilance of a guitar into a processed sound. Here, I think it useful to borrow something from acoustic simulators and "exciters'. That is, highpass filter one copy of the guitar signal, and clip the dickens out of what's left, for mixing in selectively with the filter-shaped guitar signal. The Miku pedal, at least from the demos I've heard, does not incorporate any simulated sibilance, only vowels.
QuoteThere's nothing there that's hard for analog.
If you look at the wiki link I posted you see there's a Singer Library. In the description,
"The database must have all possible combinations of phonemes of the target language, including diphones (a chain of two different phonemes) and sustained vowels, as well as polyphones with more than two phonemes if necessary"
Not the thing for analog. As for approximating some of the features with a large analog ckt, sure.
QuoteThe question is "How" to bring natural sibilance of a guitar into a processed sound.
The Miku unit does have some level sensitive behaviours which are demo'd in that video.
How invoke a 's' vs a 'p' on guitar hmm ... (slide or pluck ;D) The start sounds of the Miku are kind of non-consonant.
With a digital device you could assign what signal feature triggers what consonant but I suspect you would need a
a number of modes so you could select different ways the unit operated. It would sound pretty weird if you only go 's' or 'p' as well.
Quote from: Rob Strand on October 27, 2019, 05:58:49 PM
QuoteThere's nothing there that's hard for analog.
If you look at the wiki link I posted you see there's a Singer Library. In the description,
"The database must have all possible combinations of phonemes of the target language, including diphones (a chain of two different phonemes) and sustained vowels, as well as polyphones with more than two phonemes if necessary"
Not the thing for analog. As for approximating some of the features with a large analog ckt, sure.
Agree that trying to build a database is going to be hard for an analog circuit! But the OP's original request was for formant filter/vowel sounds, not an analog recreation of the Vocaloid software.
The actual sound production model they're using wouldn't be beyond reach:
(https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Excitation_plus_Resonances_%28EpR%29_voice_model_%28Bonada_et_al._2001%2C_Fig.1%29.svg/560px-Excitation_plus_Resonances_%28EpR%29_voice_model_%28Bonada_et_al._2001%2C_Fig.1%29.svg.png)
It's a lot of filters to build, but then, so's a vocoder.
Von Kempelens contraption in action:
Uses German words but I think one gets the idea. Pretty darn impressive for the time, imho.
Andy
Quote from: StephenGiles on October 27, 2019, 05:34:16 PM
So if the amount of circuitry gets large it becomes rackmount surely, with pretty flashing lights to impress your girlfriend!!
Du can nicht impressen der girlfriend mit das blinkenlichten.
And then there is the "Voder":
A machine from 1939, developed at Bell Labs by Homer Dudley. Can anyone please tell me why on earth half the schematics (e.g. at 4:10) are in German, of all languages? For an American machine. In nineteen thirty nine?
Also: why did I ever put up with singers when I could have built that? Way better singing voice, intonation and articulation than any of the vocalists I had the joy of working with. Almost beats Hatebeak for coolness factor, too. Almost.
Andy
Mark is SPOT ON about "How To Wreck A Nice Beach" - fascinating, unique book. The Voder research is definitely a big part of the early part of the book. And RG - among Bell Labs' applications was the first scrambler for secure comms (between Roosevelt & Churchill during WWII.) A room full of gear at either end, doing encoding, scrambling and decoding ALL ANALOG of course. A lot in there about Claude Shannon too.
The title is a rendition of the faulty decoding that can happen on the phrase "How to recognize speech." The book is full of weird stuff like that - it occupies the very-slim space where the military-industrial complex and early rap records intersect. A totally essential read for weirdos like... well, if you don't recognize yourself then you may not be one.
QuoteBut the OP's original request was for formant filter/vowel sounds, not an analog recreation of the Vocaloid software.
I think Mark's questions created a minor fork in the thread.
Very cool videos.
QuoteUses German words but I think one gets the idea. Pretty darn impressive for the time, imho.
The technique to get the 'L' sound was very cool as well.
QuoteA machine from 1939, developed at Bell Labs by Homer Dudley.
Sounds better than the chips that came out in the 80s, mainly because of the pitch changes.
Similar type of speech generator was used by Stephen Hawking.
QuoteGerman, of all languages? For an American machine. In nineteen thirty nine?
Maybe there was a European patent?
> The technique to get the 'L' sound was very cool as well.
How do you say "l"? I never noticed, but you put your tongue in the center of your mouth-jug. Keeps lengthwise resonance but shifts transverse resonance way up.
The 300:30 selection ratio and year of practice may be why we can't affordably replace singers with voder-talkers.
Yep, things went bonkers pretty quickly. Synths can create vowel sounds. Parallel filters help so parallel parametric eqs should work. The eqs would need to be re-tuned for each vowel sound. A microcontroller project maybe?
Or you can build an eq/filter for each vowel sound. Add a step sequencer....
Without wishing to get unnecessarily tangential, let's distinguish between processing intended to produce relatively static, or at least adjustable, timbres that are reminiscent of vocal sounds, and processing intended to create the illusion of human speech. That is processing that changes in time, in a manner that mimics how human speech changes. In a sense, it is a superficial "Turing test", based purely on acoustic cues.
When I was working, and attending a meeting, I noticed that whenever I inhaled audibly, all eyes at the table would turn to me, in expectation that I was going to say something, and potentially something of import (the fools!). But consider the oral cue here. If one is going to "make a point", that utterance is intended to come out in one big continuous stream, rather than in halting little snippets. So the big inhale is needed to provide the air for a longer utterance. We may not be consciously aware of it, but it's a cue nonetheless.
Derek Trucks uses a similar device when playing slide solos. He'll often slide down the neck, in an unpitched multi-string gliss, before making "the big statement" and it sounds for all the world like a big inhale.
To tie all of this together, I introduced this to draw attention to what makes us THINK sounds produced by an electronically processed instrument sound "just like" speech. It's not JUST the formants and their shaping, but the overall speech cues. It's part of how your dog knows (or assumes) that you're not just mumbling to yourself, but are "communicating" to it. And conversely, it's how stable-level synthetic speech, without such cues, sounds robotic and inhuman.
Not analog, but I got some interesting results in DSP which I think are two resonant filters in series. It's been a long time since I did this so I can't remember the details. These are really resonant SVFs so it should be fairly easily adaptable to analog.
By "easy" of course I mean theoretically possible. This is probably:
An envelope follower going into two independent gain/offset blocks so that the direction, center, and sweep of filters is independently controllable. The outputs for each of those goes to the center frequency control of the SVFs. Q is fixed. I was really trying to get a wah/anti-wah thing going but when I stumbled on this I laughed so hard that I just left it that way. I think it gets some pretty good guttural sounds on the low strings.
https://www.soundclick.com/artist/default.cfm?bandID=1373300&content=songs
Check the sound clips for "Tuvan Throat Singer" and "Munchkin Choir".
> distinguish between processing intended to produce relatively static, or at least adjustable, timbres that are reminiscent of vocal sounds, and processing intended to create the illusion of human speech.
The naked voder made vowels. It took a year of training for "girls" to get good at "speech".
The music-filter analogy is: you can put an "oo" filter in easy, but to "sing" "yes we're going really going going now" is tons more work/practice/control. (Experience your mouth already has, hence talk-boxes.)
> the illusion of human speech.
As opposed to? Unhuman speech? Superhuman speech? The only "hit records" with non-human speech are Collins et al's whale songs (https://en.wikipedia.org/wiki/Songs_of_the_Humpback_Whale_(album)#Reception) (which are Greek to me) and some meaningless oinks in Piggies (https://en.wikipedia.org/wiki/Piggies) and a few others.
When I say "illusion of human speech", I mean that it passes a sort of Turing test, and sounds to us like a person speaking, even if it sounds like a poor recording of speech (e.g., the way an old Edison wax cylinder might). When we hear Vocaloid or Auto-Tune or Stephen Hawking, it sounds synthetically generated to us; more on the machine side of "the uncanny valley" than on the human side. We can certainly make out the words, which I guess was the principle objective of all those industrial efforts to squeeze most intelligibility out of least bandwidth, but sometimes at the cost of not sounding human. So, for me at least, the challenge is identifying those aspects that can bring things a little closer to the human side of the uncanny valley. If it can't be done, I'll accept that. But until we can't tell if it's live or Memorex, I think we ought to keep trying. :icon_biggrin:
Quote from: PRR on October 26, 2019, 04:19:03 PM
When a vocoder is complex enough for special uses (Type-n-Talk), there's probably some better way to do it (storing recorded snips in ROM).
Texas Instruments' Speak-and-Spell family used linear predictive coding, stored in ROM to contain phonemes (sounds that make up speech).
Quote from: Mark Hammer on October 30, 2019, 09:13:20 AM
When we hear Vocaloid or Auto-Tune or Stephen Hawking, it sounds synthetically generated to us; more on the machine side of "the uncanny valley" than on the human side.
The irony of that is that auto-tune *isn't* synthetically generated - it's actual human speech, processed, more like a vocoder than a speak-n-spell.
This implies there's a quality of genuine speech that can be lost (or removed) from actual speech and leave something that is intelligible but alien. Perhaps if we could figure out what that was, we could add it to other sound and make something that sounds human but unintelligible? I think I've probably heard sounds in that category over the years, and not just travelling in foreign countries!
In some ways, Autotune illustrates how complex and multidimensional the "humanification" of synthetic speech is. What makes Autotune, or at least the way that many musicians have used it, sound nonhuman is the suddenness of the shift in formants and pitch.
Here's an analogy that might be useful to spark thinking: how do dogs recognize something as another animal? They spend most of their time around people. So how does a German Shepherd see a dumb little Yorkie or Dachshund and recognize it as another dog or at least another living thing? There will be traits that are shared by "living things" that are uncharacteristic of objects (and vice versa). It's partly their physical characteristics, but also the nature of the movement. Some movement is perceptibly and uniquely "biological".
Few years back I took an online course in ChucK, which is a somewhat bizarre programming language for sound creation. One of the assignments was to create a synthetic conversation between two beings, one small and one large. Here's my version:
https://www.kadenze.com/courses/physics-based-sound-synthesis-for-games-and-interactive-systems-iv/gallery/file-submission-20-6-synthe-tique-dialogue-fantastique-a-couple-o-critters-sittin-around-talkin/gary-worsham-d7cd8750-fed7-423?browsing_scope=course
I recommend this class if you're at all interested in physical modeling synthesis. It's not analog though so sorry for the OT.
DL
Prelinguistic infants (we'll say between 9 and 15 months for arguments' sake) will babble away, practicing the twists and turns of phonemes of the language they're being raised in. When one ignores the absence of discernible "meaning" to what they're babbling, and simply examines the duration of their utterances, how and where pauses are inserted, and the prosody contained (i.e., how the pitch moves over the course of the utterance), it is largely indistinguishable from the speech of adult talkers. By 10-12 months, most infants have mastered what
sounds like adult speech, even though they don't necessarily have any actual words to plug into it. I remember well when our eldest was maybe 15-16 months or so, and we were at a Burger King. He sauntered over to the play area where he spied some kids who must have been around 7-8. He opened his mouth, said "Hi" and then poured forth with a stream of absolute gibberish that nevertheless had the superficial/structural qualities of speech. The kids all looked at me and asked "What did he say?", assuming by those structural properties that there was some underlying communicative intent and not simply gibberish. The sample that Larry linked to provides an excellent illustration of that. Not a single word in there, but it sure
sounds like a conversation.
That's why I suggest that if one wants to mimic a "talking" guitar, it helps to understand what leads us to perceive sounds as more speech-like and not simply mechanical sounds from objects around us. It doesn't necessarily
have to involve sophisticated technology. For instance, Jeff Beck's use of the humble wah in his rendition of the Willie Dixon tune "I Ain't Superstitious" sounds like a conversation with Rod Stewart.
Quote from: Mark Hammer on October 31, 2019, 09:04:05 AM
Prelinguistic infants (we'll say between 9 and 15 months for arguments' sake) will babble away, practicing the twists and turns of phonemes of the language they're being raised in. When one ignores the absence of discernible "meaning" to what they're babbling, and simply examines the duration of their utterances, how and where pauses are inserted, and the prosody contained (i.e., how the pitch moves over the course of the utterance), it is largely indistinguishable from the speech of adult talkers. By 10-12 months, most infants have mastered what sounds like adult speech, even though they don't necessarily have any actual words to plug into it. I remember well when our eldest was maybe 15-16 months or so, and we were at a Burger King. He sauntered over to the play area where he spied some kids who must have been around 7-8. He opened his mouth, said "Hi" and then poured forth with a stream of absolute gibberish that nevertheless had the superficial/structural qualities of speech. The kids all looked at me and asked "What did he say?", assuming by those structural properties that there was some underlying communicative intent and not simply gibberish. The sample that Larry linked to provides an excellent illustration of that. Not a single word in there, but it sure sounds like a conversation.
Along the same lines:
Similarly: "Wenn ist das Nunstück git und Slotermeyer? Ja! Beiherhund das Oder die Flipperwaldt gersput!"