Software Voice Vowel Detection in ActionScript 3.0
Lest you think that I have come up with the solution to this and you merely look for a download link, I have to let you know that I’ve come pretty close but gave up. I’ll tell you what I did and why I decided to put the project down. If you’ve followed the thread on Flashcoders, you already have some insight. Perhaps this post might get you thinking and you may come up with a workable solution!
Start. While implementing a text to speech engine (which returns an on-the-fly .mp3 file), I harnessed the power of SoundMixer.computeSpectrum. This allowed me to pretty easily move the jaw on a character up and down based upon the amplitude of the audio playback. When not moving the jaw too drastically, it looks pretty decent.
But what I really wanted to do was to shape the mouth to match the audio as best I could. Since I was using a software voice (not related in any way to Mac OS X’s voices), I could more accurately theoretically match patterns.
I began by creating a spectrum analyzer so I could evaluate the .readFloat values coming through SoundMixer. Now, I wanted to generate vowel “patterns” of values that I could store and use for matching later on the fly. I added an input text field and a speak button. A handy array in my spectrum class would gobble up values as they poured through the SoundMixer. Another button would later trace out all of the values captured. Yes, I just ran this application for each vowel I entered and played back. I ignored all zero values for each vowel, as there were tons of these… mostly at the beginning and the end of the audio file playing back.
So I ran and collected values for naked vowels (“a”, “e”, “i”, “o”, and “u”). Granted, even gathering values on naked vowels in the system, sometimes there is a little variation on the resulting values. For every enterFrame I collected 256 values. For an example, here are the values for the vowel “e” (for my software voice, non-zero values, beginning to end, and I show it as the vowel contained the least amount of data associated with it):
0.16437114775180817, 0.05698421224951744, 0.29469117522239685, 0.36785733699798584, 0.6720859408378601, 0.6766841411590576, 0.7320562601089478, 0.5057891607284546, 0.20905239880084991, 0.27565640211105347, 0.21965934336185455, 0.05556119605898857, 0.03433304652571678, 0.0008950285846367478, 0.21720515191555023, 0.24865098297595978, 0.22235532104969025, 0.0186882633715868, 0.015403595753014088, 0.27536413073539734, 0.5621691346168518, 0.29421526193618774, 0.6785570383071899, 0.3395964205265045, 0.7934200763702393, 0.5752798914909363, 0.0018509186338633299, 0.6312869787216187, 0.27040114998817444, 0.4692630171775818, 0.1586538404226303, 0.1783161163330078, 0.16662229597568512, 0.0667736753821373, 0.18298551440238953, 0.0435587540268898, 0.13021166622638702, 0.07257943600416183, 0.05768268555402756, 0.12160884588956833, 0.3065049350261688, 0.5201613306999207, 0.8288012146949768, 0.6042874455451965, 0.5649490356445313, 0.4499855637550354, 0.5027022361755371, 0.27770909667015076, 0.10652794688940048, 0.053154200315475464, 0.05952844396233559, 0.027240918949246407, 0.012978718616068363, 0.43212512135505676, 0.5084068775177002, 0.3296736478805542, 0.6569280624389648, 0.19170114398002625, 0.3795267641544342, 0.6922668218612671, 0.5504468083381653, 0.0013365419581532478, 0.3074118196964264, 0.04432011768221855, 0.08865272253751755, 0.13337476551532745, 0.015104932710528374, 0.3372088372707367, 0.3984711766242981, 0.4381348192691803, 0.6821690797805786, 0.38577720522880554, 0.28415364027023315, 0.8259784579277039, 0.19797925651073456, 0.18903198838233948, 0.3253178000450134, 0.24393196403980255, 0.16757753491401672, 0.034594301134347916, 0.25724005699157715, 0.03448990359902382, 0.0033073625527322292, 0.09720642119646072, 0.38554030656814575, 0.1810891032218933, 0.5335835814476013, 0.6567003726959229, 0.42182138562202454, 0.5153235793113708, 0.6158512830734253, 0.2590691149234772, 0.09425458312034607, 0.42378973960876465, 0.11871729046106339, 0.3611906170845032, 0.06403621286153793, 0.5142664909362793, 0.45043256878852844, 0.20055122673511505, 0.29153770208358765, 0.3764188587665558, 0.28491273522377014, 0.18202126026153564, 0.15635745227336884, 0.3402288258075714, 0.22502505779266357, 0.25899720191955566
There you have it. My “e”. I created arrays for each vowel that serve for lookup.
Then when the sound was playing through, I’d look for a starting value (or close to it) in order to start and continue looking to see if the values were close (excluding zero values of course). I think that readFloat may keep some kind of counter and gets the next value in a byte array each time it’s called. I haven’t seen declaratives on this method yet. Anyway, I found that in no way do you need to check every single value for the current byte and that in a particular pattern array. Sometimes just checking the first one or two values would work a charm. And you can probably check every 10th value or something. Otherwise you really risk letting some values not match (if you are going to be extremely strict). Once a match is deemed impossible you should stop checking against that vowel since it’s already failed before the full pattern match completed. Why keep checking when you know it already failed? Save some cycles.
Those pattern values have many decimal places… which also risk throwing off matches. So I used Number( level.toFixed(3)) – and this seemed to work pretty well. Not trimming them before stuffing them into the pattern arrays allows me some flexibility. Keeping those as registered gives flexibility later.
After a positive test for a vowel, I dispatch a custom event so I can manipulate a mouth in the document class, or do whatever I’d like. Testing has gone pretty well. It’s working.
Now, one has to also consider that vowels in words sound quite differently. So one has to pull out those parts as patterns. A, Ahh, Cat, Dart, etc. so those aren’t serving as vowels anymore, but sounds. Those would need to be added.
Why did I quit on the project? Well, this vowel recognizer I’ve coded up ONLY works for this exact software voice I am using. Which means if you tried using it, it wouldn’t work for you unless you were using the exact same text-to-speech voice that I am using. That’s a bummer. You’d have to generate your own patterns for the voice you were using. And trust me, it’s a bit of a pain in the ass.
I’ve noticed that even just testing with a few vowels, I’d get a little response lag. That’s to be expected. The only way this would be perfect was if you’d know the match exactly as it happend, or slightly ahead of time. My system merely evaluates what’s already passed through and acts as soon as it can. Which at times is visually disturbing. With all the effort, I’m not sure it’s always worth it. When it happens quickly, it’s pretty awesome. I have also noticed that at times in normal words (not naked) the system is detecting the vowel sounds and firing the proper events. That’s pretty awesome.
This project started off as a nice to have feature. Purely visual. I wanted to see if I could quickly come up with a solution.
I looked at many things:
- somehow using dynamic cue points.
- splitting up the known text string into words, listen for pauses so I could walk through the current word being spoken, and evaluate the word currently being spoken (note: this may still be a decent way to go as long as you’re supplying the text string to be converted to voice).
- matching visual bitmap representations of waveforms.
- Some other stuff I can’t quite remember
So my solution almost works, but it’s not quite fast enough. It’s strictly tied to an exact software voice… even if pitch or tempo were to change my system would break.
So there you have it. I’ll leave it in place for my local musings. I wonder if anyone has tackled this problem before or might plan on doing so. I’d be curious to see what you did or what you come up with. Cheers.