Jump to content
IGNORED

Emulating the AtariVox


Urchlay

Recommended Posts

I'm considering adding AtariVox emulation to the next release of Stella (what can I say, I'm hooked on "Man Goes Down").

 

Am wondering if anyone has samples of the Speakjet speech chip in .wav (or whatever) format, or if anyone would be willing to help by making such samples, since I don't (yet) have an AtariVox.

 

I'll basically need one sample of each phoneme, though it's not clear to me whether I'll be able to get by with just that (by changing the pitch of each sample on the fly, as needed), or whether I'll actually need different samples for the different pitches the chip can produce.

Link to comment
Share on other sites

If you just want speech then you'll need just the 72 speech allophones. Although, if I'm recording I could do them all if you want ?

 

What sample rate do you need ?

 

I usually record AtariVox stuff at 8MHz (through a Griffin iMic as it gives a clearer sample than my sound card).

 

 

I'd suggest doing them at the SpeakJet's default settings -

 

Volume 96

Speed 114

Pitch 88

Bend 5

Link to comment
Share on other sites

I'm considering adding AtariVox emulation to the next release of Stella (what can I say, I'm hooked on "Man Goes Down").

 

Am wondering if anyone has samples of the Speakjet speech chip in .wav (or whatever) format, or if anyone would be willing to help by making such samples, since I don't (yet) have an AtariVox.

 

I'll basically need one sample of each phoneme, though it's not clear to me whether I'll be able to get by with just that (by changing the pitch of each sample on the fly, as needed), or whether I'll actually need different samples for the different pitches the chip can produce.

I think someone's already made all of the samples. Ken at speechchips.com announced that he was going to write a standalone Speakjet emulator but I don't know if he ever finished. Worth asking though. Also worth asking him if I reinvented any wheels with the work I did here:

 

I wrote a program that takes plain text and tries to translate it into Speakjet phoneme codes. It uses the same (public domain) old Naval algorithm as the text-to-speech chip that can be attached to a Speakjet. It's just something I'm playing around with so it may need some work, esp. since the phonemes in the Naval source code didn't always have an exact match in the Speakjet.

 

In some cases, it may sometimes work better than the text-to-speech chip, since my program first tries to parse the Phrasealator.Dic file from the Phrasealator.exe program (which contains a few hundred common english words that have been manually optimized.) If nothing else, it should remove some tedium from creating Speakjet codes by hand, that aren't covered by the Phrasealator.exe program.

 

The attached file contains source, a Win32/DOS and a OSX command-line build.

 

EDIT: fixed problem with OSX build.

Avox_text2speech.zip

Edited by batari
Link to comment
Share on other sites

since the phonemes in the Naval source code didn't always have an exact match in the Speakjet.

 

They're for the SPO256-AL2 chip

 

I made an emulator for the VecVox (so that games written for the SPO256 based VecVoice would work on it)

 

After a lot of experimenting, I came up with this conversion table which you might find useful -

 

 

SPO256-AL2 	SpeakJet

0	  	  4
1	 	  5
2 	 	  6
3 	 	  1
4 	 	  2
5 		156
6 		155
7 		131
8 		195
9 		199
10 		165
11 		141
12 		129
13 		192
14 		153
15 		134
16 		140
17 		191
18 		169
19 		128
20 		154
21 		176
22 		139
23 		135
24 		136
25		128
26		132
27		183
28		170
29		190
30		138
31		160
32		163
33		175
34		180
35		166
36		179
37		189
38		168
39		148
40		186
41		194
42		194
43		167
44		144
45		145
46		147
47		150
48		185
49		128
50		182
51		133
52		151
53		164
54		169
55		187
56		142
57		184
58		153
59		152
60		149
61		178
62		159
63		170	

 

 

IIRC he was talking about using the TTS256 algorithm in a PC software based converter.

 

(It's a shame the chip is fixed at 9600 baud. If he made a 19200 version it could be dropped straight into a AVox with no additional electronics).

 

I was thinking, if Ken hasn't got the wav's, a quick way would be to put the chip in demo mode, record all the sounds it spits out, and edit them later.

Edited by Richard H.
Link to comment
Share on other sites

IIRC he was talking about using the TTS256 algorithm in a PC software based converter.

You're right - I missed that the first time I read it.

 

I searched for many hours trying to find if he finished and released such a program but I could not find anything. Ironically, it took me about the same amount of time to come up with my own since the source code for text-to-speech was public domain and easy to find. The program started as something built into bB, but I decided that it would be more useful as a separate utility so I split it off.

Link to comment
Share on other sites

Hrm, you guys actually know what you're talking about... this will be my first attempt at doing anything with speech synthesis, so I might have to ask you a lot of dumb questions :)

 

www.speechchips.com does have an emulator for the SPO256, called ChipTalk... Windows binary, free download, no source. Once it's installed, there are 64 .wav files in c:/Program Files/ChipTalk/phonemes. They're 8-bit, mono, 11KHz.

 

Apparently it does something more than play each wav file in sequence: I tried doing that manually, and it sounds awful. The ChipTalk program somehow blends them together so they sound like words, instead of discrete phoneme samples. Unfortunately, ChipTalk worked perfectly the first time it ran (this is using Wine on a Linux box), and now it's producing no audio and unplayable .wav files. I don't have a Windows machine anywhere (haven't had for years), so will have to try something else...

 

...apparently, the Speakjet has 72 phonemes, but the SPO256 emulator only has 64, so I wouldn't be able to use the .wav files as-is anyway.

 

I think what I need is some "Speech Synthesis for Dummies" type of documentation. Either of you have any suggestions?

 

Another approach to this: instead of writing my own speech synth, I could use something like Festival (software speech synth, can be used as a library). The upside to this is that all the hard parts are done; I'd just be converting Speakjet phoneme codes to Festival codes. The down side is that it probably wouldn't sound anything like a real Speakjet.

 

Anyway, I better get going, or I'll be late for work...

Link to comment
Share on other sites

Apparently it does something more than play each wav file in sequence

 

Ask Ken, but I'm pretty sure he's just stringing the .wavs together. As a side note, the wav's he's using came from an Intellivoice emulator. I just went through the list and identified which were SPO256 matches and then relabled them. They're actually pretty poor quality, which is another reason why the speech sounds bad.

 

If you hook ChipTalk up to an actual SPO256-AL2, you'll notice a big difference as transition smoothing between sounds is done on the actual chip (and SpeakJet).

 

 

As for the SpeakJet's extra 8 phonemes, they're not that essential. I don't think Ken adjusted the original algorithm to incorporate them.

 

 

Here's an example of some plain text (no creative spellings etc) I put through Ken's TTS chip, it'll be the same for batari's software.

 

I think the algorithm works well.

TTS256.wav

Edited by Richard H.
Link to comment
Share on other sites

As for the SpeakJet's extra 8 phonemes, they're not that essential. I don't think Ken adjusted the original algorithm to incorporate them.

 

 

Here's an example of some plain text (no creative spellings etc) I put through Ken's TTS chip, it'll be the same for batari's software.

 

I think the algorithm works well.

 

Actually, from reading the Atarivox docs, it looks like there's no TTS involved: the sample code sends what look like raw phoneme codes to the chip:

 

dc.b	 183,7,159,146,164; \HE \FAST \EHLE \LO \OWWW

 

So, for a *really* rudimentary emulation, I could just pick up the bits as they're clocked out the joystick port, gather them up into bytes, use each byte to decide what phoneme .wav to play (and ignore bytes that do other things, like setting pitch/bend/etc). That'd do for my initial implementation, could add other stuff as I figure out how to do it (right now, I don't know how to process audio samples to do things like change their pitch, or even resample them... am learning as I go).

 

If you're willing to take the time to do it, I'd love to have high-quality samples (44.1KHz or so). If it turns out I need to downsample them, I can do that myself (I may not know how to write code that does downsampling yet, but I do know how to use existing software to do it for me...)

 

I guess record them at the Speakjet's default settings, like you said. Presumably, the defaults are the "real" samples, and the chip modifies them on the fly when the settings are changed...?

Link to comment
Share on other sites

Actually, from reading the Atarivox docs, it looks like there's no TTS involved: the sample code sends what look like raw phoneme codes to the chip:
That's right, there's no TTS in the SpeakJet. It needs an additional chip or software

 

 

So, for a *really* rudimentary emulation, I could just pick up the bits as they're clocked out the joystick port, gather them up into bytes, use each byte to decide what phoneme .wav to play (and ignore bytes that do other things, like setting pitch/bend/etc).

 

Yes, that'd work

 

 

I guess record them at the Speakjet's default settings, like you said. Presumably, the defaults are the "real" samples, and the chip modifies them on the fly when the settings are changed...?
Yes, I'm guessing it must be so because those are the default values the chip starts up with.

 

(BTW it doesn't use samples, but something called 'Mathematical Sound Architecture' to make the sounds in real time)

 

 

If you're willing to take the time to do it, I'd love to have high-quality samples (44.1KHz or so).

 

Sure, no problem !

 

 

 

BTW do you have the SpeakJet docs ?

 

(I've attached them incase you don't)

speakjetusermanual.pdf

Edited by Richard H.
Link to comment
Share on other sites

I'd suggest that a good approach to start with would be to make the emulator output speech data to the PC serial port and have the Speakjet play it. As a first approximation, have the handshake line report the status of the handshake wire from the Atarivox. The timing on this wouldn't be quite right, but it should be good enough for most well-written software.

 

I'd suggest something like the following:

int speech_timer,speech_data;

void emulate_speech(void)  /* Call once per cycle */
{
 if (!speech_timer)
 {
speech_data <<= 1;
if (~SWCHA & SWACTL & 1)
  speech_data |= 1;
if (speech_data)
{
  if (speech_data == 1)
	speech_timer = 31;
  else if (speech_data == 2)
	speech_data = 0;
  else if (speech_data & 1024)
  {
	if (speech_data & 1)
	{
	   out_byte(255 ^ ((speech_data & 510) >> 1));
	}
	speech_data = 0;
  }
  speech_timer = 62;
}
 }
 else
speech_timer--;
}

Hopefully I figured that code right. That should cause data to be output to the AtariVox when the code properly bit-bashes the joystick port. The code should reject spurious start bits or data bytes with no stop bit.

Link to comment
Share on other sites

(BTW it doesn't use samples, but something called 'Mathematical Sound Architecture' to make the sounds in real time)

 

...going to be difficult/impossible to emulate that correctly. A proprietary algorithm embedded in a chip, and me without an electron microscope :)

 

I've got 2 approaches I can take:

 

1. Get samples recorded from a Speakjet, try to munge them in realtime so they sound right... I guess I'd need to at minimum be able to speed them up, slow them down, change their pitches, and blend them together (so the "th" and "e" in the word "the" will sound like a word instead of two discrete phonemes). This would be a lot of work, and I doubt it'd sound right in the end... If I knew exactly what I was doing (hah!), I might be able to get it to sound somewhat like a Speakjet... but I doubt it, since the Speakjet isn't based on samples in the first place.

 

2. Use an existing open source speech synthesis library (such as rsynth: http://sourceforge.net/projects/rsynth/). This would probably sound more like real speech than anything I could do with samples. It would also take a lot less work, but it wouldn't sound much like an actual Speakjet.

 

So how important is it that the emulation sound exactly like the real thing?

 

I spent a couple hours messing with the rsynth code, and I'm pretty sure I can map the Speakjet phonemes to the rsynth phonemes... and support most of the Speakjet commands (fast, slow, stress/relaxation, different pitches) with existing code in rsynth. Am not sure if it's the right way to go, but it's definitely the *easy* way to go.

 

If you're willing to take the time to do it, I'd love to have high-quality samples (44.1KHz or so).

 

Sure, no problem !

 

Thanks... but I'd hate for you to spend a lot of time making samples, if it isn't truly necessary. What do you think?

 

BTW do you have the SpeakJet docs ?

 

(I've attached them incase you don't)

 

Yep, I had just finished downloading it when I saw your message :)

Link to comment
Share on other sites

I'd suggest that a good approach to start with would be to make the emulator output speech data to the PC serial port and have the Speakjet play it. As a first approximation, have the handshake line report the status of the handshake wire from the Atarivox. The timing on this wouldn't be quite right, but it should be good enough for most well-written software.

 

Yow! The code looks good, but I'm trying to emulate the Atarivox in software... I haven't actually got one to hook up to a serial port (and I hear they're hard to come by these days).

 

I'd use code like yours for building up the data bytes, then use them as input for the Speakjet emulation (replace the out_byte() with my emulation function, basically). Actually, would be easy enough to support a real Atarivox, if I but had one for testing...

Link to comment
Share on other sites

1. Get samples recorded from a Speakjet, try to munge them in realtime so they sound right... I guess I'd need to at minimum be able to speed them up, slow them down, change their pitches, and blend them together (so the "th" and "e" in the word "the" will sound like a word instead of two discrete phonemes). This would be a lot of work, and I doubt it'd sound right in the end... If I knew exactly what I was doing (hah!), I might be able to get it to sound somewhat like a Speakjet... but I doubt it, since the Speakjet isn't based on samples in the first place.

 

2. Use an existing open source speech synthesis library (such as rsynth: http://sourceforge.net/projects/rsynth/). This would probably sound more like real speech than anything I could do with samples. It would also take a lot less work, but it wouldn't sound much like an actual Speakjet.

I'd suggest #2, as #1 is going to sound semi-crappy (like the Chiptalk program) and not a lot like a real Speakjet anyway.

Link to comment
Share on other sites

would be easy enough to support a real Atarivox, if I but had one for testing...
PM me your address, you can have mine and I'll build myself another one.

 

 

I'd suggest #2, as #1 is going to sound semi-crappy (like the Chiptalk program) and not a lot like a real Speakjet anyway.

 

I'd say that is a good choice too, as #1 is going to do the chip a real injustice.

 

If you could get the open source speech synth to sound a bit like the SpeakJet, that would be cool too.

Link to comment
Share on other sites

PM me your address, you can have mine and I'll build myself another one.

 

OK, but I'll buy it from you. The only reason I don't buy one from the AA store is that they're out of stock (or maybe never carried them?)

 

I'd suggest #2, as #1 is going to sound semi-crappy (like the Chiptalk program) and not a lot like a real Speakjet anyway.

 

I'd say that is a good choice too, as #1 is going to do the chip a real injustice.

 

 

So, two votes for using an existing engine instead of trying to stitch together samples... Actually, three, since I vote for it too.

 

If you could get the open source speech synth to sound a bit like the SpeakJet, that would be cool too.

 

Might be able to. rsynth has a pretty tweakable set of parameters, so I should be able to get fairly close. Its phoneme list doesn't exactly match the Speakjet one-for-one, but it's not too far off, either.

Link to comment
Share on other sites

The only reason I don't buy one from the AA store is that they're out of stock

I'm working on getting them back soon.

 

I've been busy building a standalone programmer for the SpeakJet chips. It'll configure them ready to go onto the boards, without having to connect them up to a PC (and use PhraseALator).

Link to comment
Share on other sites

I've been busy building a standalone programmer for the SpeakJet chips. It'll configure them ready to go onto the boards, without having to connect them up to a PC (and use PhraseALator).

 

Hm. What does the programming do? Anything I should worry about, for emulation purposes?

Link to comment
Share on other sites

What does the programming do? Anything I should worry about, for emulation purposes?

 

No, it's just for changing the baud rate of the chip, altering the event config, clearing the EEPROM and writing a new start-up phrase. You don't need to worry about this for emulation.

 

The programmer just makes it easier / faster for Albert when he's building the Avox's.

Link to comment
Share on other sites

A question for you, Richard...

 

When sending a stream of speech to the SpeakJet, at what point does the chip start to speak? Does it start talking immediately when it gets the first phoneme, or does it wait until it's got a certain number of phonemes in its buffer, or does it start talking when X amount of time passes after the last phoneme was sent... or does it do something else I haven't thought of yet?

 

One reason this matters is that there's no "end of phrase" marker for most of the phrases that "Man Goes Down" says. It just stops sending data after e.g. the "n" at the end of the word "down". My code needs to be able to tell when to start talking...

 

...and both the speech synthesis libraries I'm working with (rsynth and flite, not made up my mind which to use yet) require at least a full words' worth of phonemes before they can start speaking. If I try to have rsynth speak individual phonemes as they come in, it sounds worse than using samples would, and if I do the same with flite, it doesn't say anything at all :(

 

Here's what flite sounds like, saying "man goes down"...

mgd_phrase.wav

Link to comment
Share on other sites

When sending a stream of speech to the SpeakJet, at what point does the chip start to speak?
It starts talking straight away, the data enters the 64 byte input buffer and commands are executed by FIFO order.

 

The READY line on the Speakjet is detected to detemine if the chip can receive more data, (high=yes, low=no)

 

 

Here's what flite sounds like, saying "man goes down"...

Hey, that's not bad at all :)

Link to comment
Share on other sites

It starts talking straight away, the data enters the 64 byte input buffer and commands are executed by FIFO order.

 

Ugh. I was afraid of that.

 

If I knew enough to write my own speech synthesis engine, I'd be fine... but I don't, so I'm having to use an existing library. None of the ones I've looked at so far will let me add phonemes to its buffer once it's started rendering. I can feed it one phoneme at a time, but then I might as well be using samples (rsynth and flite both work somewhat like the SpeakJet itself: they blend adjacent phonemes together to smooth the transitions, by mathematically modelling the human throat, mouth, nose... I don't pretend to understand how *that* works). If I tell either library to say "M", then "A", then "N"... I end up with something that sounds like 3 distinct, unrelated noises (like I'd get using samples). If I tell either one to say "MAN" all at once... well, I already posted a .wav file.

 

So I've got two options:

 

1. Go with samples. I'll be able to play them in realtime, but they will sound like crap.

 

2. Use rsynth or flite anyway, gather up the phonemes into a buffer until the game stops writing data to the port for a full frame, then speak the entire phrase at once. This will sound a lot better than samples, but it will cause a noticeable delay between the time a real SpeakJet would start talking and the time the emulated one will (1/60 sec per phoneme... average phrase is probably under 15 phonemes, so 1/4 second). The longer the phrase, of course, the longer the delay.

 

I dunno which is more important: better-sounding speech, or accurate timing of the speech.

 

I'm about 90% done with implementing #2 (well, the barebones version: no support for stress/relax, pitch/bend/speed changes, etc). When I get it working correctly, I'm willing to build binaries for people to try out (presumably this means Windows binaries), if anyone's interested.

 

There's also one more option:

 

3. Spend the next few months learning enough about the theory of speech synthesis, so I'll know enough to write my own engine. I wish I could do this, but realistically I just don't have the energy or attention span any more (I could do it if it was my job, or if I were unemployed, but I can't do it and the job I have now...)

 

The READY line on the Speakjet is detected to detemine if the chip can receive more data, (high=yes, low=no)

That much, I've got working so far :)

 

Here's what flite sounds like, saying "man goes down"...

Hey, that's not bad at all :)

 

Does it sound even remotely like the real thing? I've only ever heard an AtariVox once, maybe a year ago, and I can't remember how it sounds now...

 

That's the default speed and pitch settings. One more annoying wrinkle here is that the SpeakJet lets you change speed/pitch in mid-phrase, but neither rsynth nor flite does this (you have to start a new phrase with a new speed/pitch, so there's a noticeable "stop" in the middle of the speech). I can probably modify one library or the other to support fake phonemes that tell it to change the settings, though...

Link to comment
Share on other sites

4th option:

 

Speak the word after every pause command or end of phrase. The pause character is sent quite often in normal speech and might be a good enough break in the flow of speech so nobody would notice, and the delay would be shorter.

Link to comment
Share on other sites

5th option: Have one or more "fake" I/O addresses that are supported within the emulator for purposes of speech output. For software using something similar to the driver on Richard H's web site, all that would be necessary would be for the speech driver to tell the emulator where its text pointer was stored (e.g. by putting its address to a fake I/O address). Once the emulator knew that, the emulator could look ahead as needed. The timing wouldn't quite match a real AtariVox, but I doubt anything other than the real thing would have perfect timing anyway.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...