Speech recognition for Japanese

5 min readNov 21, 2017

Voice user interface is the next UI.

My first computer didn’t have mouse. Only things I could enjoy were some games.

Then I started using Windows which had multiple windows system and mouse. I liked it.

Now I use touch not just for mobile devices but even for notebook. I don’t think I can go back to mouse anymore because touch is too comfortable.

And we’re starting using our voice to talk to computers. In terms of communication, voice is obviously the most intuitive manner.

What are ingredients?

According to Jungling Hu, chair of the AI frontier conference, AI assistant has 6 components. I attended her meetup when she made a great presentation.

Speech recognition
NLU
Recommender system
Dialog system
Chatbot
Speech Synthesis

My goal is to develop my version of opensource Alexa for Japanese.

One of my friends told me that natural language processing was one of the most difficult thing and I’d better not to try because I did’t have NLP background.

But…

Why not?

I’m going to write piece of codes one by one and share insights and the dirty codes.

Anyway I started off with speech recognition because I learned speech recognition at Udacity AI nano degree course.

Prepare an app for recording

First thing I had to do was collecting data. But the problem was there seems to be no license free data. I mean I can use LibriSpeech ASR corpus which I used in the course. But I can not find any good alternative for Japanese. There are a few voice corpuses for Japanese but non of them are license free.

So I developed a native app to record my voice.

What the app does is pretty simple.

Keep showing sentences
Let me record my voice
Upload the voice to server

I can also fix the pronunciation if it’s wrong.

To tell the truth I developed a web app first but I could not make it on safari. That’s why I ended up developed a native app.

Input data and labels

Input data is wav file, which is voice.

What I should use for labels is not so obvious.

Japanese language has 2 kinds of characters. Chinese character, called Kanji ,and Hiragana. Kanji is a ideograph and Hiragana is a phonogram.

So my natural choice was Hiragana.

あいうえお かきくけこ さしすせそ たちつてと なにぬねのはひふへほ まみむめも やゆよ らりるれろ わをんぁぃぅぇぉ ゃゅょ ゎ っがぎぐげご ざじずぜぞ だぢづでど ばびぶべぼぱぴぷぺぽ ゔ ー <Space>

There are 84 characters including space. I use space for the place to take a breathing pause.

I converted Kanji to Hiragana with a NLP software called mecab, which is quite popular in Japan but not so perfect.
Then I converted Japanese version of comma and a few other quotation marks into space.

But there are a few problems.

Some characters like は and わ have same sounds.
Sometimes comma is used just for clarification and we don’t take a breathing pause at that case. And where we take a breathing pause is ultimately up to reader.
Young generation use ー to express long lasting sounds but ー is not a traditional Japanese.

Anyway below is an example of conversion.

Record voice

I recorded roughly 173 minutes of my voice. It’s almost 3 hours. It took about 10 hours because I needed to check conversion, recording voices, and checking the voices.

1,374 sentence, 8,558 secs, for training
308 sentences, 1,833 secs, for validation

Since the data may not be enough, I didn’t prepare the test set.

Model

Since I learned voice user interface at Udacity, I tweaked the model I used for my final project.

I use MFCCs as features because MFCCs is faster to train than Spectrograms.

For model, I use Conv1D, Bidirectional RNN, and time distributed layer. Conv1D improve training speed and accuracy both. Bidirectional RNN sound reasonable because speech is sequential. I also used time distributed layer which use CTC loss function.

Because CTC loss function is given by Udacity, I just understand abstract but this is the magic. My model learn nothing without time distributed layer with CTC loss function.

You can see my model and other util functions on my github.

Result

Training loss and validation loss got small as expected. The results looked great at least on my jupyter notebook. Even validation prediction looked to be correct.

You can check my jupyter notebook here.

But when I prepared a web app, where I can test with any voice and sentences, results were not always good.

Potential reasons are …

Data is definitely not enough. I may need 100x more data?
Where we take a breathing pause is tricky. I’m not sure whether I should even include <space> which represent breathing pause at all. Or should I have added <space> manually after recording voice ?
I used Hiragana for labels. But there are exceptions and some Hiragana characters have unfortunately same pronunciation.

Here’s an example of test results.
If the sentence is pretty short, I can see AI is learning something. I’m happy about it.

You can try my demo here.
But you need to try multiple times to get meaningful results even for a short sentence.

By the way I used an online novel for my sentences which happened to be a fantasy novel and the novel contains less common words like magic and monsters. There are even grammatical mistakes. This may be also a potential problem.

Conclusion

Voice and natural language understanding are exciting.

If you have ideas to improve performance, please give me your advice.
Suggestion for labeling, suggestion for model, or any suggestions are welcomed.

I will develop better model with your suggestions and more data and report it on this blog.

By the way I believe having open version of Alexa is pretty good for all languages.

Thanks,