Real-time Vietnamese Speech Recognition

This article provides step by step guide on how to design real-time a non-English language speech recognition..

Joe Nguyen · 6/30/2019 12:56:28 PM

Real-time Vietnamese Speech Recognition

Voice_Waveform.jpg

 

1. Introduction

Machine learning has become an indispensable technology that contributes to the success of mazing applications. Speech recognition is one of the successful machine learning applications. In this article, a real-time speech recognition application will be described in detail. This project is completed by a second-year UNSW university student who has only explored machine learning for a very short period of time. In order to achieve that goal, ANNHUB is used to design, train, evaluate a neural network model to recognize 10 different non-English language words (Vietnamese), and this trained model is deployed into LabVIEW real-time application. It takes only 1 week to complete this project, from collecting dataset, developing a feature extraction algorithm to clean and extract features from the dataset, developing a neural network model and a real-time LabVIEW application, and deploying the trained neural model into this real-time LabVIEW application.

 

2. Data collection 

In order to collect voice data, a built-in computer microphone with voice recording software is used. The idea is to record a human speech (word) and save it into an audio file. Three data sets are recorded, one for training, one for evaluation and one for testing processes. You can download these data sets from the Voice Recognition ANNHUB example link. The data structure is shown below:

Voice_Data_structure.jpg

The audio file will be 1 second long, have 22050 sample/s, 16 bits per sample and contain 1 word. Every word will be recorded 45 times to provide 35 samples to train, 5 samples to test and 5 samples to evaluate. 

 

3. Feature extraction 

       Voice_Waveform.jpg

 

As you can see from the diagram, there are noises before and after the word as shown below. However, we are only interested in the data of the word. 

Algorithm to extract data: After recording the audio, convert it to a 1-D array then get rid of the negative part and sketch it (as shown below).

Extract_Voice_Data.jpg

 

Set up a window which will move along the x-axis with 1000 samples/interaction (depend on sample rate), the width of the window will be 1600 samples (depend on the sample rate). The samples inside the window will be summed up and compared to a threshold value. If it’s greater than the threshold value (which in this case is 3.5) then the position of the window will be recorded as the start position of the word. The window will keep moving till the sum less than the threshold value, once again the position of the window now will be the finish position of the word.

 

These positions will be written into an array as shown below. Finally, Extract data from raw data (include all the negative values).

Voice_Data_position.jpg

 

Although clean data has been extracted from a raw data signal, audio features are not clearly visible and it is hard to distinguish between different words just by looking clean raw audio data. To overcome this issue, a popular feature method for audio signals will be implemented.  This method is Mel-Frequency Cepstral Coefficients (MFCCs). In this article, this MFCCs algorithm is implemented in LabVIEW, and more information on this method, please visit below resources:

https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

https://au.mathworks.com/help/audio/examples/speaker-identification-using-pitch-and-mfcc.html

 

4. Prepare a training dataset in the correct ANNHUB format form.

After applying MFFCs, we will get a matrix that has 40 columns, however, we are only interested in the data from 2nd columns to 13th columns, take 10 elements in each column and put them in the 1-D array. These will be our featured.

Voice_Dataset_in_ANNHUB_Format.jpg

 

5. Design a Neural Network in ANNHUB.

After audio features are extracted and exported in the correct ANNHUB data format in step 4, this training data can be directly imported into ANNHUB software. The following steps are used to design, train, evaluate and test the neural network model for this speech recognition. 

Step 1: Design a Neural Network model.

Design_Neural_Network_Model_In_ANNHUB.jpg

This process will construct a Neural Network model by selecting the training algorithm (1), activation functions for each layer (2), pre-processing method (3), post-processing method (4), number of hidden nodes (5), training ratio to separate training part, validation part and test part from the training set (6), and lost/cost function (7). All configuration processes are done by a few simple clicks.

 

 Step 2: Train the Neural Network model.

After being configured, the Neural Network is ready to be trained.

Train_Speech_Neural_Network_In_ANNHUB.jpg

 

In this training page, firstly stopping criteria are selected (1), then appropriate training algorithm parameters are specified (2) before the training can start (3). The early stopping technique is used automatically to avoid overfitting (overtrained) issues.

 

 Step 3: Evaluate the trained Neural Network model.

After being trained, this Neural Network will be evaluated by popular evaluation techniques supported in ANNHUB such as ROC curve, confusion matric and so on. The final test of the trained Neural Network on a new test dataset will be shown below:

Evaluate_Speech_Neural_Network_ANNHUB.jpg

 

The accuracy of the completely unseen new dataset achieves 94%.

6. Deploy the trained Neural Network model in the LabVIEW application.

Once the trained model has been evaluated, tested and verified, it can be exported into a supported format that can be used directly in supported programming languages, including LabVIEW. To load and use the trained model, an appropriate Application Programming Interface (API) will be used. For more information on ANNHUB LabVIEW API, please see the ANNHUB Help page

 

The block diagram below shows how to deploy the trained model in the LabVIEW environment.

Deploy_Speech_Nerual_Network_In_LabVIEW.jpg

This LabVIEW application provides a starting point for a complete real-time speech recognition application. Built-in microphone and State Machine architecture will be used to fulfill this task.

  

7. Conclusion 

In this article, real-time speech recognition application for non-English languages has been developed. By using ANNHUB software, the Neural Network design process has been so simple, and with ANNAPI (ANNHUB LabVIEW API) it is also easy to deploy the trained model into real-time LabVIEW application.

  

 

 

Related Blogs

Keep in touch with the latest blogs & news.

Be the first to post a comment.

Please login to add your comment.

ANS Center provides elegent solutions to simplify machine learning and deep learning design and deployment.

We support mutiple hardware platforms and programming languages, including LabVIEW, LabVIEW NXG, LabWindow CVI, C/C++/C#, and Arduino.