Spanish Language Learning Application

Team Name

The Language Factory

Timeline

Fall 2022 – Spring 2023

Students

Shan Nathwani
Salman Nazir
Ninad Pandit
Raunak Kunwar
Chidera Nwankwo

Abstract

Our project aims to create a Spanish language learning application that incorporates a speech recognition system using machine learning. The motivation for this project stems from the observation that many students do not see the value of learning a second language, particularly Spanish, which is the second most spoken language in the world. The application will record the user’s audio input and pre-process it to extract key features of the audio [2]. A pre-trained convolutional neural network will be used to determine the accuracy of the user’s pronunciation. The system will be trained on an extensive dataset of Spanish audio dialogue, and the trained model will be savable to avoid the need for re-training [1]. The application will be user-friendly and will allow users to proceed to the next question if their pronunciation meets the accuracy threshold. The project’s future work includes finding a more diverse and expansive dataset to increase the model’s accuracy and training the model on more epochs to improve detection accuracy. Overall, our project aims to encourage Spanish language learning by providing an interactive and engaging platform for users to practice their speaking skills.

Background

After referring to various surveys we got to know that, after English, the most spoken language is Spanish. A Washington Post Article combined with information from the United States Census Bureau states that 20 percent of Americans speak more than one language while 50 percent of Europeans speak more than one language. The general feeling nowadays is that learning a second language is not necessary as we are living in a country with predominantly English speakers. Our former sponsors have also stated that learning Spanish has become boring as the professors are now competing for students’ attention against smart phones. One reason that students are not interest in learning a language, especially Spanish, is that they do not often see its value.

Project Requirements

The application must implement a machine learning model of some kind.
The model must be specifically designed for speech recognition in the Spanish language.
The application must be able to record the user’s audio input and pre-process it to extract key features of the audio.
The system should have a database of an extensive dataset of Spanish audio dialogue that is used to train the model.
The trained model must be savable, so that re-training is not necessary
The system must be able to compare the pre-processed user’s input audio with the trained model’s output to determine the word error rate.
Product must be in the form of an application (web or mobile)
The application should be user-friendly and easy to navigate, with clear instructions and feedback provided to the user.
The application should allow the user to proceed to the next question if their pronunciation meets the accuracy threshold.
The system must have a high accuracy rate in recognizing spoken words to provide accurate results.

System Overview

Our team’s section of the Spanish language learning application will consist of a speaking
and accuracy check system. The other sections pertaining to writing and reading are outside of our scope and are to be handled by another team. The general idea of our application is to take a user’s Spanish audio input and determine if their pronunciation is close enough for a native Spanish speaker to understand. The user is first asked to say a random word from our vocabulary bank in Spanish. After the user is prompted to speak and records their voice, their audio data is then pre-processed, meaning that the key features of the audio are extracted, and then these features are passed into the pre-trained model. The preprocessing is done using existing Python libraries such as “Librosa”. After achieving a high enough accuracy towards the actual pronunciation, the user will proceed onto the next question. Implementation of our speech recognition model was done using a convolutional neural network that was trained on an extensive dataset of Spanish audio dialogue.

Results

Pre-Processing and Environment Setup (~12 mins):

This clip will demonstrate how we set up our environment for training purposes. This clip walks through the creation of our phonetic Spanish dictionary, transforming our dataset’s transcriptions into their phonetic counterparts, and finally preprocessing our dataset’s audio files to be passed into the CNN model.

Construction and Training of our Model (~8 mins):

This clip will explain our reasons on why we chose a Convolutional Neural Network to recognize speech and how it was constructed and trained.

Demonstration of our Application (~5 mins):

The clip will walkthrough the core requirements of the application which include: prompting the user for a Spanish word to speak, recording the user’s audio input (and preprocessing), and then the testing of our model’s prediction using this recording.

Model Accuracy:

The model is currently operating with a test accuracy of 95%. This is the result after training for 50 epochs.

Future Work

Due to processing power and time constraints, we had to choose a rather small Spanish audio dataset for training purposes. What could be improved in the future is finding a more diverse and expansive dataset to increase the accuracy for our model. Our current dataset is simply audio from a Tedx Talk and therefore the diversity of the words spoken may low depending on the topic of the Tedx Talk. What would also be done in the future is training the model on more epochs to increase prediction accuracy.

Project Files

Project Charter (link)

System Requirements Specification (link)

Architectural Design Specification (link)

Detailed Design Specification (link)

Poster (link)

Source Code and Documentation (link)

References

Hernandez-Mena, Carlos D. “TEDx Spanish Corpus. Audio and Transcripts in Spanish Taken from the TEDxTalks; Shared under the CC BY-NC-ND 4.0 License.” Openslr.org. Universidad Nacional Autonoma de Mexico, 2019. https://www.openslr.org/67/.
Shehzen, M. “How to Generate MFCC from Audio. - ML for Lazy 2021.” Medium, Analytics Vidhya, 12 July 2021, https://medium.com/analytics-vidhya/how-to-generate-mfcc-from-audio-ml-for-lazy-2021-42c2fdfa208.