The Recognition of Persian Phonemes Using PPNet

Saber Malekzadeh, Mohammad Hossein Gholizadeh, Seyed Naser Razavi, Hossein Ghayoumi Zadeh

DOI: 10.4103/jmss.JMSS_32_19

Abstract


Background: In this paper, a novel approach is proposed for the recognition of Persian phonemes in the Persian consonant-vowel combination (PCVC) speech dataset. Nowadays, deep neural networks (NNs) play a crucial role in classification tasks. However, the best results in speech recognition are not yet as perfect as human recognition rate. Deep learning techniques show outstanding performance over many other classification tasks, such as image classification and document classification. Furthermore, the performance is sometimes better than a human. The reason why automatic speech recognition systems are not as qualified as the human speech recognition system, mostly depends on features of data which are fed to deep NNs. Methods: In this research, first, the sound samples are cut for the exact extraction of phoneme sounds in 50 ms samples. Then, phonemes are divided into 30 groups, containing 23 consonants, 6 vowels, and a silence phoneme. Results: The short-time Fourier transform is conducted on them, and the results are given to PPNet (a new deep convolutional NN architecture) classifier and a total average of 75.87% accuracy is reached which is the best result ever compared to other algorithms on separated Persian phonemes (like in PCVC speech dataset). Conclusion: This method not only can be used for recognizing mono-phonemes but it can also be adopted as an input to the selection of the best words in speech transcription.

Keywords


Persian consonant-vowel combination, Persian, PPNet, speech recognition, short-time Fourier transform

Full Text:

PDF

References


Li J. Soft Margin Estimation for Automatic Speech Recognition. The PHD Dissertation: Georgia Institute of Technology; 2008. Back to cited text no. 1

Janet MB, Deng L, Glass J, Khudanpur S, Lee CH. Developments and directions in speech recognition and understanding part 1 [dsp education]. IEEE Signal Process Mag 2009;26:75-80. Back to cited text no. 2

Mohamed AR, Dahl GE, Hinton G. Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 2011;20:14-22. Back to cited text no. 3

Morris J, Fosler-Lussier E. Conditional random fields for integrating local discriminative classifiers. IEEE Trans Audio Speech Lang 2008;16:617-28. Back to cited text no. 4

Carla L, Fernando P. Phoneme Recognition on the Timit Database. Speech Technologies; 2011. Back to cited text no. 5

Graves A, Mohamed AR, Hinton G, editors. Speech Recognition with Deep Recurrent Neural Networks. 2013 IEEE Intersnational Conference on Acoustics, Speech and Signal Processing. IEEE; 2013. Back to cited text no. 6

Gao Y, Xie Y, Cao W, Zhang J, editors. A Study on Robust Detection of Pronunciation Erroneous Tendency Based on Deep Neural Network. Sixteenth Annual Conference of the International Speech Communication Association; 2015. Back to cited text no. 7

Hu W, Qian Y, Soong FK, Wang Y. Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Commun 2015;67:154-66. Back to cited text no. 8

Widrow B, Glover JR, McCool JM, Kaunitz J, Williams CS, Hearn RH, et al. Adaptive noise cancelling: Principles and applications. Proc IEEE 1975;63:1692-716. Back to cited text no. 9

Ghoraani B, Krishnan S. Time – Frequency matrix feature extraction and classification of environmental audio signals. IEEE Trans Audio Speech Lang Process 2011;19:2197-9. Back to cited text no. 10

Maue-Dickson W, Dickson D. Anatomical and Physiological Bases of Speech. Butterworth: Heinemann; 1982. Back to cited text no. 11

Williams AL, McLeod S, McCauley RJ. Interventions for Speech Sound Disorders in Children. Education Resources Information Center: Brookes Publishing, 1st Ed.; 2010. Back to cited text no. 12

Beigi H. Fundamentals of Speaker Recognition: 1st Ed., Springer; 2011. Back to cited text no. 13

Jacobsen E, Lyons R. The sliding DFT. IEEE Signal Process Magaz 2003;20:74-80. Back to cited text no. 14

Kalamani M, Valarmathy S, Krishnamoorthi M. Adaptive noise reduction algorithm for speech enhancement. World Academy of Science, Engineering and Technology. Int J Comput Control Quantum Inf Eng 2014;8:1014-21. Back to cited text no. 15

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436-44. Back to cited text no. 16

Albelwi S, Mahmood A. A framework for designing the architectures of deep convolutional neural networks. Entropy 2017;19:242. Back to cited text no. 17

Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L, editors. Large-Scale Video Classification with Convolutional Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2014. Back to cited text no. 18

Olson DL, Delen D. Advanced Data Mining Techniques. Springer Science & Business Media; 2008. Back to cited text no. 19

Nazari M, Sayadiyan A, Valiollahzadeh SM, editors. Speaker-Independent Vowel Recognition in Persian Speech. 2008 3rd International Conference on Information and Communication Technologies: From Theory to Applications. IEEE; 2008. Back to cited text no. 20

Sadeghi VS, Yaghmaie K. Vowel recognition using neural networks. Int J Comput Sci Netw Secur 2006;6:154-8. Back to cited text no. 21

Tavanaei A, Manzuri MT, Sameti H, editors. Mel-Scaled Discrete Wavelet Transform and Dynamic Features for the Persian Phoneme Recognition. 2011 International Symposium on Artificial Intelligence and Signal Processing (AISP). IEEE; 2011. Back to cited text no. 22


Refbacks

  • There are currently no refbacks.


 

  https://e-rasaneh.ir/Certificate/22728

https://e-rasaneh.ir/

ISSN : 2228-7477