Speech Processing Lab
  
Computer Engineering Dep.
Sharif University of Technology



Projects of Speech Processing Laboratory (SPL)

Some of the current projects in the SPL are as follows. Complete information about each project and its development activities will be added later. Also the publications of these projects are available at Publication page.

 
 Speech Recognition: Dictation system
      NEVISA is a Persian dictation system developed as a result of this project. In fact NEVISA is the first output of SPL's recognition engine which uses most popular algorithms and methods in the speech recognition field. This engine use HMM-based modeling and MFCC feature extraction method with some modifications. For more information about this application see ASR-Gooyesh's official web site.  
     
 Learning Dialogue Management in Spoken Dialogue Systems
      In this project, the goal is that implementing and constructing a spoken dialogue system (SDS) for the persian language, which comprises four parts: automatic speech recognition (ASR), natural language understanding (NLU), dialogue management (DM) and spoken language generation (SLG) in which we focus on dialogue management part which tend to generate a natural communication between users and system so as to strengthen system to interact with users and provide information supported within system knowledge.
     
 Speaker Recognition
     

Automatic recognition of a speaker based on speaker-specific features embedded in speech waves has become an important branch of biometric identification known as speaker recognition. Speaker recognition technology has been used in many applications, from access control and transaction authentication to speech data management and personalization. The widespread use of telephone and cellular communication has made speaker identification from telephony speech very popular, yet brought many challenges. Errors introduced by Handset and session variability are amongst the most crucial. Choosing appropriate features and selection of models has been widely addressed.

     
 Speaker Diarization
      The main task in speaker diarization is answering the question of
"Who Spoke When?". In this task, the number of speakers in the input
is unknown. Also, the speakers themselves are unknown i.e. we do not
have any trained speaker models. In a nutshell, we have no" prior
knowledge" about the input, and the task is we should determine that
how many speakers are speaking in the input and also determine the
places in the input where those speakers are speaking.
     
 keyword spotting in continuous speech
      this system can recognize keywords in speech online. and it uses all
modern method in speech recognition and spotting for searching and
recognizing.
     
 Statistical Natural Language Understanding
      SLU is closely related to natural language understanding (NLU), a field that has been studied for half a century. However, the problem of SLU has its own characteristics. Unlike general domain NLU, SLU is, in
the current state of technology, focused only on specific application domains. Hence, many domain specific constraints can be included in the understanding model. Ostensibly, this may make the problem
easier to solve. Unfortunately, spoken language is much noisier than written language. The inputs to a SLU system are not as well-formed as those to an NLU system. They often do not comply with rigid syntactic
constraints. Disfluencies such as false starts, repairs, and hesitations are pervasive, especially in conversational speech, and errors made by speech recognizers are inevitable. Therefore, robustness is one
of the most important issues in SLU. On the other hand, a robust solution tends to over-generalize and introduce ambiguities, leading to reduction of understanding accuracy. A major challenge to SLU is thus to
strike an optimal balance between the robustness and the constraints that prevent over-generalizations and reduce ambiguities.
     
   Robust Speech Recognition
      This project has been started since two years ago and now has developed many of the approaches to noise and speaker robustness. Since our goal is to develop speech recognition related applications operational in real environments, many of these methods have been developed and finalized in the recognition engine. Some of these methods are:
  • On robust features: CMS, PCA, RASTA-PLP, RCC, Liftering
  • On speech enhancement: Spectral Subtraction, Microphone array and beam-forming
  • On model adaptation: MLLR and MAP
  • On model prediction: PMC
  • On speaker normalization: VTLN 
     
   Language modeling and Natural Language Processing
      For any spoken language related systems, linguistic information is the necessary part of that system. For the first time in Persian language, statistical and grammatical language models have been prepared and developed by SPL. Also equivalent researches are performing for English language and our English speech recognition engine. Some of the prepared language models are:
  • N-Grams (N=1,2,3) for Persian and English
  • Word clustering-based N-grams
  • Grammatical rules using GPSG for Persian
  • Probabilistic grammars

More information is available on ASR-Gooyesh.com

     
Telephony Speech Recognition
      NEWSHA is the first speech-enabled Persian computer-telephony system. It is the result of more than three years research in SPL. For finding information about this project and related developed applications please see ASR-Gooyesh.com
     
Go to the top of this page Embedded Speech Recognition
      One of our missions in SPL is to develop an embedded speech recognition engine on low resource computers like smart phones and PDAs. Voice translator and Application launcher are our two primary applications in this area. More details are available on ASR-Gooyesh's web site.
     
  Keyword Spotting
      Keyword spotting, finding specific words in an audio stream, is another   research field in SPL. The first version of this software is now available on ASR-Gooyesh.com for finding the names of 25 countries in a stream.
  Confidence measure and Out Of Vocabulary
      Ranking the recognized word or word sequence is a necessary ability in speech recognition and word spotting systems. In order to have a practical system this ranking and detecting the out of vocabulary words are vital specially in command recognition systems. This projects has been active since two years ago.
     
  Speech Enhancement
      Single microphone speech enhancement is of wide interest due to the variety of its applications. However, it has remained as a challenging research topic during years and large numbers of researches have addressed this problem. Statistical modeling which is used in most speech processing applications, including recognition and coding, has provided a promising prospective in speech enhancement. Speech enhancement using statistical methods can generally be categorized into two classes, assumed models and trained models. These models are also referred to as short-term and long-term models in the literature respectively.
Statistical models can be seen as parameterized sets of probability density functions (pdf). In most of the speech application, it is more common to model speech signal as a stochastic process that is locally stationary within a small segment of signal called a frame. An assumed model is related to the statistics of a frame. In this type of modeling, a statistical model such as a specific Gaussian distribution is assumed for the parameters of speech and noise signals within a frame. The formulation of the enhancement problem is derived based on this assumption and on using an optimization criterion such as minimum mean square error (MMSE). Wiener filter, short-time spectral amplitude (STSA), log-spectral amplitude (LSA) estimators, and MMSE estimator using non-Gaussian priors are examples of this type of modeling. On the other hand, a model that describes the signal statistics over multiple frames is called a trained model. The most common method used for trained modeling is the hidden Markov model (HMM) in which apriori information of clean speech and noise are stored in the HMM parameters. The HMM-based speech enhancement have overcome the deficiencies of classical enhancement methods in dealing with rapid variations of noise characteristics and removing the annoying non-stationary “musical” background noise.
 
     
Go to the top of this page Voice Activity Detection (VAD)
      For detecting voice signals from non-voice ones, each speech-based system and specially recognition and enhancement systems needs a VAD block. Here we worked on VAD standards, ETSI's AMR and ITU-T's G.722 VAD and also developed two new other ones. These VADs are now incorporated in our recognition engine.
     
  Distance talking and microphone array
      Distance talking and speaker localization are the main topics of this project.
     
  Speech synthesis: Text-To-Speech (TTS)
      SPL also researches on TTS methods and tries to develop practical synthesis system in order to incorporate into other applications like telephony speech-enabled systems.
     
     
  Native and non-native pronunciation ranking
      Ranking the pronunciation of an utterance is one of the major parts of language learning applications. We have used different approaches, specially HMM-based ranking, to achieve this goal.
     
Go to the top of this page Fast likelihood computation
      Likelihood computation is one of the main obstacles to move a speech recognition engine to low-resource computers and devices. Hence various fast Likelihood computation approaches are implemented in order to decrease the computational load in real-time applications.
     
       
     

© 2010 Speech Processing Lab. all rights reserved.
For any question/comment please contact spl [at] ce.sharif.edu