Some of the current
projects in the SPL are
as follows. Complete
information about each
project and its
development
activities will be added
later. Also the
publications of these
projects are available
at
Publication page.
|
|
 |
Speech
Recognition:
Dictation system |
| |
|
|
NEVISA is
a
Persian
dictation system
developed as a result of
this project. In
fact NEVISA is
the first output
of SPL's
recognition
engine which uses
most popular
algorithms and
methods in the
speech
recognition
field. This
engine use HMM-based
modeling and MFCC
feature
extraction
method with
some
modifications.
For more
information
about this
application see
ASR-Gooyesh's
official web
site. |
| |
|
|
|
|
 |
Learning
Dialogue
Management in
Spoken Dialogue
Systems |
| |
|
|
In this project,
the goal is that
implementing and
constructing a
spoken dialogue
system (SDS) for
the persian
language, which
comprises four
parts: automatic
speech
recognition
(ASR), natural
language
understanding (NLU),
dialogue
management (DM)
and spoken
language
generation (SLG)
in which we
focus on
dialogue
management part
which tend to
generate a
natural
communication
between users
and system so as
to strengthen
system to
interact with
users and
provide
information
supported within
system
knowledge. |
| |
|
|
|
|
 |
Speaker
Recognition |
| |
|
|
Automatic
recognition of a
speaker based on
speaker-specific
features
embedded in
speech waves has
become an
important branch
of biometric
identification
known as speaker
recognition.
Speaker
recognition
technology has
been used in
many
applications,
from access
control and
transaction
authentication
to speech data
management and
personalization.
The widespread
use of telephone
and cellular
communication
has made speaker
identification
from telephony
speech very
popular, yet
brought many
challenges.
Errors
introduced by
Handset and
session
variability are
amongst the most
crucial.
Choosing
appropriate
features and
selection of
models has been
widely
addressed. |
| |
|
|
|
|
 |
Speaker
Diarization |
| |
|
|
The main task in
speaker
diarization is
answering the
question of
"Who Spoke
When?". In this
task, the number
of speakers in
the input
is unknown.
Also, the
speakers
themselves are
unknown i.e. we
do not
have any trained
speaker models.
In a nutshell,
we have no"
prior
knowledge" about
the input, and
the task is we
should determine
that
how many
speakers are
speaking in the
input and also
determine the
places in the
input where
those speakers
are speaking. |
| |
|
|
|
|
 |
keyword
spotting in
continuous
speech |
| |
|
|
this system can
recognize
keywords in
speech online.
and it uses all
modern method in
speech
recognition and
spotting for
searching and
recognizing. |
| |
|
|
|
|
 |
Statistical
Natural Language
Understanding |
| |
|
|
SLU is closely
related to
natural language
understanding (NLU),
a field that has
been studied for
half a century.
However, the
problem of SLU
has its own
characteristics.
Unlike general
domain NLU, SLU
is, in
the current
state of
technology,
focused only on
specific
application
domains. Hence,
many domain
specific
constraints can
be included in
the
understanding
model.
Ostensibly, this
may make the
problem
easier to solve.
Unfortunately,
spoken language
is much noisier
than written
language. The
inputs to a SLU
system are not
as well-formed
as those to an
NLU system. They
often do not
comply with
rigid syntactic
constraints.
Disfluencies
such as false
starts, repairs,
and hesitations
are pervasive,
especially in
conversational
speech, and
errors made by
speech
recognizers are
inevitable.
Therefore,
robustness is
one
of the most
important issues
in SLU. On the
other hand, a
robust solution
tends to
over-generalize
and introduce
ambiguities,
leading to
reduction of
understanding
accuracy. A
major challenge
to SLU is thus
to
strike an
optimal balance
between the
robustness and
the constraints
that prevent
over-generalizations
and reduce
ambiguities. |
| |
|
|
| |
 |
Robust
Speech
Recognition |
| |
|
|
This project has
been started
since two years
ago and now has
developed many
of the
approaches to
noise and
speaker
robustness. Since
our goal is to
develop
speech recognition
related applications
operational in
real
environments,
many of these
methods have been
developed and
finalized in
the recognition
engine. Some of
these methods
are:
- On robust
features:
CMS, PCA,
RASTA-PLP,
RCC,
Liftering
-
On speech enhancement:
Spectral
Subtraction,
Microphone
array and
beam-forming
- On model
adaptation:
MLLR and MAP
- On model
prediction:
PMC
- On speaker
normalization:
VTLN
|
| |
|
|
| |
 |
Language
modeling and
Natural Language
Processing |
| |
|
|
For any spoken
language related
systems,
linguistic
information is
the necessary part
of that system. For the first time
in Persian
language, statistical and
grammatical
language models
have been
prepared and
developed
by SPL. Also
equivalent researches
are performing
for English
language and
our English
speech
recognition
engine. Some of
the prepared
language models
are:
- N-Grams
(N=1,2,3)
for Persian
and English
- Word clustering-based N-grams
-
Grammatical
rules using
GPSG for
Persian
-
Probabilistic
grammars
More
information is
available on
ASR-Gooyesh.com. |
| |
|
|
|
|
 |
Telephony
Speech
Recognition |
| |
|
|
NEWSHA is
the first
speech-enabled
Persian
computer-telephony
system. It is
the result of
more than three
years research
in SPL.
For finding
information
about this
project and
related
developed
applications
please see
ASR-Gooyesh.com. |
| |
|
|
 |
 |
Embedded
Speech
Recognition |
| |
|
|
One of our
missions in SPL
is to develop an
embedded speech
recognition
engine on low
resource
computers like
smart phones and PDAs.
Voice translator
and Application
launcher are our
two primary
applications
in this
area. More
details are
available on
ASR-Gooyesh's
web site. |
| |
|
|
| |
 |
Keyword
Spotting |
| |
|
|
Keyword
spotting,
finding specific
words in an audio
stream, is another
research field
in SPL. The
first version of
this software is
now available on
ASR-Gooyesh.com
for finding the
names of 25
countries in a
stream. |
|
|
|
| |
 |
Confidence
measure and Out
Of Vocabulary |
| |
|
|
Ranking the
recognized word
or word sequence
is a
necessary
ability in speech
recognition and
word spotting
systems. In
order to have a
practical system
this ranking and
detecting the
out of
vocabulary words
are vital
specially in
command
recognition
systems. This
projects has
been active since
two years ago. |
| |
|
|
| |
 |
Speech
Enhancement |
| |
|
|
Single
microphone
speech
enhancement is
of wide interest
due to the
variety of its
applications.
However, it has
remained as a
challenging
research topic
during years and
large numbers of
researches have
addressed this
problem.
Statistical
modeling which
is used in most
speech
processing
applications,
including
recognition and
coding, has
provided a
promising
prospective in
speech
enhancement.
Speech
enhancement
using
statistical
methods can
generally be
categorized into
two classes,
assumed models
and trained
models. These
models are also
referred to as
short-term and
long-term models
in the
literature
respectively.
Statistical
models can be
seen as
parameterized
sets of
probability
density
functions (pdf).
In most of the
speech
application, it
is more common
to model speech
signal as a
stochastic
process that is
locally
stationary
within a small
segment of
signal called a
frame. An
assumed model is
related to the
statistics of a
frame. In this
type of
modeling, a
statistical
model such as a
specific
Gaussian
distribution is
assumed for the
parameters of
speech and noise
signals within a
frame. The
formulation of
the enhancement
problem is
derived based on
this assumption
and on using an
optimization
criterion such
as minimum mean
square error (MMSE).
Wiener filter,
short-time
spectral
amplitude (STSA),
log-spectral
amplitude (LSA)
estimators, and
MMSE estimator
using
non-Gaussian
priors are
examples of this
type of
modeling. On the
other hand, a
model that
describes the
signal
statistics over
multiple frames
is called a
trained model.
The most common
method used for
trained modeling
is the hidden
Markov model
(HMM) in which
apriori
information of
clean speech and
noise are stored
in the HMM
parameters. The
HMM-based speech
enhancement have
overcome the
deficiencies of
classical
enhancement
methods in
dealing with
rapid variations
of noise
characteristics
and removing the
annoying
non-stationary
“musical”
background
noise.
|
| |
|
|
 |
 |
Voice
Activity
Detection (VAD) |
| |
|
|
For detecting
voice signals
from non-voice
ones, each
speech-based
system and
specially
recognition and
enhancement
systems needs a VAD
block. Here we
worked on VAD
standards,
ETSI's AMR and
ITU-T's G.722
VAD and
also developed two
new other ones.
These VADs
are now incorporated
in our
recognition
engine. |
| |
|
|
| |
 |
Distance
talking and
microphone array
|
| |
|
|
Distance talking
and speaker
localization are
the main topics
of this project. |
| |
|
|
| |
 |
Speech
synthesis:
Text-To-Speech
(TTS) |
| |
|
|
SPL also
researches on TTS
methods and
tries to develop
practical
synthesis system
in order to
incorporate
into other
applications
like telephony
speech-enabled
systems. |
| |
|
|
| |
|
|
| |
 |
Native and
non-native
pronunciation
ranking |
| |
|
|
Ranking the
pronunciation of an utterance
is one of the
major parts of
language
learning
applications. We have
used different
approaches,
specially
HMM-based
ranking, to achieve this goal.
|
| |
|
|
 |
 |
Fast
likelihood
computation |
| |
|
|
Likelihood computation is one of the
main obstacles to move a speech recognition engine
to low-resource computers and devices.
Hence various fast Likelihood computation approaches are
implemented in order to decrease the
computational load in real-time applications. |
| |
|
|
| |
|
|
|
| |
|
|