I had previously worked with Hynek Hermansky, on distortion invariant feature design for acoustic models.
I worked in Speech and Vision Lab at IIIT-Hyd with Kishore Prahallad, on efficient back-off strategies for quality speech synthesis, for my Masters (by research)
2016
Far-field ASR without parallel data
Vijayaditya Peddinti, Vimal Manohar, Yiming Wang, Daniel Povey and Sanjeev Khudanpur
Submitted to Interspeech, 2016
[abstract] [bib]
Abstract
In far-field speech recognition systems, training
acoustic models with alignments generated from parallel
close-talk microphone data provides significant
improvements. However it is not practical to assume the
availability of large corpora of parallel close-talk
microphone data, for training. In this paper we
explore methods to reduce the performance gap between
far-field ASR systems trained with alignments from
distant microphone data and those trained with
alignments from parallel close-talk microphone data.
These methods include the use of a lattice-free
sequence objective function which tolerates minor
mis-alignment errors; and the use of data selection
techniques to discard badly aligned data. We present
results on single distant microphone and multiple
distant microphone scenarios of the AMI LVCSR task. We
identify prominent causes of alignment errors in AMI
data.
@inproceedings{peddinti2016ami,
author = {Peddinti, Vijayaditya and Manohar, Vimal and Wang, Yiming and Povey, Daniel and Khudanpur, Sanjeev},
title = {Far-field ASR without parallel data},
booktitle = {Submitted to Interspeech}
}
Purely sequence-trained neural networks for ASR based on lattice-free MMI
Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahrmani, Vimal Manohar, Yiming Wang, Xingyu Na and Sanjeev Khudanpur
Submitted to Interspeech, 2016
[abstract] [bib]
Abstract
In this paper we describe a method to perform sequence-
discriminative training of neural network acoustic
models without the need for frame-level cross-entropy
pre-training. We use the lattice-free version of the
maximum mutual information (MMI) criterion. To make its
computation feasible we use a phone n-gram language
model, in place of the word language model. To further
reduce its space and time complexity we compute the
objective function using neural network outputs at one
third the standard frame rate. These changes enable us
to perform the computation for the forward-backward
algorithm on GPUs. Further the reduced output
frame-rate also provides a significant speed-up during
decoding. We present results on 5 different LVCSR
tasks with training data ranging from 100 to 2100
hours. Models trained with this lattice-free MMI
criterion provide a relative word error rate reduction
of ∼ 15%, over those trained with cross-entropy
objective function, and ∼ 8%, over those trained with
cross-entropy and sMBR objective functions. A further
reduction of ∼ 2.5%, relative, can be obtained by fine
tuning these models with the word-lattice based sMBR
objective function.
@inproceedings{povey2016,
author = {Povey, Daniel and Peddinti, Vijayaditya and Galvez, Daniel and Pegah Ghahrmani and Manohar, Vimal and Wang, Yiming and Na, Xingyu and Khudanpur, Sanjeev},
title = {Purely sequence-trained neural networks for ASR based on lattice-free MMI},
booktitle = {Submitted to Interspeech}
}
2015
Winner of the IARPA ASpIRE challenge [press announcement]
Reverberation robust acoustic modeling using with time delay neural networks
Vijayaditya Peddinti, Guoguo Chen, Daniel Povey and Sanjeev Khudanpur
Proceedings of Interspeech, 2015
[abstract] [bib]
Abstract
In reverberant environments there are long term interactions between speech and corrupting sources. In this paper a time delay neural network (TDNN) architecture, capable of learning long term temporal relationships and translation invariant representations, is used for reverberation robust acoustic model- ing. Further, iVectors are used as an input to the neural network to perform instantaneous speaker and environment adaptation, providing 10% relative improvement in word error rate. By sub- sampling the outputs at TDNN layers across time steps, training time is reduced. Using a parallel training algorithm we show that the TDNN can be trained on ~ 5500 hours of speech data in 3 days using up to 32 GPUs. The TDNN is shown to provide results competitive with state of the art systems in the IARPA ASpIRE challenge, with 27.7% WER on the dev test set.
@inproceedings{peddinti2015reverb,
author = {Peddinti, Vijayaditya and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
title = {Reverberation robust acoustic modeling using with time delay neural networks},
booktitle = {Proceedings of Interspeech}
}
Audio Augmentation for Speech Recognition
Tom Ko, Vijayaditya Peddinti, Daniel Povey and Sanjeev Khudanpur
Proceedings of Interspeech, 2015
[abstract] [bib]
Abstract
Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. In this paper, we investigate audio-level speech augmentation methods which directly process the raw signal. The method we particularly recommend is to change the speed of the audio signal, producing 3 versions of the original signal with speed factors of 0.9, 1.0 and 1.1. The proposed technique has a low implementation cost, making it easy to adopt. We present results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios. An average relative improvement of 4.3% was observed across the 4 tasks.
@inproceedings{ko2015augmentation,
author = {Tom Ko and Peddinti, Vijayaditya and Povey, Daniel and Khudanpur, Sanjeev},
title = {Audio Augmentation for Speech Recognition},
booktitle = {Proceedings of Interspeech}
}
Best paper award
A time delay neural network architecture for efficient modeling of long temporal contexts
Vijayaditya Peddinti, Daniel Povey and Sanjeev Khudanpur
Proceedings of Interspeech, 2015
[abstract] [bib]
Abstract
Recurrent neural network architectures have been shown to efficiently model long term temporal dependencies between acoustic events. However the training time of recurrent networks is higher than feedforward networks due to the sequential nature of the learning algorithm. In this paper we propose a time delay neural network architecture which models long term temporal dependencies with training times comparable to standard feed-forward DNNs. The network uses sub-sampling to reduce computation during training. On the Switchboard task we show a relative improvement of 6% over the baseline DNN model. We present results on several LVCSR tasks with training data ranging from 3 to 1800 hours to show the effectiveness of the TDNN architecture in learning wider temporal dependencies in both small and large data scenarios.
@inproceedings{peddinti2015multisplice,
author = {Peddinti, Vijayaditya and Povey, Daniel and Khudanpur, Sanjeev},
title = {A time delay neural network architecture for efficient modeling of long temporal contexts},
booktitle = {Proceedings of Interspeech},
publisher = {ISCA}
}
Back to Top
2014
Deep Scattering Spectrum with deep neural networks
Vijayaditya Peddinti, T. Sainath, S. Maymon, B. Ramabhadran, D. Nahamoo and Vaibhava Goel
Proceedings of ICASSP, 2014
[abstract] [bib]
Abstract
State-of-the-art convolutional neural networks (CNNs) typically use a log-mel spectral representation of the speech signal. However, this representation is limited by the spectro-temporal resolution afforded by log-mel filter-banks. A novel technique known as Deep Scattering Spectrum (DSS) addresses this limitation and preserves higher resolution information, while ensuring time warp stability, through the cascaded application of the wavelet-modulus operator. The first order scatter is equivalent to log-mel features and standard CNN modeling techniques can directly be used with these features. However the higher order scatter, which preserves the higher resolution information, presents new challenges in modeling. This paper explores how to effectively use DSS features with CNN acoustic models. Specifically, we identify the effective normalization, neural network topology and regularization techniques to effectively model higher order scatter. The use of these higher order scatter features, in conjunction with CNNs, results in relative improvement of 7% compared to log-mel features on TIMIT, providing a phonetic error rate (PER) of 17.4%, one of the lowest reported PERs to date on this task.
@inproceedings{peddinti2014,
author = {Peddinti, Vijayaditya and T. Sainath and S. Maymon and B. Ramabhadran and D. Nahamoo and Goel, Vaibhava},
title = {Deep Scattering Spectrum with deep neural networks},
booktitle = {Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on},
pages = {210-214}
}
Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise
Thomas Schatz, Vijayaditya Peddinti, Yuan Cao, Francis Bach, Hynek Hermansky and Emmanuel Dupoux
Proceedings of Interspeech, 2014
[bib]
@inproceedings{schatz-peddinti-cao-bach-hermansky-dupoux:is2014c,
author = {Thomas Schatz and Peddinti, Vijayaditya and Cao, Yuan and Francis Bach and Hermansky, Hynek and Emmanuel Dupoux},
title = {Evaluating speech features with the Minimal-Pair ABX task (II): Resistance to noise},
booktitle = {Proc. of INTERSPEECH}
}
Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks
Tara N Sainath, Vijayaditya Peddinti, Brian Kingsbury, Petr Fousek, Bhuvana Ramabhadran and David Nahamoo
Proceedings of Interspeech, 2014
[abstract] [bib]
Abstract
Log-mel filterbank features, which are commonly used features for CNNs, can remove higher-resolution information from the speech signal. A novel technique, known as Deep Scattering Spectrum (DSS), addresses this issue and looks to preserve this information. DSS features have shown promise on TIMIT, both for classification and recognition. In this paper, we extend the use of DSS features for LVCSR tasks. First, we explore the optimal multi-resolution time and frequency scattering operations for LVCSR tasks. Next, we explore techniques to reduce the dimension of the DSS features. We also incorporate speaker adaptation techniques into the DSS features. Results on a 50 and 430 hour English Broadcast News task show that the DSS features provide between a 4-7% relative improvement in WER over log-mel features, within a state-of-the-art CNN framework which incorporates speaker-adaptation and sequence training. Finally, we show that DSS features are similar to multi-resolution log-mel + MFCCs, and similar improvements can be obtained with this representation.
@inproceedings{sainath2014deep,
author = {Tara N Sainath and Peddinti, Vijayaditya and Brian Kingsbury and Petr Fousek and Bhuvana Ramabhadran and David Nahamoo},
title = {Deep Scattering Spectra with Deep Neural Networks for LVCSR Tasks},
publisher = {ISCA},
url = {http://ttic.uchicago.edu/~haotang/speech/IS140389.pdf}
}
Back to Top
2013
Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline
Thomas Schatz, Vijayaditya Peddinti, Francis Bach, Aren Jansen, Hynek Hermansky and Emmanuel Dupoux
Proceedings of Interspeech, 2013
[bib]
@inproceedings{schatz-peddinti-bach-jansen-hermansky-dupoux:is2013,
author = {Thomas Schatz and Peddinti, Vijayaditya and Francis Bach and Jansen, Aren and Hermansky, Hynek and Emmanuel Dupoux},
title = {Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline},
booktitle = {Proc. INTERSPEECH}
}
A Summary Of The 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition
Aren Jansen, Emmanuel Dupoux, Sharon Goldwater, Mark Johnson, Sanjeev Khudanpur, Kenneth Church, Naomi Feldman, Hynek Hermansky, Florian Metze, Richard Rose, Michael Seltzer, Pascal Clark, Ian Mcgraw, Balakrishnan Varadarajan, Erin Bennett, Benjamin Borschinger, Justin Chiu, Ewan Dunbar, Abdellah Fourtassi, David Harwath, Chia-Ying Lee, Keith Levin, Atta Norouzain, Vijayaditya Peddinti, Rachael Richardson, Thomas Schatz and Samuel Thomas
Proceedings of ICASSP, 2013
[bib]
@inproceedings{jansen-dupoux-goldwater-johnson-khudanpur-church-feldman-hermansky-metze-rose-seltzer-clark-mcgraw-varadarajan-bennett-borschinger-chiu-dunbar-fourtassi-harwath-lee-levin-norouzain-peddinti-richardson-schatz-thomas:icassp2013,
author = {Jansen, Aren and Emmanuel Dupoux and Sharon Goldwater and Mark Johnson and Khudanpur, Sanjeev and Church, Kenneth and Naomi Feldman and Hermansky, Hynek and Florian Metze and Richard Rose and Michael Seltzer and Pascal Clark and Ian Mcgraw and Varadarajan, Balakrishnan and Erin Bennett and Benjamin Borschinger and Justin Chiu and Ewan Dunbar and Abdellah Fourtassi and David Harwath and Chia-Ying Lee and Levin, Keith and Atta Norouzain and Peddinti, Vijayaditya and Rachael Richardson and Thomas Schatz and Thomas, Samuel},
title = {A Summary Of The 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition},
booktitle = {Proc. ICASSP},
address = {Vancouver, Canada}
}
Mean Temporal Distance: Predicting ASR Error from Temporal Properties of Speech Signal
Hynek Hermansky, Ehsan Variani and Vijayaditya Peddinti
Proceedings of ICASSP, 2013
[bib]
@inproceedings{hermansky-variani-peddinti:icassp2013,
author = {Hermansky, Hynek and Variani, Ehsan and Peddinti, Vijayaditya},
title = {Mean Temporal Distance: Predicting ASR Error from Temporal Properties of Speech Signal},
booktitle = {Proc. ICASSP},
address = {Vancouver, Canada}
}
Filter-Bank Optimization for Frequency Domain Linear Prediction
Vijayaditya Peddinti and Hynek Hermansky
Proceedings of ICASSP, 2013
[abstract] [bib]
Abstract
The sub-band Frequency Domain Linear Prediction (FDLP) technique estimates autoregressive models of Hilbert envelopes of subband signals, from segments of discrete cosine transform (DCT) of a speech signal, using windows. Shapes of the windows and their positions on the cosine transform of the signal determine implied filtering of the signal. Thus, the choices of shape, position and number of these windows can be critical for the performance of the FDLP technique. So far, we have used Gaussian or rectangular windows. In this paper asymmetric cochlear-like filters are being studied. Further, a frequency differentiation operation, that introduces an additional set of parameters describing local spectral slope in each frequency sub-band, is introduced to increase the robustness of sub-band envelopes in noise. The performance gains achieved by these changes are reported in a variety of additive noise conditions, with an average relative improvement of 8.04% in phoneme recognition accuracy.
@inproceedings{peddinti2013filterbank,
author = {Peddinti, Vijayaditya and Hermansky, Hynek},
title = {Filter-Bank Optimization for Frequency Domain Linear Prediction},
booktitle = {Proceedings of ICASSP},
address = {Vancouver, Canada},
publisher = {IEEE},
pages = {7102 - 7106}
}
Back to Top
2011
Significance of vowel epenthesis in Telugu text-to-speech synthesis
Vijayaditya Peddinti and K. Prahallad
Proceedings of ICASSP, 2011
[abstract] [bib]
Abstract
Unit selection synthesis inventories have coverage issues, which lead to missing syllable or diphone units. In the conventional back-off strategy of substituting the missing unit with approximate unit(s), the rules for approximate matching are hard to derive. In this paper we propose a back-off strategy for Telugu TTS systems emulating native speaker intuition. It uses reduced vowel insertion in complex consonant clusters to replace missing units. The inserted vowel identity is determined using a rule-set adapted from L2 (second language) acquisition research in Telugu, reducing the effort required in preparing the rule-set. Subjective evaluations show that the proposed back-off method performs better than the conventional methods.
@inproceedings{peddinti2011,
author = {Peddinti, Vijayaditya and K. Prahallad},
title = {Significance of vowel epenthesis in Telugu text-to-speech synthesis},
booktitle = {Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on},
pages = {5348-5351}
}
Exploiting Phone-Class Specific Landmarks for Refinement of Segment Boundaries in TTS Databases
Vijayaditya Peddinti and Kishore Prahallad
Proceedings of Interspeech, 2011
[abstract] [bib]
Abstract
High accuracy speech segmentation methods invariably depend on manually labelled data. However under-resourced languages do not have annotated speech corpora required for training these segmentors. In this paper we propose a boundary refinement technique which uses knowledge of phone-class specific subband energy events, in place of manual labels, to guide the refinement process. The use of this knowledge enables proper placement of boundaries in regions with multiple spectral discontinuities in close proximity. It also helps in the correction of large alignment errors. The proposed refinement technique provides boundaries with an accuracy of 82% within 20ms of actual boundary. Combining the proposed technique with iterative isolated HMM training technique boosts the accuracy to 89%, without the use of any manually labelled data.
@inproceedings{peddinti2011exploiting,
author = {Peddinti, Vijayaditya and Kishore Prahallad},
title = {Exploiting Phone-Class Specific Landmarks for Refinement of Segment Boundaries in TTS Databases},
booktitle = {Proceedings of Interspeech 2011}
}
Back to Top