Voice activity detection(VAD), which separates the voice region from silence or noise region of input speech signal, is one of the indispensable pre-processing steps in continuous speech recognition, speech coding and noise estimation/reduction etc. While many successful researches were conducted continuous speech in noiseless environment or for isolated words in noisy environment, there are few method of VAD for continuous speech in heavy noise environment. Since unvoiced consonant signals have very similar characteristics to those of noise signals, it may result in serious distortion of unvoiced consonants to estimate and remove the noise components if voice activity detection and thereafter noise estimation/removal is carried out without paying special attention on unvoiced consonants.
In this dissertation, assuming that the voiced sound regions are removed by a method developed in our lab, we propose a method to explicitly extract the boundaries between unvoiced consonant region and noise region so that more exact VAD could be performed. The proposed method is based on histogram in frequency domain which was successfully used by Hirsch for noise estimation, and also on similarity measure of frequency components between adjacent frames. To evaluate the performance of the proposed method, experiments on unvoiced consonant boundary detection was carried out on noisy speech signals of 10dB and 15dB SNR. For all seven kinds of noised, the overall rate of correct extraction resulted in approximately 90%. The proposed algorithm could be used for VAD for speech recognition and speech coding as well as for noise estimation and reduction in heavy noise environments.