Περίληψη σε άλλη γλώσσα
The subject of this PhD thesis is the efficient and robust processing and analysis of the audio recordings that are derived from a call center. The thesis is comprised of two parts. The first part is dedicated to dialogue/non-dialogue detection and to speaker segmentation. The systems that are developed are prerequisite for detecting (i) the audio segments that actually contain a dialogue between the system and the call center customer and (ii) the change points between the system and the customer. This way the volume of the audio recordings that need to be processed is significantly reduced, while the system is automated. To detect the presence of a dialogue several systems are developed. This is the first effort found in the international literature that the audio channel is exclusively exploited. Also, it is the first time that the speaker utterance duration is estimated. The most sophisticated system is fully automated. It is based on actor indicator functions. That is, functions w ...
The subject of this PhD thesis is the efficient and robust processing and analysis of the audio recordings that are derived from a call center. The thesis is comprised of two parts. The first part is dedicated to dialogue/non-dialogue detection and to speaker segmentation. The systems that are developed are prerequisite for detecting (i) the audio segments that actually contain a dialogue between the system and the call center customer and (ii) the change points between the system and the customer. This way the volume of the audio recordings that need to be processed is significantly reduced, while the system is automated. To detect the presence of a dialogue several systems are developed. This is the first effort found in the international literature that the audio channel is exclusively exploited. Also, it is the first time that the speaker utterance duration is estimated. The most sophisticated system is fully automated. It is based on actor indicator functions. That is, functions which define if an actor speaks at a certain time instant. In particular, the cross-correlation and the magnitude of the corresponding the cross-power spectral density of a pair of indicator functions are input to various classifiers, such as voted perceptrons, radial basis function networks, random trees, and support vector machines for dialogue/non-dialogue detection. To boost classifier efficiency AdaBoost is also exploited. The aforementioned classifiers are trained using ground truth indicator functions determined by human annotators. For testing, actual indicator functions are derived by applying audio activity detection and actor clustering to the audio recordings. MUSCLE movie database is used for experimental evaluation. Part of MUSCLE movie database was developed for the thesis. MUSCLE movie database is available on demand for academic purposes. High dialogue/non-dialogue accuracy is reported. Several systems are then developed for speaker segmentation. The most sophisticated algorithm for automatic speaker segmentation is based on the Bayesian Information Criterion (BIC). BIC tests are not performed for every window shift (e.g. every milliseconds), as met in the literature, but when a speaker change is most probable to occur. This is done by estimating the next probable change point thanks to a model of utterance durations. It is found that the inverse Gaussian fits best the distribution of utterance durations. As a result, less BIC tests are needed, making the proposed system less computationally demanding in time and memory, and considerably more efficient with respect to missed speaker change points. A feature selection algorithm based on branch and bound search strategy is applied in order to identify the most efficient features for speaker segmentation. Furthermore, a new theoretical formulation of BIC is derived by applying centering and simultaneous diagonalization. This formulation is considerably more computationally efficient than the standard BIC, when the covariance matrices are estimated by other estimators than the usual maximum likelihood ones. Two commonly used pairs of figures of merit are employed and their relationship is established. Computational efficiency is achieved through the speaker utterance modeling, whereas robustness is achieved by feature selection and application of BIC tests at appropriately selected time instants. Preliminary experimental results are made on conTIMIT database, which was created as a part of the thesis. conTIMIT database is available for academic purposes. Experimental results indicate that the proposed modifications yield a superior performance compared to existing approaches. Finally, the new formulation of BIC found an additional application in phone segmentation. The second part of the thesis is dedicated to emotional speech processing. To begin with, gender classification by processing emotional speech is investigated, because gender information is found to improve speech emotion classification accuracy. A large pool of 1418 features is created, including 619 features tested for the first time in the context of the reported research. This is the largest feature set to the best of the author’s knowledge. The features are related to statistics of contours computed over pitch, formants, energy, autocorrelation, MPEG-7 descriptors, Fujisaki’s parameters, jitter, and shimmer. A branch and bound feature selection algorithm is applied to select a subset of 15 features. The fact that features tested for the first time are included among the selected ones verifies the novelty of the approach. The selected features are fed as input to 8 classifiers, namely K nearest neighbors, radial-basis function neural networks, probabilistic neural networks, support vector machines, discriminant analysis-based classifiers, classification trees, learning vector quantizers, and neural gas networks. Two databases are employed: the Berlin database of Emotional Speech and the Danish Emotional Speech database. A perfect classification accuracy of 100% is obtained when applying K nearest neighbors, radial basis function neural networks, support vector machines, and learning vector quantizers. Methodological research focuses on a comparative study of the performance gains among classifiers as well as among variants of the particular classifiers. Furthermore, a comparative study of the classification accuracy between the 2 databases is available. The reported classification results outperform those obtained by state-of-the-art techniques. The next problem being addressed is the way the emotion is perceived by humans. Systematical discrepancies seem to occur in the way different people perceive the emotion expressed in the same audio recording. Often, even the person who uttered the audio recording is not able to name his/her own emotion. Thus, there is an inborn difficulty in determining the objective emotional label of the audio recording. Consequently, a robust and effective solution is to annotate the data by using as many annotators as possible. For this reason, Danish Emotional Speech Database and Speech Under Simulated and Actual Stress database are annotated by 6 annotators. 3 basic axes are applied: perceptual judgments, including loudness, rate, blareness, melodicity, articulation, and steadiness; emotion primitives consisting of valence, activation, and confrontation; and emotional states comprising anger, happiness, neutral, sadness, surprise, and stress. Emotion Annotation Tool (EmotAnnTool) was utilized to serve this purpose. The percentage of correctly classified audio recordings is relatively small. However, this is not a problem of poor annotation quality rather than a fact that emotion recognition is a tedious task even for humans. We analyze statistically the pairwise emotion label agreement for all 6 annotators using the kappa statistic and it becomes obvious that there is little consensus between the annotators. Thus, determining a common label for each audio recording is not an easy task, especially for specific groups of emotions, such as happiness and anger. Finally, we created one of the few audio databases in Greek language with spontaneous emotionally colored audio recordings. Spontaneous emotions quality is expected to be better than acted emotions quality, although spontaneous data are more difficult to collect. The data were provided by the Singular Logic S.A. company to Artificial Intelligence and Information Analysis Laboratory of Informatics Department, Aristotle University of Thessaloniki. In specific, the audio recordings are derived by the Hellenic Seaways state-of-the-art call center, which provides information about Greek ship timetables and makes reservations. The emotional labels are: anger-annoyance, frustration, excuse, neutral, and satisfaction. Obviously, the negative emotions are analyzed in depth, since 3 out of the 5 emotion labels correspond to negative emotions. In a call center scenario customers who experience negative emotions need cautious handling, to improve quality of service. For each emotional label we propose a set of keywords, as well as the corresponding tone and style. The database is also annotated with respect to the gender of the speaker, the presence of background noise, the number of the speakers, and the number of words, rendering the database suitable for additional applications, like smart computer-human interaction. The PhD thesis is concluded with the contributions to emotional speech processing. Moreover, future work is discussed.
περισσότερα