CISUC

Emotion-based Analysis and Classification of Audio Music

Authors

Abstract

This research work addresses the problem of music emotion recognition using audio signals. Music emotion recognition research has been gaining ground over the last two decades. In it, the typical approach starts with a dataset, composed of music files and associated emotion ratings given by listeners. This data, typically audio signals, is first processed by computational algorithms in order to extract and summarize their characteristics, known as features (e.g., beats per minute, spectral metrics). Next, the feature set is fed to machine learning algorithms looking for patterns that connect them to the given emotional annotations. As a result, a computational model is created, which is able to infer the emotion of a new and unlabelled music file based on the previously found patterns. Although several studies have been published, two main issues remain open and are the current barrier to progress in field. First, a high-quality public and sizeable audio dataset is needed, which can be widely adopted as a standard and used by different works. Currently, the public available ones suffer from known issues such as low quality annotations or limited size. Also, we believe novel emotionally-relevant audio features are needed to overcome the plateau of the last years. Supporting this idea is the fact that the vast majority of previous works were focused on the computational classification component, typically using a similar set of audio features originally proposed to tackle other audio analysis problems (e.g., speech recognition). Our work focuses on these two problems. Proposing novel emotionally-relevant audio features requires knowledge from several fields. Thus, our work started with a review of music and emotion literature to understand how emotions can be described and classified, how music and music dimensions work and, as a final point, to merge both fields by reviewing the identified relations between musical dimensions and emotional responses. Next, we reviewed the existent audio features, relating them with one of the eight musical dimensions: melody, harmony, rhythm, dynamics, tone color, expressive techniques, musical texture and musical form. As a result, we observed that audio features are unbalanced across musical dimensions, with expressive techniques, musical texture and form said to be emotionally-relevant but lacking audio extractors. To address the abovementioned issues, we propose several audio features. These were built on previous work to estimate the main melody notes from the low-level audio signals. Next, various musically-related metrics were extracted, e.g., glissando presence, articulation information, changes in dynamics and others. To assess their relevance to emotion recognition, a dataset containing 900 audio clips, annotated in four classes (Russell's quadrants) was built. Our experimental results show that the proposed features are emotionally-relevant and their inclusion in emotion recognition models leads to better results. Moreover, we also measured the influence of both existing and novel features, leading to a better understanding of how different musical dimensions influence specific emotion quad-rants. Such results give us insights about the open issues and help us define possible research paths to the near future.

Keywords

audio music emotion recognition,bi-modal approaches,emotionally-relevant audio features,expressive techniques,music and emotion,music information retrieval,musical texture

PhD Thesis

Emotion-based Analysis and Classification of Audio Music 2019

PDF File


Cited by

No citations found