Why Pop? A System to Explain How Deep Learning Models Classify Music
Document
Description
The impact of Artificial Intelligence (AI) has increased significantly in daily life. AI is taking big strides towards moving into areas of life that are critical such as
healthcare but, also into areas such as entertainment and leisure. Deep neural
networks have been pivotal in making all these advancements possible. But, a well-known problem with deep neural networks is the lack of explanations for the choices
it makes. To combat this, several methods have been tried in the field of research.
One example of this is assigning rankings to the individual features and how influential
they are in the decision-making process. In contrast a newer class of methods focuses
on Concept Activation Vectors (CAV) which focus on extracting higher-level concepts
from the trained model to capture more information as a mixture of several features
and not just one. The goal of this thesis is to employ concepts in a novel domain: to
explain how a deep learning model uses computer vision to classify music into different
genres. Due to the advances in the field of computer vision with deep learning for
classification tasks, it is rather a standard practice now to convert an audio clip into
corresponding spectrograms and use those spectrograms as image inputs to the deep
learning model. Thus, a pre-trained model can classify the spectrogram images
(representing songs) into musical genres. The proposed explanation system called
“Why Pop?” tries to answer certain questions about the classification process such as
what parts of the spectrogram influence the model the most, what concepts were
extracted and how are they different for different classes. These explanations aid the
user gain insights into the model’s learnings, biases, and the decision-making process.