Deep Learning for Sound Recognition

Dr. Michael Frishkopf, Professor of Music and the Director of the Canadian Centre for Ethnomusicology (CCE), asks how do we recognize the components and attributes of sound, describe and parse an audio recording of music, speech, or environmental sounds, or extract sonic features, classify types, segment units, and identify sources of sounds? Sometimes recordings capture a single sound source: a single instrument, speaker, or bird; others may find multiple but coordinated sources: a musical ensemble, or a conversation; yet typically in fieldwork, a recording encompasses a complex mix of uncoordinated sound sources, a total soundscape that may include music as well as speech, music from multiple groups performing simultaneously, many speakers speaking at once, or many bird calls, all of which are layered together with “noise” such as the sounds of crowds, highways and factories, rain, wind and thunder. Unlike the analogous challenges in visual “recordings” (photographs), recognizing complex sound environments on audio recordings remains a rather mysterious process.

In contrast to an earlier era of “small data” (largely the result of the limited capacity of expensive analog recorders), the advent of inexpensive, portable, digital recording devices of enormous capacity combined with a growing interest in sound across the humanities, social sciences, and sciences, now contribute vast collections of sound recordings, resulting in interest in sound within the realm of “big data.” To date, most of the sound collection data is not annotated and in all practicality, is therefore inaccessible for research.

Computational recognition of sound, its types, sources, and components is crucial for a wide array of fields, including ethnomusicology, music studies, sound studies, linguistics (especially phonetics), media studies, library and information science, and bioacoustics, in order to enable indexing, searching, retrieval, and regression of audio information. While expert human listeners may be able to recognize complex sound environments with ease, the process is slow: they listen in real time, and they must be trained to hear sonic events contrapuntally. Through this project, ‘Deep Learning for Sound Recognition’, we aim to explore opportunities for the application of big data deep learning that will ultimately enable these functions across large sound collections for ongoing interdisciplinary research.

To accompany this project, the CCE is also currently engaged in an experiment on ILAM's Sound Files to see if they could accurately identify instruments (playing singly, in pairs, or triples - and overlapping or not) generated by MIDI, because this experiment would allow them to generate a virtually unlimited number of examples. That experiment has not yet concluded. They will continue with the task of instrument recognition, which appears to be more feasible, meaningful, and useful than place/region recognition.

Another possibility is to focus on so-called "unsupervised learning" where they would give all the data to some learning algorithm to cluster as it wills - then try to interpret the clustering. Though he had not put a priority on this type of machine learning it is starting to become apparent that it is worth trying, at least as a first step.