We consider the problem of detecting, isolating and classifying elephant calls in continuously recorded audio. Such automatic call characterisation can assist conservation efforts and inform environmental management strategies. In contrast to previous work, in which call detection was performed for audio signals several seconds in length, we perform call activity detection at discrete time instants, which implicitly allows call endpointing. For experimentation, we employ two annotated datasets, one containing Asian and the other African elephant vocalisations. We evaluate several shallow and deep classifier models, and show that the current best performance can be improved by using an audio spectrogram transformer (AST). Furthermore, we show that transfer learning leads to improvements both in terms of computational complexity and performance. Finally, we consider automated sub-call classification using an accepted vocalisation taxonomy, a task which has not previously been considered, and for which the transformer architectures again provide the best performance. Our best classifiers achieve an average precision (AP) of 0.962 for binary call activity detection, and an area under the receiver operating characteristic (AUC) of 0.957 and 0.979 for call classification (5 classes) and sub-call classification (7 classes), respectively. These represent new benchmarks or improvements on previously best systems.
Elephant, automated call characterisation, passive acoustic monitoring, transformer, deep neural networks