Explainable audio Classification of Playing Techniques with Layer-wise Relevance Propagation

Deep convolutional networks (convnets) in the time–frequency domain can learn an accurate and fine-grained categorization of sounds. For example, in the context of music signal analysis, this categorization may correspond to a taxonomy of playing techniques: vibrato, tremolo, trill, and so forth. However, convnets lack an explicit connection with the neurophysiological underpinnings of musical timbre perception. In this article, we propose a data-driven approach to explain audio classification in terms of physical attributes in sound production. We borrow from current literature in "explainable AI" (XAI) to study the predictions of a convnet which achieves an almost perfect score on a challenging task: i.e., the classification of five comparable real-world playing techniques from 30 instruments spanning seven octaves. Mapping the signal into the carrier-modulation domain using scattering transform, we decompose the networks’ predictions over this domain with layer-wise relevance propagation. We find that regions highly-relevant to the predictions localized around the physical attributes with which the playing techniques are performed.


INTRODUCTION
Our scientific understanding of sound production has grown considerably since the early years of computer music [1,2].For example, the mechanics of a piano can be simulated with enough precision so as to allow faithful synthesis [3].However, for other instruments, we lack a closed-form description of the interaction between player and instrument [4].Such is the case, in particular, when this interaction belongs to the "extended" vocabulary of playing techniques: tremolo, vibrato, staccato, and so forth [5].This is because the gesture of the performer is more difficult to specify for extended techniques than the so-called "ordinary" technique [6].
Meanwhile, the renewed interest for machine learning in audio signal processing has advanced the state of the art in the task of playing technique classification [7].This task is motivated by the key role of playing techniques in the computational analysis of musical performance [8].Of particular interest is the subfamily of "periodic modulation techniques" (PMT), in which the musician alternates quickly between two positions: e.g., pressing and releasing a key to produce a trill [9].In comparison with the ordinary technique, a PMT audio signal modulates periodically in amplitude, in frequency, or both.
For this reason, the scattering transform offers a judicious choice of feature map for PMT classification.Indeed, it represents the audio Supported by an Atlanstic2020 project on Trainable Acoustic Sensors (TrAcS).Companion website: https://github.com/changhongw/examod. signal x in terms of a tensor Sx which is indexed by time t, first-order wavelet frequency λ1, and second-order wavelet frequency λ2 [10].Prior research has proven an approximate closed-form expression for an idealized model of PMT, in which both the carrier and the modulator are sinusoids of respective frequencies ω1 and ω2 [11].Under this idealization, the energy in Sx has a local maximum at the scattering path (λ1, λ2) = (ω1, ω2).Yet, real-world PMT's involve non-sinusoidal carriers and modulators [12].The corresponding Sx yields several nonzero regions in the scattering transform domain [13].
Passing the tensor Sx as input to a supervised classifier has led to state-of-the-art performance over several datasets for playing technique recognition [14].It has also allowed to match subjective ratings of auditory similarities between playing techniques across different instruments and mutes [15].Another strong tendency of recent research is to switch from shallow classifiers (e.g., support vector machines) to deep learning (e.g., convnets) [16,17].But despite the growth of data-driven approaches to musical acoustics [18], the perceptual underpinnings of playing technique recognition remains poorly known.What makes the difference, for example, between a vibrato and a trill [19,20]?Answering this kind of research question requires insight on the spectrotemporal characteristics of the audio signal at hand, and not simply an accurate classification.
In this article, we propose to characterize real-world playing techniques in terms of sparse activations in feature space.A prior publication has tackled this problem with unsupervised dictionary learning [21] over magnitude spectra.The originality of our approach is that it is supervised: rather than decomposing the tensor Sx, it decomposes the prediction of a deep neural network f over the domain (λ1, λ2).The resulting decomposition does not measure which pairs (ω1, ω2) are present in x; but more specifically, which are relevant to the value of f (Sx).We decompose the predictions using the layer-wise relevance propagation (LRP) method, which explains pre-trained models' predictions by associating each neuron a relevance score.Although LRP has led to many publications in image and text processing [22], it has been rarely applied to speech [23,24] and never to music.Our main finding is that LRP aligns with current knowledge about sound production and musical gestures.

Deep Taylor decomposition
We define a neural network f of depth M recursively over layers: where Wm and bm are the matrix of weights and vector of bias in layer m, and ρ is a rectified linear unit (ReLU).i and j index neurons in layer m − 1 and layer m, respectively.Let y m denote the layer-wise output of the network, i.e. y m = f m (Sx).The relevance at the deepest layer, which takes y M −1 as input, is defined as the prediction itself: RM (y M −1 ) = f M (Sx).Our goal is to decompose RM into shallower layers until reaching R0 at the level of the input f 0 (Sx) = Sx.We seek an LRP rule of the form: in which the link matrix Lm preserves total relevance: Before identifying Lm, we impose that relevance Rm and activation y m should be proportional over each node j [25]: with cm[j] an unknown proportionality factor.In this case, Rm(y m−1 )[j] = 0 is equivalent to y m [j] = 0, which defines a plane in dimension R N m−1 according to Eq. ( 1).We then define a search direction dm ∈ R N m−1 and solve for a root point ỹm−1 from: where αm is a scalar.We then obtain for every j: To identify Lm, we perform a deep Taylor decomposition [26], comprising a series of Taylor decompositions of the relevance recursively at each node.Specifically, we do a Taylor expansion of the function Rm at the root point ỹm−1 : Neglecting the higher-order terms in Eq.( 7) and injecting Eq.( 6) into Eq.(7), we obtain the base formula for deriving different LRP rules [26]:

Baseline rule: LRP-0
According to the simplest rule, known as LRP-0, the search direction is defined as [26].Therefore, according to Eq. ( 8), we obtain the relevance redistribution formula: To avoid division by zeros, LRP-ε rule adds a small positive number ε to the denominator of Eq. ( 9).

Advanced rule : LRP-[ε, z + ]
LRP-z + rule searches the nearest root point on the segment ) and the name "z + " originates from [26] which defined z (10) LRP-[ε, z + ] is a composite rule which applies the LRP-ε rule for convolutional layers and the LRP-z + rule for fully connected layers.We refer to [25] for a complete list of propagation rules.We implement LRP in Python via the Zennit package 1 .

Scattering transform
As a biological plausible surrogate for human perceptual judgments of isolated audio events [15], the scattering transform decomposes audio signals using wavelet convolutions, modulus nonlinearities, and average pooling.The first-order scattering transform S1x maps the signal x into the time-frequency domain, by convolving it with a wavelet filterbank ψ λ 1 , taking modulus and averaging with a lowpass filter ϕ: S1x is essentially a constant-Q transform (CQT) which is a commonly-used representation for music signal analysis [16,17].Yet, the averaging loses temporal modulations, which are critical to the discrimination of PMTs.To recover this information, we perform a second-order decomposition of the unaveraged S1x with a wavelet filterbank ψ λ 2 [10]: We use Sx = S1x+S2x as input to a convenet for playing technique classification.Backpropagating the predictions using the LRP rules in Section 2, we obtain the relevance score R0(Sx)[t, λ1, λ2], which shows the contribution of each element in Sx.

Deep convolutional network
We train a convnet with 3 convolutional layers and one dense layer.Each convolutional layer comprises a one-dimensional convolution unit, a batch norm, a ReLU, and an average pooling.The dense layer is followed by a softmax unit.The input to the convnet is the tensor of scattering coefficients, either S1x or Sx.The corresponding feature dimensions are 74 and 1200; and the number of trainable parameters are 10.3 K and 2.1 M. For both cases, the network is trained with early stopping and a batch size of 32.We use weighted cross-entropy loss due to the unbalanced classes.

Studio On Line dataset
We use a subset of the Studio On Line dataset (version 0.9HQ) [7] that includes five types of PMTs: tremolo, flatterzunge, trill, bisbigliando, and vibrato.We call this subset SOL-PMT, which contains 2530, 0.5

Evaluation
We extract the scattering features for the SOL-PMT dataset with 8 and 2 filters per octave in the first-and second-order.The averaging scale of the lowpass filter is 2 13 , resulting in a frame size of T = 2

Local relevance maps
Decomposing the predictions f (Sx) to the input Sx, we obtain 1200 × 32 relevance values R0(Sx) for each test audio.To obtain example-wise (local) relevance maps, we average the relevance over the 32 time frames and visualize it in terms of carrier-modulation pair, i.e. (λ1, λ2).Another finding is that S1x, equivalent to CQT, exhibits low relevance values for all PMTs as compared to S2x.Although S1x is conceptually equivalent to many popular audio front ends in deep learning considered for a wide range of tasks and indeed shows a large amount of energy (see Fig.
where The relevance values are more localized and globally structured vertically with high values at the modulation rate of the playing techniques across pitch.This means that the convnet successfully enforces the pitch invariance that is needed for the task at hand.S1x almost shows no relevance to the prediction as compared to S2x.For a given class like vibrato, low energy regions in S2x exhibit high evidence, probably because they are discriminative to the other PMTs.

CONCLUSION
We propose a framework to explicitly connect networks' predictions to the physical attributes of audio signals.This is achieved by mapping the signal into a carrier-modulation domain using scattering transform, a surrogate of auditory perception.We then decompose the predictions of a convnet trained for playing technique classification to this domain using the layer-wise relevance propagation method.Our findings show that highly relevant regions are localized around the modulation rates of playing techniques, regardless of pitch.This explicit connection between networks' predictions and physical attributes of audio signals, fully data-driven, opens new avenues for sound production and music gesture analysis.

Fig. 1
displays the local explanation maps of five test examples, each from one class at the same pitch A4=440 Hz (except A#4=466 Hz for trill).(a) is the log-spectrogram showing the spectro-temporal characteristics of each technique, followed by their corresponding Sx visualizations in (b).The column ticked by 0 in each subfigure in (b) is S1x and the remaining colored region is S2x, as annotated in the left subfigure.Those regions in (c) and (d) correspond to the relevance values for S1x and S2x, respectively.Fig. 1 (c) and (d) are the relevance maps R0(Sx) for each class by applying the LRP-ε and LRP-[ε, z + ] rule, respectively.A core observation is that relevance scores are localized around the modulation rate of the playing techniques, e.g. 8 Hz for trill and 32 Hz for flatterzunge.Indeed, these are the physical attributes with which the playing techniques are performed according to our knowledge of music gestures.Additionally, relevance values do not positively correlate with scattering energy.High energy regions in Sx, e.g. with λ2 > 64 Hz for trill, tremolo, and flatterzunge, do not show strong evidence in the corresponding relevance maps.
1 (b) and Fig. 2 top), our experiments show that it has few impact on the decisions of a classifier that jointly considers S1x and S2x for PMT classification.Comparing R0(Sx) derived from (c) LRP-ε and (d) LRP-[ε, z + ] for each playing technique class, we notice that the latter provides more contrasted relevance values.To quantify the effect of LRP rules, we calculate the kurtosis of R0(Sx) for each class over its test examples.The mean kurtosis obtained from the LRP-ε and LRP-[ε, z + ] rule are 13.82 and 19.22, respectively; and the corresponding standard deviation are 8.81 and 13.72.The higher values from the LRP-[ε, z + ] rule verify our observations in Fig. 1 (c) and (d).Therefore we use LRP-[ε, z + ] rule for class-wise relevance aggregation.

4. 3 .
Class-wise aggregationWe propose to derive class-wise explanations by aggregating local relevance maps in the test set.For a specific class, we first register the locations of the top-n maximal values of each local relevance map R k 0 (Sx):

k = 1 ,
..., K indices the test examples of this class.Let I k be a λ1 × λ2 matrix where I k (λ 1 ,λ 2 )∈P k = 1 and I k (λ 1 ,λ 2 ) / ∈P k = 0. Summing I k over the test examples derives the class-wise aggregated map: R0(Sx) = K k=1 I k .Fig. 2 bottom shows the top-5 argmax, i.e. n = 5 in Eq. (13), aggregated relevance maps for the five PMT classes.To show the corresponding input, we display the averaged scattering coefficients over the test examples for each class (see Fig. 2 top).Similarly to Fig. 1, the left column in each subfigure corresponds to S1x and the remaining colored region is S2x.The class-aggregated relevance maps further support our findings from the local maps in Section 4.2.

Table 1
lists the classification accuracy for each PMT class.The nearly perfect scores demonstrate the effectiveness of Sx for PMT recognition.The considerable performance drop after removing S2x from the full feature map verifies the importance of S2x for the discrimination of PMTs.
13(186 ms).Coefficients with carrier frequencies below 32 Hz are removed as those coefficients represents spectro-modulations that are inaudible.The disparate length of audio examples are fixed into 2 18 samples (around 6 seconds) by truncating or zero-padding.The full feature map Sx for each audio example is then sized 1200 × 32, where 1200 is the feature dimension and 32 is the number of time frames.After randomly shuffling the data, we split each playing technique class into training, validation, and test subsets by a 6:2:2 ratio for each instrument.We provide a full description of the split and the file IDs on the companion website.