Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction

The speaker diarization task answers the question ”who is speaking at a given time?”. It represents valuable information for scene analysis in a domain such as robotics. In this paper, we introduce a temporal audio-visual fusion model for multi-users speaker diarization, with low computing requirement, a good robustness and an absence of training phase. The proposed method identifies the dominant speakers and tracks them over time by measuring the spatial coincidence between sound locations and visual presence. The model is generative, parameters are estimated online, and does not require training. Its effectiveness was assessed using two datasets, a public one and one collected in-house with the Pepper humanoid robot.


INTRODUCTION
Matching speech signal to its emission source is a crucial task to perform an accurate analysis of a scene with different speakers.Commonly named speaker diarization, it consists in assigning audio segments to classes that correspond to speaker identities.This task brings an answer to the question "who spoke when?" [1].Early work on speaker diarization focused on audio modality [2,3].Nowadays, typical audio speaker diarization systems that use audio as input are composed of three components: (1) Speech segmentation, where the audio input is decomposed into short segments, each segment is supposed to have only one speaker, and the noise is filtered out.This component can be seen as a voice activity detection module.(2) Extraction of the audio embedding from the segmented sections through various methods.The most noticeable are MFCCs [4], speaker factors [5] and ivectors [6].And (3) Clustering module, where the extracted audio embeddings are clustered into speakers.For this task the number of speakers is determined.It is also possible to pre-process upstream using speech enhancement and denoising technics which lead to significant improvement through deep learning [7].Speaker representation has been largely improved with the arrival of neural networks and other new methods such as d-vector [8] and x-vector [9].An interesting alternative consists in the fusion of audio and visual data.The combination of these two modalities provides complementary information and, therefore, audiovisual approaches to speaker diarization are likely to be more robust than audio-only approaches.It can be associated with a face or mouth tracking through spatial coincidence on the image plane [10,11], on a ground plane [12] or in 3D [13].Methods for tracking person in 3D require spatially distributed camera networks and microphone arrays which are not tractable in the case of complex (real) scenarios.On the other hand, methods relying on plane or ground image may present a lack of information and suffer much more from occlusion but they offer the advantage of easier data collection and can be utilized in a wider range of scenarios.In addition of mouth position, several methods are based on the synergy between utterances and lip movements through different approaches such as mutual information [2] and deep learning [14].
In this study, we propose a method which models this fusion through the spatial coincidence of visual and sound source localization (SSL) and by combining this concordance model with a dynamic Bayesian formulation that tracks the identity of the active speaker.SSL provides several benefits in multi-user conversations such as the ability to handle overlapping speech segments, eliminating the need for a voice activation module.The proposed method can be applied in various acoustic conditions by leveraging spatial information from SSL and face location.This paper is organized as follows: in Section 2 we present our model for multi-user speaker diarization.Section 3 presents the experimental setup and the evaluation of the proposition.

Problem definition
First, we introduce the notations and definitions of variables.Scalars are written in italic, vectors in italic bold and matrices are in italic and underlined.Upper-case letters denote random variables while lower-case letters denote their realization.We represent t the time-step index of both visual and audio frames, synchronized with each other.At frame t, there are at most N visual observations, X t = (X t1 , ..., X tn , ..., X tN ) ∈ R 2×N , where the random variable X tn corresponds to the mouth location of person n in image t.Then, a multi-person tracker provides a time series of N image locations, namely X 1:t = {X 1 , ..., X t } and associated visual-presence binary masks V 1:t , namely variable V tn associated with X tn such that V tn = 1 if person n is present in image t and 0 otherwise.N t = n V tn represents the number of persons that are observed at frame t.In practical, when V tn = 0, X tn = X t−i,n with t − i the most recent timeframe where V t−i,n = 1.We also consider a SSL module that provides the azimuth and elevation of the dominant sound source at each audio frame t.The soundsource location can then be mapped onto the image plane, such that an azimuth-elevation pair of observations is transformed into an image location modeled by a random variable Y t = (Y t1 , ..., Y tk , ..., Y tK ) ∈ R 2×K with K audio-visual observations for a visual frame at t and Y 1:t = {Y 1 , ..., Y t } its time serie.To these audio-visual observations we associate a speech-activity binary masks A 1:t = {A 1 , ..., A t }, such that A t = 1 if there is an active audio source at frame t or 0 otherwise.The objective is to track dominant speaker(s) at time t in associating audio responsibility over time the audio activity (if any) with one of the tracked persons.Audio sources out of the pictures are not taken into account.This is also referred to as audio-visual speaker diarization, addressed below in the framework of temporal graphical models.We introduce a time-series of discrete latent variables, S 1:t = {S 1 , ..., S t } such that S t = n, n ∈ 1, 2, ..., N if person n is both observed and speaks at frame t, and S t = 0 if none of the visible persons speaks at frame t.Notice that S t = 0 represent two different cases: firstly, there is at least one active sound-source at t (A t = 1) but its or their location cannot be associated with one of the visible persons and it can be interpreted as noise, secondly, there is no active sound-source at t, A t = 0. We will also use another latent variable Z t1:K = Z t1 , .., Z tK with Z tk = n which represents the attribution of the sound source k to the visual identity n.Z tk = 0 means the source k isn't assigned to any person in the image.

Speaker Diarization Model
The temporal speaker diarization problem can be formulated as finding a maximum-a-posteriori (MAP) solution, namely finding the most probable configuration of the latent state S t that maximizes the following posterior probability distribution.Also referred to as the filtering distribution it can be express in the following way: Following Bayes formula, the posterior probability (1) can be written and beyond developped as: With G n = N i=0 P (S t = n|S t−1 = i)P (S t−1 = i|u 1:t−1 ) and u t = (x t , v t , y t , a t ).The evaluation of ( 2) is recursive and a reasonable number of person simultaneously tracked need to be considered (5)(6)(7)(8) in order to keep the calculation tractable.Computation of this equation requires the observed likelihood P (u t |S t = s t ) and the transition probabilities P (S t = j|S t−1 = i) explained in the next two subsections.

EM Audio-Visual Observation Model
The main feature of the proposed model is its ability to robustly associate the SSL at time t with a person.The expectation-maximization for Gaussian mixture model infers the posterior probability that a person utters speech from audio and visual observations that are mapped onto the same mathematical space.We distinguish two cases.The first one If there is no audio activity at time t(A t = 0), the posterior can be evaluated with the following formula, where c is a small positive scalar, e.g., c = 0.2: (3) If a sound-source is active at time t, (A t = 1), we assign it to a visual identity n such that Z tk = n plays the role of an assignment variable in a mixture model.Its location y tk is assumed to be drawn from the following Gaussian/uniform mixture: where θ t = ({p tn } N n=0 , {Σ tn } N n=0 , β) denotes the set of model parameters, namely the prior, N n=1 v tn p tn + p t0 = 1, the 2 × 2 covariance matrices Σ tn , and parameter β that characterizes the outlier component of the mixture, namely a uniform distribution.The parameter set θ t can be estimated via the EM algorithm for Gaussian mixtures.
The algorithm begin with E-step that evaluates the posterior probabilities r tkn using current parameters values θ t , Z t is our assignment variable, Z tk = n means y tk is generated by component n.We first compute r tkn ∀n, 1 ≤ n ≤ N which correspond that a sound source is associated with a visible person: We can also write the probability that a sound source is not associated with a visible person n = 0, either because it corresponds to a sound emitted by a non visible person or emitted by another type of source, i.e., the posterior of the uniform component of the mixture: M step re-estimates the parameters using the current responsibilities.
r tkn (y tk − x tn )(y tk − x tn ) T + εI (7) with ε > 0 is a scalar acting as a parameter to prevent empty clusters, and I is the 2 × 2 identity matrix and where we have defined: The algorithm can be easily initialized by setting all the priors equal to 1/N + 1 and by setting all the variances equal to a positive scalar.Because the component means are fixed, the algorithm converges in only a few iterations.We have N faces and K sources, which represent a combinatorial problem at each iteration.For each possible association at t, we have to consider all the possible cases for the next step.These computations between the faces and the sources will explode in the course of time.To address this issue and keep the audio-visual model tractable, instead of computing all combinations, we factorize all sources in one dominant source y tn * t .It first requires to choose the person with highest speaking probability represented by the prior: Therefore, the mean source y tn * t is the sound source location that is considered the most probable based on x tn * t : and for n = 0 with u tn * = (y tn * t , x t , v t , a t ).Finally, by noting that the observed-data likelihood P (u tn * t ) does not depend on S t and by assuming a uniform distribution over the priors of visible person n (v tn = 1), i.e., π t0 = π tn = 1/(N t + 1), we obtain the following observation model: This enables to replace the observed likelihood (left hand side of ( 14)) with the posterior (right hand side of ( 14)) in (2).

State Transition Model Audio-Visual
The state transition probabilities, p(S t = j|S t−1 = i), provide the temporal modality for tracking speech turns along timestep.p(S t = j|S t−1 = i) is computed through several cases based on the presence/absence of persons and on their speaking status (for convenience and without loss of generality we set v t0 = 1): The first case of ( 15) defines the self-transition probability, p s , e.g., p s = 0.8, of person i present at both t − 1 and t.The second case defines the transition probability from person i present at t − 1 to another person j present at t.The third case simply forbids transitions from person i present at t − 1 to person j present at t − 1 but not present at t.The fourth case represents the transition probability from person i present at t − 1 but not present at t, to a person j present at t.The fifth case defines the transition probability from person i not present at t − 1 to person j that is not present at t.One may easily verify that

Data
The proposed method is evaluated on two corpora.The first considered corpus is CAV3D [15].It contains 20 sequences with duration ranges from 15s to 80s.An evaluation is conducted on a subset SOT composed of 9 sequences with a single speaker and a subset SOT2 composed of 6 sequences with a single active speaker and a second interfering person (not speaking).
The second corpus is recorded by ourselves on Pepper, a humanoid robot from Softbank Robotics.It contains dialogs between two or three persons.A total of 9 different subjects participated in this experience: 2 women, 7 men.Participants were asked to speak one at a time and try to avoid overlapping.We brought variations to dialogs by asking participants to randomly move in and out the scene, face the robot or look at each others.Different positions in the room were used to obtain different acoustic configurations.In a three-person dialogue, the participant positioned in the middle was requested to remain silent, to act as a distractor.The total duration is around eleven minutes and dominant speaker is carefully labeled in each frame.Windows with a bounding box that is either undetected or inaccurately computed are put aside.It was observed that the Pepper SSL occasionally experiences issues with activation, and in the absence of calibration of the Pepper SSL module, windows containing speech but no SSL detection are discarded too.

Technical Specifications
For this experiment, speaker diarization model is implemented for CAV3D and the Pepper corpus with some differences.SSLs are extracted with [16], and interpolation is performed from 3D coordinate, given by SSL, to speaker mouth localization using SOT subset.Pepper SSL module retrieves the direction of the emitting source (azimuth and elevation angles) from the TDOAs measured on the different microphone pairs.The angles provided by the sound source localization engine match the real position of the source with an average accuracy of 10 degrees.Transition from angle to image plane localization is made by interpolation.We record SSL from a loudspeaker at different positions in the image and perform a regression to get a mapping angle into image position.Sound sources located out of the image are filtered.We calibrate and fine-tune the parameters of the micro configuration using the first SOT sequence for CAV3D and a training sequence for the Pepper corpus in preparation for the testing phase.Thus we set Σ = Diag[300, 800], β = 10 7 , ε = 100 for CAV3D and Σ = Diag[300, 500], β = 300000, ε = 200 for the Pepper corpus.The remaining parameters are shared between both experiences, c = 0.2, p s = 0.8.

Results
The diarization performance is evaluated by Diarization Error Rate (DER), the lower the better.It contains three terms: Missing Detection (MS), False Alarm (FA), and Speaker Error (SPKE).
To evaluate the acoustic conditions we investigate CAV3D dataset with and without an oracle VAD.Those results can be compared to those of the audio-visual speaker diarization state of the art (SOTA): WST [14] and the audio only speaker diarization SOTA VBx [17] on the AMI corpus [18].The AMI corpus is a collection of meetings which shows similarities with the Pepper corpus.WTS yields to 21.3% and 21.1% of DER on the two AMI subsets ES and IS and VBx to 38.65% for the whole AMI corpus.Both are computed without an oracle VAD.We denote similar results taking variation of DER between datasets into account.The theoretical complexity of the algorithm is O(n 2 ).The audiovisual observation model represents 99% of the running time.For 10 sound sources being detected, with 5 considered persons, the running time is 0, 0631 seconds, out of a total running time of 0, 0635 seconds.

CONCLUSION
We proposed a model for temporal speaker-diarization based on principled mathematical and algorithmic concepts coupled with two types of perception, SSL and plane image.This model shows good results with a capacity of adaptation to different acoustic conditions without training phase.Thus this diarization method is not biased towards a particular training dataset, hence it is applicable to a large number of practical human-robot interaction scenarios.The VAD function and Robustness are carried by the uniform component of the mixture, which collects sound source locations that are far from the Gaussian components, which are centered around the faces.However to get the audio-image fusion, an interpolation needs to be made between SSL and image plan for every micro configuration, thus removing this micro configuration dependency is a challenge for future work.

Table 1 .
The use of an oracle significantly lowers the DER as it gives valuable information The performances (%) of our model for different experience set, A t is set with oracle VAD derived from diarization labels, or with presence or absence of SSL to reduce the number of FA. Results on CAV3D are promising, the model losses only 4.28% of DER between SOT and SOT2 without oracle VAD.With a DER of 19.27% on the Pepper corpus we can assume that our model fulfills its diarization goal in a standard robotic case.This method shows interesting results on SOT2 with a SPKE of 0%.The prediction only matches the right person when it detects a speaker.But it substantially decreases on the Pepper dataset.It comes from more complex scenarios and may also be related to Pepper micro quality.