Conversational Feedback in Scripted versus Spontaneous Dialogues: A Comparative Analysis

Scripted dialogues such as movie and TV subtitles constitute a widespread source of training data for conversational NLP models. However, there are notable linguistic differences between these dialogues and spontaneous interactions, especially regarding the occurrence of communicative feedback such as backchannels, acknowledgments, or clarification requests. This paper presents a quantitative analysis of such feedback phenomena in both subtitles and spontaneous conversations. Based on conversational data spanning eight languages and multiple genres, we extract lexical statistics, classifications from a dialogue act tagger, expert annotations and labels derived from a fine-tuned Large Language Model (LLM). Our main empirical findings are that (1) communicative feedback is markedly less frequent in subtitles than in spontaneous dialogues and (2) subtitles contain a higher proportion of negative feedback. We also show that dialogues generated by standard LLMs lie much closer to scripted dialogues than spontaneous interactions in terms of communicative feedback.


Introduction
While the amount of text data available for training or fine-tuning LLMs is large and growing steadily, spoken conversational data remains relatively scarce.Although corpora of spontaneous spoken interactions have been collected for various languages (Dingemanse and Liesenfeld, 2022), those are generally of a modest size and limited to specific topics or tasks.Due to this scarcity of available data, a common approach for the development of conversational models is to rely on corpora of authored dialogues extracted from movie scripts (Danescu-Niculescu-Mizil and Lee, 2011) or movie and TV subtitles (Lison et al., 2018;Davies, 2021).
However, those dialogues are markedly different from spontaneous interactions.Most importantly, movie scripts and subtitles are explicitly written with the aim of narrating a story.Subtitles must also abide to strict length constraints, and thus tend to only transcribe the most salient part of each turn.As a consequence, many conversational phenomena such as disfluencies (Shriberg, 1996), overlapping talk (Schegloff, 2000), and backchannels (Yngve, 1970) are either absent or uncommon in those dialogues, unless their presence happens to contribute to the storyline (Berliner, 1999;Chepinchikj and Thompson, 2016).
This paper provides a quantitative analysis of how subtitles differ from spontaneous dialogues, focusing more specifically on conversational feedback (Allwood et al., 1992) and grounding (Clark and Schaefer, 1989) phenomena.To highlight differences in linguistic properties between subtitles and spontaneous conversation corpora, we first compile a range of lexical statistics and use a dialogue act tagger to estimate the relative frequencies of various feedback signals.To obtain more fine-grained estimates on three core feedback categories, respectively Agreement / Acceptance, Acknowledgement / Backchannel and Negative Feedback, we collect manual annotations on multiple dialogue samples and fine-tune a LLM on those annotations to automatically detect the presence of those feedback in our corpora.Finally, we apply the fine-tuned LLM on synthetic dialogues generated with standard autoregressive LLMs, and show that those dialogues are comparatively much closer to scripted dialogues than to spontaneous interactions when it comes to the frequency and type of conversational feedback.Those experiments are conducted for eight languages (English, Chinese, French, German, Hungarian, Italian, Japanese and Norwegian) for which corpora of spontaneous dialogues are readily available.
The paper is structured as follows.Section 2 reviews related work, and Section 3 presents the corpora employed in our experiments.Section 4 describes the observed lexical distributions of feed-back phenomena and Section 5 compares them to estimates derived with a dialogue act tagger.In Section 6, we describe the manual annotation of dialogue samples and the fine-tuning of an LLM to automate this process.Finally, Section 7 describes the results of applying this LLM-based method to synthetic dialogues, and Section 8 concludes.

Conversational Feedback and Grounding
A key aspect of any communicative activity is the management of the common ground, a process often called conversational grounding (Clark and Schaefer, 1989).The study of grounding and related phenomena, such as conversational feedback (Allwood et al., 1992), has been instrumental to cognitive approaches to communication (Clark, 1996), and to dialogue system development (Traum, 1994;Paek and Horvitz, 2000;Yaghoubzadeh et al., 2015).
Feedback and grounding can happen at any of the levels of communication that includes simple contact, perception, understanding and higher-level evaluation of what had been said (Allwood et al., 1992;Clark, 1996).Conversational feedback may appear at different positions in a dialogue.However, a number of corpus studies found that they have a tendency to occur at specific places, mostly where they cause little interference (Kjellmer, 2009).These places of occurrence have also been referred to as Feedback Relevant Spaces (Heldner et al., 2013;Howes and Eshghi, 2021).Although, arguably, any utterance relates directly or indirectly to grounding (through implicit and high level pragmatic inference, Clark and Schaefer 1989), acknowledgments and other positive feedback signals (see Ex. (1)), along with repair (see Ex. (2)), have been identified as the most prominent grounding mechanisms (Jefferson, 1972;Bunt, 1994).Their frequency in human-human dialogue is known to be very high (e.g., Stolcke et al., 2000a) and universal across languages (Liesenfeld and Dingemanse, 2022;Dingemanse et al., 2015).These conversational signals, while they do not cover all grounding phenomena, can therefore be seen as a useful proxy to quantify feedback in a dialogue. (1) A: and uh it really does irk me to see those guys out there uh you know making that ///much money/// B: ///yeah///1 Recent works have emphasized the role of feedback and grounding signals in their study of humanhuman conversations (Fusaroli et al., 2017;Dideriksen et al., 2022;Dingemanse and Liesenfeld, 2022) as well as human-agent interaction (Visser et al., 2014;Hough and Schlangen, 2016;Buschmeier and Kopp, 2018;Axelsson et al., 2022).
The literature tends to merge the two closely related concepts of backchannels and acknowledgments.Backchannels (Yngve, 1970), or continuers (Schegloff, 1982), are not positioned on the main channel, but uttered by the "listener", often as low intensity unobtrusive overlapping speech (Heldner et al., 2010) or non-verbally (Allwood et al., 2007;Truong et al., 2011).Acknowledgments, on the other hand, have a slightly broader, functional definition of minimal positive feedback (Jefferson, 1984;Allwood et al., 1992).
There is a large body of work on lexical markers, also called cue phrases or discourse markers (Jefferson, 1984;Allwood et al., 1992;Muller and Prévot, 2003), since they present interesting linguistic features and constitute convenient explicit cues for detecting feedback utterances automatically (Jurafsky et al., 1998;Gravano et al., 2012;Prévot et al., 2015).Gravano et al. (2012) developed a list of affirmative cue words made of alright, mm-hm, okay, right, uh-huh, yeah.Form-Function studies of similar lists have been made at least for Swedish (Allwood, 1988), U.S. English (Ward, 2006), and French (Prévot et al., 2015).
Few studies have, however, concentrated on direct negative feedback associated with rejection and corrective dialogue acts.Although Allwood et al. (1992) suggests a polarity dimension for characterizing feedback, most recent studies have focused on positive feedback.Indeed, in collaborative dialogue and everyday conversations, which are the two genres dominating available datasets, positive feedback constitutes the large majority of explicit feedback (e.g., Malisz et al., 2016).Negative feedback is instead often expressed constructively, using repair mechanisms, specifically clarification requests (Purver, 2004).These may rely on simple lexical cues (e.g., for English, pardon?, huh?), sluices (such as what?, who?), or on clarification ellipsis, as in the following example (Fernández et al., 2007): 1Notation: ///text/// produced in overlap with the speech of the other speaker.From Switchboard (Godfrey et al., 1992) (2) A: and then we're going to turn east B: mmhmm A: not straight east slightly sort of northeast B: slightly northeast?2 The occurrence of feedback signals in dialogue transcriptions can be detected using various types of sequence labeling models from classical hidden Markov models (Stolcke et al., 2000b) to modern neural architectures and large language models (Liu et al., 2017;Noble and Maraev, 2021).

Analysis of Subtitles
Subtitles are typically short written text snippets and they accompany audiovisual content on the screen.They are often subject to condensation and normalization, where non-standard verbal elements (repetitions, signs of hesitation etc.) are omitted or replaced by more standard alternatives (Gottlieb, 2012) due to constraints on the length, readability and writing conventions.As subtitles are displayed alongside audiovisual content, viewers can typically recover omitted dialogue-relevant cues from the accompanying images and sounds.Interlingual subtitling -where the original language of the audio is different from the subtitling language -differs somewhat from intralingual subtitling, which is meant for same-language audio and subtitles which also records non-verbal elements writing for the benefit of hearing impaired audiences or non-native speakers (Gottlieb, 2012).
Rühlemann (2020) compared real conversations and scripted ones and observed that continuers were absent from the latter.Prevot et al. (2019) compared data from the Open Subtitles corpus (Lison and Tiedemann, 2016;Lison et al., 2018) in English, French and Mandarin with both written and conversational corpora and found that OpenSubtitles occupied an intermediate position between written and conversational data in terms of lexical and syntactic features.This paper builds upon those earlier works but focuses specifically on communicative feedback, using a combination of lexical statistics, manual and automate annotations to quantify its frequency in various corpora.

Corpora
We rely on data from both OpenSubtitles and existing, publicly available corpora of real conversations covering eight different languages (see Table 1).

Spontaneous Dialogues
German (de) We use the Hamburg MapTask corpus (HZSK, 2010), in which twelve dyads of (L2) speakers of German engage in dyadic taskoriented short dialogues.

Italian (it)
We use the CLIPS corpus (Savy and Cutugno, 2009), consisting of both a map task and a difference spotting task between images.We exclude dialogues with a high proportion (> 10%) of utterances with dialectal words.

Japanese (ja)
This language is represented by the transcripts of the CallHome Japanese corpus (Den and Fry, 2000) consisting of 120 unscripted telephone conversations between native speakers, mostly family members or close friends.

Norwegian (no)
We use the NoTa-Oslo corpus (Johannessen et al., 2007), containing interviews and conversations from 2004-2006 with 166 informants from the Oslo area.The dialogues consist of 10-minute semi-formal interviews and 30-min informal dialogues with other informants.

Mandarin Chinese (zh)
The source of our Mandarin Chinese data was CALLHOME (Wheatley, 1996) consisting of unscripted telephone conversations between native speakers.

Subtitles
The scripted dialogues are extracted from OpenSubtitles 2018 (Lison et al., 2018), a large collection of over 3.7 million subtitles (amounting to ≈ 22.1 billion words) extracted from the OpenSubtitles.orgdatabase and covering 60 languages.We include both (1) subtitles for the hearing impaired, where the subtitle language and the original audio language are identical and (2) subtitles for foreign audiences.The subtitles are then filtered according to several criteria.Only recent movies (year ≥ 1990) are included to reflect contemporary language use, as is the case for the corpora of spontaneous conversations.We also omit subtitles with less then 100 utterances and exclude genres that are less relevant for this study (Documentary, Reality-TV, Biography, Sport, Musical, Music, Adult, Animation, Short and Game-Show).
We sample up to ten movies per audience type (hearing impaired vs. foreign audience) from the five largest genres, namely drama, comedy, crime, action, and romance.Table 1 shows the number of movies and utterances per language for the selected subtitles.Note that subtitles are typically segmented by dialogue turns or sentences instead of utterances.The term "utterance" should therefore be understood broadly in this paper.
This paper focuses on the textual aspects of grounding phenomena.While speech and nonlinguistic aspects of communicative feedback (such as timing, intonation, gestures or gaze) are both important and well-studied, in particular for acknowledgements and backchannels, those information are not available in subtitles corpora, which are intrinsically limited to text transcriptions.

Lexical Analysis
Lexical statistics of acknowledgment cues gives us a first picture of the feedback frequency.Acknowledgments tend to be produced by the addressee (not the main speaker) and are therefore often short productions uttered in overlap and potentially with a lower voice.Out of those three properties (brevity, overlap, lower volume), only the first is practically measurable in our experiments, as the subtitles are by construction text-based.
Given their relation to acknowledgments, we first analyse "very short utterances" (Edlund et al., 2009), defined here as three tokens or less.Feedback is also very well represented at initial positions of longer turns/contributions.We therefore targeted two locations: very short utterances (all tokens) and initial positions (one token) of all other utterances.Comparing term frequencies between these locations and the overall corpus allowed us to compile language-specific lists of cue words.Those lists of cue words (presented in Table 3 in the Appendix) are divided into four core classes of feedback: • positive feedback/acknowledgment (+) • neutral/continuer (=) • negative feedback (-) • clarification request (?).
We plot in Figure 1 the frequencies of those feedback classes in each corpus, either in terms of absolute frequency (left side) or by looking at the relative proportions of the feedback classes (right side).Figure 2 shows the lexical distribution of the most frequent lexical items observed in the utterances of plot (b) for English.
We observe that the statistics based on cue words differ substantially between subtitles and spontaneous dialogues.This difference is observed across all languages and sub-genres, (see Appendix A for other languages).We sought to identify and reduce other sources of variation between corpora.STAC, as a chat corpus, exhibits different patterns than other dialogue corpora, notably due to the presence of emojis.Similarly, for English and French, we explored the impact of politeness expression (highly frequent in OpenSubtitles).Those peculiarities did not, however, change the overall picture of our analysis (see Figure 13 in Appendix A).
One key difference between real dialogues and subtitles relates to the overall frequency of feedback cues, which is much higher in spontaneous dialogues (40-50%) than in subtitles (10-20%), as  We compared our English cue word lists against the annotations in Switchboard.After grouping feedback-related labels into a single Feedback category, we find that the cue word lists yield an  1 score of 0.76.

Dialogue Act Tagging
Although lexical statistics do highlight substantial differences in subtitles and spontaneous dialogues, they remain imprecise estimates, as many cue words related to feedback tend to be ambiguous.In this section, we refine our analysis using a dialogue act tagging model trained on the DAMSL-Switchboard corpus.

Data
We map the original set of Switchboard (SWBD) tags, and their clustered DAMSL-SWBD equivalents, into five coarse dialogue act (DA) classes: Forward looking, Yes/no answers, Assessment, Backchannel and Other.The two classes most directly relevant for feedback, namely Backchannel and Assessment, are inspired, in part, by Mezza et al. (2018).Distinguishing between these two feedback-related classes is also motivated by Goodwin (1986), who outline a number of positional and functional differences between these.The Backchannel category consists of the SWBD-DAMSL la-bels3 Acknowledge (Backchannel), (SWBD tag b), Backchannel in question form (bh), Response Acknowledgment (bk), Summarize/reformulate (bf) and Signal-non-understanding (br).As this latter tag suggests, negative feedback signals are also part of the Backchannel category, since they are too few to reliably learn a separate class from.The Assessment category comprises not only the labels Agree/Accept (aa), but also Appreciation (ba) and Exclamation (fe).The forward looking category contains utterances expressing explanations, instructions and suggestions as well as questions.Table 4 in Appendix B shows the distribution of instances per label and their SWBD tag.

Model Training
We fine-tune the monolingual bert-base-cased pretrained model (Devlin et al., 2019)  velopment and testing.We set up the task as a sequence classification problem, including the preceding utterance as context.We train the model with a batch size of 8, a learning rate of 4 − 5 and default values for the other parameters.We run and compare three different random seeds, yielding similar performance.To improve recall, we also adjust the probability thresholds for the feedback classes.
The model performs relatively well on the Switchboard test set, yielding an accuracy of 0.81.The  1 scores for the Assessment and Backchannel classes are respectively 0.59 and 0.83.This score difference may be due to Backchannel instances being better represented in the training data, as well as some label confusion between the Assessment and the Yes/No question categories.

Empirical Results
We then use the trained dialogue act tagger to detect conversational feedback signals in both the spontaneous dialogue and subtitles.For non-English corpora, we machine translate the data using the Google Translate API.Feedback-annotated conversational corpora is non-existent for most languages and the quality of current MT systems is generally considered high enough to serve as a viable alternative (Isbister et al., 2021).
Table 2 presents the empirical results obtained with our dialogue act tagger on both spontaneous dialogues and subtitle corpora.We observe that backchannels are considerably more frequent (by a factor three) in spontaneous dialogues than in subtitles for half of the languages -which is in line with the results of our lexical analysis in Section 4. The number of utterances labeled as Assessment differs less, but subtitles still seem to contain less of this feedback type in almost all genres and languages except French (see Appendix B for details).Given that the tagger is only trained on a single corpus, some of the differences found may also be attributed to the generalization ability of the tagger to certain domains.We therefore also conduct some manual error analysis.

Error Analysis
In general, the proportion of the Backchannel category for the spontaneous conversations is lower for Hungarian, Italian, Norwegian and Mandarin than for the other languages.This is likely due to the use of infrequent spelling variants of backchannel signals such as hmm, mh.We have also found that the tagger has difficulties detecting feedback when they are part of longer utterances, whether they appear in an utterance-initial position or not.We also observe a general tendency to associate sentence-final question marks to feedback cues.When inspecting the most frequent utterances tagged as feedback, we also notice that short utterances pose some challenges for machine translation due to polysemy, e.g., Cosa? "Thing?", also translatable as "What?", in Italian.

Further Annotations
The results from the dialogue tagger do show some clear trends regarding the extent to which communicative feedback is expressed in subtitles compared to spontaneous interactions.However, the use of DAMSL-Switchboard as sole source of training data is a limiting factor in our analysis, in particular when it comes to non-English dialogues, which must be machine-translated prior to labeling.Furthermore, the tagger does not provide information about the frequency of negative feedback, although the lexical analysis from Section 4 does seem to point towards a higher frequency of those communicative signals in subtitles.
We therefore complement the analyses of the two previous sections with a manual annotation effort.
To this end, we sample from each corpus a set of 300 utterances to annotate.However, as evidenced by the results of the previous sections, many utterances of our corpora do not seem to contain any communicative feedback.To ensure the annotation process can cover a sufficiently broad variety of feedback signals despite this class imbalance, we do not select the utterances purely at random, but select half among those marked as feedback-relevant by the cue words of Section 4, and the other half among those that do not.

Annotation Process
We recruited 6 annotators with prior expertise in linguistic annotation and proficient in the language corresponding to the corpus to annotate.Those annotators were provided each utterance in its context, and were tasked to decide whether the utterance in question contains one of the following three categories of communicative feedback: defined in the annotation guidelines as such: AGREE_ACCEPT : indicates that the speaker agrees or accepts what has been said.
ACK_BACK : indicates that the speaker is listening to her interlocutor, or at least heard what has been said, without necessarily agreeing with it or committing to its content.
NEGATIVE_FEEDBACK : indicates that the speaker could not hear or understand her interlocutor, or even rejects or disagrees with what the other person has said.
Answers to explicit questions should not be considered as feedback.Each utterance can be tagged with zero, one, or multiple feedback labels.These categories specifically target and distinguish between different conversational feedback phenomena and are therefore somewhat more comprehensive than the categories employed by the tagger of the previous section.There, similar categories were derived by merging the available feedback-relevant dialogue act labels from the SWBD annotations.
A total of 24 corpus samples, each comprising 300 utterances, were annotated4.Three corpus samples (respectively for English, French and Chinese) were doubly annotated, and the Kappa's score of their agreement was found to be 0.59 for 4The full set of annotated dialogue samples is available at https://github.com/NorskRegnesentral/conv_feedback.AGREE_ACCEPT, 0.42 for ACK_BACK and 0.54 for NEGATIVE_FEEDBACK across the 3 samples.This relatively low inter-annotator agreement illustrates the challenging nature of the annotation task, in particular due to the lack of explicit turn boundaries in subtitles, making it at times difficult to determine the context behind each utterance.

Annotation Results
Figure 3 illustrates the frequencies of the three feedback categories across the 24 annotated samples.We observe again a lower proportion of both Agree / Accept and Acknowledgement / Backchannel feedbacks in the subtitles compared to real interactions.The proportion of Negative feedback is, however, higher for the subtitles.We hypothesise that this may stem from the fact that disagreements between interlocutors are more interesting from the storytelling perspective, and are therefore more common in subtitles than in real interactions.We investigated whether subtitles for foreign audiences differed from subtitles written for the hearing impaired (as those often need to adhere more closely to the original on-screen conversation), but did not find any substantial disparity.

LLM-based Annotation
The frequencies of Figure 3 are obtained using the manually annotated dialogue samples.However, those samples only cover a small fraction of available corpora.Furthermore, as the sampling procedure relied on the use of cue-words to cover a sufficiently broad set of feedback types (see above), it is likely to overestimate the proportion of com- municative feedback.To mitigate this bias, we finetune an instruction-tuned Gemma 2 model (Gemma Team et al., 2024) to predict the probability of an utterance including one of the three defined feedback categories.The fine-tuning relied on LoRA (Hu et al., 2021) and included as instructions the annotation guidelines also provided to the human experts.The full set of 24 dialogue samples was used for the fine-tuning, each utterance being provided in its local dialogue context.For non-English utterances, we also include in the prompt an English translation of the utterance and its context, obtained using Google Translate.
The fine-tuned Gemma2 LLM was then applied to all corpora to predict whether their utterances contained one of the three categories of feedback defined above.The results are shown in Figure 4.The proportions of communicative feedback are somewhat lower in the actual corpora than in the annotated samples (which is expected given how the dialogue samples were derived), but the overall trends remain similar to Figure 3.

Conversational Feedback in Synthetic Dialogues
We conclude by investigating the occurrence of communicative feedback in synthetic dialogues generated with autoregressive language models.More precisely, we wish to analyze whether the communicative feedback generated by those models are closer to the patterns found in real interactions or to scripted dialogues such as subtitles.
To this end, we use available GPT-2 models (Rad- Figure 5: Frequency of communicative feedback in synthetic dialogues generated using GPT-2 models, either applied without fine-tuning or after fine-tuning on corpora of spontaneous interactions or subtitles.
ford et al., 2019) for the eight covered languages 5.The use of GPT-2 models is motivated by practical considerations and the need to obtain pre-trained models for each of the eight languages.For each corpus, we derive a fine-tuned version of its corresponding GPT-2 model by further training the model on the corpus dialogues.To account for the corpus size differences, the number of epochs is adjusted to ensure that the total number of gradient updates is similar across all corpora.The GPT-2 models are then employed to produce synthetic dialogues (100 dialogues of about 50 turns per model.For the fine-tuned models, all turns are automatically generated, while for the base models, the following dialogue start is used as context: Hi! -Hi, how are you?-Fine, and you? to bias the model towards the generation of dialogues.Finally, the LLM annotator from the previous section is applied on those synthetic dialogues to estimate their frequency of communicative feedback. The results are shown in Figure 5.We observe that the synthetic dialogues generated with the standard GPT-2 models without any further fine-tuning are much closer to the ones derived from subtitles than to those derived from spontaneous interactions when it comes to communicative feedback.This is notably the case for positive and neutral feedback.The occurrence of negative feedback is, however, 5The following pre-trained models are employed: gpt2-base (English), gpt-fr-cased-small (French), german-gpt2 (German), gpt2-small-italian (Italian), PULI-GPT-2 (Hungarian), norwegian-gpt2 (Norwegian), gpt2-chinese-cluecorpussmall (Mandarin Chinese), and japanese-gpt2-medium (Japanese).not as common as in subtitles.Although the above results were obtained here using only GPT-2 pretrained models, we expect to find similar patterns for other (and more recent) LLMs.

Conclusion and Future Work
As evidenced in this paper, movie and TV subtitles exhibit notable linguistic differences to actual spontaneous dialogues in the amount and type of conversational feedback they include.Based on a collection of corpora of both spontaneous dialogues and subtitles across eight languages, we provide both lexical statistics and dialogue act estimates derived with a fine-tuned dialogue act tagger.We show that the proportion of conversational feedback is considerably lower in subtitles than in spontaneous dialogues across the corpora included.Furthermore, the type of conversational feedback also differs, as negative feedback is proportionally more frequent in subtitles.This is corroborated by manual annotations of 24 dialogue samples from the selected corpora, and the use of a fine-tuned LLM trained on those annotations.Finally, we also show that dialogues generated from language models are closer to scripted dialogue than real interactions in their use of communicative feedback.Beyond their linguistic interest, these results can provide useful insights for the development of conversational models, as those are often trained on scripted dialogues and might therefore struggle both to understand communicative feedback from the user and to produce such feedback themselves.

A Conversational Feedback Lexical Statistics Cue Word Lists
In Table 3, we present the list of cue words used for computing the lexical statistics in Section 4. Content warning: the lists contain potentially offensive language.

Lexical Statistics plots
Figures 6-12 present statistics for utterance and feedback types as well as common feedback-related lexical items for different languages.Figure 13 shows politeness keywords and emojis in our English and French corpora.

B Detailed Dialogue Act Tagging Results
Dialogue Act Grouping

Results per Corpus
Tables 5 and 6 present the results of our dialogue act tagger per (sub)corpus used.Here, we only make a binary distinction by grouping the feedback-relevant classes Backchannel and Assessment into a single Feedback category.The number of utterances refers to the final version of the data after pre-processing with meta-linguistic information removed.
Figure 1: Frequency of conversational feedback of various types among utterances in the English corpora (both spontaneous and subtitles) based on manually curated lists of cue words to detect.Fig.(a) shows the absolute frequency while Fig.(b) zooms in on utterances labelled with at least one feedback.+ denotes positive feedback/acknowledgement, = neutral/continuer feedback,negative feedback, ?clarification requests and 'OTH' is for other utterances.fo and hi respectively stand for 'foreign audience' and 'hearing-impaired' subtitles.Corpora without these prefixes are spontaneous dialogues.

Figure 2 :
Figure 2: Most common lexical items associated with communicative feedback, as detected through manually curated lists of cue words in English, factored by corpus.

Figure 3 :
Figure 3: Frequency of communicative feedback depending on the source of the dialogue sample (spontaneous interactions or subtitles) and the category of feedback, based on annotations from human experts.

Figure 4 :
Figure 4: Frequency of communicative feedback depending on the corpus type and category of feedback, based on the predictions of the fine-tuned Gemma 2 model trained on human annotations.
Figure 13: Short utterance distribution including politeness and emojis.

Table 1 :
Overview of dialogue data sources for both spontaneous conversations and subtitles employed in this paper.

Table 2 :
using 80% of the Switchboard data as training and 20% for de-3web.stanford.edu/jurafsky/ws97/manual.august1.htmlProportions (%) of the relevant dialogue act groups detected by the BERT-based dialogue act tagger in the spontaneous conversation (SpConv) and in the subtitle (Subs) corpora.

Table 4 :
Instances created from the DAMSL-SWBD corpus with labels mapped to coarse-grained dialogue act groups.

Table 6 :
Number and frequency of communicative feedback phenomena predicted by the BERT-based dialogue act tagger on our subtitle corpora.Non-English datasets were automatically translated into English before inference.