Enriching a Lexical Resource for French Verbs with Aspectual Information

The paper presents a syntactico-semantic lexicon of over a thousand French verbs. It has been created by manually adding lexical aspect features to verb frames from TreeLex [16]. We present how the original syntactic resource has been adapted to the current project, our aspect assignment procedure and an overview of the resulting lexical resource.


Introduction
For Natural Language Processing (e.g., Information Extraction, Syntactic Parsing, Text Generation), as well as language-oriented Digital Humanities applications (e.g., Discourse Analysis, stylometry), machine-tractable as well as human-readable large-scale lexical resources are still a very valuable asset, even in a scene which appears today dominated by robust Machine-Learning algorithms and giga-word corpora.For instance, even though syntactic parsing has seen great advances in the past 10 years, thanks to the development of Treebanks and dependency-annotated corpora, even the best parser fails to capture in a consistent and predictable way such an intuitive linguistic notion as transitivity.In this sense, (semi-)manually constructed lexicons are an indispensable complementary resource to corpus-driven resources (e.g., "word embeddings", n-grams datasets).We see the symbolic/-Machine Learning divide as a consequence of the fact that each type of resource addresses a portion of the problem.Thus, the challenge contemporary NLP systems are facing today is more how to integrate different knowledge sources than to prove that one source is betteror more consistent -than the other.In this paper, we present TreeLex++, an extension of TreeLex [16], a syntactic lexicon for French, based on the French Treebank (FTB), enriched here with aspectual information.Different lexical resources have been devised over several decades for the automatic processing of French texts, in different theoretical frameworks: from the manually-encoded Lexicon-Grammar tables [13] framed in a distributionalist framework, to contemporary large-scale, semi-automatically induced lexicons such as the Lefff [24,23], or resources acquired by way of "serious games", such as Jeux de Mots [17,18].Most of those 10:2 Enriching a Lexical Resource for French Verbs with Aspectual Information lexical resources have focused on providing a formalized description of the main syntactic categories, with an emphasis on verbal predicates.In extending TreeLex with aspectual information, our goal is primarily to set up a large-scale aspectual characterization process of verbs.Secondly, we wish to provide the NLP and DH communities with a resource which combines corpus-induced syntactic characterizations1 as well as basic aspectual distinctions, based on Vendler's classification [25].
In the first sections, we present how TreeLex++ derives from the original FTB-induced TreeLex resource (Section 2 and 3).Then we move on to the presentation of our aspectual semantics characterization process (Section 4).In Section 5 we give a general overview of the present state of the resource.Section 6 is dedicated to conclusions and perspectives.

TreeLex
TreeLex is a syntactic lexicon automatically extracted from the French Treebank [1].The lexicon contains ca.2000 contemporary French verbs with their syntactic realizations and frequencies found in the FTB.The FTB is a corpus of newspaper texts (Le Monde newspaper, 1990-1993), in which constituent trees were originally encoded in XML format.In addition to lexical information for every word (category, lemma, person, number, gender etc.), the corpus provides a syntactic structure for each sentence: both syntactic groups and functions are indicated (see Figure 1).The XML-based annotation schema has since been complemented with a more straightforward tabulated format, following the CoNLL specifications that were widely adopted after the CoNLL shared task on dependency-parsing [20].
The FTB annotation schema is centered around the verbal nucleus (VN) which makes syntactic dependents easily accessible.This corpus organization is exploited by [16] in order to obtain obligatory arguments and provide syntactic frames for verbs present in the FTB.The resulting lexicon, called TreeLex2 , provides a rich syntactic representation of each argument since both functions and their phrasal realizations are encoded.Example 1 shows a lexical entry for the transitive verb entraver 'to impede' which takes a nominal subject (SUJ:NP) and a nominal direct object (OBJ:NP).If a verb allows for different syntactic combinations (i.e., either a list of functions or different realizations), every frame is listed separately.Therefore, a single verb (more precisely, its lemma) can be found several times in the lexicon, see (2).As no semantic disambiguation was performed, this strategy aims at distinguishing potentially different senses associated with each frame.Here, in (2a-b), voler has the meaning of 'to steal' whereas in (2c) it can be translated as 'to fly'.

détruire: SUJ:NP,(OBJ:NP)
Finally, since multi-word units are indicated in the FTB, TreeLex lists 465 multi-word verbs, such as courir le risque 'to take a risk' or donner lieu 'to result/take place'.

3
Beyond TreeLex: towards TreeLex++ On the other hand, TreeLex's size makes an in-depth qualitative linguistic study feasible.For example, it could be extended with semantic information to investigate interactions between semantic and syntactic properties of verbs.For French, several projects have produced lexical resources containing syntactic and semantic verbal properties, or different levels of semantic information, e.g., verbal semantic classes (LVF, cf.[10]), thematic roles (French FrameNet, cf.[7]) or lexical aspect (Nomage, cf.[3] or [9]).In the current project, we decided to focus on high-level syntax-semantics relationships and thus we augmented the syntactic frames in TreeLex with manually encoded aspectual information.Our approach differs from [3] or [9], as verbal aspect assignment is guided by corpus examples rather than by elicited sentences. 4Similarly to [9], aspect is assigned to a verb-frame couple rather than to a verb alone.Nevertheless, the level of detail of our aspectual classes is distinct both from [3] and [9]: we use only the four major Vendlerian classes 5 .
In order to prepare the TreeLex data for aspect assignment, several modifications have been adopted.First, all frames had to be represented in a uniform way.Therefore all syntactic arguments, whether optional or not, have been treated equally and indications of optional realizations have been removed.In particular, verbs such as détruire 'to destroy' in (3) were transformed into (4): 4. détruire: SUJ:NP,OBJ:NP Second, we had to address the ambiguity in TreeLex entries.As shown in (2), TreeLex verbs may appear with several frames.According to [16], this affects about 40% of TreeLex verbs.Such multiple frames may indicate a polysemous and/or a polyaspectual verb.However, all different syntactic realizations of a single argument structure (the same sequence of functions) are listed as separate frames in TreeLex, see (5).This representation is therefore unclear: it may show a true semantic (meaning) difference or introduce an artificial syntactic (frame) ambiguity.For example, the direct object (OBJ) of the verb déplorer 'to regret/deplore' in (5) has two syntactic realizations (a nominal phrase, NP, or a subordinate phrase, Ssub) but this syntactic variation does not imply a difference in meaning.In order to avoid such an artificial ambiguity, we grouped all frames which differed only by their phrasal realization.Therefore, the double nature of OBJ in ( 5) is currently represented as in (6).

déplorer: SUJ:NP,OBJ:NP/Ssub
In an effort to reduce semantic ambiguity, we decided to consider only verbs which, after syntactic grouping, appeared with a single syntactic frame.As a consequence, verbs such as voler in (2) have been excluded. 6Multi-word verbal units have been omitted as well, as their meaning is usually idiosyncratic and conventional.Moreover, due to their idiomatic nature, syntactic construction appears heavily constrained.
Finally, all remaining 1161 verbs have been coupled with examples extracted from the FTB.We collected corpus examples in order to illustrate how each frame is instantiated and to provide a real context for aspect assignment.

Incorporating lexical aspect
Aspectual information has been added manually to TreeLex verbs.Unlike grammatical aspect, lexical aspect refers to inherent semantic properties indicating the way in which predicates are structured in relation to time.In the most general terms, the properties in question have to do with the presence (or lack thereof) of an end point (limit or boundary), duration or dynamicity in the lexical structure of certain classes of verbs.Thus, for instance, the presence of a limit distinguishes between telic (i.e., a time-limited situation) and atelic verbs.

Annotation procedure
Aspectual assignment is a relatively new task, in the field of natural language annotation.The research exposed here is therefore to be seen as the first steps towards a full-fledged syntactic/semantic lexical resource.Our aspect assignment procedure consisted in a double manual annotation by two experts in semantics.Our annotation procedure is therefore not a "standard" annotation process, since, after the initial annotation phase, a final adjudication phase took place in order to arrive at the annotations presented in the current version of Treelex++.This process, which departs from established annotation approaches, is to be considered as a way of ensuring consistency in the current phase, where aspectual tagging is entirely performed manually.Each verb has been considered along with its syntactic frame and the corresponding examples found in the FTB.The assignment task consisted in choosing one of the four classes (tags) in Table 2.Each decision was made after applying the usual tests presented in the literature on verb lexical aspect (see [12,15,25,8,27,19,6,22], among others).We have used the following six tests (cf.Table 3): T1: progressive form of être en train de 'to be V-ing'  In order to illustrate our procedure, let us take the verb invoquer 'to invoke' in one of the sentences where it appears in the corpus:  It is important to mention that verbs were annotated according to their meaning in the sentences found in the FTB corpus.Verbal polysemy was addressed only if different meanings appeared in the corpus.It is known that phrasal context can influence the verbal aspect ( [8,26] inter alia).Upon applying the tests presented above, plural subjects and direct objects were transformed into their singular forms, so as to avoid the effect that plural arguments can turn ACC predicates (écrire un article en dix jours 'to write a paper in ten days') into ACT ones (écrire des articles pendant dix jours 'to write papers for ten days').Likewise, we have used past perfective tenses (Elle a travaillé (hier) 'She has worked (yesterday)') in order to avoid a habitual reading which is usually obtained in imperfective senses (Elle (travaillait/travaille) à la poste 'She (worked/works) at the post office').Since imperfective tenses favour a habitual reading, the dynamicity property [±dynamic] of the verb becomes inaccessible.For similar reasons, frequency adverbs triggering iterative or habitual readings (souvent 'often', tous les jours 'every day') were not taken into account either, since they interfere with verbal aspectual features.
We obtain an aspectual characterization limited to the meanings appearing in the corpus.It is not an annotation of the verbs as lemmas, neither verbs in sentences, but rather an annotation of verbal structures (verb + arguments) in a discursive context, which allowed us to identify verbal meaning and to avoid polysemy as much as possible.

Annotation consistency assessment: Inter-Rater Reliability
Based on the annotation process outlined above, we have been able to estimate the inter-rater reliability (IRR), by taking into account the annotations produced by two annotators on 1161 verbs.The annotators are both experts in aspectual semantics.Each verb in the list has been annotated independently by each annotator, even though a final adjudication step yielded the annotations visible in the current version of the lexicon.Comparing the annotations produced by both annotators was necessary, in order to arrive at a consistent decision in the final resource.For example, atteler 'to tie' was initially labelled "ACT" by annotator 1, while annotator 2 was not sure of his annotation.After the first annotation phase, both annotators agreed to tag the entry as "ACT".Therefore, for the purpose of assessing the inter-rater agreement, we consider the initial annotation, which counts as a disagreement case.Conversely, for cerner 'to surround', annotator 1 was not sure of her annotation, while annotator 2 initially labelled the entry as "ACH".After confronting their annotations, both annotators finally agreed on labelling this entry as "ACC".Again, this case counts as a disagreement between both annotators.As can be seen, the final decision does not reflect either annotator's initial decision, which underlines the fact that aspectual annotation is a complex task.Cases such as the one discussed here therefore strongly advocate in favor of a post-annotation adjudication phase.
The following IRR statistics were produced using R packages: {irr}7 and {irrCAC}8 .Assessing IRR is not a straightforward task, since many methods have been presented in the literature 9 .We choose to present "standard" IRR statistics, such as Cohen's Kappa [5], in this preliminary stage, alongside Gwet's "Agreement Coefficient" score AC1 [14,28].Since the present lexical resource is still under construction, these IRR scores are essentially a way of assessing the complexity of the aspectual annotation task presented here, and therefore the consistency of the annotation procedure.In the annotation task under consideration, each annotator had to categorize 1161 verbal entries into 4 major classes: ACC, ACH, ACT, STATE.In total, 3 hybrid classes were also considered, such as: ACC/ACH, ACH/ACT and STATE/ACH.For example varier 'to vary' was initially labelled "ACH/ACT" by annotator 1 (final decision: "ACT").Finally, a "not sure" tag was also used.As a consequence, the initial list of verbal entries has been associated with 8 different tags, including "not sure".
As can be seen in Table 5, both annotators agree on 82.6% of the cases, with an estimated 9.7% of chance agreement.The reported Kappa score (0.744) indicates a moderate inter-rater agreement 10 , which is not uncommon for complex tasks.In our case, this score can be largely attributed to the fact that 4 major classes and 3 hybrid ones were considered.Gwet's AC1 score (0.806) is slightly higher than Cohen's Kappa, which can be attributed to 10:8 the fact that Gwet's AC1 is a chance-corrected agreement coefficient that is known to yield higher agreement coefficients than Cohen's (and other authors'), in certain configurations.Regardless of the method, these figures indicate a "moderate" to "good" inter-annotator agreement.

Enriching a Lexical Resource for French Verbs with Aspectual Information
At this point, it is worth emphasizing once more that, once the preliminary annotation was completed, a final adjudication phase took place, which yielded the final aspectual annotations visible in the current version of Treelex++.Since these final annotations are those end users will see, it is necessary to assess IRR scores between each annotator and the final annotations.In this case, Kappa scores in the 0.85 range, and AC1 scores in the 0.9 range can be reported.Final users of the TreeLex++ lexical resource should therefore consider that the proposed aspectual annotations are consistent, and that the annotation procedure based on syntactico-semantic tests achieves good results for the classes considered.As encouraging as they might seem, these figures should not obscure the fact that there is still considerable room for improvement, in terms of both scale and detail.For future versions, we are contemplating Games With A Purpose (GWAP) such as JeuxdeMots [17] as a source of user input.We are confident JeuxdeMots players will consider favorably new games, such as aspect-oriented tasks, provided we are able to propose 'gamified' versions of the present annotation procedure.

Data in TreeLex++
The resulting resource, TreeLex++, contains 1161 verbs enriched with syntactic (frame) and semantic (lexical aspect) properties.It is available in a text format as a CSV file (comma separated value).Each verb is accompanied by its frame, the lexical aspect, the number of examples found in the FTB and their full list 11 .To simplify the search of the inflected form in the example text, the corresponding verb is indicated between <b> and </b> tags, as presented in (8): 8. Quant à moi , je trouve qu' on se <b>fiche</b> du monde en n' expliquant pas les choses en langage courant .'As for me, I think that they don't give a toss about the people by giving no explanation in the common language.' To make linguistic generalizations easier, information encoded in syntactic frames has been translated into several representations: number of syntactic arguments12 whether a verb is reflexive or not a general frame (a list of syntactic functions and obligatory clitics) a simplified frame (a list of syntactic functions alone) the full frame including syntactic realizations (types of phrases) The corresponding syntactic information for déplorer in ( 6) and the reflexive verb se ficher 'to not give a toss' presented in TreeLex++ format is given in Table 6.A brief summary of syntactic realizations 13 of TreeLex++ verbs is given in Table 7 below.The number of arguments in TreeLex++ does not exceed three and the vast majority of verbs (74.24%) have two arguments.However, as indicated in Table 6, this does not necessarily correspond to a transitive structure (SUJ.OBJ) as the second argument may have a different function than a direct object (see Table 1).The distribution of verbal aspectual classes found in TreeLex++ is given in Table 8.The majority of verbs in TreeLex++ are telic (ACH or ACC).If we look at dynamicity, only a small proportion of verbs (8.87%) are true statives, the bulk of the entries are dynamic (ACH, ACC or ACT).However, the distribution of durative (STATE, ACT, ACC) and non-durative (ACH) verbs is almost equal.

L D K 2 0 2 1 10:10 Enriching a Lexical Resource for French Verbs with Aspectual Information
The resource is neither syntactically nor semantically balanced, which is probably due to the content of the FTB corpus (newspaper texts).
As shown in Table 8, most verbs are assigned a single aspect.Hence, it seems that our approximate disambiguation technique is quite efficient.3 verbs, however, exhibit a double aspect: excéder, observer, and traverser.Indeed, judging from their context, these verbs are truly polysemous in the FTB: excéder is ambiguous between 'to exceed' and 'to infuriate', observer is used as either 'to observe' or 'to respect/keep' and traverser corresponds to 'to cross' or 'to experience'.Therefore, even when syntactic properties are restricted to a single frame, certain semantic ambiguities could remain.

Conclusions and perspectives
TreeLex++ is a lexical resource which associates both syntactic and semantic properties, for over a thousand verbs, illustrated with attested examples taken from the FTB.Such a database offers a valuable resource for fundamental linguistics research, NLP and DH applications.
From a fundamental research perspective, TreeLex++ allows to identify correlations, if any, between syntactic frames and aspect values.In other words, it allows researchers to work at the syntax/semantics interface.For instance, intuitively, the accomplishment verbs (ACC) should be associated with transitive verbs (2-argument predicates).TreeLex++ provides an opportunity to verify this hypothesis empirically: not only can it be confirmed or refuted but we can also estimate the degree of association between syntactic structures and aspectual classes.The first findings presented in [4] show how TreeLex++ can be put to use in this perspective.As for NLP applications, a number of practical uses of aspectual information is cited in [9]: the assessment of event factuality, text summarization, machine translation or automatic detection of temporal relations.We anticipate performance gains for those task, by integrating TreeLex++ as a symbolic resource, within a Machine Learning processing chain.
In its current version, TreeLex++ contains only single-frame verbs, which roughly covers a half of the entries in TreeLex.In order to include the remaining half in TreeLex++, we have to employ a true semantic disambiguation technique first.As mentioned in Section 5, a verb with a unique syntactic combination may still be polysemous and polyaspectual.In case of several frames, this potential ambiguity is multiplied and human disambiguation effort, already complex and time-consuming, increases considerably.A possible solution could be a lexical look-up of verb-frame couples in LVF [10] in order to identify different verb senses.However, pairing the senses with the corresponding FTB examples would require an ad-hoc approach.As mentioned above, another available option is to leverage user input, by resorting to crowd-sourcing, or "Game With A Purpose" platforms.We have taken steps towards this end by contacting Jeux de Mots's developer, Mathieu Lafourcade, in the perspective of integrating the aspectual information from TreeLex++ to the existing Jeux de Mots lexical network.This will allow for the development of new types of lexical games.We also hope Lafourcade's lexical propagation and integrity checking mechanisms will allow us to capture more general syntax/semantics properties than those which can be currently found in the FTB.
An evaluation methodology for our resource is also in order, beyond Inter-Rater Reliability scores, to determine the accuracy, as well as the coverage of our aspectual assignment process.For instance, we could compare our results with aspect values attributed to verbs in the Nomage project [3].However, Nomage methodology (for verbs) differs from ours as aspect assignment is based on elicited examples rather than on verb uses in a corpus.Another comparison could be made with the syntactico-semantic resource described in [9] which served for training of an automatic classifier of verbal aspect.Unfortunately, this data does not seem to be publicly available.Moreover, both resources use different aspectual values from ours thus the corresponding tagsets have to be converted first in order to provide the equivalent information.Again, we turn towards the Jeux de Mots platform, in the hope of gaining insights from users's inputs on lexical aspect assignment tasks 14 , as well as from the network's built-in sanity checking mechanisms.

Figure 1 A
Figure 1 A sample of FTB sentence annotation.

T2:
question related to dynamicity Que s'est-il passé hier?'What happened yesterday?'T3: use of aspectual semi-auxiliaries commencer à 'to start doing something', continuer de 'to keep on doing something', arrêter de 'to stop doing something' T4: duration complement en x temps 'in x time' T5: duration complement pendant x temps 'during x time' T6: imperfective paradox V[temps inaccompli] IMPLIQUE V [temps accompli] 'V[imperfect tense] IMPLIES V [perfect tense]' Resource for French Verbs with Aspectual Information

Table 3
A grid for the allocation of aspectual classes to TreeLex verbs.

7 .
Pour justifier cette décision, la direction invoque la déprime du marché automobile.'To justify this decision, the management invokes the depression of the automobile market.' T1: This verb cannot appear in a progressive form: *La direction est en train d'invoquer la déprime du marché automobile.T2: La direction a invoqué la déprime du marché automobile is an acceptable answer to the question Que s'est-il passé hier?T3: This verb cannot appear as a complement of commencer, continuer, etc.: *La direction a commencé/continué à invoquer la déprime du marché automobile.T4: invoquer is not compatible with en x temps: *La direction a invoqué la déprime du marché automobile en deux heures.T5: the sentence is not compatible with pendant x temps either: *La direction a invoqué la déprime du marché automobile en deux heures.This sentence is only acceptable in an iterative reading.T6: La direction invoquait la déprime du marché automobile does not imply La direction a invoqué la déprime du marché automobile.Thus, according to the battery of tests summarized in Table4, invoquer in (7) should be assigned to the ACHIEVEMENT class.

Table 6
Syntactic information in TreeLex++.

Table 7
The distribution of verbs with respect to the number of arguments.

Table 8
Aspect distribution in TreeLex++.