Using Grammatical Inference For Structure Induction

Given the huge quantity of the current available textual information, text mining process tackles the task of searching useful knowledge in a natural language document. When dealing with a free-format textual corpus (e.g. a job announcement) where the linguistic rules are not respected, the time consuming morpho-syntactic analysis is not of a great help. However, text mining techniques process may exploit linguistic sub-structures in the text. In this paper, we present an applications of grammatical inference (GI) in a machine learning system applied to a text corpus. We specify and use the process of the grammatical inference as an instance of the constraint satisfaction problem that instantiates automata in a (language inclusion) lattice


Introduction
Textual databases constitute the major part of the current available information.Significant research work concentrates on the Information Extraction (IE) from these databases.
Given a textual corpus, the information extraction process applied by the techniques of Text Mining (e.g.[24], [27], [26], [13]) consists of the search for no-explicit informations in such corpora.As an example, Text Mining can extract significant information from Marine catastrphes bulletins like the prior event sequence of such disasters.
In a basic approach, IE task would be tedious if no a priori structural information is available about the text.On the other hand, given the cost of a syntactical analysis, an IE process based on a whole morpho-syntactic analysis of documents would not often be realistic.When dealing with free-format texts, such analysis would not be of a great interest in a text mining process usually based on key patterns.
In the case of free format texts, the rules of linguistic grammars are seldom respected.These texts rather contain few words without using entities such as determinant, verb and other punctuation.
In the current work, we are interested in the structures of sub-languages in free-format texts.For example, suppose an advertisement of an exposition on Egypt that will take place in Paris.The knowledge of the structure of the sub-language representing the address (where the exposition takes Place) may avoid concluding too quickly (and wrongly) on the place of the exposition upon the simple presence of Egypt city name.
Text Mining research field has been focused on since 1991 through MUC programs.However, it is still domain specific and time-consuming to build a new system or to adapt an existing one to a new domain.Although symbolic and statistical methods have been applied in some IE systems (e.g.[21], [28]), not a lot have combined Grammatical Inference with (naive) statistical information.
Techniques of Grammatical Inference (GI) ( [15], [16], [17]) promise to be useful in this field.They carry the process of Text Mining to capitalize the (partial) morphosyntactic structure of patterns (or of sub languages) with a few amount of information on the contents structure.These techniques attempt to induce the structures of a source data (flow of signs) by a set of production rules of a regular grammar.The induced grammar being an element of a (language-inclusion) lattice, the text mining is concerned by an informed search (seen as a generalization) within this lattice carrying required information and semantics.
This paper describes a research work on the design and implementation of GI process that was successfully applied first in a Pattern Recognition project on documents like summaries, dictionaries, scientific reports and so on.Here, whenever the linguistic structure of these documents are extracted, Textual DataMining technics ( [24], [27], [25], [26], [20], [23]) are applied to such documents and extract valuable knowledge from the data.
For example, a given report document (among the corpus) can be (logically) structured by a production rule like: report ← abstract, outline, chapter, sub-chapter, chapter, references.Having one rule per example, the objective is to generalize these rules in a GI process and to propose an automaton that describes the underlying language.
A second application of this process was on a seminar announcement corpus.A seminar announcement may have the following structure (one of the possible formats): seminar ← heading, subject, speaker, date, hour, address, organizer In this second experiment, the aim is first to learn how to recongnize slot values (and their structures), and then to capture slot fillers from new announcements.
In this paper, we focus on the extraction of the structure and the content of the above announcement corpus of documents using the Grammatical Inference.We apply the process of Grammatical Inference to a set of regular production rules.As in the above examples, each rule (one for each element of the sample set) represents the (logical) structure of the sample 1 .Negative descriptions (and samples) can be provided in order to denote those structures that must be rejected.The inference engine then produces a representative regular grammar that will recognize documents in their respective context.
In the following, some background about the Regular Inference is reported.

Grammatical Induction
The problem of grammatical inference can be considered in a Constraint Satisfaction Problem framework (see e.g.[6]).Although some work (e.g.[10]) tackled this problem as an instance of graph coloring, the proposed approach gave an interesting but a quite general idea of the question.
In [8], Gold showed that any recursively enumerable class of language is identifiable using a complete representation with the positive and the negative data.Hence, the class of regular languages cannot be correctly identified from only the positive examples.Although the usual case in document handling is to learn from only positive examples (given in the set I + ), the induced grammar can be drastically refined by some negative examples (the set I − ) and avoid over generalization 2 .
It is known that any algorithm that would construct a deterministic finite automaton (DFA) with a minimum number of states compatible with all the data already processed can identify any regular language in the limit ( [7]).
To achieve that, we developed an original and complete algebraic framework for the Grammatical Inference.In this framework, we define a relation over the (language inclusion) lattice of automata represented by the set of all samples I =( I + ∪ I − ) that leads to the construction of partitions over that search space.To realize that, an initial algebra A GI is assigned to a regular grammar G I associated to 1 Various sectionslike the address section of a seminar announcement are handled in turn. 2 Opposite to overfitting, the extreme case of the over generalization for an alphabet Σ is the language Σ*.
the sample set I. Then we focus on the definition of a quotient algebra A GI /R of A GI that leads to a uniquely defined isomorphism from A GI /R to the language of the induced automaton A. This automaton is supposed to govern and generalize the language structuring the sample set.
Within this algebraic framework (describing why the logical description is processed in that way), we discuss a general Constraint Satisfaction specification that characterizes the search space of the GI problem.Then, we define a set of constraints that outlines the quotient algebra above and constructs the final induced DFA (the automaton A).
Here, each valid announcement begins with the word seminar.The rule r 5 denotes that a message can not begin with a Determinant (Det) while r 6 denotes that a message with two successives Organizations must be rejected.Note that saying naively that Organization will always follow seminar word is wrong.
In the next section, an algebraic specification of the GI problem is stated.Then, in the sections 4 and 4.1, some practical issues, the implementation of the proposed CSP framework together with some examples are reported.Then, some relationships with other works in this field are recalled in the section 9.

Algebraic View of the GI
In the algebraic specification below, and relative to the sample set (I = I + ∪ I − ), the properties of the partitions over the terms of A GI -algebra associated to the grammar G I are depicted.Then we formally characterize a relation from these partitions to L(A), the language of the final induced automaton.This is done by the definition of a set of constraints defining a congruence relation R over the terms of A GI .The latter produces a quotient-algebra A GI /R whose terms are isomorphic to those of L(A).
Quotients of the A GI -algebra give a (language inclusion) lattice.Here, our main aim in the Grammatical (regular) Inference is to characterize this lattice and to guide the search in it.
The Grammatical Inference problem can be specified by using the relation between an initial many sorted algebra and context-free grammars ( [11]) 3 .To construct the algebra associated to a context-free grammar G, each non terminal of G is assigned to a class of derivation tree.Consequently, the non terminals of G are sorts of a many sorted algebra whose operations are defined by the production rules of G.The derivation tree (and the language) of any non terminal X denotes the carriers of the sort X of the algebra.
Let G =(N, T, P, S) be a context-free grammar and L G be its language with N =non terminals, T : the terminals, P : Production rules and S : the start symbol .Let associate to G the A G -algebra whose signature is ((N ∪ T ), Op) where Op is the set of names given to the productions in P .The terms of this algebra are (possibly partial) derivation trees starting from any non terminal of G.
An A G -algebra is initial in a category C based on the same signature if for all algebra B of C, there exist a unique homomorphism f : A G → B.
Let's now consider the sample set I = I + ∪ I − (with I + ∩ I − = ∅), the grammar G I from the set I , the A GI -algebra associated to G I and the language L(A) (A is the final induced DFA).We are interested in f such that f : A GI → L(A).Consider the set of finite automata associated to elements of I (one automaton per element of I) and let Tree(I) denote the tree of all these automata.
We define the automaton G I =( Q ∪{S}, Σ,P δ , S,F) associated to Tree(I) where Q is the set of all states in Tree(I), Σ is the set of terminals in I and P δ is the set of the names given to the transitions in Tree(I).The start symbole S is such that S → p 01 | p 02 |... where p ij ∈ P δ , j is the rank of the transition in the ith automaton associated to each element of I. F is the set of final states in Tree(I).G I associated to Tree(I) is possibly a no deterministic but ǫ-free (circuit-free) automaton.
If L I+ is the language of the positive samples (resp.L I− for the negative ones) generated by the final induced automaton A (that accepts only L I+ ), then, for any partition of Q containing equivalent states (cf. the section 4), 3 This relation is easily extended to the regular grammars.
Let consider R a congruence relation, a partition Tree(I) /R from Tree(I) and its regular grammar G R , A GI and A GI /R are the algebra assigned to G I and G R .In the following section, we will define a homomorphism homo R from A GI to A GI /R that formally defines the equivalence classes of A GI -algebra.Then, we will state a constraint satisfaction specification of the (language inclusion) lattice induced by homo R and propose a Constraint Logic Program (CLP [6]) that will search, under some constraints, for a (not necessarily minimal4 canonic) DFA in that lattice.

The Quotient Algebra
Let A GI =( ( Q ∪ Σ),Op) be an algebra associated to the (regular) grammar of the sample set I. Terms of A GI are derivation trees (let note them by â or b) of the form ri(rj, rm(..., rk(rn)...) and of some sort q ∈ Q.Let R a congruence relation on A GI .Op is the set of names (like ri) of rules of G I of the form (q', α → q) or (α → q), α ∈ Σ, q,q' ∈ Q.The quotient algebra induced by R is defined by ri is the name of a production rule (q', q") → q with q, q', q" ∈ Q is defined by ri(rj,rm( ..., rk (rn)...)=[ ri(rj, rm(..., rk(rn)...)].A derivation tree [â]inA GI /R is constructed using elements congruent to â ∈ A GI .
Equivalently, (r i ,r j )∈R implies r i ≡ r j (same as [r i ] ≡ [r j ]).
Quotient algebra are characterized by the universal property (up to an isomorphism [12]).This property is stated by the following (homomorphism) theorem applied to A GI (proofs out of the scope, avoid self reference) : Theorem 1.Let A GI associated to Tree(I) be the algebra and R a congruence relation on A GI .Then is a homomorphism that has the following property.
Letf:A GI → L(A) a homomorphism with the (former) congruence relation R, then there exists a unique homomorphism f such that the following diagram of mapping is commutative, i.e., f = f • homo R .

Commutative Mapping Diagram
It can be first showed that homo R is a homomorphism before proving the above theorem.One may note that the quotient algebra will define equivalence classes on A GI .Defining f will let us reach our goal which is to define L(A) from A GI .In the next section, we define a CSP specification by the Congruence predicate that defines the congruence relation R (over the of terms A GI ) and hence characterizes the search space Lat R (see figure below) and the instantiations in it.Then we discuss the properties of I + and I − with respect to L I , L I+ , L I− and L(A).
In the following figure, the top element Σ * of the lattice Lat R represents the set of all the words that can be constructed over the alphabet Σ, and the button element ∅ represents the empty set.This search space contains all automata (one for each element of Σ * ) in which the final automaton L(A) is searched.

The Congruence Predicate
Recall that R is a congruence relation over (the sorts of) The Congruence predicate constructs the store θ (a set of constraints) and assigns an equivalence class [q i ] to each q i ∈ Q.The set θ may contain constraints like xi nr ,= and =.Whenever the set of final constraints is satisfiable, if there is more than one solution, then we will choose the one which minimises the number of equivalence classes.Initially, θ = ∅.
In order to extract equivalence classes, this predicate is applied to every pair of (compound) terms of A GI .Within each couple of terms, the predicate is applied to every couple of sub-term of â1 and â2 .Backtracking is used to compute a consistent θ (which characterizes Lat R ).Initially, [q i ] is the equivalence class of each q i ∈ Q. Elements of I + and I − are distinguished, hence we recognize final states (F + and F − with F = F + ∪ F − and F + ∩ F − = ∅) of these two sets from each other and from any other equivalence class.
Predicate Congruence(r1,r2): adds constraints to the constraint store θ Let r1 and r2 be transition rules (for â1, â2) with α,β ∈ Σ Givens the rules r 1 and r 2 above (depicted in the figure 3 below) , the application of the Congruence predicate can produces 3 different configurations (i.e.[s Although [α]=α is in its simplest form, we introduced the notion of equivalence class for the alphabet using the lexical class function CL(α)=[α] where: For example, different city names or two (possibly different) organizations (university, research laboratory) are equivalent.Obtaining the final induced automaton is a matter of search in Lat R .This automaton is the solution of a consistent instantiation in the constraint store θ.Among all solutions, we pick up the one the minimizes the number of states.
This predicate takes as input the sets I + and I − and generates the final DFA which is in turn an executable CLP program representing the induced Grammar.In the application (i.e.test) phase, we try to match the automaton of a new input seminar announcement.In the case of a success, further processing can take place (e.g.slot fillers value assignment in the announcement corpora, as stated briefly in the next section).
The following figure shows a more general case.Note that if we consider α 1 (resp.β 1 )a st h eleft context of α 2 (resp.β 2 ) and α 3 (resp.β 3 )a si t sright context, we will cover, to some extent, the case studied in [21]:

Figure 4. contextes and states
Applying the Congruence predicate to above case will produce 5 different configurations (depending on the equivalence classes of α i ,β i ) with various number of states in which the final induced minimal DFA has 4 states.Constraint store then will decide the final induced DFA considering all transitions and the negative examples.
It is worth emphasize that the Grammatical Induction applied only to positive examples (I + ) tends to overgeneralize L + (see e.g.[20]).Hence, one may express negative descriptions that are representative of the words to be rejected.For example, we may state that a seminar announcement heading containing the Hour value must be rejected.The I − set of the section 5.5 contains some negative examples for an announcement heading.

An Example
The theoretical aspects and the implementation issues of the related work were validated first by using the experimental protocol cited in [9].The following example reports an original one that shows some interesting aspects of the grammatical inference engine.
The generated automaton A and L(A) are given below.Here, notations like a/s,f over an arc means that the transition is a part of success (s) or failure (f ) derivation.The tag <s ,f >means that the transition can possibly take to a success or to a failure.It is interesting to note that even though I contains words of the Context Free language a n b n , the above DFA extracts the following underlying knowledge from the sample set I (cf. a n b m ): the DFA recognizes either an even number of a's followed by an even number of b's or an odd number of a's followed by an odd number of b's.But it rejects an even number of a's (resp.b's) followed by an odd number of b's (resp.a's) at the same time (which are words in L − ).Obviously, A can not generate that context free language but learns a part of it.
Relative to this induction is the notion of grammatical enrichment6 that may be defined as follows.Suppose that the state [q] is originated from I − .If there is any successful derivation of ω ∈ L(A) containing L([q]), then we say that I − enriches L + .In the above example, For example, bb derived through q 0 q 5 q 2 q 3 is in the enriched L + if ever we do not constrain that derivation to only use success (<s> or <s,f > tag) edges.

The Text Mining Application
As mentioned above, we used the GI system to extract linguistic structure of different parts of a seminar announce-ment database.An example of such announcement is given below.We want to extract various information such as the Date or the Subject in a seminar.Finals measurements like the research fields of a university (or a researcher, etc.) can then be extracted.In this process whose goal is to extract slot fillers, valuable template slot fillers are already defined by an expert 7 : he/she knows in advance which kind of information is contained (and sought) in the data base.
It is also appropriate to note that a seminar announcement can be incomplete.For instance, the Hour may be missing within an announcement or it can be expressed in a different form (for example, by the "Friday afternoon" expression).
The reminder of this paper describes the use of the Grammatical Inference engine (consolidated by a Bayesian analysis, see the section 5.3) with respect to the textual IE task applied to the seminar announcement corpus.

Slots and Fillers of the Corpus
The following slots are defined for the seminar announcements corpus (abbreviations are further used in the paper).An announcement starts with the seminar (séminaire in French) keyword.

Related Grammatical Inference
In the IE process applied to natural language texts, there are major differences between the Sentence Analysis and the traditional NLP parsers .The goal of syntactic analysis in an IE system is not to produce a complete parse tree for each sentence in the text.Instead, our system needs only to perform partial parsing.That is, it needs only to construct as much structures as the IE task requires.
Current methods (see e.g.[2], [1]) use generally global constraints to resolve local ambiguities.But because of the gaps in the grammatical and lexical coverage, full sentence parsers may end up making poor local decisions about structures in order to create a parse spanning the entire sentence.
Furthermore, the syntactic analysis in a text mining process is avoided for several more reasons: -the cost and the complexity of this analysis, -the very few use of the results of this analysis (the goal is not to correct errors or to translate the text), -the texts may not follow the correct and complete syntax rules (of French in our case), etc.
A partial parser looks up for fragments of text that can be reliably recognized, e.g., noun and verb groups.Because of its limited coverage, a partial parser can rely on general pattern-matching techniques, particularly finite-state machines, to identify these fragments deterministically based on pure local syntactic elements.Partial parsing is well suited for information extraction applications for an additional reason : the ambiguity resolution decisions that makes full parsing difficult can be postponed until later stages of the processing where top-down expectations from information extraction task can guide the system's actions.
In our seminar announcement corpus, the subject is similar to a noun group but may not follow its rigorous syntax.Then, the inference stage helps, in this case, to retain effective rules used in the examples.Therefore, the corresponding text mining process will rather be a syntax directed process.
Starting from a sample set (positive examples and negative cases description, see the section 5.5 for an example of GI), the Grammatical Inference (GI) induces a regular grammar8 (a DFA) of this sample set.In the test phase, sentences presented to the grammar will be regarded as pertaining (or not) w.r.t. the language generated by induced grammar.
The Grammatical Inference carries out a classification of the sentences (accept or reject means belonging or not to a given language) but, in its original form, it does not handle the semantics of these constructions.Hence, Bayesian measures will guide the process by predicting the slot and its value (in its context) to be submitted to the grammar.The IE process is then achieved with more precision and reliability (see also [35]).

Naive Bayesian use
Several techniques of text mining use the Bayesian analysis that (even in its naive form) gives interesting results.In the method known as naive Bayesian, the document is presented as a vector of characteristics (e.g.various sections of an announcement).Other presentations such as bag of words consider the text in the form of a collection of words where any internal structure (physical, logical, morpho-syntactic or semantics) is inhibited.
The Bayesian rule is recalled below.Given a hypothesis (e.g. to have such a section of the class C in such a context inside a seminar announcement) and an announcement E over C,wehave: The idea is to express the weighted probability of the membership of a pattern or a sub-language within a class C according to the characteristic of the text E and those of other texts classified as such.
To summarise the current process, key patterns leading to recognize the various (but not all) fillers of an announcement are first defined during the training stage.Together with the key patterns, the frequency measurements and the regular production rules will help to decide (to classify)a section of the announcement.During the test phase, a pattern p first gets a probability to belong to a slot filler by the presence of a deterministic keyword (100 %) and/or by the probability (from the frequency table) of its (possibly left and right) contexts.p is then submitted to the induced grammar according to these probabilities.Failure cases are postponed to the postprocessing step 9 .The process uses the backtracking to consider other possibilities (see section 6 for the Sub filler).

Details of the GI process
It is easy to note that a simple textual search (based on keywords) cannot be appropriate for extracting knowledge from our seminar announcements.Methods of knowledge extraction based on the Bayesian analysis allow to predict the position of a given information in the text together with its average length (see e.g.[35]).This technique, based on the learning of the position of a section (e.g. the <Sub> section) would not be appropriate here because of the free format of the announcements.In addition, an announcement can be incomplete.Thus, getting the induced grammar of e.g. the <Adr − P lace> section will make it possible to analyse the content of that sub-language.
We use the grammatical inference in various sublanguages (e.g. the heading or the subject of an announcement) that may contain relevant information.As an example, the heading can contain a topic, a subject or an organizer that can be possibly extended in the reminder of the announcement.The subject (Sub) can add precise details to the Topic of the seminar and vice versa.Such complementary data are registered both in the frequency table and hard coded in the production rules.The sequence of operations is governed by the key patterns, the probabilities from the frequency table (table 1) and, finally, by the production rules 10 .
It may be noted that if the Grammatical Induction is processed only upon positive examples (the set I + set below), then the result tends to over-generalise the language induced.Hence, the expert may express negative descriptions that are representative of the words that must be rejected 11 .For example, he may state that a seminar announcement heading containing the Hour value must be rejected.The following example contains some negative examples for an announcement heading (the set I − ).

An Example of GI
As an example, the results of the grammatical inference on the heading of announcement follows.The grammar below partially describes what the headings of the sample set contained.Hence, the following I + does not cover all possible headings in all seminar announcements, but those of the sample set.
10 However, we are not in the context of the so-called Probabilistic Grammars 11 For the seminar announcement case, negative examples are quite straightforward.
The induced grammar accepts the language L + (the induced language of I + ) and rejects those of L − (the induced language of I − ).The final induced automaton accepts the language given below 12 .The rules that reject unsuitable constructions (i.e.words in L − ) are not reported here for the sake of clarity.However, one may observe that a rejection takes place in the induced DFA when a derivation (upon a token) leads to a final failure state F − (see the section 5.4).The language of the induced finite state automaton The language induced from the set I =( I + ∪ I − ) for the heading part of announcements is given below.Recall that this definition gives only the successful derivation paths.
Nota Bene: the induced grammar is an operational logical grammar (extended DCG).Predicates expressing constraints and actions are then added to its rules (see the Date example below).As an example of action, while recognizing (in their context): -a<Thème> may contain a part of the Subject; then the value corresponding to the Subject will be added to the < Sub > filler; -for a <V ille>, the corresponding city value will be added to <ADR − P lace> filler 13 .Other possible adjustment actions are achieved during the post-processing phase.

An example: the Date analyser DCG
Below, some of the induced grammar rules (annotated by their semantical actions 14 given inside brackets) for the < Date > filler are given.The lack of any part of a Date (e.g., the day-name) is not reported here 15 .

−−a separator
Nota Bene: the value 100 (parameter of the predicate add) indicates the confidence coefficient of the filler assigned to the slot.Here, the case of < Date > is rather simple and follows a known format.We may however note that the presence of "matin/après-midi" (AM/PM in English) of the < Date > will complete the <Hr> slot filler.

Frequency Measurements
The following percentage values is constructed from the input samples.1. Frequency table of an announcement sections In the above table, a cell C i j gives the frequency (or the Support, see below for a definition) of the column j that followed the line i in the training set.The Pres (Presence) column (the last one) gives the frequency of each element of the line in the training set (e.g. the Sub is present only in 45% of the announcements).We add to this table two other values: 77% of the announcements contain a T opic in their heading, and 18% of the headings contain an indication on the organizer (Org).
The cells containing 0% are of a particular interest because they give indications on the cases that do not occur.For example, <OrgSp>never follows the heading of an announcement.
As an example, we apply the conditional probability to the section Sub of the example of section 5 where the slot of the second line is not determined.This example shows how the post-processing will help deciding that slots filler.Given the table 1 above, the probability so that the unknown section (in the example given in the section 5) is a Subject (surrounded by the Heading and the Speaker) is 12%.However, this announcement does not contain a Subject in its heading and, the Speaker is the successor of a Subject in 23 45 cases.Therefore, the filler is predicted at 23% (weighted 51%) to be the Subject.
Note that the strongest probability of the section that follows the heading is the Date section.However, one can recognize a Date by the keywords in the induced grammar.
The depth of the Morpho-Syntactic analysis engine is a system parameter.In some cases, the (partial) linguistic class from this filler can be extracted giving a (partial) Noun Group (even without any initial determinant, see e.g.[1]).

Results for the Example
This section describes briefly some experimentations on the seminar announcement corpus.
For the grammatical induction, the GI process is applied within the morphological step in order to learn to reject useless combinations like those constructions that are liguistically ambiguous and useless for us 16 .Once the learning step is achieved on the seminar corpus, we obtained the following results for the seminar example of the section 5 (confidence coefficient for a filler value is reported at right when it is less than 100; the original database is in French): Org = "Institut de Physique Nucleaire de Lyon" Sub = "Le probleme des conversions de modes" (51) Sp = "Yves Colin de Verdiere" (51) OrgSp = "Institut Fourier Grenoble" (61) Hr = "14:30 H" Adr_Plc= "Salle 27-Rez de chaussee-Bât.Paul Dirac" Adr = "Institut de Physique Nucleaire de Lyon" Date = ""

Performances Evaluation
Several textual IE systems, notably those of MUCs, involved large training corpora with thousands of documents (see e.g.[26]).However, such large training corpora (and their associated templates) may not be available for most real tasks.
Experiments with smaller training collections (such as the 100 documents provided for MUC-6) suggest that fully automated learning techniques applied to a few text examples with minimal automatic syntactic processing may not be able to achieve sufficient coverage (see e.g.[34]).
We paid a special attention to the over generalization pitfall of the GI engine.An amount of work was done in testing the GI on several different corpora (bibliography, abstract, table of content, etc.) in order to improve induction algorithm.The GI engine is parametric such that several different degrees of generalization 17 can be set (by varying the constraints over the language-inclusion lattice of automata).The output automaton is then tested against 16 Here, some linguistic knowledge is required to eliminate useless lexical class combinations from morphological analysis. 17Three for the moment the training set and the one (that accepts all positive examples rejecting all negative one) with the least number of states is chosen.One may observe that the refinement operator is hard-coded within the the Congruence Predicate of section 4 In addition, another parameter is available in the GI engine the turns on-off the so-called enrichment issue (section 4.1).
However, we are aware that larger sample sets (and other domain specific corpora such as abstract scanning) are needed to improve the system.Larger sample set has however an inconvenience.Recall that the search space is given by the lattice of language-inclusion specified by the GI process and illustrated by the Congruence Predicate.This search space grows exponentially with the size of sample set I.
Starting with 300 examples, we applied a ten-fold cross validation and observed that the results were not significantly changed for more examples.Metric used : in the IE task (i.e. the corpus is known to contain announcements), evaluation metrics are based on the filler presence and prediction.

P recision =
N umber of Correctly assigned slots N umber of assigned slots Recall = N umber of Correctly assigned slots N umber of correct present slots In addition, an harmonic measure called F-measure (see e.g.[29]) is used to give the mean of the above values: F − measure = P recision × Recall The diagram of the figure 5 shows the performance percentages we obtained.For the seminar announcements corpus, it is not surprising to have high performance values (95% and 80%) given the intended slots and the relative low risk of error.The system is quite domain specific and may even be enhanced.Student work is currently done to adapt the system to other corpora.

The Related Work
Several textual IE system have been proposed since the focus on researches started by MUC program of DARPA (e.g.[22], [29]).
The use of pattern dictionary is common to many systems.Some uses clustering to create patterns by generalizing those identified by an expert (see e.g.[33]).The dictionnary used during the analysis.step contains basically keywords (and their lexical class).
Syntactic information can be used as in Autoslog ( [31], [32]) that uses a set of general syntactic patterns validated by an expert.Among these systems, some uses advanced syntactic analysis to identify the relationship between the syntactic elements and the linguistic entities (e.g. in [28]).This analysis is costly (when the semantic information is not used) and may limit the system specially if linguistic rules are not respected (like in our seminar examples).
In many IE systems, human interaction is highly required through different phases of training.Machine Learning techniques like decision trees are used ( [30]) to extract coreferences using the annotated coreference examples.
Among these systems, the current work is closed to PA-PIER system ( [21]).RAPIER is an ILP system that takes pairs of documents and filled templates and induces rules that directly extract fillers for the slots in the template.This system uses constraints on words and part-of-speech tags surrounding the fillers' left and right contexts.To some extent, our system can be seen from this point of view since, as mentioned in the GI (section 4), our grammatical Inference engine implements this technique implicitly.In addition, these results should be compared with those of the Named-Entity research work (see e.g.[3]) and aims to learn names by identifying all named locations, persons, organizations dates and so on.

Conclusion
In this paper, a new constraint satisfaction framework for GI has been presented and implemented by an operational constraint logic program that outputs the final DFA.Here, the algebraic specification gives a theoretical framework to state why some processing are done (e.g.merged states) before explaining how we do process.This specification allows to show that the homomorphism f (section 3.1) exists and we gave an implementation of it by the Congruence predicate which produces a set of constraints.If this set is satisfiable, then we choose a solution with the fewest number of states.Among other works in the field, [13] and [14] proposed similar methods for document analysis.But in the algebraic and constraint satisfaction frameworks of the Grammatical Inference, the logical aspects for the direct grammar extraction have, as well as known, not yet been investigated.
This work was initiated in a (paper) document processing project where GI results are used to classify and then translate documents into machine readable form.Other applications dealing with more general multimedia contents (video in particular) are under the study.
The code in GNU-Prolog of the realization is available from the author.
On top of the GI part, we designed and implemented an IE system that fills slots of a template associated to seminar announcements using Bayesian measurements.Once the template are slots filled, usual techniques of Data Mining can applied to the results since the resulting values of the slots describe simply a relational database scheme.One current use of the system is to extract information like the research field of universities, laboratories or researchers.That is, to guide PHD students in their researches.This is a work in progress and the performance results are encouraging to continue the project.We plan to first enhance and then extend the system to other corpora like job announcements and marine weather announcements.The aim is to establish statistics on marine catastrophes and previsions.The system will be integrated to a (database) Datamining engine in order to establish valuable information on marine events.

Figure 3 .
Figure 3. transitions for r 1 and r 2

<
Sub > the (general) T opic and the Subject of the seminar, <Org > the organizer, i.e. a university, lab.,... < Adr − Plc>the address and/or the place where the seminar takes place, <Sp> the person who will make the talk, <OrgSp> the organization of the Speaker (e.g. the research lab. of the Speaker), < Date > the date of the seminar, <Hr> the beginning hour (or the time range) of the seminar.

Table
Here, OrgSp abbreviates Organizer − Speaker , Pres stands for P resent , Sub for Subject , Plc for P lace , Hr for Hour and Sp for Speaker :