(Surface) Sequoia Treebank

The corpus contains 3099 French sentences, from Europarl, Est Republicain newspaper, French Wikipedia and European Medicine Agency.

It has initially been annotated with constituency trees and converted to surface syntactic dependency trees.

It has recently been augmented with deep syntactic dependencies, see the Deep Sequoia corpus page.

Adding deep syntactic dependencies has been the occasion of correcting some surface dependencies, leading to the deep-sequoia-7.0 release, which contains several kinds of syntactic annotations of the same sentences:

  • deep syntactic dependency graphs
  • surface dependency trees
  • surface constituency trees

Description of the (surface) Sequoia corpus

Contact : ,

The corpus is freely available under the free licence LGPL-LR (Lesser General Public License For Linguistic Resources) cf. http://infolingu.univ-mlv.fr/DonneesLinguistiques/Lexiques-Grammaires/lgpllr.html

If you use the surface syntactic trees, please cite the following paper:

Candito M. and Seddah D., 2012 : "Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical", Actes de TALN'2012, Grenoble, France

The total number of sentences is 3099 (as of version 6.0 and above), and each sentence is annotated for part-of-speech and phrase-structure, following the French Treebank guidelines.

The constituency trees were then automatically converted to dependency trees.

Manual corrections were performed on the version 3.1 to obtain versions 3.2 and further 3.3. The corrected errors were spotted by Bruno Guillaume, using the grew tool, on the automatically converted dependency trees. ( http://wikilligramme.loria.fr/doku.php?id=sequoia )

Versions 4.0 and above contain manual annotations of long-distance dependencies, as described in :

Candito M. and Seddah D., "Effectively long-distance dependencies in French : annotation and parsing evaluation", Proceedings of TLT'11, 2012, Lisbon, Portugal)

In versions 6.0 and above, some duplicate sentences have been removed (from the EMEA-test part of the corpus), leading to a corpus of 3099 sentences. See the appendix for the ids of the removed sentences.

Contact : ,

Version history

V6.0 may 2014

This version contains some manual corrections (with respect to previous versions), performed at the occasion of the Deep Sequoia project : the Alpage Team and the Sémagramme team have added deep syntactic annotations to the dependency version of the sequoia corpus (see https://deep-sequoia.inria.fr/fr/). During the deep sequoia project, some errors in the surface dependency trees have been corrected.

As a result, contrary to previous releases : the dependency version no longer results from automatic conversion of the constituency trees.

V5.2 june 2013

Changes pertaining to constituency trees (and to automatically-derived dependency trees) :

- various manual corrections (spotted during manual annotation of "deep" syntactic dependencies, by Alpage and Calligramme)

- debug of the distinction between the categories P and P+D : now compound prepositions can bear the category P+D if they amalgamate the determiner

Changes pertaining to dependency trees only :

- complement versus adjunct distinction for prepositional dependents of adjectives or adverbs : (possible labels for prepositional dependents of adjectives and adverbs are : "mod" for modifiers, and "a_obj", "de_obj" or "p_obj" for complements

- handling of coordination of determiners as normal coordination (despite the absence of DP in the FTB) and better handling of ranges of determiners : "trois ou quatre personnes" (three or four people) => coordinating conjunction now depends on the first determiner "trois à quatre personnes" (three to four people) => "à" (to) now regularly depends on the first determiner

- various debugs during prediction of labels absent from the constituency trees: including - nominal modifiers of prepositions "deux ans avant l'élection" (two years before the election) => "ans" (years) is now a modifier of the preposition "avant" (before)

V5.1 march 2013

- Manual corrections of errors that were spotted by Bruno Guillaume using grew : http://wikilligramme.loria.fr/doku.php?id=sequoia:errors

- annotation of canonical functions for the arguments of causative verbal nucleus (faire + infinitival verb) Labels FCT1:FCT2 mean: - FCT1= surface grammatical function - FCT2= canonical grammatical function So in "Paul fait payer la somme par son avocat", "Paul" receives the label "SUJ:ARGC", as it is the surface subject, but the underlying causer

- annotation of non referential "il" subject clitics (feature void=y) Label SUJ:SUJ, feature void=y Plus feature 'intrinsimp=y' in case of impersonal verbs (with no "personal" counterpart)

V4.0 october 2012

Manual annotation of long distance dependencies.

In the conll files, columns 7 and 8 give the head and dependency label that take into account non-local phenomena, leading to sometimes non projective trees. Columns 9 and 10 contain the head and dependency label obtained via automatic conversion, and are necessarily projective.

V3.3 july 2012

Manual corrections of errors that were spotted by Bruno Guillaume using grew :

http://wikilligramme.loria.fr/doku.php?id=sequoia

- 60 corrected errors in the phrase-structure trees

- 13 errors coming from the automatic conversion to dependencies (still to correct)

- 45 false errors (no correction to make)

Addition of the *xml files : phrase structure trees in the XML format of the FTB

V3.2

a few manual corrections

V3.1

addition of the *+fct.mrg_strict files, strictly compliant to the FTB annotation scheme

V3.0 june 2012

First public release

Content

    • Files *+fct.mrg_strict are files with one sentence per line, with phrase-structure in bracketed format, in which the phrase-structures are strictly compliant to the French Treebank annotation scheme.
    • Files *.xml are the XML FrenchTreebank format for the *+fct.mrg_strict files
    • Files *+fct.mrg are the same sentences, still in bracketed format, but with phrase-structures compliant to the ftb-uc treebank instanciation :

*** prepositions that dominate a infinitival VP do project a PP

*** any sentence introduced by a complementizer (CS tag) is grouped into a Sint constituent

    • In first releases, the *conll files are obtained through automatic conversion of the *+fct.mrg files to surfacic projective dependency trees, in CoNLL 2006 format

From version 4.0, columns 7 and 8 give the head and dependency label that take into account manual annotation of LDDs, that override the automatic conversion, leading to sometimes *non projective trees*. Columns 9 and 10 contain the head and dependency label obtained via automatic conversion, and are necessarily projective.

From version 6.0 : the *conll files contain some manual corrections, performed during the Deep Sequoia project (see https://deep-sequoia.inria.fr/fr/).

Number of sentences for each sub-domain :

561 sentences Europarl file= Europar.550+fct.mrg

529 sentences EstRepublicain file= annodis.er+fct.mrg

996 sentences French Wikipedia file= frwiki_50.1000+fct.mrg

574 sentences EMEA (dev) file= emea-fr-dev+fct.mrg

544 sentences EMEA (test) file= emea-fr-test+fct.mrg, among which 101 were removed (because duplicates) in version 6.0.

Data split (TALN 2012 experiments)

The "neutral" domain is made of EstRepublicain + Europarl + FrWiki, and the split into dev and test sets is the following :

head -265 annodis.er+fct.mrg >> sequoia-neutre-dev+fct.mrg

head -280 Europar.550+fct.mrg >> sequoia-neutre-dev+fct.mrg

head -498 frwiki_50.1000+fct.mrg >> sequoia-neutre-dev+fct.mrg

tail -264 annodis.er+fct.mrg >> sequoia-neutre-test+fct.mrg

tail -281 Europar.550+fct.mrg >> sequoia-neutre-test+fct.mrg

tail -498 frwiki_50.1000+fct.mrg >> sequoia-neutre-test+fct.mrg

Appendix : duplicate sentences removed in version 6.0

< emea-fr-test_00301

< emea-fr-test_00302

< emea-fr-test_00303

< emea-fr-test_00304

< emea-fr-test_00305

< emea-fr-test_00306

< emea-fr-test_00307

< emea-fr-test_00308

< emea-fr-test_00309

< emea-fr-test_00310

< emea-fr-test_00311

< emea-fr-test_00312

< emea-fr-test_00313

< emea-fr-test_00314

< emea-fr-test_00315

< emea-fr-test_00316

< emea-fr-test_00317

< emea-fr-test_00318

< emea-fr-test_00319

< emea-fr-test_00320

< emea-fr-test_00321

< emea-fr-test_00322

< emea-fr-test_00323

< emea-fr-test_00324

< emea-fr-test_00325

< emea-fr-test_00326

< emea-fr-test_00327

< emea-fr-test_00328

< emea-fr-test_00329

< emea-fr-test_00330

< emea-fr-test_00331

< emea-fr-test_00332

< emea-fr-test_00333

< emea-fr-test_00334

< emea-fr-test_00335

< emea-fr-test_00336

< emea-fr-test_00337

< emea-fr-test_00338

< emea-fr-test_00339

< emea-fr-test_00340

< emea-fr-test_00341

< emea-fr-test_00342

< emea-fr-test_00343

< emea-fr-test_00344

< emea-fr-test_00345

< emea-fr-test_00346

< emea-fr-test_00347

< emea-fr-test_00348

< emea-fr-test_00349

< emea-fr-test_00350

< emea-fr-test_00351

< emea-fr-test_00352

< emea-fr-test_00353

< emea-fr-test_00354

< emea-fr-test_00355

< emea-fr-test_00356

< emea-fr-test_00357

< emea-fr-test_00358

< emea-fr-test_00359

< emea-fr-test_00360

< emea-fr-test_00361

< emea-fr-test_00362

< emea-fr-test_00363

< emea-fr-test_00364

< emea-fr-test_00365

< emea-fr-test_00366

< emea-fr-test_00367

< emea-fr-test_00368

< emea-fr-test_00369

< emea-fr-test_00370

< emea-fr-test_00371

< emea-fr-test_00372

< emea-fr-test_00373

< emea-fr-test_00374

< emea-fr-test_00375

< emea-fr-test_00376

< emea-fr-test_00377

< emea-fr-test_00378

< emea-fr-test_00379

< emea-fr-test_00380

< emea-fr-test_00381

< emea-fr-test_00382

< emea-fr-test_00383

< emea-fr-test_00384

< emea-fr-test_00385

< emea-fr-test_00386

< emea-fr-test_00387

< emea-fr-test_00388

< emea-fr-test_00389

< emea-fr-test_00390

< emea-fr-test_00391

< emea-fr-test_00392

< emea-fr-test_00393

< emea-fr-test_00394

< emea-fr-test_00395

< emea-fr-test_00396

< emea-fr-test_00397

< emea-fr-test_00398

< emea-fr-test_00399

< emea-fr-test_00400

The corpus contains French sentences, from Europarl, Est Republicain newspaper, French Wikipedia and European Medicine Agency.

The corpus is freely available under the free licence LGPL-LR (Lesser General Public License For Linguistic Resources) cf. http://infolingu.univ-mlv.fr/DonneesLinguistiques/Lexiques-Grammaires/lgpllr.html

If you use it, please cite the following paper:

Candito M. and Seddah D., 2012 : "Le corpus Sequoia : annotation syntaxique et exploitation pour l’adaptation d’analyseur par pont lexical", Actes de TALN'2012, Grenoble, France

The total number of sentences is around 3200, and each sentence is annotated for part-of-speech and phrase-structure, following the French Treebank guidelines.

The constituency trees were then automatically converted to dependency trees.

Manual corrections were performed on the version 3.1 to obtain versions 3.2 and further 3.3. The corrected errors were spotted by Bruno Guillaume, using the grew tool, on the automatically converted dependency trees. ( http://wikilligramme.loria.fr/doku.php?id=sequoia )

Versions 4.0 and above contain manual annotations of long-distance dependencies, as described in :

Candito M. and Seddah D., "Effectively long-distance dependencies in French : annotation and parsing evaluation", Proceedings of TLT'11, 2012, Lisbon, Portugal)

Version history

  • V4.0 october 2012
    • Manual annotation of long distance dependencies.
      • In the conll files, columns 7 and 8 give the head and dependency label that take into account non-local phenomena, leading to sometimes non projective trees. Columns 9 and 10 contain the head and dependency label obtained via automatic conversion, and are necessarily projective.
  • V3.3 july 2012
    • Manual corrections of errors that were spotted by Bruno Guillaume using grew
      • 60 corrected errors in the phrase-structure trees
      • 13 errors coming from the automatic conversion to dependencies (still to correct)
      • 45 false errors (no correction to make)
    • Addition of the *xml files : phrase structure trees in the XML format of the FTB
  • V3.2 a few manual corrections
  • V3.1 addition of the *+fct.mrg_strict files, strictly compliant to the FTB annotation scheme
  • V3.0 june 2012 : First public release

Content

  • Files *+fct.mrg_strict are files with one sentence per line,

with phrase-structure in bracketed format, in which the phrase-structures are strictly compliant to the French Treebank annotation scheme.

  • Files *.xml are the XML FrenchTreebank format for the *+fct.mrg_strict files
  • Files *+fct.mrg are the same sentences, still in bracketed format, but with phrase-structures compliant to the ftb-uc treebank instanciation :
    • prepositions that dominate a infinitival VP do project a PP
    • any sentence introduced by a complementizer (CS tag) is grouped into a Sint constituent
  • Files *conll contain the dependency trees, obtained through automatic conversion, in CoNLL 2006 tabulated format. From version 4.0, columns 7 and 8 give the head and dependency label that take into account manual annotation of LDDs, that override the automatic conversion, leading to sometimes non projective trees. Columns 9 and 10 contain the head and dependency label obtained via automatic conversion, and are necessarily projective.

Number of sentences for each sub-domain :

561 sentences Europarl file= Europar.550+fct.mrg

529 sentences EstRepublicain file= annodis.er+fct.mrg

996 sentences French Wikipedia file= frwiki_50.1000+fct.mrg

574 sentences EMEA (dev) file= emea-fr-dev+fct.mrg

544 sentences EMEA (test) file= emea-fr-test+fct.mrg

Data split (TALN 2012 experiments)

The "neutral" domain is made of EstRepublicain + Europarl + FrWiki, and the data split into dev and test corpus is the following :

head -265 annodis.er+fct.mrg >> sequoia-neutre-dev+fct.mrg

head -280 Europar.550+fct.mrg >> sequoia-neutre-dev+fct.mrg

head -498 frwiki_50.1000+fct.mrg >> sequoia-neutre-dev+fct.mrg

tail -264 annodis.er+fct.mrg >> sequoia-neutre-test+fct.mrg

tail -281 Europar.550+fct.mrg >> sequoia-neutre-test+fct.mrg

tail -498 frwiki_50.1000+fct.mrg >> sequoia-neutre-test+fct.mrg

Contact

,


LANGUE/LANGUAGE

Calendar

Mo Tu We Th Fr Sa Su
30 31 01 02 03 04 05
06 07 08 09 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31 01 02