MÉLUSINE

ON MEASURE ABOVE ALL ELSE

PUBLICATIONS DIVERSES

"On measure above all else", preface to Étienne Brunet, Author accounts I, statistical studies, from Rabelais to Gracq. Paris, Honoré Champion, 2009, pp. 8-17.

Étienne Brunet is a linguist well known for his statistical work on the great texts of French literature (see his Wikipedia entry). We met during the meetings that Bernard Quemada, director of INaLF (National Institute of the French Language) organized in Nancy. Étienne directed a lexical studies center in Nice, while I represented the researchers from my CNRS laboratory in Meudon. When he told me about the work he was thinking of entitling "Author accounts", with a quite relevant familiar pun, I could not refuse him the following preface, where I mainly sought to make better known the immense work of this tireless finder. It must be said that we had the same word games, since I had entitled in 1991 Comptes A rebours a collective essay on Huysmans' work. Should I specify that these books never brought us a penny?

Etienne Brunet, Selected Writings Volume 1: Author accounts. Statistical studies. From Rabelais to Gracq. Texts edited by Damon Mayaffre, preface Henri Béhar, Champion, Paris, Digital Letters Collection, 2009.

Notice: By gathering sixteen studies and an associated DVD, this book by Étienne Brunet participates in the renewal of literary research. The great authors of three centuries of French literature, from Du Bellay to Gracq, from Rabelais to Le Clézio, are reread by articulating traditional poetics and emerging digital hermeneutics. The author combines qualitative reading and computer-assisted quantitative reading. He puts the working hypothesis to the test of the machine and supports the researcher's intuition with textual statistics. Far from being diminished, the pleasure of the text is enlivened by it. If the emphasis is often placed on the lexical and thematic dimension of the works, the reader will also find the study of parts of speech in Balzac, syntactic sequences in Flaubert, verb tenses in Zola, the sentence in Proust, rhymes in Hugo, Verlaine or Rimbaud, etc. Above all, and for the first time, he will find associated in a complementary manner in this volume the final conclusions of thirty-five years of literary analysis and the research tool that made them possible. On the DVD, consulting the databases processed by the HYPERBASE software allows both hypertextual reading of a considerable mass of texts (Rabelais, Corneille, Racine, Rousseau, Hugo, Sand, Giraudoux, Colette, France, etc.) and the implementation, in new analyses, of the processing methodology presented. Étienne Brunet is professor emeritus at the University of Nice and founder of the Bases, Corpus and Language laboratory. He is a specialist in computer science and statistics applied to text studies, and the author of the Hyperbase software. He has published works on Hugo, Zola, Proust, Giraudoux, and on French vocabulary from 1789 to the present day.

See reviews of this work:

Text of my preface:

On measure above all else

I have always been struck by this cleavage (of which I am myself the victim) between knowledge of mathematics and that of literature, between what Pascal called, with the vocabulary of his time, the spirit of geometry and the spirit of finesse, or to put it more trivially, between numbers and letters. The oldest, most popular, and most watched television game is called "Numbers and Letters". Associating two sequences, "the longest word" and "the count is good", it is reputed to rest on the candidates' calculation skills and vocabulary knowledge. When I was responsible for the audio-visual and computer science department of the university, I found the corridor leading to the technical rooms too austere and had it brightened up with a television set, working permanently. A single glance was enough to see that this program gathered the maximum of spontaneous viewers. I suppose that these joined without any difficulty skills in calculation and vocabulary. Paradoxically, attendance was no longer the same in the classrooms that announced lexical statistics programs!

Why does what announces itself as a game, and which requires skills in domains so distant from mathematics and vocabulary, become repellent, incomprehensible, obscure, when one treats it seriously? Why wouldn't the same student seduced by the program seize the questions that Étienne Brunet tirelessly poses to literature, and I would even say to great literature?

* * *

Indeed, the latter, refusing the initial cleavage evoked above, specialized in the numerical study of large literary ensembles. And here again, I never cease to be amazed by the global behavior of our contemporaries, who appreciate with the greatest seriousness the daily polls that the press lavishes on them, examine the factorial analyses of correspondences (AFC) of magazines while nodding learnedly, but pretend to ignore the number of words in In Search of Lost Time, the average length of each of its sentences, etc. as if there were there a sacrilege, an attack on majesty, on the eminent dignity of Letters!

After his thesis on the structure and evolution of Jean Giraudoux's vocabulary (1), Étienne Brunet dedicated his days and nights, so to speak, to enhancing the texts stored on the computers of the ex-Treasury of the French language (which later became the National Institute of the French Language and now the National Center for Textual and Lexical Resources). The whole of French vocabulary since the Revolution (2) first, then the great massifs that are the works of Marcel Proust, Émile Zola, Victor Hugo, always from the data stored in Nancy, carefully completed and revised by him (3). Not without having, in passing, contributed, through indexes and concordances, to the constitution of the Rousseauist corpus (4).

To all this are added, since 1995, digital publications, under the privileged form of CD-Rom, chosen for its storage capacity, its maneuverability and the eternal duration that was then predicted for it. The reader would do well to familiarize himself with his Rabelais, his Balzac (also available on the Internet), his Rimbaud, his Pascal, his Proust, in short with the infinite production that he elaborates in his personal Nice laboratory and generously entrusts to publishers lucid enough to disseminate and distribute them (5). I will only mention for memory the first work of the genre, considered as a prototype of what the study support of a literary work under the eye of its author could have become, the Julien Gracq CD-Rom, withdrawn from sale at the last moment, or even the Paul Éluard, on which I will necessarily return.

Besides the documentary facilities provided by such works (concordances, contexts, word lists, etc.), they allow a large number of studies to which Étienne Brunet devotes himself in the articles presently gathered. In summary, these bear on the statistics of large corpora, their structure, their internal evolution, their great tendencies, words in expansion or regression, lexical richness and variety, internal specificities to the work or external ones (with respect to a period, a given genre, etc.). I will not have the presumption to discuss here the complex and very subjective notion of "lexical richness", so much debated elsewhere. Suffice it to know that Étienne Brunet conceived for this his W index, detailed in his thesis on Giraudoux.

With him, no one can ignore the essential characteristics of vocabulary in the works of our great writers and even, more generally, in French texts from the 16th century to the present day.

* * *

At the beginning, and given this constitutive fracture of our minds, the rare literary scholars who ventured into the statistical study of vocabulary had to address the engineer or technician, formulating politely a request that should not contain any gap: — Oh! Venerable grand master of the Machine, could you, in your unparalleled magnanimity, procure me an output of all the ends of novels recorded in your august machines? — Yes, my little one, but it will take time. — No problem, if I can take text from it for unparalleled studies!

At the end of the day, after waiting a few months, I saw arrive at my home (the reader will have understood that I speak of my sole experience) an enormous package of computer outputs giving me quite simply the texts in question in reverse order! No doubt poorly formulated, my question, misunderstood, had produced a monster.

More astute than I, and more courageous, Étienne Brunet, fine literary scholar as everyone knows, became a statistician first, then an analyst-programmer, incontestably conquered by the Pascal language! For everything in him speaks to us of Pascal, the author of the Provinciales but also the inventor of probability calculus and the calculating machine called Pascaline!

He is himself the witness and actor of the evolution, what am I saying, of the revolution that made us pass in a few years from computer science to micro-computing, and from the isolated machine to the worldwide network. After having worked on the big machines of Sophia-Antipolis, he knew how, without reducing anything of his researcher ambitions, to adapt to the personal computer, having first tamed the transfer of data through the different networks that succeeded each other until the universal reign of the Internet.

Simultaneously, the programs dedicated to vocabulary study have evolved, passing from the examination of graphic forms to the categories of discourse. We always lament: machines can only "read" or spot a chain of words, a sequence of letters separated by punctuation or a white space. Even with this limit, it is possible to examine an author's lexicon, and, through the study of co-occurrences, to pass to what Brunet calls "themes" or lexical fields (I would rather say key concepts): Colette's bestiary, colors in Rimbaud, time in Proust, religious vocabulary... From there the ambition to characterize each author's "vision of the world", others say "lexical universes", by the sole statistical approach.

But, fortunately, other programs (CORDIAL, FRANTEXT) become reliable enough to objectively label grammatical categories and allow, when the study requires it, a lemmatization of vocabulary, in other words its reduction to the dictionary state. I specify that for the examination of a literary text and more precisely of poetry, lemmatization has always seemed to me to miss its object, as if one wanted to describe a suburban pavilion with Le Corbusier's standards. But I conceive that when one is interested in large ensembles, one feels the need to define them in their broadest lines, by standardized categories. Likewise, one wants to go as far as the study of the syntactic structure of texts. And if the software for this is not quite up to par, one can remedy it by a quick detour, of which Étienne Brunet is the master, the grammatical words indicating subordination, for example; punctuation giving rhythm, sentence length...

Besides these monumental works with long reach, the master of lexical statistics has made it an obligation to produce a certain number of tools that he has made available to the community of researchers and the interested public. Their exact title indicates enough their object and utility:

  • CD-rom THIEF (Tools for Helping Interrogation and Exploitation of Frantext), statistical base for Mac and PC, on line and off line (12 chronological slices, from 1500 to 1990, 117 million words), InaLF (Nice), 1996.
  • CD-rom BALZAC (prototype), in collaboration with Professor Kiriu (Tokyo), Mac and PC version, 1996.
  • BALZAC ON THE INTERNET, in collaboration with Professor KIRIU (Tokyo). Concordances and contexts of the Human Comedy, (address: http://lolita.unice.fr), 1996.
  • CD-rom FRANCIL, Textual base on French in francophony (oral, press and literature), for Mac and PC, 76 texts, 4.5 million words, AUPELF-UREF, INaLF (Nice), 1998
  • CD-rom BATELIER (Base de Textes Littéraires pour l'Enseignement et la Recherche), co-edition MEN, InaLF and Champion, 1998 (Mac and PC).
  • Statistical base ÉCRIVAINS (70 authors from the 17th to the 20th century, 55 million words). Digital data extracted from Frantext, on bistandard CD-Rom Mac and PC, INaLF (Nice), 1999, (Mac and Pc).

Besides various textual bases, not commercialized, on La Fontaine, George Sand, Nerval, Baudelaire, Maupassant, Jules Verne, Saint-John Perse, his absolute masterpiece in my eyes is incontestably the HYPERBASE software, which summarizes roughly all his approaches, with its multiple documentary and statistical functions. Destined for a wide public, it should be part of the panoply of any student in Letters and human sciences, since it can process corpora as literary as historical or journalistic. The didactic brochure that accompanies it is clear and explicit enough that it is useless to detail its different chapters. To summarize, and in a spirit conforming to what literary research should be, I would say that all the studies gathered here can be realized and verified (in the scientific sense of the term) by the reader, provided he has the digitized text.

I add, and this is not the least merit of Étienne Brunet's work, that he knows how to account for it very clearly, with this touch of Giralducian humor that is said to be proper to Normaliens. Thus, from the start: "The computer ignores modesty", or again about Victor Hugo's rhymes, affirming following Valéry, not without having verified it himself on the studied corpus, that the second is inspired, the first obtained by research and transpiration. The portrait he draws of his fellow student Gilbert Cesbron, and the rare evocation of his youth (about Julien Gracq) make me think that there is in him a repressed novelist, or rather a narrator who takes shelter behind mathematical laws so as not to let his sensitive personality surface.

* * *

This volume gathers articles that, in a general way, can serve as an introduction to the use of the tools and works mentioned above. Beyond that, it bears witness to the evolution of sciences and techniques relative to this singular branch of literary studies that is lexical statistics or lexicometry or even textometry. We will have the clearest illustration of this with the article on Giraudoux. Likewise, with respect towards the pioneer that was Guiraud, Brunet shows discreetly but firmly what the machine authorizes us to write today about Rimbaud, which was only conjecture or approximation with the word counter.

Like his Renaissance predecessors, the universal man that he has become has encountered, in the course of his work, the problem, alas unavoidable, of phynance (to write it like Father Ubu), under its noblest aspect obviously, which has the name "copyright" and which, thanks to the grip of Walt Disney's heirs, is now extended to 70 years after the writer's death. Thus his realizations on Gracq, Éluard, etc., will be available, respectively in 2077 and 2022, unless the legislator makes an exception for pedagogical and research work! Another solution, the simplest but not the most realistic, being that the software publisher accepts to pay the share that belongs to them to the rights holders. People were surprised that I could produce the CD-rom of the Europe review (which Etienne Brunet treats here), often asking me heavily how I could overcome this delicate copyright problem (instead of appreciating the innumerable services that such a work can bring to our culture). There is no diversion of the law there, and even no secret: it is the owner of the title (in this case Les Amis d'Europe, a company that I chair) who realized a digital edition of his own production, favoring, by the same token, indispensable works like those of our specialist in large corpora.

The fact that he has been interested, throughout his research career, in large ensembles, allows him to advance (always with Father Goriot's prudence) some laws promised to a beautiful future. On the one hand, there is the specificity of each author, of each work, measured against other writers of the same period. Thus Flaubert passes from verb to noun; inversely in Proust the noun gives way to the verb, under the effect of moral preoccupations. Zola is less rich than Proust and Giraudoux, but his vocabulary is more concrete.

On the other hand, it seems that all the treated corpora undergo the law of pregnance of literary genres (cf. the article on Flaubert). That is to say that, whatever the examined corpus, it is dominated by the constraint of the literary genre in which it is inscribed. But, it will be said, this law is perhaps valid for classical texts, it has no relevance for avant-garde works, which set themselves against the traditional typology of genres! I can testify, by my own observations, that this is not the case. At a certain level, any literary work ends up being inscribed in one of the great genres that structure literary discourse (prose, theater, poetry, correspondence, etc.). To this is added a law still mysterious (insofar as it has not been universally validated), that of the aging of the artist, translating into the progression of verbs and the regression of nouns over time.

* * *

After a great hope, the most general reaction that I have been able to observe before these lexical statistics works, is disappointment. We want everything and immediately from the machine. When the latter, after an increasingly short calculation time, outputs for us, for example, the specificities of each chapter of a novel, the reader exclaims: "but I already knew it!", or again: "it suffices to read the text in question to know that it is first a question of... and of...". This first reaction is quite justified. However, I will point out to my interlocutor that his knowledge was only intuitive and subjective. It was still necessary to prove this particularity of the examined chapter, and to establish it scientifically, in an incontestable manner!

Let it not be said that the researcher teaches us nothing, or again that the mountain gives birth to a mouse! He always proves something to us. Finally the science of literature to which we all aspire (beside criticism, whatever its obedience) can rely on tangible, incontestable and verifiable facts.

So much so that Étienne Brunet often amuses himself by verifying "common sense", that is to say the received opinions about such or such author, such or such text. "Finally Malherbe came" to purify the language: we will see in the following pages in what manner this happened. But Victor Hugo did not put the red cap on his dictionary! Inversely, we produce "reading illusions" in series and on any subject. Our researcher comes to put things back in place, quietly and with certainty.

A Renaissance scholar by his capacity to reconcile the mathematical and literary approach of texts, Etienne Brunet has the same concern to spread his discoveries to the whole of humankind. I know no more generous researcher, less concerned with his personal glory. Besides these properties inherent to the scholar, we must add what the machine could not tell us: the quality of reading, the depth, the finesse that he brings to the examination of works of which we can be sure that he knows them better than anyone.

Complement:

Measures and excess in French letters in the 20th century: homage to Henri Béhar, professor at the New Sorbonne: theater, surrealism and avant-gardes, literary computer science. Paris: H. Champion, 2007. 525 p.; 23 x 16 cm. An exploration of the audacities and literary experiments of the 20th century according to several axes linked to the works and centers of interest of H. Béhar, professor emeritus at the New Sorbonne: the relations of theater and reality, surrealism and avant-gardes, the Platonic myth of the androgyne, literary computer science in its links with publishing, research, teaching.


  1. The Vocabulary of Giraudoux. Structure and Evolution. Éditions Slatkine, Geneva, 1978, 688 p.
  2. French Vocabulary from 1789 to the Present Day, according to the data of the Treasury of the French Language. Preface by Paul Imbs. 1982. TLQ 17. 3 vol. 1836 p.
  3. Proust's Vocabulary, with the complete and synoptic Index of In Search of Lost Time, Slatkine-Champion, 1918 p., 3 vol., 1983 (preface by J.Y. Tadié); Zola's Vocabulary, followed by the complete and synoptic Index of the Rougon-Macquart, 3 volumes, 472 p., 646 p., 357 p. and 5500 p. on standardized microfiches, Slatkine-Champion, 1985 (preface by H. Mitterand); Victor Hugo's vocabulary, 1988, Slatkine-Champion editions. Vol.1, 484 p., vol. 2, 637 p., vol. 3, 556 p., +27 standardized microfiches containing the synoptic index of Hugo's works (6878 p.).
  4. Index of Émile, XLIII-LIII, 583 p. Slatkine, 1980; Concordance of Émile, XV, 720 p. 1980; Index of Letters written from the mountain, 344 p., 1983; Index of Considerations on the government of Poland and Index-concordance of the Constitution project for Corsica, 288 p., 1986 (in collaboration with Léo Launay); Index of J.-J. Rousseau's theatrical and lyrical work (for Le Devin de village, p. 375-390, in collaboration with A. and G. Fauconnier), 1986.
  5. I only mention here the discs available commercially: CD-ROM RABELAIS, in collaboration with Marie-Luce Demonet, Mac and PC version (with the help of the National Library of France, the Municipal Library of Lyon, and the National Book Center), éditions Les Temps qui courent, Paris, 1995; CD-ROM RIMBAUD, Éditions Champion, Paris, 1999, (Mac and Pc); CD-ROM PROUST, Éditions Champion, Paris, 1999, (Mac and Pc); CD-ROM PASCAL, Éditions Champion, Paris, 1999, (Mac and Pc); CD-ROM RABELAIS, Éditions Champion, Paris, 1999, (Mac and Pc).

ARTICLE PRÉCÉDENT
ARTICLE SUIVANT