Errata, and a lesson of caution for “culturomics”

In the previous post I shared my enthusiasm about possible applications for digital, quantitative tools for studying historical data. Focusing on how Google’s N-gram Viewer may lead to interesting findings in the field of esotericism, both in order to gauge trends in popularity or spread of a certain term (such as “esotericism” itself) in various languages, and in order to research philological matters. In the latter case, one of the preliminary results seemed much more like a breakthrough at first sight than it really appears to have been. In fact, now instead we have an illustration of a serious difficulty with this kind of research, which emphasises the necessity of the critically-minded, suspicious human scholar in the middle of all digital tools.

First, the error. In the original post (which will soon be updated and corrected to avoid misleading anyone), the following was reported about the  search for  “ésotérisme” in French:

In the French, it turns out that Jacques Matter is predated by two other references to ésotérisme. Both are (accidentally, it would seem) from 1811: the second volume of Pierre Leroux’s De l’humanité, de son principe et de son avenir, and in volume 9 of Henri Martin, Histoire de France. Both references use esotericism dismissively about features of religion the authors don’t like: the esotericism of the essenes and pharisees in the case of Leroux, and that of the Papacy in the case of Martin (although the latter  is more ambiguous, distinguishing between the esotericism of the “ancient Orient” and the “negative esotericism” of the “sceptical philosophers”).

This remark appears to have been much too hasty. In fact, checking the biographical data of the two authors one should already have seen that neither Leroux nor Martin could have published anything in 1811 –  Leroux was 14 years old, and Martin merely a little boy of 1.

What appears to have happened is that Google’s metadata are utterly wrong in this case. I had already become aware of this problem, especially through a bad tendency to count  later editions of a certain work in the year where the first edition was printed, which causes all sorts of chronological problems. One thing is that the later editions thus make imprecise frequency numbers of certain n-grams both for the year in which the original was published and for the year the later edition actually appeared. Another is that words from the critical apparatus (including modern introductions, notes, and references) for modern editions of older prints in fact are transposed back to the year of the original’s publication, creating the potential illusion that a certain scholarly neologism actually appeared centuries ago.

But this was not the problem for the two works in question. The problem was something else, for these books have nothing to do with the year 1811 in the first place. In fact, using Google’s own scanned copies and checking the front matter, we see clearly that De l’humanité appeared in 1840, while Henri Martin’s  Histoire de France was published in 1844.

Why then are both listed as published in 1811? For Histoire de France, the reason seems to be relatively simple. The publication year is typeset with a set of numerals in which a 4 looks very much like a 1. If the cataloguing has been done entirely by scan, the date seems to be a product of the imperfect OCD technology used in the process. (Although it  is also possible that  a human worker has done the same error of perception.) When it comes to Leroux’ book, it is less clear what happened, since Roman numerals were  used for the frontispiece. Here it seems the problem must have been human in the first place, either starting with the metadata of the library in which it was scanned, or during the construction of the data set.

Whatever the cause, it’s wrong. Which means two things: that Jacques Matter’s Histoire critique du gnosticisme still contains the earliest known occurrence of the word ésotérisme in the French language, and that the search tool of Google Books / N-Gram Viewer still has some critical faults which makes it difficult to use for this kind of research without spending a lot of time double checking and actively criticizing the data it gives you.

For the sake of finding the earliest reference, this shouldn’t be too much of a problem, since you actually can double check it yourself when the whole book is scanned  (thus making the process just a little bit more similar to the old-fashioned way, just without having to deal with librarians). It is somewhat more troubling that errors of this kind (I’ve already seen a little too many of them in too short a time) indicate a large source for artifacts in the quantitative data generated from the data set. With metadata this inaccurate it soon becomes unclear what the graphs really measure.

This being said, we shouldn’t forget that the research team at Harvard which created the data set were the first ones to warn that metadata was unsatisfactory for the French and German corpora, as well as for all corpora before 1800. In other words: it is really at the present moment only English works from 1800 to today that it makes any sense to look at in any serious way. My cursory probings of esotericism/ésotérisme/Esoterik seem to confirm this. Some problems with the German search were mentioned already in the first post, and it is also instructive that in the end, it is now primarily for the English sources that any real progress in the type of philological research I probed has come about.

(Thanks to Jean-Pierre Laurant and Marco Pasi for spotting the errors and notifying me quickly.)

 

Creative Commons License
This work by Egil Asprem was first published on Heterodoxology. It is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

The URI to TrackBack this entry is: https://heterodoxology.com/2011/01/30/errata-and-a-lesson-of-caution-for-culturomics/trackback/

RSS feed for comments on this post.

3 CommentsLeave a comment

  1. […] UPDATE: It has surfaced that the two French references are in fact later than 1811. For a full update and correction, see the new errata post. […]

  2. […] dei limiti (alcuni dei quali sono stati evidenziati, tra gli altri, da Brett Holman e, ancora, da Egil Asperm) e i curatori del progetto hanno elencato (nelle sezioni V e VI della FAQ) certi aspetti da tenere […]

  3. […] dei limiti (alcuni dei quali sono stati evidenziati, tra gli altri, da Brett Holman e, ancora, da Egil Asprem) e i curatori del progetto hanno elencato (nelle sezioni V e VI della FAQ) certi aspetti da tenere […]


Leave a comment