Searching for the keyword

Back in July of last year I came upon a fascinating paper on quantitative linguistics which looks at keywords in text and how they convey content. At the time I struggled to find a context in which to report and comment on the study by Dresden and Bologna physicists Eduardo Altmann, Giampaolo Cristadoro and Mirko Esposti, and so put the paper to one side, only to rediscover it recently when tidying up my to-do list.

Keywords are of interest to online writers looking to elevate their work in search engine rankings, but going by the online discussion of the subject it would seem that relatively few give it serious thought. Some even continue under the mistaken impression that keyword meta-tagging influences the Google algorithm. It doesn’t, and focussing on such a simplistic techno-subeditorial approach to keyword use is a waste of time.

What is particularly interesting about the Altmann et al. study is how it shows the importance of placement over frequency of keywords. Through looking at long-range statistical correlations the researchers were able to identify relationships between distant sections of text, in the sense that they preferentially use the same words and letters. What you have is a series of structured linguistic levels ranging from the text as a whole to its fundamental building blocks.

The computational method employed by Altmann and his colleagues involves translating texts into binary codes, replacing vowels with a 1 and consonants with a 0. In this way one can work through a text bit by bit, identifying structures and repetitive patterns. This is done by means of correlation functions which map symbolic to numerical sequences.

Taking Tolstoy’s War and Peace as an example, the researchers identified repeating patterns in the text. These are the long-range correlations referred to above. With a numerical analysis based on the mathematical framework of information theory one can determine the connection between any two letters, words or blocks of words located at arbitrarily distant points in a text. And the language in which the text is written is immaterial.

Altmann et al. also studied the ‘burstiness’ of words in a text. That is, the frequency with which particular patterns arise. What they found is that the more frequently a certain word is used in a passage of writing, the more likely the word is representative of a certain subject. One key point to note is that certain words which crop up repeatedly in a text may not be present in bursts of words within a given passage. Such repeated words will exhibit long-range correlation, but they are not closely related to the subject of the text.

What does this mean in practical terms? Long-range textual correlations and burstiness could be used to develop more sophisticated internet search algorithms that save users from having to wade through pages and pages of irrelevant material, and the statistical analysis could be incorporated into software designed to detect plagiarism.

One thing I am sure of is that Altmann et al.’s findings will not help to improve my writing.