Here be data

Data is king. In linguistics as well as in any other discipline, no serious claim can be made without solid data. And surely the revolution of Big Data hails a new dawn for linguistics research too: the computer-assisted ability to compile humongous corpora, crunch inordinately vast amounts of data and watch previously unseen patterns emerge has some magical appeal.

Yet, big or not, data and its mathematical exploitation isn’t everything. I was reminded of this by two recent developments in language study.

How will English mark future time reference?

The first was a follow-up study to Keith Chen’s 2013 research purporting to show that the way a language marks the future affects the way its speakers prepare (or not) for the future. Namely, if a language systematically marks Future Time Reference (FTR), its speakers are less likely to display future-oriented behaviour (such as saving for their retirement) than if their language used present-tense forms to refer to the future. The rationale being that someone who uses the latter type of language sees the future in more of a continuity with the present.

Chen’s findings are highly seductive: if true, they would constitute hard evidence for the Sapir-Whorf hypothesis that the language you speak influences your outlook on the world — and in this case, your behaviour also.

Chen’s research has attracted a lot of criticism, most of which focuses either on its methodology or on the way the hypothesis is formulated (G.K. Pullum, for example, argues that it could very well be laid out the other way round).

Strangely enough, criticism of the linguistic aspects of his work has been sparse. My first impulse was to question the choice of weather forecasts as corpus: of course weather forecasts deal almost exclusively with the future, but how can they be deemed representative of how a whole linguistic community refers to future time? Not only do they only account for a tiny fragment of that community’s linguistic output, they also constitute a highly ritualized form of speech, far from the spontaneity of ordinary speakers.

But where I choked was when Chen cited English as a ‘strong-FTR’ language (where future tense is obligatory) as opposed to German (which may dispense with the future tense to mark future time reference). That is just plain contrary to the way English works. As Pullum points out in another blog post, English commonly resorts to the present tense to convey future time reference; he gives the following examples:

Meg’s mother arrives tomorrow.
If the phone rings, don’t answer it.
My flight takes off at 8:30.
IBM is declaring its fourth-quarter profits tomorrow.

Arguing that the be going to structure is future tense is just as wrong: it’s just the verb go in the present continuous. As for calling will + V the future tense, I’m equally baffled: it is nothing more than the modal auxiliary used, yes, in the present tense. Its original, etymological meaning is that of volition, which is how Shakespeare used it sometimes:

LEONTES
He dreads his wife.

PAULINA
So I would you did; then ’twere past all doubt
You’d call your children yours.

(The Winter’s Tale, 2.3.52-129)

And it is how it is still used today when, for example, taking marriage vows. Future time reference (only one of the many uses of will) is just an extension of the prospective outlook inherent in the concept of volition.

The bottom line is that, contrary to French or Spanish for example, the morphology of the English verb does not comprise a future tense; English only has two tenses, the past and the present (or, more accurately, the non-past), various uses of the latter being resorted to for FTR.

Arguably, I have no knowledge of most of the languages that Chen claims to have investigated; but after reading that English is a ‘strong-FTR’ language, my feeling was exactly the same as Pullum’s, even before I read his blog post:

If the facts are shaky for English, how likely are they to be accurate on languages that have not been studied nearly so intensively?

Yet this new study by Austrian and German economists does not consider it necessary to go back on the flawed linguistic premise:

According to Chen’s (2013) linguistic-savings hypothesis, languages which grammatically separate the future and the present (like English or Italian) induce less future-oriented behavior than languages in which speakers can refer to the future by using present tense (like German).

Cole Robertson, an Oxford PhD student specializing in language study, writes a very efficient critical summary of it here, but he does not question the assertion either:

As with English, Italian requires a speaker to mark the future tense, e.g. “it will rain tomorrow” not simply “it rains tomorrow.” German, on the other hand, requires no such marking—speakers are free to say the equivalent of “it rains tomorrow.”

For both studies (Chen’s and the new one), data is not the problem: the researchers had access to vast amounts of it and, in the latter research, it was, to the best of the researchers’ ability, checked for cultural bias. After all, the hypothesis may well be a valid one. But we don’t know that because the main assumption, the classification of languages according to FTR marking, is flawed; it would be like trying to account for Jupiter’s orbit on the basis of its being a telluric planet.

In other terms, mere data crunching, however extensive and sophisticated, does not constitute a linguistic analysis. In this case, to be honest, they could — nay, they should have used a linguist.

IPCC reports, in the Flesch

The same goes for the second of those, viz. the release of a study of IPCC reports from the viewpoint of their readability. Authored by four European academics, it was published on October 12 in Nature Climate Change and got some coverage in the French media, less so in other countries apparently. If you think that climate change is a vital concern, and that political action is urgent, then you’ll agree that their intention is good: to show that the lack of readability of IPCC Summaries for Policymakers (SPMs) is detrimental to their influence on governments, and to suggest ways for improvement.

Here data is not the problem either: the authors compiled a large corpus consisting of 20 IPCC SPM reports between 1990 and 2014 as well as their coverage in scientific and popular media, for a total of 1,024 texts (a wordcount would have been more relevant to assess corpus size, but there you are).

Alas, the method is flawed. As can be seen from the first page of the article, the measure of readability is based on the Flesch Reading Ease algorithm, which computes readability scores from two variables: the length of words and the length of sentences.

This in itself is a disputable way of assessing readability. I can think of many examples where short sentences, made up of short words, may pose problems for the reader/hearer. Headlines are a case in point: their very brevity, imposed by over a century of copyediting tradition, breeds all manner of ambiguities and interpretative pitfalls. Yet, one example of the worst excesses of headlinese passes the Flesch Readability Test with flying colours:

Screen Shot 2015-10-30 at 16.31.39

I challenge you to parse that headline properly (click here and ⌘/Ctrl-F the headline for the answer). ‘Noun piles‘, a recurring feature of British newspaper headlines, fare equally well regardless of the actual headache endured to decipher them:

Screen Shot 2015-10-30 at 16.40.24

Obviously, brevity does not systematically equal clarity: using longer, more precise words (consider ‘Congressman Oyster opposes inquiry into crowd mishandling’) and fleshing sentences out with morpho-syntactic clues (such as articles and auxiliaries) can improve readability dramatically by reducing the amount of interpretative effort required of the reader.

But worse still, as Language Log’s Mark Liberman has pointed out time and again (and indeed very recently), the Flesch Reading Ease algorithm doesn’t even care whether it’s dealing with real language or not: ‘word length’ is nothing more than the number of characters in an unbroken string, and ‘sentence length’ the number of those strings between two strong punctuation marks. Chains of random characters may score highly on the test, provided they consist of short enough units:

Screen Shot 2015-10-28 at 11.35.50

Resorting to such a flawed methodology in what purports to be a ‘linguistic analysis’ bespeaks a lack of awareness of the reality of language and intersubjective communication. But the pity is that it betrays the original intention of the authors, laudable though it may be. IPCC reports may well be unreadable for a non-specialist audience, and that may very well be an important issue when it comes to influencing career politicians’ policy choices. But we won’t know from that study: if the demonstration is wrong, it follows that the conclusion is wrong.

Of course not everyone is going to notice that: at 24.8%, the readability score of the study’s abstract (the part that goes out to the general public) means that it will be understood by roughly the same target audience as IPCC reports themselves.

Advertisements

One thought on “Here be data

  1. It’s true that the linguistic analysis used by Chen was not nuanced (although based on an existing linguistic typology), but arguing that some of the languages are analysed incorrectly is not the same as explaining away the correlation. That is, how would incorrect data points provide an alternative explanation for the correlation?

    However, there was another way in which the analysis was not linguistically informed: it did not control for the historical relationships between languages. Chen’s most recent paper shows that at least part of the original claim was inflated by a lack of control for these historical relations. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0132145

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s