Category Archives: Pama-Nyungan

Color in Pama-Nyungan: update

Last November, Hannah Haynie and I published a paper in the Proceedings of the National Academy of Sciences on color term systems in Pama-Nyungan. In it, we used phylogenetic methods to show that color term systems can both gain and lose terms, and that while they do so mostly in accordance with prior work on color term systems (Berlin and Kay, Kay and Maffi, and colleagues), we also found evidence for ‘exceptional’ systems that appeared not to conform to the B&K system. We used data from the Chirila database and fairly standard phylogenetic methods of ancestral state reconstruction.

For an analysis of this type to be correct, several assumptions must be satisfied:

  • sample data need to be representative of the languages as a whole;
  • sample data need to be correct;
  • the analytical tools need to be applicable to what’s being studied;
  • the analyses need to be interpreted correctly.

Over the last six months, Hannah and I have been in correspondence with David Nash about many of these points, particularly those involving sampling, the correctness of the underlying data, and judgments about what is a color term. In particular, in the original version of Table S1, a data conversion error resulted in words from several languages being associated with the wrong row in the table (particularly Wargamay and Warlmanpa). This did not affect the analyses reported in the paper, as the error was introduced when spreadsheets were converted to Microsoft Word documents for uploading to the journal’s online submission site. [The corrected table is available here.]

The discussions with Nash revolved around several issues already identified both in our paper and the supplementary materials:

  • the difficulty of determining whether a color term is genuinely absent from the language, or simply not recorded;
  • the difficulty of establishing the ranges of color terms glossed in English by non-native speakers of the language;
  • the issue of polysemy, for example, whether a term glossed as “unripe, green” is truly a color term, or whether “green” here is meant solely in the sense of “unripe, not ready for eating” (and therefore not glossing a true color term).

Coding decisions of this type are based on a careful philological analysis of each individual source, and while phylogenetic analyses are usually robust to individual errors, systematic errors may bias the results. In general, where Hannah and I were unsure, we tended to include rather than exclude; this applies especially to terms for ‘green’ and terms for ‘red’ based on words meaning ‘blood’ (which could be interpreted as the descriptive adjective ‘bloody’ rather than a true color term). For ‘green’ terms, many languages have a word that is glossed as ‘green’ or ‘unripe’; while some of these terms do appear to be real color terms (in that they can refer to items that aren’t unripe, like shirts), others aren’t — they refer to the ripeness of fruit, not directly to its color. (We had a similar problem with ‘grey’, which was often ambiguously glossed as a color term or a word referring only to grey hair.)

Another issue is the extent to which we make use of data from closely related languages in determining the color inventory of a particular language variety. For example, if a particular variety appears to lack a term for ‘blue’, but a term is present in other languages in the subgroup, are we justified in treating the lack of a term as a true omission? In our analyses, we treated such cases as absent rather than indeterminate, because we did not want to omit true variation in the color inventories of languages. But it would also be a possible argument to claim that color inventories are unlikely to vary so much between dialects of the same language (or closely related languages in a subgroup), so unrecorded colors are probably omissions from data collection rather than genuine absences from the language.

We suspect that some terms were not recorded because of the linguists’ expectations about what items are present (or not) in a language. For example, Australian languages are stereotypically claimed to lack color terms beyond black, white, red, and yellow; this can lead researchers not to ask for terms like blue or purple.

Finally, data for this paper came from the Chirila database (Bowern 2016), which while extensive (800,000+ items), is by no means exhaustive. Nash brought to our attention several cases where color terms had been recorded in sources which are not in Chirila. These are also noted in the revised supplementary table and reflected in the newly uploaded analysis files.

In order to assess the impact of our coding decisions, as well as the impact of terms which were missing from Chirila and hence recorded as absent from the languages, we re-ran all analyses. We ran two sets of updated analyses. One simply corrected errors resulting from data missing from Chirila. The other also used Nash’s alternative judgments about presence/absence of color terms like ‘green’. In neither case were our main conclusions affected. That is, we still find support for both color gain and color loss. While, as is expected, the numerical values of individual results changed somewhat, our inferences and conclusions stand. Color loss is possible (under this model), though it’s substantially less common than color gain.

I am currently working on a new update to Chirila and many of these revised sources will be available there.

New bootcamp under way!

The 2017 grammar boot camp starts tomorrow. Three students (with bios below) will be working with me on materials for Noongar. We’re very lucky to be working with Denise Smith-Ali, Noongar linguist, and Sue Hanson from the Goldfields Language Centre. Our main focus for the month is to put together a phonological description of Noongar, with sound files to illustrate what we are describing. In some ways, this is pretty straightforward (in that it’s the sort of thing linguists do, the scope is known, etc) but in other ways, it’ll be a challenge! For example, we want to make something easy to access, and easy to edit and update. We’ll be posting more about this as we make decisions.

Akshay Aitha: Akshay is a rising senior at UC Berkeley working on a double major in Linguistics and Applied Mathematics (with a concentration in Logic). My main research interest at the moment is the functional structure of nominals, especially in my heritage language, Telugu. I also have a strong enthusiasm for linguistic fieldwork. Outside of my coursework, I’ve been involved as a research assistant on various phonetics and fieldwork projects under graduate students in the Berkeley Linguistics department, and I’m also involved in my department as an officer of our club for undergraduates, SLUgS.

Lydia Ding: Lydia is a recent graduate of Carleton College, where she majored in Linguistics and completed a senior thesis for distinction on wh-questions in Nukuoro [nkr] (Polynesian). Her primary interests lie in language documentation, syntax, morphology, and computational linguistics.

Sarah Mihuc: Sarah is a recent graduate of McGill University with a BA Honours in Linguistics & Computer Science. She works on anti-agreement and on word order in Kabyle Berber. She also has experience in experimental and computational linguistics, and fieldwork on two Mayan languages.

New paper on language and genetics in Aboriginal Australia

Somewhat belatedly, here is a link to new work of mine and colleagues’ on gene-language coevolution in Pama-Nyungan, the peopling of Sahul, and migration and admixture in the Pleistocene. It was recently published in NatureThere’s a lot in this paper, a Genomic History indeed. There has been some media attention, particularly Michael Erard’s piece on Pama-Nyungan phylogenetics and how important computational work has been to recent advances in Australian language history. There’s also a summary piece in The Conversation, particularly about the genetic side of the paper.

Conference talk on grammar boot camps

I run a grammar boot camp every year, where a small group of students write a grammar of a language in a month. Last year it was Ngalia, and this year (starting in a few weeks) it’ll be Cundalee Wangka and Kuwarra. I also ran a year-long grammar group to pilot the idea in 2013, using materials from Tjupan. All four languages are varieties of the Wati subgroup of Pama-Nyungan and all the books are based on fieldwork conducted by Sue Hanson.

At the recent Wanala Conference run by the Goldfields Language Centre, Anaí Navarro, Matthew Tyler and I did a video presentation about the boot camp, its aims, methods, and results. Here’s a link to the video: Warning that it’s 190mb and 22 minutes long.

Introducing CHIRILA

I am very pleased to announce that the first phase of CHIRILA (Contemporary and Historical Resources for the Indigenous Languages of Australia) has been released. This represents approximately 180,000 words from 155 different Australian languages. It is a subset of the full database (of approx 780,000 items); eventually I hope to be able to release most of the data. Currently, the first phase is that for which we have explicit permission, or which is already in the public domain.
The material is hosted at; please see the web site for more information about the contents of the database, how to download data, what formats are available, and the like. We do not provide a web interface to the data; you download it and use excel or a database program to read the files.
We hope the data will be useful to researchers, community members, and others with an interest in Australia’s Indigenous language heritage. also includes access to the preprint of a paper describing the database (both the online and full versions).

Explorations in Pama-Nyungan Phylogenetics

I recently gave one of the plenary talks at a workshop on phylogenetic algorithms at the Lorentz Center in Leiden (Netherlands). In the talk I gave an overview of a number of recent results from my research program, including the creation of a Pama-Nyungan phylogeny and some of the research results that come from that.

The slides are available from, from this link.

One of the results that is worth highlighting is the distribution of innovative languages within subgroups. A standard theory argues that languages innovate in the center of their ranges. The innovations diffuse across the language area over times, and therefore areas around the periphery tend to show more archaisms than those in the center. This distribution should also apply to language subgroups, assuming that language split occurs through the gradual accretion of isoglosses so that dialects split into separate languages.

If this is true, subgroup areas should show the same distributions, if not in absolute terms, but in large measure. That is, more innovative languages should lie towards the center of subgroups, and more conservative ones should lie around that edges.

It turns out that it is straightforward to plot the most innovative languages in each subgroup, according to how much basic vocabulary they have replaced. In the Chirila database, there are basic vocabulary lists coded by cognacy. To get a sense of how innovative a language is, we can simply sum, for each word in the language, the number of languages that share that cognate and divide it by the total number of language-cognate items. That gives us a sense of the extent to which languages participate in the most archaic vocabulary in the famiy. Plotting the most innovative language in each subgroup gives us the following map.

As you can see, the most innovative languages are not, in most cases, in the center of the subgroups, but rather on the peripheries.

What can explain the discrepency? It’s probably the result of migratory expansions. That is, the languages that are the most innovative are the ones as the ‘ends’ of their subgroup phylogenetic expansions. That is, the most innovative languages are the ones that have undergone the most branching; another way of thinking about this is that more innovation happens on lineages with more branching events. This echoes a result from other work by Atkinson, Pagel, and colleagues, who also found that lineage splitting speeds up change.

One might think that this result reflects language contact; that is, that languages on the periphery might be in contact with more different languages, which leads to an increase in unidentifiable vocabulary. But these languages are not the only ones which are in contact with languages from other subgroups. In fact, if we map the most conservative languages in each subgroup, they are also often to be found around the periphery.

It may still be the case that the center-periphery model still holds in areas where languages have stopped expanding, and that Pama-Nyungan subgroups were (on the whole) not formed by diversification in situ.

It’s also interesting to plot the most and least conservative subgroups:

This is a bit more dodgy. For example, I strongly suspect that Thura-Yura’s place in this list is inflated by Wirangu having (as loans) a number of items that are otherwise found only in Western Pama-Nyungan languages, and by Wirangu overall showing some Pama-Nyungan retentions that are otherwise replaced in the rest of Thura-Yura. The broad trend, however, is that the further east, the less conservative. The correlation between longitude and retention is -0.49. The correlation doesn’t hold for latitude (0.05) or number of languages in the subgroup (-0.02).

Language by source materials

For the curious, here is a map of the languages in the full database, color-coded by number of items. As you can see, there’s considerable variation, but there are also a good number of languages with substantial holdings.


Counts of sources in Australian lexical database, as at August 19, 2015