Category Archives: Lexicology/lexicography

Introducing CHIRILA

I am very pleased to announce that the first phase of CHIRILA (Contemporary and Historical Resources for the Indigenous Languages of Australia) has been released. This represents approximately 180,000 words from 155 different Australian languages. It is a subset of the full database (of approx 780,000 items); eventually I hope to be able to release most of the data. Currently, the first phase is that for which we have explicit permission, or which is already in the public domain.
The material is hosted at; please see the web site for more information about the contents of the database, how to download data, what formats are available, and the like. We do not provide a web interface to the data; you download it and use excel or a database program to read the files.
We hope the data will be useful to researchers, community members, and others with an interest in Australia’s Indigenous language heritage. also includes access to the preprint of a paper describing the database (both the online and full versions).

Language by source materials

For the curious, here is a map of the languages in the full database, color-coded by number of items. As you can see, there’s considerable variation, but there are also a good number of languages with substantial holdings.


Counts of sources in Australian lexical database, as at August 19, 2015

Plain English Description of Australian Comparative Database

I have circulated this plain English description of the Pama-Nyungan (now Comparative Australian*) lexical database to various language centres in Australia, but I’m posting it here too in case it’s useful to others writing such descriptions, and in case others would like to know about the database in broad terms. I am in the process of writing a more detailed paper that describes the database.

Continue reading


One of the great things about co-teaching is all the stuff you learn from your co-instructor. Arienne gave a nice demo today of TextStat, a flexible concordance program from the Dutch studies dept at the Freie Universitaet Berlin. It’s free, and available for Windows, PC, and Linux.

Its major advantage is that it will read Word and OpenOffice files. That is, you don’t need to format the input text in any special format before it’s imported into the program. It will also retrieve web pages.

As programs go, it’s pretty simple. It does wordlist generation and concordancing, and you can view citations in context or in list format. But that’s already pretty useful. It’s very memory-light and doesn’t take up much space on the hard drive. Installation is easy (just unzip the archive on windows). If you want high-powered concordance software, NLP tools are for you, but if you want an easy way to see what’s in your data, this is definitely the way to go.

Public domain pictures of Australian fauna

I just came across, and it’s great. I am continuing to work (albeit slowly) on the Bardi dictionary, and one of the tasks is finding good illustrations for the plants and animals that have been identified in Bardi country. Coming up with good pictures can be difficult. It’s a web/electronic dictionary, so they can be in colour, but the pictures need to be good illustrations of the relevant species and ideally they would show distinguishing markings. Ideally, they would be taken in the an environment similar to the area around One Arm Point. And it would make things much simpler if the pictures were either in the public domain, or at least able to reproduced without breaking the law. I have many pictures of One Arm Point: last time I was on fieldwork I spent an afternoon taking photos of everything I could think of, with an eye to illustrating the dictionary. But we have hundreds of plants and animals in the dictionary, and there are some that I’d prefer not to get close enough to photograph, like this sea snake:

Sea has a nice range of pictures, many of which are sourced from public domain or sharable sites.

Call for discussion

As readers may know, I’ve been compiling a database of lexical items for Australian languages (funded by the NSF). It started with Pama-Nyungan but I’m now expanding it to the rest of the families. It’s reached 600,000 items now so it’s quite a nifty research tool for looking at language relationships.
I have more than 1000 sources, including data from many people in the Australianist community, for which I’m very grateful. This data is subject to a whole bunch of different requirements, permissions, and restrictions, ranging from “sure, here’s the file, feel free to pass it on” to “I can give you this but you can’t tell anyone you have a copy.” This makes long-term planning for the materials rather complicated.
Currently the only people with access to this database are me, my research assistants, the post-doc working on the project, and 3 people I’ve given downloads to. As word is getting out about the database, I’m now receiving requests look things up. That’s great, it was a heap of work of many people to put it together and it’s right that it shouldn’t be my private play-database; there’s many life-times of work that could be done on this.
This now raises the question about how (and whether) to make the materials more available in a way that’s useful, and that protects the original copyright and IP rights of contributors. For example, I have e-copies (in some cases, re-typed copies) of material that’s still in print. I would not want the original authors to lose sales.
Here is something I’m starting to think about in terms of db development, and I’d welcome readers’ feedback. I want to stress that this will be a very long process, well beyond the life of the current grant (which will be up in 2012).
  • Users would need to register to get any access to data. In order to register, they would have to sign an agreement to respect the rights of the depositors and the database owners. This would at minimum include strictly non-commercial use and would be subject to fair use agreements.
  • There would be different user types which would provide access to different data levels. Queries might originally be limited to a single language or a certain number of word lookups, for example.
  • User types/roles might include the following (these roles are, of course, not mutually exclusive):
  • community member for a language or group (total access to sources on that language);
  • data contributor (provider of wordlists or reconstructions, in one tongue-in-cheek game-theory-oriented case, I am thinking of an access level such as ‘if you allow your data to be viewed by others, you can get full access to the data from other people who have made the same agreement’*);
  • student;
  • general researcher
  • Every single data point would be referenced with its original source (that is currently how the db has words; it is not like some of these online dbs that have the original source only on the dictionary ‘front page’; the LEGO lexicon project, for example, does that). Citations would need to be to the original data source (as well as some acknowledgement of this db).
  • I am also considering how reconstructions might be shown; for example, would all supporting data be shown, or just a subset of it?
  • There are many other issues to consider; please let me know your thoughts. Anyone is welcome to comment, and I’d particularly like to hear from potential users and Aboriginal community members.
    *In another, even more tongue-in-cheek access view, I am thinking that you can have as many search returns scaled by contributed data points; in that view, Luise Hercus and Patrick McConvell, for example, would get essentially unlimited searching, while the linguist who shall remain nameless but whose response was “over my dead body” would be restricted to, say, 10 hits.

    Linguist in the news

    Jangari‘s in an article in the Sydney Morning Herald!