Call for discussion

As readers may know, I’ve been compiling a database of lexical items for Australian languages (funded by the NSF). It started with Pama-Nyungan but I’m now expanding it to the rest of the families. It’s reached 600,000 items now so it’s quite a nifty research tool for looking at language relationships.
I have more than 1000 sources, including data from many people in the Australianist community, for which I’m very grateful. This data is subject to a whole bunch of different requirements, permissions, and restrictions, ranging from “sure, here’s the file, feel free to pass it on” to “I can give you this but you can’t tell anyone you have a copy.” This makes long-term planning for the materials rather complicated.
Currently the only people with access to this database are me, my research assistants, the post-doc working on the project, and 3 people I’ve given downloads to. As word is getting out about the database, I’m now receiving requests look things up. That’s great, it was a heap of work of many people to put it together and it’s right that it shouldn’t be my private play-database; there’s many life-times of work that could be done on this.
This now raises the question about how (and whether) to make the materials more available in a way that’s useful, and that protects the original copyright and IP rights of contributors. For example, I have e-copies (in some cases, re-typed copies) of material that’s still in print. I would not want the original authors to lose sales.
Here is something I’m starting to think about in terms of db development, and I’d welcome readers’ feedback. I want to stress that this will be a very long process, well beyond the life of the current grant (which will be up in 2012).
  • Users would need to register to get any access to data. In order to register, they would have to sign an agreement to respect the rights of the depositors and the database owners. This would at minimum include strictly non-commercial use and would be subject to fair use agreements.
  • There would be different user types which would provide access to different data levels. Queries might originally be limited to a single language or a certain number of word lookups, for example.
  • User types/roles might include the following (these roles are, of course, not mutually exclusive):
  • community member for a language or group (total access to sources on that language);
  • data contributor (provider of wordlists or reconstructions, in one tongue-in-cheek game-theory-oriented case, I am thinking of an access level such as ‘if you allow your data to be viewed by others, you can get full access to the data from other people who have made the same agreement’*);
  • student;
  • general researcher
  • Every single data point would be referenced with its original source (that is currently how the db has words; it is not like some of these online dbs that have the original source only on the dictionary ‘front page’; the LEGO lexicon project, for example, does that). Citations would need to be to the original data source (as well as some acknowledgement of this db).
  • I am also considering how reconstructions might be shown; for example, would all supporting data be shown, or just a subset of it?
  • There are many other issues to consider; please let me know your thoughts. Anyone is welcome to comment, and I’d particularly like to hear from potential users and Aboriginal community members.
    *In another, even more tongue-in-cheek access view, I am thinking that you can have as many search returns scaled by contributed data points; in that view, Luise Hercus and Patrick McConvell, for example, would get essentially unlimited searching, while the linguist who shall remain nameless but whose response was “over my dead body” would be restricted to, say, 10 hits.

    9 responses to “Call for discussion

    1. Pingback: Endangered Languages and Cultures » Blog Archive » Have your say

    2. Could you add to your posting some information on whether NSF, in supporting this work, has imposed any publication or accessibility requirements on the project and, if so, what they are? Thanks.

    3. Sure. The requirements for accessibility are not specifically about the database. As I understand it, the NSF encourages grantees to make the results of their research available as widely as possible. I have been doing this by putting paper offprints on my web page (and I’m working with Yale’s library to set up something in their institutional repository, but that’s down the road a little because they don’t have an e-repository yet). I have also been working informally with a number of aboriginal and Torres Strait Islander communities in making materials available, and I will continue to do that no matter what discussions take place here.
      In my original grant application, I stressed the complexity of the IP restrictions on this material and left it for future consideration.
      So, in short, while I can’t speak for the NSF, I believe that they would encourage making these materials available within IP practicalities, but there is nothing in their requirements that must shape the form of this distribution.

    4. Hi Claire,

      This is a really interesting question. As someone who runs a few language databases (e.g. the Austronesian Basic Vocabulary Database and POLLEX-Online), my take on this is that there are far too many linguists out there who think of the data as their own private playground. I too have encountered people who wouldn’t give me data “over their dead bodies”. As I see it, there’s absolutely no justification for this – the data do not belong to them. They might have collected it, but it’s part of humanity’s shared knowledge and should be treated as such. Anything else is cultural theft. (For the record, the vast majority of people that I’ve asked for data have been very happy to contribute). I really think linguistics really needs to look towards disciplines like biology where data are deposited in online repositories like Genbank as a condition of publication.

      So – I have all the data available (except where the databases are still in-development) with prominent notices about how to cite the database in question. It does mean I get ripped off occasionally by people downloading data and reposting it without attribution (I won’t mention any names). On the other hand, hopefully it’ll shame those not-over-my-dead-body types into doing the right thing.

      As for IP issues – I think citing the original source (dictionary, wordlist, whatever) is the way to go. On the other hand – we need to get credit for building databases. I like the way the WALS database has done it where the IP belongs to the authors of the chapters i.e. Matthew Dryer wrote the chapter on word order, and provided the word order data. If you use that data, you need to cite Dryer’s chapter. This might work for situations where all the entries are from one source (e.g. the Arosi wordlist here from Fox’s 1970 dictionary), but what do you do when there are many sources for one, say, language? For the POLLEX website I’ve tried to add a source reference for each entry, but I suspect people will just cite the database rather than the multiple sources.

      As an aside, I’ve never really understood the point of limiting how much people can use your website – surely that’s just punishing your best users? (Note that this is a different issue to limiting the automatic spidering of the site).


    5. Pingback: Have your say – Ethnos Project Crisis Zone

    6. “Every single data point would be referenced with its original source”
      Don’t you mean that the reference is to the proximate source?: that is, the archive item or publication from which your project imported the datum. Each source you used in turn would have other sources, declared for many or most vocabulary items.
      “Citations would need to be to the original data source”
      I suppose citation of your proximate data source would suffice, as that source could be checked and someone if motivated could follow the trail back as far as needed.

    7. I’m glad to see you considering these issues. I wonder, though, why you want to take on the role of archivist. Is there not an appropriate archive somewhere that can house, ensure the long-term preservation of, and provide appropriate (and appropriately restricted) access to these materials? Presumably, you would be able to provide updated versions of the database as your work progresses.

    8. Emily I don’t see this as archiving the materials but setting up something that would be an ongoing project with contributors to both the raw data and the historical side of things. Multiple versions would be a nightmare for a project like this. But anyway, even if I did outsource the archiving side of things, these questions would still need to be resolved.

    9. Yes, absolutely they would still need to be resolved (and any archive worth it’s salt would listen to your preferences about how). I hope you will think about archiving at some point, because resources such as the one you are creating should be preserved for posterity.

    Leave a Reply

    Fill in your details below or click an icon to log in: Logo

    You are commenting using your account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s