Category Archives: Technology and Software

Great digital tools

Nick Thieberger has a great post on new digital tools in the humanities (bleeding over into linguistics). It’s been a while since I’ve done any trawling for new programs and it looks like there are plenty of new things available for lots of different types of projects. Some are a big enigmatic for my liking. NewRadial, for example is ‘data analysis for the humanities,’ but exactly what that entails isn’t exactly clear. Catma looks kind of useful though. I can imagine using it to tag texts for interesting grammatical features, for example. Text Analysis Markup System is another program in the same vein.

I couldn’t quite see the point of voyant-tools, though it does produce pretty word graphics. Nodex looks like it might be a handy network mapping tool (e.g. for mapping loanword data). It’s windows-only though, I see. OpenHeatMap is a simpler version of google’s fusion tables. Lots of bibliographical software here, including some nice plugins for Zotero. And here’s a list of transcription tools.

Enjoy!

Advertisements

TextStat

One of the great things about co-teaching is all the stuff you learn from your co-instructor. Arienne gave a nice demo today of TextStat, a flexible concordance program from the Dutch studies dept at the Freie Universitaet Berlin. It’s free, and available for Windows, PC, and Linux.

Its major advantage is that it will read Word and OpenOffice files. That is, you don’t need to format the input text in any special format before it’s imported into the program. It will also retrieve web pages.

As programs go, it’s pretty simple. It does wordlist generation and concordancing, and you can view citations in context or in list format. But that’s already pretty useful. It’s very memory-light and doesn’t take up much space on the hard drive. Installation is easy (just unzip the archive on windows). If you want high-powered concordance software, NLP tools are for you, but if you want an easy way to see what’s in your data, this is definitely the way to go.

Australian Language Polygons and new Centroid files

I’ve finished a *draft* google earth (.kmz) file with locations of Australian languages, organised by family and subgroup.

Some things to note:

  • You may use these files for education and research purposes only.
  • NO commercial use under any circumstances without my written permission.
  • NO republication any any circumstances without my written permission.
  • You may quote from these files. Please use the following citation: Bowern, C. (2011). Centroid Coordinates for Australian Languages v2.0. Google Earth .kmz file, available from http://pantheon.yale.edu/~clb3/
  • These files represent my compilation of many available sources, but are known to be deficient in a number of areas. Some sources are irreconcilable. This work is unsuitable for use as evidence in Native Title (land) claims.
  • Please do not repost or circulate these files. Send interested people to this page. I will be updating the files from time to time.
  • Please let me know of errors! The easiest way to do this is to change the polygon or centroid point for the language(s) you are correcting, and send me that item as a kml file.
  • If you use derivatives of this file (e.g. you calculate language areas from it, convert it to ArcGIS, etc), that’s fine, but please send me a copy of the derivative file

Python script to convert backslash codes to tabbed text

Sophia Gilman is a Yale student who’s been working on my NSF Pama-Nyungan project this year. One of the things she’s been working on is a script to convert irregularly ordered backslash codes to a tabbed text file (for further import into database programs). The script takes the backslash file, detects the headword code, asks you a bunch of questions about it, and sets up a file with each backslash code as a column in a table.

The script was developed specifically for our project and its needs, but it’s flexible enough that it might be useful for others too. We’re making the script available for free, but it’s Sophia’s work, and she (and the NSF project BCS-844550) should be acknowledged if you use the script in work that results in publications.

A couple of notes on the script:

  • It’s a python script. You need to have python installed on your computer to run it. If you don’t have python and you have backslash coded files for Australian languages that you would like to convert, we can help. If you’re working on another area, though, I’m afraid we can’t provide any support for script use.
  • SIL’s MDF (Toolbox) standard codes are hard-coded into the program.
  • Some features are specific to the needs of my NSF project and may be irritating to others:
  • Subentries are converted to main entries. The program makes some effort to treat material appropriate to the entry as a whole as belong to each newly created record.
  • Multiple glosses are converted to multiple records.
  • Examples are not split into multiple table columns; they are grouped into a single column.

The script is available here. If you modify it for your own use, we’d appreciate a copy.

Call for discussion

As readers may know, I’ve been compiling a database of lexical items for Australian languages (funded by the NSF). It started with Pama-Nyungan but I’m now expanding it to the rest of the families. It’s reached 600,000 items now so it’s quite a nifty research tool for looking at language relationships.
I have more than 1000 sources, including data from many people in the Australianist community, for which I’m very grateful. This data is subject to a whole bunch of different requirements, permissions, and restrictions, ranging from “sure, here’s the file, feel free to pass it on” to “I can give you this but you can’t tell anyone you have a copy.” This makes long-term planning for the materials rather complicated.
Currently the only people with access to this database are me, my research assistants, the post-doc working on the project, and 3 people I’ve given downloads to. As word is getting out about the database, I’m now receiving requests look things up. That’s great, it was a heap of work of many people to put it together and it’s right that it shouldn’t be my private play-database; there’s many life-times of work that could be done on this.
This now raises the question about how (and whether) to make the materials more available in a way that’s useful, and that protects the original copyright and IP rights of contributors. For example, I have e-copies (in some cases, re-typed copies) of material that’s still in print. I would not want the original authors to lose sales.
Here is something I’m starting to think about in terms of db development, and I’d welcome readers’ feedback. I want to stress that this will be a very long process, well beyond the life of the current grant (which will be up in 2012).
  • Users would need to register to get any access to data. In order to register, they would have to sign an agreement to respect the rights of the depositors and the database owners. This would at minimum include strictly non-commercial use and would be subject to fair use agreements.
  • There would be different user types which would provide access to different data levels. Queries might originally be limited to a single language or a certain number of word lookups, for example.
  • User types/roles might include the following (these roles are, of course, not mutually exclusive):
  • community member for a language or group (total access to sources on that language);
  • data contributor (provider of wordlists or reconstructions, in one tongue-in-cheek game-theory-oriented case, I am thinking of an access level such as ‘if you allow your data to be viewed by others, you can get full access to the data from other people who have made the same agreement’*);
  • student;
  • general researcher
  • Every single data point would be referenced with its original source (that is currently how the db has words; it is not like some of these online dbs that have the original source only on the dictionary ‘front page’; the LEGO lexicon project, for example, does that). Citations would need to be to the original data source (as well as some acknowledgement of this db).
  • I am also considering how reconstructions might be shown; for example, would all supporting data be shown, or just a subset of it?
  • There are many other issues to consider; please let me know your thoughts. Anyone is welcome to comment, and I’d particularly like to hear from potential users and Aboriginal community members.
    *In another, even more tongue-in-cheek access view, I am thinking that you can have as many search returns scaled by contributed data points; in that view, Luise Hercus and Patrick McConvell, for example, would get essentially unlimited searching, while the linguist who shall remain nameless but whose response was “over my dead body” would be restricted to, say, 10 hits.

    Elan transcription mode

    The good people who maintain Elan recently announced a new version, with a new transcription mode (and some other goodies too). I’ve been using it for a few days now.

    For the most part, it’s working well. It is a definite improvement over the annotation mode for rapid transcription. Cutting down on the navigation between annotations and between tiers produces a noticeable time-saving, as well as a minor decrease in frustration with the program which is definitely worth it. I use Elan for all transcription now so I’m pretty pleased with this. The table interface is nice, navigation is easy, and starting and stopping the audio with tab is also very useful. I like being able to keep the current annotation in the centre too.

    Segmentation mode has also improved and is more intuitive and easier to use.  The program also seems (though it may be my imagination) to be running a little faster and coping better with audio on my mac.

    There are a couple of things that I think could be improved for usability. One is the barrier that the naming conventions in the program imposes. When I first started using Elan I could never keep straight the differences between ‘included in’, ‘time subdivision’, and ‘symbolic subdivision’. The explanation in the manual was pretty difficult to follow.  Now, it turns out that I did not interpret the different types in the same way that the authors of the program did. This means that my free translations are not of the correct type to show up automatically in the annotation mode (they need to be ‘symbolic associations’). I can see the justification for treating them as ‘symbolic associations’ (though I maintain that this terminology is not user-friendly), though I can also see perfectly reasonable arguments for having free translations as ‘included in’ or ‘time subdivision’, depending on what the unit of the parent tier is. After all, the parent tier unit could be either intonation units (useful for transcription but not full sentences) or sentences. Ideally one would represent both, but that’s a big job, especially when one can’t read off the ‘sentences’ from the wave form in segmentation mode. One has to transcribe the IUs first, then merge those annotations into a set of larger ones; but we would want the sentence level annotations to be the parent of the IU translations, and that, I think, one can’t do. [Please tell me if I’m wrong…]

    Changing the data type for free translations is not a problem for new transcriptions but it is incredibly irritating for the large number of partially complete transcript files I have in my collection. I’ll need to change quite a few files by hand. However, because you can’t change a tier type once the tier has information in it, and you can’t apparently paste information into a tier of type symbolic association, this is going to be a big job. I’m not sure if it’s worth doing, since almost all my searching is done on the language transcription tier anyway. It’ll introduce a major inconsistency into the transcripts, but unless there’s an easier way to change the type of a tier so that it is linked, I think a better use of my time is doing more transcription…

    I haven’t worried too much about the impenetrability of instructions  in the past, but while I’m on that topic, I have to say that the instructions of the new annotation mode have some of the same issues. There are a lot of screenshots, and there’s a lot of detail, which is great, but the tier names in the example are extremely difficult to remember, and the tier types are also pretty unintuitive. For example, why have ‘po’ for ‘practical orthoraphy’ but ‘tf’ for ‘free translation’? Why reverse some names but not others? In the past I have not worry too much about things like this, because I don’t have a problem learning new software and I usually don’t mind spending a bit of time figuring out what the program is wanted to say (or I figure it out without reading the manual in the first place, more often). But now I am spending more time teaching this software as part of classes, I’m increasingly finding that very complex naming conventions that are not easily rememberable are really unhelpful in teaching. It really increases the take-up time needed and the amount of time needed to explain things in class, because I have to explain the ontology as well as how to use the program. That would be ok if I taught a separate class on fieldwork software, but I can’t do that, and I don’t really want to.

    Elan is not, of course, the only piece of software to have a confusing ontology and set of naming conventions (how long was it before Praat got an “open” file button??) But it’s something that could be improved, I think. (Of course, for current users who have made the effort to learn the existing ontology, changing it will merely annoy…)

    Anyway, back to the review. One thing I noticed with this mode is that it makes navigation within the annotation more difficult, at least on a Mac. It is quite difficult to move between words within the annotation, because some of the shortcuts (e.g. option + right) are now defined to have the user moved between table cells in the annotation. This makes proofing transcripts a bit difficult, or growing back and correcting an earlier error. It’s also possible to make the program crash if one has many short annotations and navigates too fast between them. Stopping and starting with the Tab button also seems to be a little unreliable, and one of my files got stuck in loop mode. The automatic playback doesn’t always work either.

    I would like it if the waveform window showed a little more than just the annotation, say half a second on each side. As it is, it is quite difficult to select the start of the annotation to play from the beginning (though I see now that one can use Shift-Tab to play from the beginning). I would also find it useful to have some idea about how far through the file I am. One can get a sense from this that the scroll bar on the right-hand side but the thing in the main annotation mode which shows the density of annotation points and where one is in the file I find very useful. I think it might also be useful if when one stopped and started the annotation with tab, the playback started from .25 second (or something similar) before the current cursor position. Unless I’m very quick with the ‘stop’ button I often miss the first part of the next word. This would also be very useful for segments created with the ‘segmentation’ mode, which have no buffer either side. This makes them hard to use as subtitles for movies because the annotations don’t stay on the screen long enough.

    Speaking of silence recogniser, this is area where tier copying is problematic. If I create annotations using this mode, I can’t then copy those annotations to a tier with a translation type as dependent. I have to create a new tier, or reassociate the translation tier with the channel1 tier, and then rename it. This seems to defeat the purpose of using a template, since I basically have to set up the tiers from scratch for each file. [While I’m talking about help and the audio recognizer, why does it appear under ‘screen display’ in the help files? That’s not very intuitive. Also, there’s something wrong with the search function in the built-in help; it doesn’t find ‘recognizer’ or ‘silence’.] There’s an ‘add tier’ button but it doesn’t seem to do anything.

    Finally, I think I would like to see keyboard shortcuts for navigating between the different modes. Currently, there are buttons at the top of the screen in annotation mode for navigating between ‘grid’, ‘text’, subtitles’, ‘audio recogniser’, ‘lexicon’ ‘metadata’ and ‘controls’. Now, I don’t need to run the recogniser more than once per file (why isn’t that a ‘mode’?)

    So, all in all, I’m impressed with the transcription mode. I wish the instructions about tier setups had been a little clearer 5 years ago when I set up my templates, but too late to change that now. There are also some issues with tiers which make the program not as easy to use as it might be. Elan is still very much the best transcription program for fieldwork and it could be even better with some more attention paid to usability.

    Elan and Sendpraat redux

    I has an earlier post on sendpraat and praat with Elan. Here’s an update from Han Sloetjes (via Ruth Singer).

    It’s possible that there’s an issue with the file permissions, and this needs to be fixed in the terminal window:

    • open a Terminal window (Applications/Utilities/Terminal)
    • type “cd ” (without the quotes) and drag the folder where sendpraat is into the Terminal window. This copies the path to the folder. With the Terminal window active, press enter.
    • You should now be “in” the right folder and you can type “ls -al” followed by enter.
    • The output should be something like this:

    dhcp68:sendpraat han$ ls -al
    -r-xr-xr-x@  1 han  admin  17400 27 mrt  2008 sendpraat
    -r-xr-xrwx@  1 han  staff  17400 11 mrt  2008 sendpraat_intel
    -rw-r–r–@  1 han  staff  21900 13 jan  2006 sendpraat_ppc

    • If there are no “x” (executable) characters in the first part of the sendpraat line, you can do the following command

    chmod 557 sendpraat

    • followed by enter, then try ls -al again to see the changes. If there is an  “x” try again from within ELAN.