One of the things I wished for in working on the Bardi grammar was a time-aligned corpus. So now that I’ve largely finished (apart from covering my now-submitted manuscript with scribbles about typos, new things to include, and the like) I have time to make a better job of it. It’s also part of working out what to prioritise for the next few weeks. Here’s my workflow (put here so I can find it again, to keep me honest, and in case it’s useful for someone else):
- I have fairly decent metadata (audition sheets) for most of my recordings; I slacked off a bit on the last trip but I also recorded fewer stories, so I’ve got away with it for the moment. First job was to go through the metadata and make sure that each story has its own page (with its own reference key) in my text database.
- My student Laura chopped up and extracted a lot of the stories from the older files. I still need to do the rest, but as I go through to check that the files are labelled with the right story I’m checking off that they have a sound file (Laura did this, but it’s in the OSX-only file colour coding and I’m working on a PC)
- About half the files have transcripts already (from my earlier trips where I transcribed on paper, checked them with speakers with a tape player, and then typed them when I got back to computer-land); for those, I’m exporting the transcript to a new file, adding an Elan header, and importing into Elan. [Note that for some reason, I cannot import ‘Toolbox’ files, only ‘Shoebox’ files. It probably has something to do with the character encoding. If importing Toolbox files gives you an undefined error in Elan, try using the Shoebox setting]. These then get time-aligned with the ‘bulldozer’ mode and simultaneously checked (and annotated/tagged if there are things I want to find again).
- New files without transcripts get a new Elan file created and segments created for transcription mode.
- Once all that’s done, the files will be exported back into Toolbox with the time-alignment codes.
- When an individual story is done, it should be run through CuPeD for the web archive of stories. I still need to do some work to get that running.
The final stage down the track is to interlinearize (or at least tag by part of speech) some of the corpus. I doubt I’ll be able to do this in the near future but if I can figure out a way to do it semi-automatically it would be a nice resource.