[This is not a fieldwork post; it’s a real time one]
I’ve just succeeded in getting the Yan-nhaŋu dictionary from Toolbox into TshwaneLex. Why did I spend about 10 hours on this, you ask? For two reasons. One is because it is clear that the Toolbox version needed huge amounts of work on it still in order to be publishable. Another is because there are structural issues in the dictionary entries that can’t really be solved by Toolbox. For example, say you want a headword (a verb) with its senses, example, and then the paradigm information at the bottom. There’s no way to associate the paradigm information with the entry as a whole and not an invididual sense. I also wanted something that would rigorously enforce formatting and information order (and ideally one where that could be changed easily, since there will be several versions of this dictionary). I’m still interlinearising so I’ll be exporting the headword and one-word translations back out periodically.
Getting the data into TshwaneLex wasn’t completely straightforward, so here’s what I did:
- Cleaned up the Toolbox file.
- Removed all line breaks within fields (necessary for CSV importing)
- Removed empty fields
- Exported the file about 15 times, each time with the \lx field and another field. This was necessary because the fields were out of order.
- For fields where the information is associated with a subpart of the entry (e.g. semantic fields, definitions), the headword, the sense/gloss or related field were exported together with the new field. Examples:
- \sd was exported with the db sorted by \ge, and \ge was moved to the top.
- \rf, \xn and \xe were exported sorted by \xv.
- These were then cleaned up in Word:
- backslash codes were removed
- items with missing information removed
- The fields were separated by $ (not , since that appears in fields like \ee)
- The whole thing was then very carefully imported into TshwaneLex, in a specific order.
- word list (just the \lx fields)
- Items that are daughters of the \lx field (e.g. \ps, grammar tags,some cross-references, sources, etymology)
- de and other sense information that relies on identity with\ge
- cultural information (\ee and other things relying on \ee)
- I also had to change the DTD and do a bit of manual editing.
What’s not there now?
- Have to reproof the cross-referencing (but I had to do that anyway)
- The sources are associated with the entry, not the senses (but that was incorrect in the Toolbox entry too, and would have needed to be changed by hand there too)
- Have to add back the second examples (but there weren’t more than a few fields with two examples at this stage, and they are findable through the ‘corpus’ function, and I have a huge number of examples to add anyway)
- I didn’t add the random morphology in, but most of that was just cluttering up the entries.
Note: Toolbox xml doesn’t work for importing. There’s some bug in the way it’s exported (or in the way that TshwaneLex interprets it) so that TshwaneLex can never resolve which item is the lemma and which is a subentry of the lemma (e.g. lx field contents vs ps contents). About 4 of the 10 hours spent in importing involved investigating this.