Standard Average Australian

Posted on January 1, 2018 | 2 comments

My slides for my recent Association for Linguistic Typology Talk on “Standard Average Australian” are now available on Zenodo. The slides are self-explanatory, I think, and the Zenodo page has the long abstract that I submitted to the ALT for conference review. In brief, the talk is about the (largely unreferenced) claims that many Australianists (including me, I should add) have made about the languages of the country.

I am currently writing up the results for submission to Linguistic Typology. Thanks very much to the ALT conference participants, particularly the Australianists I talked to about this.

2 Comments

Posted in Bardi, Historical, language documentation, Pama-Nyungan

HTK error list

Posted on December 1, 2017 | Leave a comment

http://www.ling.ohio-state.edu/~bromberg/htk_problems.html used to have a really useful list of common htk errors, what they mean, and how to solve them. It was maintained by then graduate student Ilana Heintz. The site disappeared about a year ago but the internet archive had a copy (I haven’t been able to find a live copy elsewhere). I’m posting it again here in case it’s useful to others, and on-going thanks to Dr. Heintz for her work in collating all this and making others’ lives a *lot* easier.

UNDERSTANDING HTK ERROR MESSAGES

Various problems & solutions I’ve come across in using HTK for building a WSJ recognizer and for my dissertation work in Language Modeling. If you’re here to find answers for your own project, consider posting your problems & solutions on your own website, for others to learn from, too.

PROBLEM	SOLUTION
HLEd -d prondict -i monophone.mlf mkphones0.led words.mlf Does nothing, only #!MLF!# is returned in the output.	There need to be double quotes around the lab filename in the words.mlf file: “/xxx.LAB” instead of ‘/xxx.lab’
HDMan -l hdman.log -w lists/all.wordlist lists/all.words.monophones.dict lists/cmudict.sort ERROR [+1452] ReadDictProns: word A out of order in dict lists/cmudict.sort FATAL ERROR – Terminating program HDMan	Unix sort doesn’t seem to match the sort HTK is looking for. Python’s sort function seems to work. Numbers are sorted with ‘.’ before 0, shorter before longer (1 < 1.0 < 10 < 100)
HLEd -l ‘*’ -d lists/allwords.prons.dict -i lists/all.phonemlf src/mkphones0.led lists/all.wordmlf ERROR [+5013] ReadString: String too long FATAL ERROR – Terminating program HLEd	Make changes to the pronunciation dictionary: Replace all multiple spaces with single space; Replace all tabs with single space; Put a ‘\’ before every double quote (“); %” Put a ‘\’ before any dictionary entry beginning with single quote (‘)
HLEd -l ‘*’ -d lists/allwords.prons.dict.notabnospace -i lists/all.phonemlf src/mkphones0.led lists/all.wordmlf ERROR [+1232] NumParts: Cannot find word ~ in dictionary FATAL ERROR – Terminating program HLEd	Add that word to the dictionary, resort if necessary
ERROR [+1232] NumParts: Cannot find word MR. STEINBERG in dictionary FATAL ERROR – Terminating program HLEd	In the MLF file, the line “MR.” ended with a slash, remove the slash from the MLF file.
HLEd -l ‘*’ -d prondict -i train.monophone.mlf mkphones0.led train.rem.mlf ERROR [+6550] LoadHTKList: Label Name Expected FATAL ERROR – Terminating program HLEd	For all numbers in train.rem.mlf, precede them with \ so they don’t look like a time.
HLEd -d train.prondict -i train.monophone.mlf mkphones0.led tdt4.arabicBN.mlf ERROR [+1232] NumParts: Cannot find word #(tdAxl in dictionary FATAL ERROR – Terminating program HLEd	some of these words ended in \) in the mlf, which was screwing with how it appears in the dictionary. I took out the \) in the mlf, now have to make sure everything has its correct entry in the prondict.
HLEd -d train.prondict -i train.monophone.mlf mkphones0.led tdt4.arabicBN.mlf ERROR [+6550] LoadHTKLabels: Junk at end of HTK transcription FATAL ERROR – Terminating program HLEd	Add -T 1 to the command line. Where it stops, look in the .mlf file for that transcription. There may be a blank line or something kooky in it. This will help you find a lot of the errors that HLEd comes up with.
HCopy -C configall -S wav2mfcc.scp ERROR [+6270] OpenParmChannel: Cannot read parameterised WAV data ERROR [+6313] OpenAsChannel: OpenParmChannel failed ERROR [+6316] OpenBuffer: OpenAsChannel failed ERROR [+1050] OpenParmFile: Config parameters invalid FATAL ERROR – Terminating program HCopy	moved the HCopy configurations out of configall and into their own configuration file without the HCopy: prefixes
HCopy -C confighcopy -S wav2mfcc.arabicBN.scp -T 1 data2/20000610_0330_0430_voa_arb_spl0.wav -> data2/20000610_0330_0430_voa_arb_spl0.mfcc ERROR [+6251] Input file is not in RIFF format ERROR [+6213] OpenWaveInput: Get[format]HeaderInfo failed ERROR [+6313] OpenAsChannel: OpenWaveInput failed ERROR [+6316] OpenBuffer: OpenAsChannel failed ERROR [+1050] OpenParmFile: Config parameters invalid FATAL ERROR – Terminating program HCopy	seems to work if I put a single file on the command line couldn’t figure out the problem, but it worked when I used a different computer maybe it’s a 64-bit vs 32-bit problem?
HCompV -C src/ConfigHVite -f 0.01 -v 0.01 -m -S lists/train.plp.list -M hmm0 proto/hmm0/prototype_base ERROR [+7032] FreezeOptions: vecSize not set ERROR [+5105] AllocBlock: Cannot allocate block data of 4294967288 bytes FATAL ERROR – Terminating program HCompV	Was using the wrong hmm0/prototype; make sure it has the appropriate lines at the top (how the MFCCs were defined, E_Z_A_D etc, means of one, variances of zero
HCompV -C src/ConfigHVite -f 0.01 -v 0.01 -m -S lists/train.plp.list -M hmm0 proto/hmm0/prototype ERROR [+7031] GetTransMat: Bad Trans Mat Sum in Row 3 HMM Def Error: GetTransMat failed at line 40/col 14/char 1028 in proto/hmm0/prototype ERROR [+7050] HMError: ERROR [+7032] LoadHMMSet: GetHMMDef failed ERROR [+2028] Initialise: LoadHMMSet failed FATAL ERROR – Terminating program HCompV	In the prototype file, at the matrix, the copy and paste had split up the lines, so the rows did not add up to one. Make sure each row fits on a single line.
HCompV -C configall -T 1 -A -D -m -M hmm0 -f 0.01 -S train_mfcc.list hmm0/prototype ERROR [+5050] ReadConfigFile: = expected line 1/col 8/char 7 in configall ERROR [+5020] InitShell: ReadConfigFile failed on file configall ERROR [+2000] HCompV: InitShell failed FATAL ERROR – Terminating program HCompV	If the first column of the config file lists the program name (HVite, HCopy, etc), make sure there is a colon after the name. HCopy: TARGETKIND=MFCC_0_D_A Also make sure any ‘#’ for comments come at the beginning of the line, not the second column.
HCompV -c ConfigHVite -T 1 -A -D -m -M hmm0 -f 0.01 -S train_mfcc.list hmm0/prototype No HTK Configuration Parameters Set HCompV: Computing side based cepstral mean ….. ERROR [+2039] HCompV: AccGenUtt: speaker pattern matching failure on file: hmm0/prototype	The -c needs to be -C, or else the config file isn’t read.
HCompV -C ConfigHVite -T 1 -A -D -m -M hmm0 -f 0.01 -S train_mfcc.list hmm0/prototype ERROR [+2050] CheckData: Parameterisation in ./20001001_10.mfcc is incompatible with hmm hmm0/prototype	In hmm0/prototype, change USER to MFCC_0_D_A (when HCopy is run with MFCC_0 as the TARGETKIND
HCompV -C ConfigHVite -T 1 -A -D -m -M hmm0 -f 0.01 -S train_mfcc.list hmm0/prototype ERROR [+2050] CheckData: Vector size in /data/data3/bromberg/fisher/segmented/fla_0069_122.mfcc[39] is incompatible with hmm hmm0/proto[13]	In the first line of hmm0/proto, which you need to create by hand in order to run HCompV, make sure the vecSize is the same as the size of the mfccs. Here its saying that the mfcc has 39 dimensions but the proto only calls for 13. Here is asample script for making the proto file.
HCompV -A -T 1 -S trainsets/training-extfiles0 -l lineObservations -I labels.mlf -o lineObservations -m -M models/hmm0.0 hmmdefs/version1-hmm-top-23vec Calculating Fixed Variance HMM Prototype: hmmdefs/version1-hmm-top-23vec Segment Label: lineObservations Num Streams : 1 UpdatingMeans: Yes Target Direct: models/hmm0.0 * stack smashing detected *: HCompV terminated	HTK is 32-bit program. Install GCC 3.4 for it to run it on a 64 bit machine. .. otherwise some part works / some gets stack overflow.
HERest -C src/ConfigHVite -I lists/train.phonemlf -t 250.0 150.0 1000.0 -S train.mfcc.list -H hmm0/macros -H hmm0/hmmdefs -M hmm1 lists/monophones1 ERROR [-7324] StepBack: File … bad data or over pruning	Possible problems include corrupt mfcc, non-matching or non-existent labels. In this case, I had to re-calculate the mean & variance for the prototype hmm using only 1/2 the data, and the problem went away. If every file is considered bad data, you may have derived the features wrong. Go back to HCopy and check the parameters (config file).
HERest -C src/ConfigHVite -I lists/train.phonemlf -t 250.0 150.0 1000.0 -S train.mfcc.list -H hmm0/macros -H hmm0/hmmdefs -M hmm1 lists/monophones1 Saving hmm’s to dir hmm1 ERROR [+7031] PutTransMat: Row 4 of transition mat sum = 1.064684 FATAL ERROR – Terminating program HERest	Too much data. Use the -p option, splitting the input and processing over several machines, then doing a separate HERest pass with -p 0 to accumulate the accumulators. Or, as above, use a smaller portion of the data. Also, make sure that the file durations are spread evenly across lists. Don’t put all the long files together, mix them up with short ones.
HERest -C src/ConfigHVite -I lists/train.phonemlf -t 250.0 150.0 1000.0 -S lists/train.plp.list -H hmm0/macros -H hmm0/hmmdefs -M hmm1 lists/monophones1 ERROR [+5010] InitSource: Cannot open source file hmm0/macros ERROR [+7010] LoadAllMacros: Can’t open file ERROR [+5010] InitSource: Cannot open source file hmm0/hmmdefs ERROR [+7010] LoadAllMacros: Can’t open file ERROR [+7050] LoadHMMSet: Macro name expected ERROR [+2321] Initialise: LoadHMMSet failed FATAL ERROR – Terminating program HERest	Need to make ‘macros’ file in hmm0 directory. Copy first few lines of the prototype into macros, then append to it the vFloors file.
HERest -C src/ConfigHVite -I lists/train.phonemlf -t 250.0 150.0 1000.0 -S lists/train.plp.list -H hmm0/macros -H hmm0/hmmdefs -M hmm1 lists/monophones1 ERROR [+5010] InitSource: Cannot open source file hmm0/hmmdefs ERROR [+7010] LoadAllMacros: Can’t open file ERROR [+7050] LoadHMMSet: Macro name expected ERROR [+2321] Initialise: LoadHMMSet failed FATAL ERROR – Terminating program HERest	Need to manually create hmmdefs file. From htkbook: “…hmmdefs containing a copy for each of the required monophone HMMs is constructed by manually copying the prototype and relabeling it for each required monophone (including sil).” Use the build_hmmdefs.py script. Add another copy of the hmm at the bottom with the label ‘sil’.
HERest -C src/ConfigHVite -I lists/all.phonemlf -t 250.0 150.0 1000.0 -S lists/train.plp.list -H hmm0/macros -H hmm0/hmmdefs -M hmm1 lists/monophones1 Pruning-On[250.0 150.0 1000.0] ERROR [+6510] LOpen: Unable to open label file /scratch/ilana/wsj/data/WSJ0/SI_TR_S/01G/01GC020X.lab FATAL ERROR – Terminating program HERest	The label file names in the all.phonemlf file were not in all caps. Changed the script that made the word-mlf file to have the filenames in all caps, then HLEd does the phone-mlf correctly.
HERest -A -C configall -p 3 -I train.monophone.mlf -S train.list3 -t 250.0 150.0 1000.0 -H hmm0/hmmdefs -M hmm1 monophones ERROR [+6510] LOpen: Unable to open label file /data/data3/fisher/segmented/fla_0069_122.lab FATAL ERROR – Terminating program HERest	In the mlf file, the filenames (utterance names) did not begin with /, so they couldn’t be matched to the filenames in train.list3. Make sure filenames in the mlf begin with / and are wrapped in quotation marks.
HERest -C src/ConfigHVite -I lists/all.phonemlf -t 250.0 150.0 1000.0 -S lists/train.plp.list -H hmm0/macros -H hmm0/hmmdefs -M hmm1 lists/monophones1 Pruning-On[250.0 150.0 1000.0] WARNING [-7325] LoadUtterance: No labels in file /scratch/ilana/wsj/data/WSJ0/SI_TR_S/01G/01GO031F.lab in HERest Segmentation fault	Rework prompts2mlf_word.py to not let a file begin with ‘.’; redo word and phone mlfs.
HERest -C src/ConfigHVite -I lists/all.phonemlf -t 250.0 150.0 1000.0 -S lists/train.plp.list -H hmm0/macros -H hmm0/hmmdefs -M hmm1 lists/monophones1 Pruning-On[250.0 150.0 1000.0] ERROR [+7011] SaveHMMSet: Cannot create MMF file hmm1/macros	mkdir hmm1
HERest -C ConfigHVite -I 20001001_1.monophone.mlf -t 250.0 150.0 1000.0 -S train_plp.list -H hmm0/macros -H hmm0/hmmdefs -M hmm1 monophones HMM Def Error: GetToken: Symbol expected at line 1/col 4/char 3 in hmm0/macros ERROR [+7050] HMError: HMM Def Error: GetOptions: GetToken failed at line 1/col 5/char 4 in hmm0/macros ERROR [+7050] HMError: HMM Def Error: LoadAllMacros: GetOptions Failed at line 1/col 0/char -1 in hmm0/macros ERROR [+7050] HMError: HMM Def Error: LoadAllMacros: Macro sym expected at line 1/col 0/char -1 in hmm0/hmmdefs ERROR [+7050] HMError: ERROR [+7050] LoadHMMSet: Macro name expected ERROR [+2321] Initialise: LoadHMMSet failed FATAL ERROR – Terminating program HERest	hmmdefs file is screwy, the line ~h ‘aa’ needs to come before the BEGINHMM line.
HERest -C ConfigHVite -I 20001001_1.monophone.mlf -t 250.0 150.0 1000.0 -S train_plp.list -H hmm0/macros -H hmm0/hmmdefs -M hmm1 monophones Pruning-On[250.0 150.0 1000.0] ERROR [+6510] LOpen: Unable to open label file 20001001_1.plp.lab FATAL ERROR – Terminating program HERest	The filenames have to be matching within the mlf files and the individual names of the pfiles. xxx.lab and xxx.pf, no variations.
HERest -A -C configall -p 1 -I train.monophone.mlf -S train.list1 -t 250.0 150.0 1000.0 -H hmm0/hmmdefs -M hmm1 monophones ERROR [+5105] AllocBlock: Cannot allocate block data of 5000000 bytes FATAL ERROR – Terminating program HERest	One of the training files is too large for the system to process. Rerun the HERest command with -T 1, see what file it fails on, remove it from train.list1, and try again. (Alternatively split up that file and its transcript and replace the original file with its splits in the training list and the mlf.)
HHEd -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed lists/monophones1 ERROR [+7030] GetHMMDef: Trans Mat Dimensions not 3 x 3 HMM Def Error: LoadAllMacros: GetHMMDef failed at char 188656 in hmm4/hmmdefs ERROR [+7050] HMError: ERROR [+7050] LoadHMMSet: Macro name expected ERROR [+2628] Initialise: LoadHMMSet failed FATAL ERROR – Terminating program HHEd	Add sp to monophones 1; To make hmmdefs4, use this script: sil2sp.pl, which I got from this htk tutorial website.
HHEd -H hmm4/macros -H hmm4/hmmdefs -M hmm5 src/sil.hed lists/monophones1 WARNING [-2631] EditTransMat: No trans mats to edit! in HHEd	Was using wrong monophones list; must use the one updated with ‘sp’
HERest -A -C configall -I train.monophone.sp.mlf -S train.list -t 250.0 150.0 1000.0 -H hmm5/hmmdefs -M hmm5 monophones.sp Pruning-On[250.0 150.0 1000.0] ERROR [+7332] Create Insts: Cannot have Tee models at start or end of transcription FATAL ERROR – Terminating Program HERest	There is an ‘sp’ short pause as the last symbol before ‘.’ in the mlf. In a previous step there was an HLEd command with a set of commands in a file like ‘mkphones1.led’. Make sure ‘IS sil sil’ is in that .led file, which puts the ‘sil’ at beginning and end of each utterance.
HERest -C src/ConfigHVite -I lists/all.sp.phonemlf -t 250.0 150.0 1000.0 -S lists/train.plp.list -H hmm5/macros -H hmm5/hmmdefs -M hmm6 lists/monophones1.sp Pruning-On[250.0 150.0 1000.0] ERROR [+7332] CreateInsts: Cannot have Tee models at start or end of transcription FATAL ERROR – Terminating program HERest	Recreate phone mlf to have ‘sil’ before and after each utterance; use “IS sil sil” in the .led file for HLed; if it still doesn’t work, find by hand the utterances that end in ‘sp’ and add ‘sil’ before the period; or use the python command line to fix it.

HVite -l ‘*’ -o SWT -b SILENCE -C ConfigHVite -a -H hmm7/macros -H hmm7/hmmdefs -i 20001001_1.realigned.monophone.mlf -m -t 250.0 -y lab -I 20001001_1.mlf -S train_plp.list prondict.sort.sp monophones ERROR [+6510] LOpen: Unable to open label file 20001001_1.lab FATAL ERROR – Terminating program HVite	name of .lab file in 20001001_1.mlf (word mlf) was wrong
HVite -l ‘*’ -o SWT -b SILENCE -C ConfigHVite -a -H hmm7/macros -H hmm7/hmmdefs -i 20001001_1.realigned.monophone.mlf -m -t 250.0 -y lab -I 20001001_1.mlf -S train_plp.list prondict.sort.sp monophones nothing appears in new transcription	Trying to realign the transcription using the current hmms. This may be the fault of me not training with enough data. Try just copying the original monophone transcript to realigned.monophone.mlf and continue
HERest -C configall -I 20001001_1-10.realigned.monophone.mlf -t 250.0 150.0 1000.0 -S train_mfcc.list -H hmm7/macros -H hmm7/hmmdefs -M hmm8 monophones Pruning-On[250.0 150.0 1000.0] ERROR [+6510] LOpen: Unable to open label file 20001001_5.lab FATAL ERROR – Terminating program HERest	The HVite process did not create a label file for every utterance; some had no tokens surviving, including file 5. Need to go back to HVite and change some parameters to make sure it can get through all utterances. For instance, change the beam searching parameters with the -t flag (htkbook pg 301)
HERest -C configall -I train.realigned.monophone.mlf -t 250.0 150.0 1000.0 -p 0 -H hmm7/macros -H hmm7/hmmdefs -M hmm8 hmm8/HER1.acc hmm8/HER2.acc hmm8/HER3.acc hmm8/HER4.acc hmm8/HER5.acc hmm8/HER6.acc ERROR [+7060] InitHMMSet: Expected newline after 2’th HMM ERROR [+2321] Initialise: MakeHMMSet failed FATAL ERROR – Terminating program HERest	forgot to put ‘monophones’ on the command line before the list of .acc files
HVite -T 1 -C configall -H hmm9/macros -H hmm9/hmmdefs -S train_mfcc.list -l ‘*’ -i recog_mono2/monophones.mlf -o S -w wdnet -p 0.0 -s 5.0 prondict.sort.final monophones Read 37 physical / 37 logical HMMs WARNING [-8520] CreateSEIndex: No transitions to state 5 in HVite WARNING [-8520] CreateSEIndex: No transitions to state 5 in HVite Read lattice with 859 nodes / 1713 arcs Created network with 6391 nodes / 7245 links	I’m doing recognition at the hmm9 stage as part of debugging. There are a few hmms in hmm8/hmmdefs and hmm9/hmmdefs that have no transition from state 4 to state 5, including ‘O’ and ‘silst’. This is a problem. It start occurring after realignment. So for the two iterations of HERest after realignment, use -u mv
HHEd -B -H hmm9/macros -H hmm9/hmmdefs -M hmm10 src/mktri.hed lists/monophones1.sp ERROR [+2635] FindBaseModel: Cannot Find HMM sl in Current List FATAL ERROR – Terminating program HHEd	Found and removed ‘sl’ from lists/triphones1
HHEd -T 1 -H hmm9/hmmdefs -M hmm10 mktri.hed monophones HHEd 34/34 Models Loaded [5 states max, 1 mixes max] CL triphones Cloning current hmms to produce new set {(*-. Error ) expected ERROR [+7230] EdError: item list parse error FATAL ERROR – Terminating program HHEd	Because ‘.’ is in the monophones list, the first triphone code in mktri.hed is invalid. remove it.
HERest -B -A -C configall -s stats -p 0 -I train.triphone.cw.mlf -t 250.0 150.0 1000.0 -H hmm11/macros -H hmm11/hmmdefs -M hmm12 triphones hmm12/HER1.acc hmm12/HER2.acc Pruning-On[250.0 150.0 1000.0] ERROR [+7191] Infinite WtAcc! (or) ERROR [+7191] Infinite MuAcc! FATAL ERROR – Terminating program HERest	Comes up in the accumulation process when doing a split re-estimation. WtAcc due to a row sum error on the transition matrix.(?) Tried: -u mv in the command, gave me MuAcc instead Tried: using files less than a minute in length Tried: splitting data up into more parallel sections (4) Tried: remove -B from the combining step in HERest, making the resulting hmm text form rather than binary. This worked, but I don’t know why…
HHEd -H hmm12/macros -H hmm12/hmmdefs -M hmm13 tree.hed triphones ERROR [+2662] AssignStructure: cannot find tree for U-r+sil state 5 FATAL ERROR – Terminating program HHEd	I’m using 5 middle states but tree.hed only has TB lines for 3 middle states. Add more TB lines to tree.hed for states 5 & 6
HHEd -B -H hmm12/hmmdefs -M hmm13 tree.hed triphones ERROR [+2662] AssignStructure: cannot find tree for t2-ay+D2 state 2 FATAL ERROR – Terminating program HHEd	One of the phonemes in the triphone listed has no indication of how to cluster it in tree.hed. Remove it from the prondict and start over (with a shortened monophone list), or remove it from fulllist and the HHEd command will run. If you have one like this you probably have a few, look carefully.
HHEd -B -H hmm12/macros -H hmm12/hmmdefs -M hmm13 src/tree.hed lists/triphones1 > log ERROR [+2662] AssignStructure: cannot find tree for ax-sp+d state 2 FATAL ERROR – Terminating program HHEd	Recreate tree.hed using local monophone list, mkclscript from tutorial, then add to that QS part of tree.hed
HHEd -H hmm12/macros -H hmm12/hmmdefs -M hmm13 tree.hed triphones ERROR [+2662] FindProtoModel: no proto for z-sp+A in hSet FATAL ERROR – Terminating program HHEd	In the cross-word triphone models, the sp causes problems, so remove it from the monophone list, from the extra triphones, from tree.hed
HERest -B -C ConfigHVite -I 20001001_1.triphone.mlf -t 250.0 150.0 1000.0 -S train_plp.list -H hmm13/macros -H hmm13/hmmdefs -M hmm14 triphones ERROR [+5010] InitSource: Cannot open source file Q-n+A ERROR [+7010] LoadHMMSet: Can’t find file ERROR [+2321] Initialise: LoadHMMSet failed FATAL ERROR – Terminating program HERest	The state-tying actually caused some states to be tied, meaning they get renamed. This is shown in the tiedlist created in the previous step with HHEd and CO “tiedlist” at the end of tree.hed. In the tiedlist output file, there are two columns in some places; the second column names the new label for the hmm in the first. It means those two are tied. So Q-n+A is tied to another triphone and thereby renamed. REPLACE TRIPHONES WITH TIEDLIST ON THE COMMAND LINE.
HERest -B -C configall -I train.triphone.mlf -t 250.0 150.0 1000.0 -S train.realigned.list -H hmm13/macros -H hmm13/hmmdefs -M hmm14 tiedlist ERROR [+7231] InitSource: Cannot open source file y-l-A FATAL ERROR – Terminating program HERest	There is a triphone that HERest is trying to reestimate that does not appear in the tiedlist. Recreate the fulllist (all possible triphones) and redo the HHEd step for decision tree tying etc.
HERest -T 1 -D -A -C configall -p 1 -I train.triphone.mlf -S train.mfcc.norm.list -t 250.0 150.0 1000.0 -H hmm15/hmmdefs -M hmm16 tiedlist HERest ML Updating: Transitions Means Variances Parallel-Mode[1] System is SHARED 51987 Logical/15201 Physical Models Loaded, VecSize=39 1 MMF input files Pruning-On[250.0 150.0 1000.0] Processing Data: fla_0130_96.mfcc; Label fla_0130_96.lab Utterance prob per frame = -5.125980e+01 Processing Data: fla_0530_22.mfcc; Label fla_0530_22.lab ERROR [+7321] CreateInsts: Unknown label m+H FATAL ERROR – Terminating program HERest	add a line to the mktri.led file that has ‘NB sp’ or ‘NB garbage’ or ‘NB whatever’ for whatever monophone for which you don’t want the biphone context to be made. Go back and remake the triphone transcript, then try the re-estimation again.
HHEd -A -H mix_moreA/hmmdefs -M mix_moreA 10.hedscript tiedlist WARNING [-2637] HeaviestMix: mix 4 in n2-O+sh2 has v.small gConst [-200000045056.000000] in HHEd WARNING [-2637] HeaviestMix: mix 1 in n2-O+sh2 has v.small gConst [-170000023552.000000] in HHEd WARNING [-2637] HeaviestMix: mix 3 in n2-O+sh2 has v.small gConst [-109999996928.000000] in HHEd WARNING [-2637] HeaviestMix: mix 4 in n2-O+sh2 has v.small gConst [-140000002048.000000] in HHEd ERROR [+2697] HeaviestMix: heaviest mix is defunct! FATAL ERROR – Terminating program HHEd	Trying to increase the number of Gaussian mixtures for each hmm at the end of training, incrementing by 2 each time. From htkbook: “Defunct mixture components can be prevented by setting the -w option in HERest so that all mixture weights are floored to some level above MINMIX.”
HVite -H hmm15/macros -H hmm15/hmmdefs -S lists/dt.list -l ‘*’ -i recog/dt.out.mlf -w wdnet -p 0.0 -s 5.0 lists/allwords.prons.dict.final lists/tiedlist ERROR [+8251] ReadLattice: Word worrisome not in dict ERROR [+3210] DoAlignment: ReadLattice failed FATAL ERROR – Terminating program HVite	Made sure vocab, wordnet were all uppercase; dictionary is all uppercase;
HVite -H hmm15/macros -H hmm15/hmmdefs -S lists/dt.list -l ‘*’ -i recog/dt.out.mlf -w wdnet.upper -p 0.0 -s 5.0 lists/allwords.prons.dict.final lists/tiedlist ERROR [+8251] ReadLattice: Word -PAU- not in dict ERROR [+3210] DoAlignment: ReadLattice failed FATAL ERROR – Terminating program HVite	add it to the pronunciation dictionary
HVite -H hmm15/macros -H hmm15/hmmdefs -S lists/dt.list -l ‘*’ -i recog/dt.out.mlf -w wdnet.upper -p 0.0 -s 5.0 lists/allwords.prons.dict.addrecog lists/tiedlist WARNING [-8221] InitPronHolders: Total of 77 duplicate pronunciations removed in HVite ERROR [+8231] GetHCIModel: Cannot find hmm [???-]IY[+???] FATAL ERROR – Terminating program HVite	Change dictionary and wdnet to all lowercase
HVite -T 1 -C src/ConfigHVite -H hmm15/hmmdefs -H hmm15/macros -S lists/dt.list -i recog/dt.out.mlf -o S -w wdnet.lower -p -10.0 -s 15.0 -t 450.0 250.0 40000.0 lists/allwords.rons.dict.addrecog.lower lists/tiedlist > recog.log ERROR [+8250] ReadLattice: Premature end of lattice file before header ERROR [+3210] DoAlignment: ReadLattice failed FATAL ERROR – Terminating program HVite	Go back into wdnet.lower and uppercase the first line and the J,I,W,S,L etc
HVite -T 1 -C src/ConfigHVite -H hmm15/hmmdefs -H hmm15/macros -S lists/dt.list -i recog/dt.out.mlf -o S -w wdnet.lower -p -10.0 -s 15.0 -t 450.0 250.0 40000.0 lists/allwords.rons.dict.addrecog.lower lists/tiedlist > recog.log ERROR [+8231] GetHCIModel: Cannot find hmm [l-]e[+sh] FATAL ERROR – Terminating program HVite	Changed pronunciation of [inhalation] to l ey sh There is a monophone somewhere in the dictionary, or in the monophone set, that is not represented as an hmm (try “cat hmm0/hmmdefs \| grep ‘~h'” to see what is represented). You may need to either change pronunciations in the dictionary to eliminate barely-used monophones or retrain with all of the monophones intact. It’s possible if you generated the monophone list from the monophone transcript that some monophones in the prondict were left out, b/c they never occurred in the first pronunciation of any word. Try regenerating the monophone list from the dictionary using shell scripting instead of HLEd.
HVite -T 1 -C src/ConfigHVite -H hmm15/hmmdefs -H hmm15/macros -S lists/dt.list -i recog/dt.out.mlf -o S -w wdnet.lower -p -10.0 -s 15.0 -t 450.0 250.0 40000.0 lists/allwords.rons.dict.addrecog.lower lists/tiedlist > recog.log ERROR [+6313] OpenParmChannel: cannot read HTK Header in File /u/drspeech/data/WSJ0/SI_DT_05/050/050A0503.nst ERROR [+6313] OpenAsChannel: OpenParmChannel failed ERROR [+6316] OpenBuffer: OpenAsChannel failed ERROR [+3250] ProcessFile: Config parameters invalid FATAL ERROR – Terminating program HVite	Changed dt.list to dt.plp.list
HVite -z lat -l $expname -C ../configall -t 150.0 -A -D -T 1 -w $expname.htk.lm -s 12.0 -p -10.0 -H ../hmmdefs.16 -S ../dev.mfcc0.list1 prondict.norm8.sort.sp ../tiedlist ERROR [+8231] GetHCIModel: Cannot find hmm [u-]n[+???] FATAL ERROR – Terminating program HVite	Haven’t figured this one out. Can’t find a pronunciation with fishy phonemes as mentioned. HDecode has no problem with all of the same inputs except for ARPA-based lm, and I don’t see anything wrong with the htk-lattice-lm. So, I dunno.
HHEd -B -H hmm15/macros -H hmm15/hmmdefs -M hmm16 src/train_mix_inc_2.hed lists/train+cv.triphonemlf ERROR [+7036] CreateHMM: multiple use of logical HMM name sp ERROR [+7060] InitHMMSet: Error in CreateHMM ERROR [+2628] Initialise: MakeHMMSet failed FATAL ERROR – Terminating program HHEd
Reading dictionary from diss/lib/myprondict ERROR [+8050] ReadDict: Probability malformed 2 ERROR [+8013] ReadDict: Dict format error ERROR [+9999] Initialise: ReadDict failed FATAL ERROR – Terminating program HDecode.long	problems in the pronunciation dictionary: quotations and double quotes need backslash brackets possibly need backslash narrow down problem by reducing prondict to only a few lines and gradually adding until the error comes up one of the last pronunciations has a non-existent phoneme (2), change it.
HDecode.long -z lat -l decodeLCA_nonums_wordLM -C configall -t 150.0 -A -D -T 1 -w lev.alltext.word.lm -s 12.0 -p -10.0 -H mix_moreA/hmmdefs -S lev.dev.mfcc.list4 levtrain.prondict.ver3 tiedlist Reading dictionary from levtrain.prondict.ver3 Reading acoustic models… Read 4163 physical / 230643 logical HMMs ERROR [+9999] HLVNet: no model label for phone (uw-gar+gar) FATAL ERROR – Terminating program HDecode.long	I have a ‘gar’bage model that is like sp, should not belong to any triphones. In HDecode, only the phonemes associated with start and/or endnode are allowed to be monophone-only. Go back and add gar triphones to full_list, remake the tiedlist. Might need to add some info for gar to tree.hed. Re-estimate from there forward.
HDecode.long -z lat -l * -C ConfigHVite -t 150 -A -D -T 1 -w 20001001_1.lm -s 12.0 -p -10.0 -H hmm15/hmmdefs -S train_plp.list prondict.sort tiedlist ERROR [+4019] HDecode: beam width expected FATAL ERROR – Terminating program /u/drspeech/opt/htk-3.4/i586-linux/bin/HDecode.long	The value after the -t flag must be a float. Change to 150.0
HDecode.long -z lat -l * -C ConfigHVite -t 150.0 -A -D -T 4 -w 20001001_1.lm -s 12.0 -p -10.0 -H hmm15/hmmdefs -S train_plp.list prondict.sort tiedlist ERROR [+9999] HDecode: cannot find STARTWORD ‘<s>’ FATAL ERROR – Terminating program /u/drspeech/opt/htk-3.4/i586-linux/bin/HDecode.long	add <s> and <\s> to the pronunciation dictionary with a pronunciation of sil
HDecode.long -z lat -l * -C ConfigHVite -t 150.0 -A -D -T 4 -w 20001001_1.lm -s 12.0 -p -10.0 -H hmm15/hmmdefs -S train_plp.list prondict.sort tiedlist ERROR [+9999] HDecode: cannot find file ‘sp’ FATAL ERROR – Terminating program /u/drspeech/opt/htk-3.4/i586-linux/bin/HDecode.long	add sp sil to the end of the prondict
HDecode.long -z lat -l decodeA -C ../configall -t 150.0 -A -D -T 1 -w mix2.unk.knd.lm -s 12.0 -p -39.0 -H ../hmmdefs.16 -S ../dev.mfcc0.list4 prondict.expand ../tiedlist FATAL ERROR – Terminating program HDecode.long ERROR [+5010] InitSource: Cannot open source file f-uw+x ERROR [+7010] LoadHMMSet: Can’t find file ERROR [+4128] Initialise: LoadHMMSet failed	There is a mismatch between the hmms that are defined in the hmmdefs file and those that are listed in the tiedlist. One or the other needs to change, probably the tiedlist. This may involve going back far enough to re-create the hmms used in the last HHEd command, so to recreate the tiedlist.
HDecode.long -z lat -l * -C ConfigHVite -t 150.0 -A -D -T 4 -w 20001001_1.lm -s 12.0 -p -10.0 -H hmm15/hmmdefs -S train_plp.list prondict.sort tiedlist WARNING [-9999] no token survived to sent end! in HDecode.long Segmentation fault	This is the model I built on a single sound file, so maybe that’s the right answer…
ERROR [+9999] HLVNet: no model label for phone (.-q+r) FATAL ERROR – Terminating program /u/drspeech/opt/htk-3.4/i586-linux/bin/HDecode.long	Remove ‘. .’ from the pronunciation dictionary
HDecode.long -z lat -l decodeA -C configall -t 150.0 -A -D -T 1 -w p.3grams.lm -s 12.0 -p -10.0 -H mix_moreA/hmmdefs.16 -S dev.mfcc0.list3 prondict.norm8.sort tiedlist ERROR [+9999] HLVNet: no model label for phone (x-sil+S) FATAL ERROR – Terminating program HDecode.long	It shouldn’t be looking for a triphone with ‘sil’ in the middle. Search for ‘sil’ in the pronunciation dictionary; the only words it should serve as pronunciation for are <s> and <\s> .
ERROR [+9999] HDecode: Incompatible parm kinds MFCC_0 vs. MFCC_D_A_0 FATAL ERROR – Terminating program /u/drspeech/opt/htk-3.4/i586-linux/bin/HDecode.long	Changed the format of the hmms to mfcc_d_a_0 even though the original files were made into MFCC_0. They’ve gotta be the same. Use MFCC_D_A_0 in the hcopy config file.
Reading dictionary from diss/lib/myprondict Reading acoustic models…Read 26745 physical / 250049 logical HMMs ERROR [+9999] HLVNet: no model label for phone (sil-}+w) FATAL ERROR – Terminating program HDecode.long	There are still some labels in the pronuncation dictionary that do not have defined acoustic models (}). Change those labels, which may have come in through the pronunciation-building script.
ERROR [+8113] ReadARPAngram: failed reading lm prob at char 1283900 in diss/data/language_model/fsms.4grams.64Kvocab.lm	This error can be reproduced by having the wrong number of ngrams present in the lm file as compared to the number defined at the top of the file. Make sure the LM and prondict have the same encoding. Make sure all quotes and double quotes have a backslash Make sure the lm and prondict contain \<s\> and \<\\s\>.</s\>
WARNING [-8100] ReadARPAngram: unseen word ‘إسأ’ in ngram in HDecode.long	Words in lm not present in pronunciation dict. Write a script to find them and add them in, being sure to resort the pronunciation dictionary afterwards.
ReadNGrams: 1827th 2Grams out of order	The bigrams are not in alphabetical order.
HLRescore -n $lm -f -y crec -r 10.0 -t 150.0 -s 20.0 -p -42.0 -C configall -A -D -T 1 -S unconstrained.list $prondict.expand Reading LM from mix1.unk.knd.lm ERROR [+8150] ReadNGrams: 577308th 2Grams out of order FATAL ERROR – Terminating program HLRescore	Nothing seems out of place in the LM which was made and not messed with. LM worked for HDecode.
HLRescore -n $lm -f -y crec -t 150.0 -s 12.0 -p -10.0 -C configall -A -D -T 1 $prondict $lattice WARNING [-9999] word 0 not in LM wordlist in HLRescore HLRescore: HLat.c:415: LatTopSort: Assertion `time+1 == lat->nn’ failed.	Got rid of the second error (LatTopSort) by determinizing, minimizing, and topologically sorting the fsm before converting to pfsg & htklat. Make sure that NULL (or whatever you’ve substituted for NULL in the lattice) exists in both the pronunciation dictionary and the language model.
HLRescore -n $lm -f -y crec -t 150.0 -s 12.0 -p -10.0 -C configall -A -D -T 1 $prondict $lattice ERROR [+8250] ReadLattice: Premature end of lattice file before header ERROR [+4013] HLRescore: can’t read lattice FATAL ERROR – Terminating program HLRescore	In the lattice, change ‘NODES’ to ‘N’ and ‘LINKS’ to ‘L’.
HLRescore -n $lm -f -y crec -t 150.0 -s 12.0 -p -10.0 -C configall -A -D -T 1 $prondict $lattice ERROR [+8251] ReadLattice: Word bEd:bEd not in dict ERROR [+4013] HLRescore: can’t read lattice FATAL ERROR – Terminating program HLRescore	Needed to use transducer=1 to get the pfsg to print correctly with fsm-to-pfsg, but now the format is messing up HLRescore. Either change to transducer=0 in fsm-to-pfsg, or, use sed to change each term:term to term.
HResults -I reference.mlf /dev/null decoded.mlf ERROR [+6550] LoadHTKList: Label Name Expected FATAL ERROR – Terminating program HResults	In the reference.mlf file, there exists either a blank line, or a digit without backslash or quotes, or something else unpalatable to HResults. Use HResults -f to figure out which utterance it’s in (the one _after_ the last one listed), and fix it. For instance, put quotations around a number or backslash a quote, etc.
HResults -I reference.mlf /dev/null hypothesis.mlf ERROR [+6570] Get LabelList: n[1] > numLists[0] FATAL ERROR – Terminating program HResults	Run the command again with -f to show full results. Look in the reference mlf file at the utterance _after_ the last one listed before the error shows up. It’s empty. Put something there or remove it (must be removed from reference).
HResults -I reference.mlf /dev/null hypothesis.mlf ERROR [+6510] LOpen: Unable to open label file NBCTV_MORNING_20070111.lab FATAL ERROR – Terminating program HResults	One of the utterances in the hypothesis mlf does not have a corresponding utterance in the reference mlf. Either it’s missing entirely or the names don’t match, check spelling and capitalization of the filenames in the two mlfs.
HERest -A -D -T 10 -C configall -C hmmadapt6-1/config_adapt -S adapt6.list -I train.triphone.new.mlf -H hmmadapt6-1/hmmdefs.16 -H hmmadapt6-1/glob -K hmmadapt6-2 mllr -u a tiedlist ERROR [+999] Components missing from Base Class list (4630 74080) ERROR [+999] BaseClass check failed FATAL ERROR – Terminating program HERest	Trying to do adaptation. Built regression tree using instructions here. But because my hmm definitions were incremented to 16 mixes, I had to change the last line of the global file to: CLASS 1 {*.state[2-4].mix[1-16]}
HERest -A -D -T 10 -C configall -C hmmadapt6-1/config_adapt -S adapt6.list -I train.triphone.new.mlf -H hmmadapt6-1/hmmdefs.16 -H hmmadapt6-1/glob -K hmmadapt6-2 mllr -u a tiedlist ERROR [+999] Output xform mask *.%%% does not match filename data2/20001220_1530_1600_NTV_ARB/20001220_1530_1600_NTV_ARB_3.mfcc FATAL ERROR – Terminating program HERest	Added -h data2//_ARB_??.mfcc to the command line, which matches the filenames being given to HERest. Also take away ‘mllr’
HERest -C configall -C hmmadapt6-1/config_adapt -S adapt6.list -I train.triphone.new.mlf -z tmf -h data2//_ARB_??.mfcc -H hmmadapt6-1/hmmdefs.16 -H hmmadapt6-1/glob -u a -K hmmadapt6-2 -M hmmadapt6-2 -d hmmadapt6-1 tiedlist ERROR [+7060] InitHMMSet: Expected newline after 1’th HMM ERROR [+2321] Initialise: MakeHMMSet failed FATAL ERROR – Terminating program HERest	The “-h data2//_ARB_??.mfcc” command is causing a bunch of filenames to come up as part of the command, and it thinks that one of those is the list of hmms. Instead, use data2//, and PUT SINGLE QUOTES AROUND IT: -h ‘data2//’
HERest -C configall -C config.global -S adapt6.list -I train.triphone.new.mlf -u a -z tmf -K xforms mllr1 -J classes -h ‘data2//’ -H hmmorig/hmmdefs.16 tiedlist no output	No errors, but no new files in xforms or anywhere else. Change the -h part of the command. Wherever there is a ‘%’, that part will be the name of the output file (plus mllr or whatever comes after the output directory after the -K flag). So if you have two input sources or speakers, put the %% to match the characters that group each speaker’s utterances, give only those utterances as input, and you’ll get a transform for that speaker with an appropriate name.

Leave a comment

Posted in Other, Technology and Software

Tagged forced alignment, htk

Recently Added to Chirila

Posted on September 20, 2017 | Leave a comment

Wergaia
Holmer’s Darumbal list
Mathi-Mathia
Sidney May papers

Leave a comment

Posted in Chirila

Class on journal article writing

Posted on August 30, 2017 | Leave a comment

Last spring, I taught a graduate class on how to submit an article to a journal. Our department, like many, has a qualifying paper requirement, where students write two “publishable” or “near-publishable” research papers as a stepping stone to the dissertation. Faculty have always had the expectation that students would submit these papers to a journal, but my impression (as Director of Graduate Studies) was that this wasn’t happening as quickly or as frequently as it should. Hence this class.

Students were third and fourth year graduate students. They had all already passed our qualifying paper requirement, and had at least one manuscript to work with. We met once a week for an hour as a group, and the students met with a partner outside of class for at least an hour too. During our group meeting the students reported briefly on they’d done with their writing buddies. I also did all the activities.

This is a writing-intensive class for graduate students in linguistics who are interested in gaining more experience with writing and publication. Student may enroll with the permission of instructor and need to have a QP or other piece of writing that would be suitable for submission to a journal by the end of the semester.

The class counts towards the departmental seminar requirement for graduates in third and fourth year.

In order to pass the class, students will need to do the following:

. Submit at least one paper to a journal.
. Submit an abstract to at least one conference.
. Provide a referee report for at least one paper for a colleague.
. Have a ‘writing buddy’ within the class, to whom you provide regular feedback.
. Provide weekly feedback to the group regarding progress.

We will meet weekly as a group for an hour, and you will also meet your writing buddy for an hour.
Assessment: this was a pass/fail class.

Here was the weekly schedule. I did not make detailed handouts for class, since this was an additional class for me. We did not use a textbook. If doing this again, I could see some advantage to using something like “writing a journal article in 12 weeks” but I don’t think it’s crucial.

Week 1: General writing and research skills. Backing up, some techniques for writing consistently, and the like. Expectations of working with a writing buddy (regular time to meet with them). The students made a research project list for homework and posted it for everyone (I showed them mine, which led into a discussion of how many projects someone should be working on at any one time). We also talked about how to identify self-sabotaging tendencies in academic work.

Week 2: Identify the manuscript to submit and what needs to be done to it in order to make it publishable/submittable (e.g. ar the data sufficient, writing clarity, organization, length, engagement with the literature). We talked about word limits, general properties of journal articles, minimal publishable units, and the like.

Week 3: How to pick a journal. We talked about main journals in the field, how to figure out what’s an appropriate place to send a manuscript (what goes to Language, for example). Homework was to figure out what journal (+ backup journal) they wanted to target. We brainstormed journals and the decision process for where to send a paper.

Week 4: How to submit an article to a journal. We walked through the Diachronica online submission process, registering for the site, creating a submission, explaining all the steps, and talking about how different platforms are different. We also talked about how to interact with journal editors, what a presubmission inquiry looks like, and when it’s ok to ask for an update.Homework for this (and previous weeks) was to continue working on what needed to be done to the paper to submit it.

Week 5: Check-in. We went through what each person was doing on their paper, where they were at, what still needed to be done.

Week 6: What a referee report looks like. How long they take to do and receive, what sort of things get commented on, tone, etc. We wrote a report on a published paper (anonymized) and I shared reports I had received on a couple of papers.

Week 7. Revising and resubmitting. How to respond to referee reports. What to expect from an editor’s decision, whether you need to respond to everything, how to deal with conflicting recommendations, what to submit in a revision. Desk rejections and what they mean. I shared copies of an original submission, referee reports, resubmission, and subsequent acceptance of a paper.

Weeks 8-11: Refereeing our papers. We did three rounds of refereeing. Each week, everyone brought two copies of their paper to class, and we spent half an hour commenting on two papers. Homework was to revise the paper in accordance with the suggestions from the class “referees”. We also talked about the comments they were giving.

Week 12: Turning a journal article into a conference paper abstract. Differences between articles and conference talks.

Week 13: dealing with proofs. Proof marks, what sorts of things can be corrected at proof stage, etc.

I also had a paper I wanted to submit that spring, and since there were 5 students in the class, I teamed up with one of them as a writing buddy too.

The deadline for submission of papers was May 10, and most of the papers were submitted fairly close to that date. Of the 5 students (+ me), the results so far are: 1 accept with minor revision (a few days ago), 1 revise and resubmit (last week), 1 reject with helpful reviews for revision and submission elsewhere (in June), 1 technical rejection (+ submission elsewhere; about a week after submission), and 2 still under review.

I think it worked pretty well, and I will probably offer it again in a year or two (not this coming year).

Leave a comment

Posted in Other, teaching

Polygons and centroids now on Zenodo

Posted on August 25, 2017 | Leave a comment

I’ve updated the polygon and centroid files for Australian language locations, and placed them on Zenodo. This means there’s something stable for you to reference if you want to use them and refer to them. As always, comments and corrections very welcome. And as always, please consider using the Zenodo community for Australian languages to upload your own materials.

Leave a comment

Posted in Chirila

Videos for Zenodo uploads

Posted on July 5, 2017 | Leave a comment

I made some videos about how to upload files to the Zenodo repository for Australian languages:

is how to sign up for a zenodo account

will show you how to upload files to the Australian Languages Zenodo community. Should be a help for anyone who would like to upload files but isn’t sure how.

Leave a comment

Posted in Chirila, Media, Technology and Software

Color in Pama-Nyungan: update

Posted on July 5, 2017 | Leave a comment

Last November, Hannah Haynie and I published a paper in the Proceedings of the National Academy of Sciences on color term systems in Pama-Nyungan. In it, we used phylogenetic methods to show that color term systems can both gain and lose terms, and that while they do so mostly in accordance with prior work on color term systems (Berlin and Kay, Kay and Maffi, and colleagues), we also found evidence for ‘exceptional’ systems that appeared not to conform to the B&K system. We used data from the Chirila database and fairly standard phylogenetic methods of ancestral state reconstruction.

For an analysis of this type to be correct, several assumptions must be satisfied:

sample data need to be representative of the languages as a whole;
sample data need to be correct;
the analytical tools need to be applicable to what’s being studied;
the analyses need to be interpreted correctly.

Over the last six months, Hannah and I have been in correspondence with David Nash about many of these points, particularly those involving sampling, the correctness of the underlying data, and judgments about what is a color term. In particular, in the original version of Table S1, a data conversion error resulted in words from several languages being associated with the wrong row in the table (particularly Wargamay and Warlmanpa). This did not affect the analyses reported in the paper, as the error was introduced when spreadsheets were converted to Microsoft Word documents for uploading to the journal’s online submission site. [The corrected table is available here.]

The discussions with Nash revolved around several issues already identified both in our paper and the supplementary materials:

the difficulty of determining whether a color term is genuinely absent from the language, or simply not recorded;
the difficulty of establishing the ranges of color terms glossed in English by non-native speakers of the language;
the issue of polysemy, for example, whether a term glossed as “unripe, green” is truly a color term, or whether “green” here is meant solely in the sense of “unripe, not ready for eating” (and therefore not glossing a true color term).

Coding decisions of this type are based on a careful philological analysis of each individual source, and while phylogenetic analyses are usually robust to individual errors, systematic errors may bias the results. In general, where Hannah and I were unsure, we tended to include rather than exclude; this applies especially to terms for ‘green’ and terms for ‘red’ based on words meaning ‘blood’ (which could be interpreted as the descriptive adjective ‘bloody’ rather than a true color term). For ‘green’ terms, many languages have a word that is glossed as ‘green’ or ‘unripe’; while some of these terms do appear to be real color terms (in that they can refer to items that aren’t unripe, like shirts), others aren’t — they refer to the ripeness of fruit, not directly to its color. (We had a similar problem with ‘grey’, which was often ambiguously glossed as a color term or a word referring only to grey hair.)

Another issue is the extent to which we make use of data from closely related languages in determining the color inventory of a particular language variety. For example, if a particular variety appears to lack a term for ‘blue’, but a term is present in other languages in the subgroup, are we justified in treating the lack of a term as a true omission? In our analyses, we treated such cases as absent rather than indeterminate, because we did not want to omit true variation in the color inventories of languages. But it would also be a possible argument to claim that color inventories are unlikely to vary so much between dialects of the same language (or closely related languages in a subgroup), so unrecorded colors are probably omissions from data collection rather than genuine absences from the language.

We suspect that some terms were not recorded because of the linguists’ expectations about what items are present (or not) in a language. For example, Australian languages are stereotypically claimed to lack color terms beyond black, white, red, and yellow; this can lead researchers not to ask for terms like blue or purple.

Finally, data for this paper came from the Chirila database (Bowern 2016), which while extensive (800,000+ items), is by no means exhaustive. Nash brought to our attention several cases where color terms had been recorded in sources which are not in Chirila. These are also noted in the revised supplementary table and reflected in the newly uploaded analysis files.

In order to assess the impact of our coding decisions, as well as the impact of terms which were missing from Chirila and hence recorded as absent from the languages, we re-ran all analyses. We ran two sets of updated analyses. One simply corrected errors resulting from data missing from Chirila. The other also used Nash’s alternative judgments about presence/absence of color terms like ‘green’. In neither case were our main conclusions affected. That is, we still find support for both color gain and color loss. While, as is expected, the numerical values of individual results changed somewhat, our inferences and conclusions stand. Color loss is possible (under this model), though it’s substantially less common than color gain.

I am currently working on a new update to Chirila and many of these revised sources will be available there.

Leave a comment

Posted in Chirila, Historical, Pama-Nyungan

	Australia Day, Invas… on How many languages were spoken…
	Freddy Calvillo on LSA Summer Institute
	Noel Downs on The 4000th Bardi dictionary…
	Claire on The 4000th Bardi dictionary…
	Nicole on The 4000th Bardi dictionary…
	Jamiraa on The 4000th Bardi dictionary…
	Claire Bowern on Standard Average Australian
	sarah on Standard Average Australian
	Normanted on Why I haven’t been posti…
	Peterlagma on Why I haven’t been posti…

Anggarrgoon

Standard Average Australian

HTK error list

UNDERSTANDING HTK ERROR MESSAGES

Recently Added to Chirila

Class on journal article writing

Polygons and centroids now on Zenodo

Videos for Zenodo uploads

Color in Pama-Nyungan: update

Top Rated

Recent Comments

Blogroll

Meta

Archived Messages

Email Subscription

RSS Links

Blog Stats