Now that I’ve calmed down a bit, I decided to write about this. To remind me later, after I graduate from this master study, that there was this moment during my thesis work when I felt extremely frustrated… at myself, for being so careless.
The deadline is getting near, less than two months to be precise. So I try my best not to waste any time, given that right until this moment I’m still struggling with the experiments.
The thing is, working in this computational linguistics (or natural language processing) field means that I have to deal with a whole lot of text. Long story short, now I have to build a matrix of co-occurrence frequency of pairs of words in the text, which contains approximately 120 million sentences in total. Well, not as simple as pair of words actually, since I have to include the property of the word from the grammatical point of view, such as the POS tag, or the category. Sorry, sounds a bit technical, I know. Just skip it.
Anyway, to be able to go further into the next experiment, I should make sure that words which will be evaluated are contained in the matrix *my supervisor emphasized this, twice*. I made a simple shell script to check it, done. Then I started building the matrix, of 20,000 x 20,000 size, and it took around 15 hours to extract the matrix from only 1 million sentences. Thanks to computer clusters, I managed to get the matrix extracted from 40 million sentences in 3 days.
While waiting for the rest to finish, I decided to use the currently available matrix for the next experiment, and see what will happen. There are 44 words (all nouns) in the gold standard, to be evaluated. So after extracting those 44 words from the matrix, there should be 44 rows in the submatrix. I started to panic when I only saw 43 rows in the output file. Sh*t, I’m screwed.
Apparently, I was missing a chisel! Only this one word forced me to delete all of the previous results that I’ve got during these last 3 days, and start anew. Too confident, too lazy to double check, and look at what happened. Please learn from this, dear me!