Thursday, May 21, 2015

Discovered problem with the parser

I was racking my head for the past few days, trying to figure out why there were discrepancies between expected results and what I saw. Particularly, the problem had to do with parser outputs (the XML file) not aligning with the descriptions.

After much digging I think I found the culprit. The parser is simply skipping some lines. I've narrowed it down to a few instances around lines 30, where some of the later sentences that start "1 hour..." are not parsed. It is frustrating that some do, others don't. Basically this makes no sense.

I really don't want to spend time figuring out what's wrong with the parser. It is certainly possible that I'm using it incorrectly, but I don't see how. I have a text file with a bunch of sentences, and feed it to the parser.

I can try to replace all instances of sentences that start with a number and "hour" or "hr" with the number spelled out. Ridiculous if you ask me.

--
Did it, doesn't seem to have helped.

There is definitely something going on at line 34. It's not getting picked up by the parser.

--
Still not sure what it is. I cut down the text going into the parser to very few lines. I think there may be a chance to work around this issue if I put a space before every period. Let's give it a shot, should be easy to do in the "periods" file.

--
Well, that seems to do the trick. Relieved. Still, I'm unhappy that I had to go through this insane journey to find out that the parser has a problem. Maybe I'll submit a bug report to them.

At a cursory glance, all the results look good. I have a problem of some of the start and end times being in reverse order, but that should be easy to handle. Just keep track of which time was seen first in the sentence. If there's a dictionary implementation that keeps things in order of entry, maybe one that's implemented with a linked list, then this should be easy. Also of course, assuming that the iterator behaves as expected.

No comments:

Post a Comment