Monday, January 26, 2015

Preprocessing? More like manipulating

Playing around a bit with the Stanford CoreNLP interpreter is allowing me to get a sense of what it can do. It seems fairly capable of understanding times, even when written in various formats. Some examples:
1 pm
1:00 pm
It has trouble with dash separated times, like:
Sometimes the text is already nicely formatted, like:
11:30am to 1pm
Days aren't reliably detected from their common abbreviations. The problematic ones are TUES, SAT, and SUN. Seems strange that TUES wouldn't work, seems unique.
I think additional preprocessing of the data is in order. Based on these preliminary findings, here's my game plan:
1. Find all times that are of the form TIME-TIME, then replace the dash with a space.
2. Find all abbreviations of days of the week, then replace with actual word.
I found examples in my data set with crude Control-F finding. Given the size of the data set sometimes the slow performance was a annoying, especially for non-finds. I'm playing around with the idea of building a tool to help me quickly find text. For example, I'd probably only search the sign's text, whereas my text editor searches everything. It is also possible that building such a tool is a waste of time - I may not use it much. It may also be a difficult problem, therefore something I don't want to spend time solving.
I probably can't just go in blindly and look for a TIME-TIME pattern. First, I need to make sure that the two elements are both time, otherwise I could be removing all dashes from anywhere. I could use the assumption that all times have a number in them. This would preclude correctly working for something like "9pm-Midnight" or something.
Regular expressions:

Trying to learn regex...
I had written my "json" file incorrectly. It was actually a list of json objects. A real way to write multiple json objects was to make a dictionary/list out of them.
Parsing json objects?

