Monday, January 26, 2015

Preprocessing? More like manipulating

Playing around a bit with the Stanford CoreNLP interpreter is allowing me to get a sense of what it can do. It seems fairly capable of understanding times, even when written in various formats. Some examples:
1:00
1pm
1 pm
1:00pm
1:00 pm
It has trouble with dash separated times, like:
8-9am
8am-9am
Sometimes the text is already nicely formatted, like:
11:30am to 1pm
---
Days aren't reliably detected from their common abbreviations. The problematic ones are TUES, SAT, and SUN. Seems strange that TUES wouldn't work, seems unique.
---
I think additional preprocessing of the data is in order. Based on these preliminary findings, here's my game plan:
1. Find all times that are of the form TIME-TIME, then replace the dash with a space.
2. Find all abbreviations of days of the week, then replace with actual word.
---
I found examples in my data set with crude Control-F finding. Given the size of the data set sometimes the slow performance was a annoying, especially for non-finds. I'm playing around with the idea of building a tool to help me quickly find text. For example, I'd probably only search the sign's text, whereas my text editor searches everything. It is also possible that building such a tool is a waste of time - I may not use it much. It may also be a difficult problem, therefore something I don't want to spend time solving.
---
I probably can't just go in blindly and look for a TIME-TIME pattern. First, I need to make sure that the two elements are both time, otherwise I could be removing all dashes from anywhere. I could use the assumption that all times have a number in them. This would preclude correctly working for something like "9pm-Midnight" or something.
---
Regular expressions:
http://www.tutorialspoint.com/python/python_reg_expressions.htm

Trying to learn regex...
http://www.zytrax.com/tech/web/regex.htm
---
I had written my "json" file incorrectly. It was actually a list of json objects. A real way to write multiple json objects was to make a dictionary/list out of them.
http://stackoverflow.com/questions/21058935/python-json-loads-shows-valueerror-extra-data
---
Parsing json objects?
http://stackoverflow.com/questions/2835559/parsing-values-from-a-json-file-in-python

No comments:

Post a Comment