Wednesday, January 7, 2015

Preprocessing the data

Not sure how to best represent the data. But I did know that having it in CSV format wasn't the best. I thought about what Henry and I did for our Yelp project, where we took JSON data and stored just the elements we needed in dictionaries. I went ahead and converted the data into JSON, adding an "id" field so each record could easily be distinguished.

---

Looking at the parking sign text, its clear that its too varied to be easily manipulated. Probably needs some preprocessing. Sometimes "NO PARKING" isn't cleanly separated from other characters, sometimes it is via a space. Some times are displayed as "11:30AM" while others are "1PM". Sometimes they're put together, like "10:30AM-2PM". NYC has different terms for things: there's "parking", then there's "standing", and finally there's "stopping".

I'm thinking of having a list of labels. Examples include "parking", "no parking", "standing", a time of day or two times of a day, "before", "after", a day or days of a week. Then try to find these labels in the descriptions. If we can see that a sign has some of these labels, it should make it easier to translate.

No comments:

Post a Comment