Wednesday, January 28, 2015

Musings from class

Further developments on getting keywords/phrases/tokens.

I'm working with a smaller sample set, the first 100 data points. My regular expression seems to find TIME-TIME patterns with no false positives, though I can't speak to false negatives. I'm having a harder time trying to replace the dashes with spaces. Tried using python's regex search and replace function, but not sure how to do what I want. I might be overthinking this.

---
In class today I had some thoughts about next steps from having these "annotated" datasets.

First, I want to see how much coverage I have with "no parking" signs. By coverage, I mean what percentage of the unique locations have a "no parking" sign. (Unique locations is key, because some locations will have multiple signs. There are fewer unique locations than signs). How much coverage is there with "no standing", "no stopping", etc? Can I achieve 100% coverage with these three signs (classes)? I think no. Are there a reasonable number of unique such "classes" that will grant 100% coverage?

Since my new game plan is more manual rule based, I think coverage is an important metric. If I can achieve good coverage with a few types of classes, it may be good enough for now. That is, I can leave out some data points, but my final solution will still be good enough to use.

It may be possible there are more than one sign of the same class at the same location, like two "no parking" signs on the same pole. I'm not sure if such cases exist, but it is a possibility. In that case, measuring coverage as number of "no parking" signs / number of unique locations would be incorrect. Something to look out for.

Stemming from this, it's also possible that one location has signs of multiple classes, such as both a "no parking" sign and a "no standing" sign. Again, our coverage calculations become inflated in such scenarios.


No comments:

Post a Comment