Tuesday, January 13, 2015

Meeting with John Cho

Met with my adviser, John Cho.

I explained how I wanted to approach solving this problem. I want to create an input vector of various features, such as the presence of a "no parking" phrase, starting and ending times, etc. I would do this using an NLP parser such as the Stanford CoreNLP tool. Then, I'd manually figure out some of the output results for a given input. The output result would also be a type of vector, with fields for each time slot in each day. These manually "labeled" data would become my training data to create our model. Then, we'd classify the rest of the inputs.

He believes that this approach will not work. Specifically he brought up the idea that different features on a vector are considered independent of each other, in a machine learning sense. I'm not sure if that's actually true. Another point he brought up was the ordering of text matters. For example:

1. No parking 8am-9am except Sunday
2. No parking Sunday except 8am-9am

Very different meanings, with the exact same words. My approach seems to fall apart on this example, since I'm just picking out keywords to populate a feature vector. Although, it isn't impossible. The phrase after "No parking" is related to the phrase, and the phrase after "except" is related to that. Unfortunately this is getting into more NLP land, which seems like a big can of worms.

He recommended I try to build a bunch of rules that would take the parsed input and produce the output. Seems doable, but potentially very tedious. A lot of the success will depend on how well I can parse the data into the pieces I need.

Ultimately it seems that Cho likes the project. He said I should look into getting Santa Monica data. They're probably more likely to have the type of data that NYC had, LA city is fairly hopeless.

No comments:

Post a Comment