Wednesday, May 6, 2015

Multi platform development woes, discovery

Because I felt limited by only developing on my laptop, I decided to give my iMac a go. I was already committing code to github, so it was easy to sync up. Here's the story so far.

First, I didn't want to re-download the shapefile from the NYC website, so I transferred the already converted CSV file from laptop to desktop. I turned on the entire pipeline and gave it a whirl. I ran into a number of issues:
  1. Hard coded file names. All my data file names were hard coded into constants at the head of the pipeline script. I tried to simply make them relative, but then realized the forward slash back slash incompatibility. I resolved this by creating a config file, which will contain a variable describing which system I'm on. This config file will not be pushed to the repository.
  2. Moved some other hard coded file names and paths up to the top of the pipeline script.
  3. Had to install the JDK to run the parser. 

When it was running, I noticed it said it was doing over 700 files. Odd, I thought. I recall it being 160, with 500 in each for a total of 80,000. Instead, now I was faced with over 350,000 data points?
  1. I checked the input CSV, and sure enough, it had 381,445 entries. What's going on here?
  2. I downloaded just the "signs" and "locations" CSVs from NYC.
    1. There are 94,205 entries in "signs"
    2. There are 742,243 entries in "locations"
  3. Now I'm just lost.
I guess it's possible that the dataset changed on me during my time working on this project. However, some of these values don't reconcile at all. For example, how can there be more "locations" than "signs"?

Looking at the signs and locations CSV files more closely, it looks like they're mislabeled. By referring to the metadata explanations PDF, the entries in the signs file are actually locations, and vice versa.

The "locations" file, which appears to actually be a "signs" file, also contains the sign's descriptions. What's curious is that a lot of the descriptions don't seem like parking regulations in the flavor of "no parking mondays 10am-2pm", but instead something like "Curb Line" or "Property Line" or "Building Line" etc. Their "sign code" fields correspond like "CL", "PL", or "BL". Haven't found examples of other types of these "non signs". 

Ideally, if I don't count these "line" lines, there will be exactly the same number of signs (in this locations.CSV file) as there are data points from the shapefile. Perhaps doing this right now is the way to start clarifying things.
--
Having done it, I get 406,056 entries in the "locations" file, which I'm assuming is actually the "signs".

I'm thinking I shouldn't use the signs.csv and locations.csv as truth, and revisit the shapefile. Download to my mac, run the conversion tool to get the csv, count the lines.

No comments:

Post a Comment