First, I didn't want to re-download the shapefile from the NYC website, so I transferred the already converted CSV file from laptop to desktop. I turned on the entire pipeline and gave it a whirl. I ran into a number of issues:
- Hard coded file names. All my data file names were hard coded into constants at the head of the pipeline script. I tried to simply make them relative, but then realized the forward slash back slash incompatibility. I resolved this by creating a config file, which will contain a variable describing which system I'm on. This config file will not be pushed to the repository.
- Moved some other hard coded file names and paths up to the top of the pipeline script.
- Had to install the JDK to run the parser.
- I checked the input CSV, and sure enough, it had 381,445 entries. What's going on here?
- I downloaded just the "signs" and "locations" CSVs from NYC.
- There are 94,205 entries in "signs"
- There are 742,243 entries in "locations"
- Now I'm just lost.
I guess it's possible that the dataset changed on me during my time working on this project. However, some of these values don't reconcile at all. For example, how can there be more "locations" than "signs"?
Looking at the signs and locations CSV files more closely, it looks like they're mislabeled. By referring to the metadata explanations PDF, the entries in the signs file are actually locations, and vice versa.
The "locations" file, which appears to actually be a "signs" file, also contains the sign's descriptions. What's curious is that a lot of the descriptions don't seem like parking regulations in the flavor of "no parking mondays 10am-2pm", but instead something like "Curb Line" or "Property Line" or "Building Line" etc. Their "sign code" fields correspond like "CL", "PL", or "BL". Haven't found examples of other types of these "non signs".
Ideally, if I don't count these "line" lines, there will be exactly the same number of signs (in this locations.CSV file) as there are data points from the shapefile. Perhaps doing this right now is the way to start clarifying things.
--
Having done it, I get 406,056 entries in the "locations" file, which I'm assuming is actually the "signs".
I'm thinking I shouldn't use the signs.csv and locations.csv as truth, and revisit the shapefile. Download to my mac, run the conversion tool to get the csv, count the lines.
No comments:
Post a Comment