Saturday, June 6, 2015

Learning from data

I implemented the features to find "anytime" tokens, as well as days of the week, and stored them into variables of the JSON. I wanted to handle the case of "except" seen before a day of the week, for example "... except Sunday". I noticed that almost all instances of this type were for exactly Sunday. I got the count of these words. Note, this is before normalizing abbreviations:

  • sunday - 34,504
  • sun - 4492
  • trucks - 3931
  • authorized - 2871
  • school - 630
  • su - 286
  • commercial - 266
  • vehicles - 164
  • farmers - 129
  • saturday - 85
  • taxis - 64
  • comm - 61
  • miu - 29
  • city - 26
  • buses - 21
  • pick - 20
  • tlc - 18
  • taxis/fhv's - 17
  • loading - 16
  • license - 6
  • tour - 6
  • m - 6
  • horse - 5
  • nycta - 3
  • sunda - 3
  • flea - 3
  • greenmarket - 3
  • friday - 2
  • 8am-11am - 2
  • n - 2
  • state - 2
  • mon - 2
  • us - 1
  • deliveries - 1
  • sunady - 1
Clearly Sunday is the major day. Notice even the variations of Sunday; the multiple types of abbreviations, and the one misspelling. 

After this, I should be able to create the separate HTMLs for every time slot. Let's actually calculate how many there will be.

24 hours in a day
4 15-minute blocks per hour

24 * 4 = 96

That's not so bad.

No comments:

Post a Comment