S3 static website analytics

Posted on 19 Mar 2015
Tags: meta

In an earlier post, I sketched out some technical aspects of this and, in particular, mentioned that it is hosted as a static website on Amazon S3. This means that I can interact with it as a bucket and can use bucket access logs to distill server logs.

I hacked together a little script in Python to do just this. Before it’s run, the logs are (incrementally) downloaded from their bucket programatically using S3cmd. Then these files are looped over and parsed, filtering out all the bucket events (like the one in which this very page was uploaded) that aren’t involved in delivering data to HTTP clients. I also filter out the various inhuman bots that show up to skew the statistics. I’m using freegeoip.net for geolocation.

There’s not yet enough data to say or visualize much of interest, so here are a few quick observations as an advance to a later deeper analysis:

  • Greetings to the reader who is located in Donets’ka Oblast; your life is probably much more interesting than mine right now.
  • Binned by hours, this site is most viewed during 4:00 Coordinated Universal Time (UTC). Possible explanations involve you all being night owls, my being a night owl since I typically upload around midnight locally, or a readership primarily located in Oceania.
  • Referrer spam is an attack surface to be aware of memetically, even in situations where access logs aren’t to be published.
  • My most popular post so far has been Stats miscellany, with 46 view, which suggests that this is a well I should go to again.

Edit (2015-03-23): More thorough screening for bot-like behavior (attempting to read robots.txt, having the substring ‘bot’ in User-Agent, etc.) has revealed the Donets’ka Oblast IP to most likely be non-human. That’s probably the safer choice these days. The access peak at 04:00 UTC is left unchanged by the screening.

Edit (2015-03-23): Somehow missed Baidu’s spider in previous screening rules. Now, with it excluded, the readership looks much less Chinese and the access peak shifts to a more believable 01:00 UTC.