mikepk.com about current projects contact

Web Server Logging Tip

April 28, 2013

One of my recent favorite techniques when using nginx is to modify the logging format to make it easier to work with. It's a simple thing, but it made me slap my forehead and say "Why have I not been doing it this way all along?" I'm using nginx for most of my projects these days, but obviously this will work with any web server that lets you modify the log format.

I've used the standard "Common Log Format" format for as long as I can remember. This means I've invariably ended up writing simple, one-off, custom parsers or looking for existing tools to extract interesting information from the web logs. This little bit of extra work has always irritated me whenever I've wanted a quick answer to some simple question that I can only get from the log (for whatever reason).

Now I use a custom web log format that includes tabs as field separators and removes the extra delimiters from the CLF. Like I said, simple, but makes my life just a little bit easier. I also include a few other fields (like response time information) that are available in the nginx log_format directive.

So in your nginx.conf (or virtual server conf), first you define a log format:

log_format timed_tab_combined \
 '$remote_addr\t$time_iso8601\t$request\t'
 '$status\t$http_referer\t$http_user_agent\t'
 '$body_bytes_sent\t$msec\t$request_time';

Then you specify that format in your access_log directive. If you want, you can also keep the common log format and write out two log files (although this is really unnecessary).

access_log  /var/log/nginx/custom-weblog.access.log timed_tab_combined;

Now if I ever need to check for a particular field, I can use the 'cut' command line tool to extract just the fields I care about. This makes visually scanning log data a lot easier without the extra noise.

I can get quick answers to simple questions like "How many times did Googlebot GET my '/scores' URL in the current log file? I 'cut' on fields 3 and 6, the request uri and the user agent respectively. Then a little grep-foo and a line count and voila, an answer.

$ sudo cut -f3,6 /var/log/nginx/custom-weblog.access.log | grep -P "^GET /scores" | grep "Googlebot" | wc -l
3117

Another example: whats the distribution of IP addresses in my current log file? (ip addresses in the example blanked for privacy).

sudo cut -f1 /var/log/nginx/custom-weblog.access.log | sort | uniq -c | sort -n | tail
   1439 108.xx.xx.xx
   1442 79.xx.xx.xx
   1442 92.xx.xx.xx
   1689 60.xx.xx.xx
   1746 65.xx.xx.xx
   2034 142.xx.xx.xx
   2141 66.xx.xx.xx
   2542 82.xx.xx.xx
   2697 182.xx.xx.xx
   8334 66.xx.xx.xx
I, for one, welcome our new robot overlords…

It also means if I have to write a one-off parser, it becomes trivially easy since it's just a tab-separated file. Since all of the information is still present, it's also super easy to convert this back into Common Log Format if you need to use existing, off-the-shelf, log parsers.

There are other ways to skin this particular cat, but the simplicity of this approach appeals to me.