Parsing nginx logs using awk

06 Sep 2018

Out of the box, nginx uses the ngx_http_log_module using the combined format:

log_format combined '$remote_addr - $remote_user [$time_local]  '
            '"$request" $status $body_bytes_sent '
            '"$http_referer" "$http_user_agent"';

The log_format supports many variables:

$remote_addr – client IP address
$remote_user – username supplied with basic authentication, generally blank
[$time_local] – server timestamp
“$request” – HTTP request type, path and protocol used
$status – HTTP response returned
$body_bytes_sent – server response size (in bytes)
“$http_referer” – referral URL (if present)
“$http_user_agent” – client user agent

Refer to ngx_htt_core_module for more information.

Using awk, we can then break up each line of the log file into “columns”. By default awk splits the string by the space character. Allowing us to easily report on things such as HTTP response codes:

awk '{print $9}' access.log | sort | uniq -c | sort -rn
126570 200
404
304
302
301
499
206
405
403
"-"

From these response codes, we can see there are 4144 404 “Not Found” responses - to determine which URLs are returning this response again we can use awk:

awk '($9 ~ /404/)' access.log | awk '{print $7}' | sort | uniq -c | sort -rn
   3343 /wp-content/themes/eNurse/assets/en_background1.jpg
    307 /grassblade-lrs/xAPI/statements
     53 /eshop/clearance-products
     38 /core/assets/frontend/layout/css/fancybox_overlay.png
     24 /core/assets/frontend/layout/css/[email protected]
     24 /core/assets/frontend/layout/css/[email protected]
     13 /my-account/favicon.ico
     11 /courses/favicon.ico
      8 /learninghub/favicon.ico
      7 /eshop/clearance-products?p=3
      7 /apple-touch-icon.png
      6 /professional-file/professional-file-enurse-cpd/favicon.ico
      6 /blogsarticles/favicon.ico
      6 /apple-touch-icon-precomposed.png.
      ...

To filter based on other response codes, just change it within the awk command, the below example would return all temporarily redirected pages based on a 301 response code:

awk '($9 ~ /301/)' access.log | awk '{print $7}' | sort | uniq -c | sort -rn

Another neat way to analyse the logs is to determine the country of origin, by performing a look up of client IP addresses using the geoip package:

awk '{ print $1 }' | sort | uniq -c | sort -rn | head -n 25 | awk '{ printf("%5d\t%-15s\t", $1, $2); system("geoiplookup " $2 " | cut -d \\: -f2 ") }'
 174.129.242.xxx  US, United States
 203.37.226.xxx   AU, Australia
 101.175.136.xxx   AU, Australia
 58.96.63.xxx     AU, Australia
 221.121.132.xxx   AU, Australia
 111.69.116.xxxx   NZ, New Zealand
 118.208.154.xxx   AU, Australia
 1.156.134.xxx    AU, Australia
 1.41.98.xxx      AU, Australia
 106.69.55.xxx     AU, Australia
 58.7.198.xxx     AU, Australia
 121.212.60.xxx   AU, Australia
 211.31.50.xxx    AU, Australia