Tutorial (Part 2): Visualizing Apache Access Logs

In this part, we will:

  1. Parse and clean raw Apache logs into a Pandas dataframe
  2. Bundle requests that share the same source and target ("edge aggregation")
  3. Create different kinds of graph views of the same logs, where each one reveals different insights into the data.

You can download this notebook to run it locally.

In [1]:
import pandas
import graphistry

    from urllib.parse import unquote # Python 3
except ImportError:
    from urllib import unquote       # Python 2

graphistry.register(key='<email pygraphistry@graphistry.com to get one api key>')

Download+Parse Apache Logs to Create a Pandas Dataframe

Raw Apache logs are a bit tricky to parse:

  • The time field contains a space thus get split into two columns. We merge them back.
  • The cmd_path_proto field bundles the HTTP command, the path accessed, and the protocol version in to a single column. We split them in three columns.

Sample raw data: - - [14/Feb/2015:01:56:03 -0800] "GET /robots.txt HTTP/1.0" 200 252 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)" - - [14/Feb/2015:01:56:10 -0800] "GET /honeypot//%22http://amunhoney.sourceforge.net//%22 HTTP/1.0" 404 284 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)"
In [2]:
url = 'http://www.secrepo.com/self.logs/access.log.2015-02-14.gz'

def parseApacheLogs(filename):
    fields = ['host', 'identity', 'user', 'time_part1', 'time_part2', 'cmd_path_proto', 
             'http_code', 'response_bytes', 'referer', 'user_agent', 'unknown']
    data = pandas.read_csv(url, compression='gzip', sep=' ', header=None, names=fields, na_values=['-'])

    # Panda's parser mistakenly splits the date into two columns, so we must concatenate them
    time = data.time_part1 + data.time_part2
    time_trimmed = time.map(lambda s: s.strip('[]').split('-')[0]) # Drop the timezone for simplicity
    data['time'] = pandas.to_datetime(time_trimmed, format='%d/%b/%Y:%H:%M:%S')
    # Split column `cmd_path_proto` into three columns, and decode the URL (ex: '%20' => ' ')
    data['command'], data['path'], data['protocol'] = zip(*data['cmd_path_proto'].str.split().tolist())
    data['path'] = data['path'].map(lambda s: unquote(s))
    # Drop the fixed columns and any empty ones
    data1 = data.drop(['time_part1', 'time_part2', 'cmd_path_proto'], axis=1)
    return data1.dropna(axis=1, how='all')

logs = parseApacheLogs(url)
host http_code response_bytes referer user_agent time command path protocol
0 200 252 NaN Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http:... 2015-02-14 01:56:03 GET /robots.txt HTTP/1.0
1 404 284 NaN Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http:... 2015-02-14 01:56:10 GET /honeypot//"http://amunhoney.sourceforge.net//" HTTP/1.0
2 404 303 NaN Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http:... 2015-02-14 01:56:15 GET /honeypot//"http://glastopf.org//" HTTP/1.0

Graph connecting Hosts to URLs: Simple Version

We create host-to-path graph by using both edge and node tables as shown in tutorial part 1.

In [3]:
def host2pathGraph(logs):
    def getEdgeTable(logs):
        edges = logs.copy()
        # Color edges by HTTP result code
        http_code_to_color = {code: color for color, code in enumerate(edges['http_code'].unique())}
        edges['ecolor'] = edges['http_code'].map(lambda code: http_code_to_color[code])
        return edges
    def getNodeTable(edges):
        nodes0 = logs['host'].to_frame('nodeid')
        nodes0['pcolor'] = 96000
        nodes1 = logs['path'].to_frame('nodeid')
        nodes1['pcolor'] = 96001
        return pandas.concat([nodes0, nodes1], ignore_index=True).drop_duplicates()
    edges = getEdgeTable(logs)
    nodes = getNodeTable(edges)
    return (edges, nodes)

plotter = graphistry.bind(source='host', destination='path', node='nodeid', \
                          edge_color='ecolor', point_color='pcolor')

Graph connecting Hosts to URLs: Declutter via Edge Aggregation

To avoid crowding a graph with many edges between the same nodes, we are going to bundle mutli-edges into one edge with added summary attributes. A multiedge is a set of edges that share the same source/destination.

For each bundle of requests, we compute the

  • The earliest time
  • The latest time
  • The most frequent referer

The first two computations use Panda's built-in min and max aggregator functions. Then, to extract the most frequent referer, we write our own custom aggregator: mostFrequent.

In [4]:
#Bundle edges into a Pandas group when they share the same attributes like 'host' and 'path'
grouped_logs = logs.groupby(['host', 'path', 'user_agent', 'command', 'protocol', 'http_code'])

# Make dataframes count, min_time, max_time, and referer that are indexed by the groupby keys.
count = grouped_logs.size().to_frame('count')
min_time = grouped_logs['time'].agg('min').to_frame('time (min)')
max_time = grouped_logs['time'].agg('max').to_frame('time (max)')

def mostFrequent(x):
    s = x.value_counts()
    return s.index[0] if len(s.index > 0) else None
referer = grouped_logs['referer'].agg(mostFrequent)

# Join into one table based on the same groupby keys
# We remove the indexes (via reset_index) since we do not need them anymore.
summary = count.join([min_time, max_time, referer]).reset_index()
host path user_agent command protocol http_code count time (min) time (max) referer
0 ////bbs/skin/ggambo5100_board/setup.php Microsoft Internet Explorer/4.0b1 (Windows 95) POST HTTP/1.1 404 2 2015-02-14 12:41:55 2015-02-14 12:41:56 None
1 ////bbs/skin/ggambo5100_board/setup.php Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O... POST HTTP/1.1 404 10 2015-02-14 12:41:18 2015-02-14 12:48:05 None
2 ////bbs/skin/ggambo5100_board/setup.php Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB... POST HTTP/1.1 404 2 2015-02-14 12:54:55 2015-02-14 12:54:57 None

Plot. For an even cleaner view, in the visualization, try using a histogram filter to only show nodes with a degree of 100 or less.

In [5]:

Switching Lenses: Another View of the Same Data

There are many way to cast data into a graph. Each reveals different insights.

For an alternate view of the web logs, we can visualize how users browse from page to page.

In [6]:
def path2pathGraph(summary):
    host2path = summary[['host', 'path']].copy()
    host2path['path'] = host2path['path'].map(lambda p: p.split('?')[0])
    sessions = pandas.merge(host2path, host2path, on='host').drop_duplicates()

    host2color = {host: 265000 + index for index, host in enumerate(sessions.host.unique())}
    sessions['ecolor'] = sessions['host'].map(lambda x: host2color[x])
    return sessions

sessionEdges = path2pathGraph(summary)
host path_x path_y ecolor
0 ////bbs/skin/ggambo5100_board/setup.php ////bbs/skin/ggambo5100_board/setup.php 265000
15 ////bbs/skin/ggambo5100_board/setup.php ////bbs/skin/ggambo5100_board/write.php 265000
30 ////bbs/skin/ggambo5100_board/setup.php ////bbs/skin/ggambo6000_board/setup.php 265000
In [7]:
graphistry.bind(source='path_x', destination='path_y', edge_color='ecolor').plot(sessionEdges)

Explore In-Tool for Deeper Insights

For example, you can quickly explore the browsing session of an individual host:

  • Click on an edge to open its label
  • On the host field, use the filter icon to filter on the edge's host value
  • Recluster the graph
  • Restart by opening the filters menu and disabling or delete the generated host filter

Another View: Attacker Fingerprints

An attacker will often use multiple computers with similar malformed browser fingerprints.

Try excluding Mozilla-based browsers by making the following exclusion:

    point:__nodeid__ like "Mozilla%"
In [8]:
graphistry.bind(source='host', destination='user_agent').plot(summary)