Logfile Parsing

While parsing logfiles on a Linux machine, several commands are useful in order to get the appropriate results, e.g., searching for concrete events in firewall logs.

In this post, I list a few standard parsing commands such as grep, sort, uniq, or wc. Furthermore, I present a few examples of these small tools. However, it’s all about try and error when building large command pipes. ;)

Of course, the two most important functions are  cat  for displaying a complete textfile on the screen (stdout), and the pipe  |  which is used after every call to forward the output to the next tool. For a live viewing of log filestail -f  is used. Note that not all of the following tools can be used with such a type of live viewing, e.g., the sort commands. However, at least “grep” and “cut” can be used.

Filter, Replace, Omit, etc.

  • grep [-v] <text>: Prints only the lines that contain the specified value. When using it with -v, it prints only the lines that do NOT have the specified value. Example:  cat file | grep 1234  or cat file | grep -v 5678 .
  • sort [-g] [-k <position>] [-r]: Sorts the input. Using -g sorts numbers to their real numerical value. With -k the start and stop positions can be set precisely. -r reverses the order. Example: Sort only through the 23th field cat file | sort -k 23,24 .
  • uniq [-f <position>] [-s <position>] [-w <number>] [-c]: Deletes multiple entries. -f skips fields, -s skips chars, -w only compares n chars. Example: Delete all lines that have the same value in the 5th field while only comparing the first 10 chars: cat file | uniq -f 5 -w 10 . -c can be used to print the number of occurrences for each line.
  • wc -l: Simple word count. -l counts the lines (mostly used).
  • comm [-1] [-2] [-3]: Compares two files and prints three columns with entries only present in file 1, 2, or both. These columns can be suppressed with the -1, etc. switches. Example: Print the lines that are uniq in file2: comm -13 file1 file2 .
  • tr -s ‘ ‘: The tool “translate” can be used for many things. One of my default cases is to omit double spaces in logfiles with tr -s ' ' . But it can be used for other use cases, such as replacing uppercase letters to lowercase: tr [:upper:] [:lower:], e.g., to have IPv6 address look alike.
  • cut -d ‘ ‘ -f <field>: Prints only the field specified with -f. The field-separator must be set to your appropriate one, e.g., space or comma. Example: Print the 23th field: cat file | cut -d ' ' -f 23 .
  • awk ‘{print $<position>}’: Basically the same as the aforementioned command. It displays only the selected field, such as: cat file | awk '{print $23}' .
  • head -n -<number>: Omits the first n lines. E.g., when each file starts with three comment lines that should be omitted: cat * | head -n -3 .
  • sed s/regexp/replacement/: Replaces the part of each line that is specified with the regex. E.g., everything before the keyword “hello” (and the keyword itself) should be removed: cat file | sed s/.*hello// .
  • paste -d ” <file1> <file2>: If you have to merge two files line by line. For example, if you have a list with hostnames and a second list with appropriate IP addresses, and you want to merge them into one file.
  • dos2unix <file>, unix2dos <file>: In case you have to convert the CR LF thing from one OS to another. Sometimes this happens to me when working with Notepad++ at Windows and some tools at Linux at the same time.

A few Examples

Here are a few examples out of my daily business. Let’s grep through some firewall logs. The raw log format looks like the following:

Connections from Host X to Whom?

Let’s say I want to know how many destination IPs appeared in a certain policy rule. The relevant policy-id in my log files is “219” (grep id=219). To avoid problems with double spaces, I delete them (tr -s ‘ ‘). The destination IP address field is the 23th field in the log entries – I only want to see them. Since the delimiter in the log file is space, I have to set it to ‘ ‘ (cut -d ‘ ‘ -f 23). Finally, I sort this list (sort) and filter multiple entries (uniq). Here is the result:

If I want to have the whole log entry lines (and not only the IP addresses), I can use sort for the 23th field (sort -k 23,24) and uniq for the 23th field (= skip the first 22 fields) while only comparing the following 20 chars (uniq -f 22 -w 20). This is the result:


Count of Connections from Host Y

Another example is the count of connections from host y, sorted by its destinations. The starting point is the source IP address (grep src= Double spaces should be removed (tr -s ‘ ‘). Only the destination IP address is relevant, which is the 23th field (cut -d ‘ ‘ -f 23). The output is sorted (sort) and counted per unique entries (uniq -c). To have the counters sorted by its numerical value, another (sort -g -r) is used. This is it:


Summary of Session-End Reasons

Grep every log entry that has the keyword “reason” in it (grep reason), followed by a replacement of the whole line until the last field, which is the reason entry. This is done via the regex that is replaced by nothing (sed s/.*reason.//). Finally, similar to the examples above, sorting the output, counting the unique entries and sorting the counts. Here it is:

Display Filter with Regex

Here is another example on how to “improve” a logfile output with sed in order to have a better view on it. The following output is from tcpdump sniffing on a network for ICMPv6 DAD messages.

I only want to see the timestamps along with the MAC & IPv6 address. That is, I want to throw away any words and symbols from this output. This can be done with  sed s/regexp/replacement/  which is called with a regex and a replacement of nothing. In my example, I want to replace anything between the > sign and the “has” keyword. The regex for this is  >.*has.  which means, beginning with > (which is escaped), followed by anything “.*” until “has”, followed by a single character “.”. And with a second run I want to replace everything to the end starting with the comma:

That’s it. ;)

Featured image: “Needle in the haystack” by Gregor Gruber is licensed under CC BY-NC-ND 2.0.

One thought on “Logfile Parsing

  1. Nice summary. Here’s another example, this is firewall log. > X.X.X.X.22: S 1621549071:1621549071(0) win 26883 (DF)

    I want to pull out source uniq IPs (sorted on all 4 octetes, but not on port number).

    tcpdump -ntr /var/log/pflog tcp | awk ‘{print $1}’ | sort -t. -k1,1n -k2,2n -k3,3n -k4,4n | uniq


    Disclaimer: IPs are used as an example, not from an actual log file.

Leave a Reply

Your email address will not be published. Required fields are marked *