Logfile Parsing

While parsing logfiles on a Linux machine, several commands are useful in order to get the appropriate results, e.g., searching for concrete events in firewall logs.

In this post, I list a few standard parsing commands such as grep, sort, uniq, or wc. Furthermore, I present a few examples of these small tools. However, it’s all about try and error when building large command pipes. ;)

Of course, the two most important functions are cat for displaying a complete textfile on the screen (stdout), and the pipe | which is used after every call to forward the output to the next tool. For a live viewing of log files, tail -f is used. Note that not all of the following tools can be used with such a type of live viewing, e.g., the sort commands. However, at least “grep” and “cut” can be used.

Filter, Replace, Omit, etc.

grep [-v] <text>: Prints only the lines that contain the specified value. When using it with -v, it prints only the lines that do NOT have the specified value. Example: cat file | grep 1234 or cat file | grep -v 5678 .
sort [-g] [-k <position>] [-r]: Sorts the input. Using -g sorts numbers to their real numerical value. With -k the start and stop positions can be set precisely. -r reverses the order. Example: Sort only through the 23th field cat file | sort -k 23,24 .
uniq [-f <position>] [-s <position>] [-w <number>] [-c]: Deletes multiple entries. -f skips fields, -s skips chars, -w only compares n chars. Example: Delete all lines that have the same value in the 5th field while only comparing the first 10 chars: cat file | uniq -f 5 -w 10 . -c can be used to print the number of occurrences for each line.
wc -l: Simple word count. -l counts the lines (mostly used).
comm [-1] [-2] [-3]: Compares two files and prints three columns with entries only present in file 1, 2, or both. These columns can be suppressed with the -1, etc. switches. Example: Print the lines that are uniq in file2: comm -13 file1 file2 .
tr -s ‘ ‘: The tool “translate” can be used for many things. One of my default cases is to omit double spaces in logfiles with tr -s ' ' . But it can be used for other use cases, such as replacing uppercase letters to lowercase: tr [:upper:] [:lower:], e.g., to have IPv6 address look alike.
cut -d ‘ ‘ -f <field>: Prints only the field specified with -f. The field-separator must be set to your appropriate one, e.g., space or comma. Example: Print the 23th field: cat file | cut -d ' ' -f 23 .
awk ‘{print $<position>}’: Basically the same as the aforementioned command. It displays only the selected field, such as: cat file | awk '{print $23}' .
head -n -<number>: Omits the first n lines. E.g., when each file starts with three comment lines that should be omitted: cat * | head -n -3 .
sed s/regexp/replacement/: Replaces the part of each line that is specified with the regex. E.g., everything before the keyword “hello” (and the keyword itself) should be removed: cat file | sed s/.*hello// .
paste -d ” <file1> <file2>: If you have to merge two files line by line. For example, if you have a list with hostnames and a second list with appropriate IP addresses, and you want to merge them into one file.
dos2unix <file>, unix2dos <file>: In case you have to convert the CR LF thing from one OS to another. Sometimes this happens to me when working with Notepad++ at Windows and some tools at Linux at the same time.

A few Examples

Here are a few examples out of my daily business. Let’s grep through some firewall logs. The raw log format looks like the following:

Jan 1 23:59:58 172.16.1.1 fd-wv-fw01: NetScreen device_id=fd-wv-fw01 [Root]system-notification-00257(traffic): start_time="2015-01-01 23:59:55" duration=3 policy_id=206 service=dns proto=17 src zone=Trust dst zone=Untrust action=Permit sent=93 rcvd=132 src=2003:51:6012:123:c24a:ff:fe09:5346 dst=2001:500:1::803f:235 src_port=56854 dst_port=53 src-xlated ip=2003:51:6012:123:c24a:ff:fe09:5346 port=56854 dst-xlated ip=2001:500:1::803f:235 port=53 session_id=3883 reason=Close - RESP

Connections from Host X to Whom?

Let’s say I want to know how many destination IPs appeared in a certain policy rule. The relevant policy-id in my log files is “219” (grep id=219). To avoid problems with double spaces, I delete them (tr -s ‘ ‘). The destination IP address field is the 23th field in the log entries – I only want to see them. Since the delimiter in the log file is space, I have to set it to ‘ ‘ (cut -d ‘ ‘ -f 23). Finally, I sort this list (sort) and filter multiple entries (uniq). Here is the result:

dst=77.0.74.170

dst=77.0.77.111

dst=79.204.238.115

dst=93.220.253.102

If I want to have the whole log entry lines (and not only the IP addresses), I can use sort for the 23th field (sort -k 23,24) and uniq for the 23th field (= skip the first 22 fields) while only comparing the following 20 chars (uniq -f 22 -w 20). This is the result:

weberjoh@jw-nb10:~$ cat 2015-01-01.fd-wv-fw01.log | grep id=219 | tr -s ' ' | sort -k 23,24 | uniq -f 22 -w 20

Jan 1 02:17:11 172.16.1.1 fd-wv-fw01: NetScreen device_id=fd-wv-fw01 [Root]system-notification-00257(traffic): start_time="2014-12-31 04:56:53" duration=76818 policy_id=219 service=tcp/port:30005 proto=6 src zone=DMZ dst zone=Untrust2 action=Permit sent=15235078 rcvd=130943813 src=192.168.110.12 dst=77.0.74.170 src_port=49913 dst_port=30005 src-xlated ip=10.49.254.5 port=2364 dst-xlated ip=77.0.74.170 port=30005 session_id=4296 reason=Close - TCP RST

Jan 1 05:53:02 172.16.1.1 fd-wv-fw01: NetScreen device_id=fd-wv-fw01 [Root]system-notification-00257(traffic): start_time="2015-01-01 04:55:25" duration=3457 policy_id=219 service=tcp/port:30005 proto=6 src zone=DMZ dst zone=Untrust2 action=Permit sent=386518 rcvd=1532970 src=192.168.110.12 dst=77.0.77.111 src_port=50279 dst_port=30005 src-xlated ip=10.49.254.5 port=1701 dst-xlated ip=77.0.77.111 port=30005 session_id=7535 reason=Close - TCP RST

Jan 1 04:36:29 172.16.1.1 fd-wv-fw01: NetScreen device_id=fd-wv-fw01 [Root]system-notification-00257(traffic): start_time="2014-12-31 05:54:15" duration=81734 policy_id=219 service=tcp/port:30005 proto=6 src zone=DMZ dst zone=Untrust2 action=Permit sent=18559326 rcvd=63638696 src=192.168.110.12 dst=79.204.238.115 src_port=49925 dst_port=30005 src-xlated ip=10.49.254.5 port=2721 dst-xlated ip=79.204.238.115 port=30005 session_id=4147 reason=Close - TCP RST

Jan 1 05:53:04 172.16.1.1 fd-wv-fw01: NetScreen device_id=fd-wv-fw01 [Root]system-notification-00257(traffic): start_time="2014-12-31 05:54:18" duration=86326 policy_id=219 service=tcp/port:30005 proto=6 src zone=DMZ dst zone=Untrust2 action=Permit sent=24870176 rcvd=276776662 src=192.168.110.12 dst=93.220.253.102 src_port=49926 dst_port=30005 src-xlated ip=10.49.254.5 port=1858 dst-xlated ip=93.220.253.102 port=30005 session_id=4483 reason=Close - TCP RST

Count of Connections from Host Y

Another example is the count of connections from host y, sorted by its destinations. The starting point is the source IP address (grep src=192.168.113.11). Double spaces should be removed (tr -s ‘ ‘). Only the destination IP address is relevant, which is the 23th field (cut -d ‘ ‘ -f 23). The output is sorted (sort) and counted per unique entries (uniq -c). To have the counters sorted by its numerical value, another (sort -g -r) is used. This is it:

209319 dst=8.8.8.8

2851 dst=88.198.52.243

230 dst=198.20.8.241

209 dst=224.0.0.251

159 dst=198.20.8.246

102 dst=192.168.5.1

50 dst=93.184.221.109

11 dst=172.16.1.5

9 dst=91.189.92.152

5 dst=91.189.95.36

4 dst=141.30.13.10

3 dst=192.168.9.6

2 dst=218.2.0.123

2 dst=103.41.124.53

1 dst=78.223.8.102

1 dst=77.0.138.150

1 dst=61.174.50.229

Summary of Session-End Reasons

Grep every log entry that has the keyword “reason” in it (grep reason), followed by a replacement of the whole line until the last field, which is the reason entry. This is done via the regex that is replaced by nothing (sed s/.*reason.//). Finally, similar to the examples above, sorting the output, counting the unique entries and sorting the counts. Here it is:

311970 Close - RESP

219406 Close - AGE OUT

69236 Traffic Denied

56179 Close - TCP FIN

3621 Close - ICMP Unreach

2968 Close - TCP RST

191 Creation

34 Close - ALG

24 Close - OTHER

Display Filter with Regex

Here is another example on how to “improve” a logfile output with sed in order to have a better view on it. The following output is from tcpdump sniffing on a network for ICMPv6 DAD messages.

16:41:24.392554 90:27:e4:35:38:a8 > 33:33:ff:87:cb:e9, ethertype IPv6 (0x86dd), length 78: :: > ff02::1:ff87:cbe9: ICMP6, neighbor solicitation, who has fe80::1441:9488:9187:cbe9, length 24

16:43:33.904282 00:26:08:b2:ad:78 > 33:33:ff:b2:ad:78, ethertype IPv6 (0x86dd), length 78: :: > ff02::1:ffb2:ad78: ICMP6, neighbor solicitation, who has fe80::226:8ff:feb2:ad78, length 24

16:53:55.789861 90:27:e4:35:38:a8 > 33:33:ff:87:cb:e9, ethertype IPv6 (0x86dd), length 78: :: > ff02::1:ff87:cbe9: ICMP6, neighbor solicitation, who has fe80::1441:9488:9187:cbe9, length 24

16:54:08.964875 a0:0b:ba:b6:d8:2e > 33:33:ff:b6:d8:2e, ethertype IPv6 (0x86dd), length 78: :: > ff02::1:ffb6:d82e: ICMP6, neighbor solicitation, who has fe80::a20b:baff:feb6:d82e, length 24

16:55:01.020645 90:27:e4:35:38:a8 > 33:33:ff:87:cb:e9, ethertype IPv6 (0x86dd), length 78: :: > ff02::1:ff87:cbe9: ICMP6, neighbor solicitation, who has fe80::1441:9488:9187:cbe9, length 24

I only want to see the timestamps along with the MAC & IPv6 address. That is, I want to throw away any words and symbols from this output. This can be done with sed s/regexp/replacement/ which is called with a regex and a replacement of nothing. In my example, I want to replace anything between the > sign and the “has” keyword. The regex for this is >.*has. which means, beginning with > (which is escaped), followed by anything “.*” until “has”, followed by a single character “.”. And with a second run I want to replace everything to the end starting with the comma:

weberjoh@jw-nb09:~$ cat test | sed s/>.*has.// | sed s/,.*//

16:41:24.392554 90:27:e4:35:38:a8 fe80::1441:9488:9187:cbe9

16:43:33.904282 00:26:08:b2:ad:78 fe80::226:8ff:feb2:ad78

16:53:55.789861 90:27:e4:35:38:a8 fe80::1441:9488:9187:cbe9

16:54:08.964875 a0:0b:ba:b6:d8:2e fe80::a20b:baff:feb6:d82e

16:55:01.020645 90:27:e4:35:38:a8 fe80::1441:9488:9187:cbe9

That’s it. ;)

Featured image: “Needle in the haystack” by Gregor Gruber is licensed under CC BY-NC-ND 2.0.

One thought on “Logfile Parsing”

Dragan Jovicic says:

2017-02-28 at 16:27

Nice summary. Here’s another example, this is firewall log.

54.213.154.7.58462 > X.X.X.X.22: S 1621549071:1621549071(0) win 26883 (DF)

I want to pull out source uniq IPs (sorted on all 4 octetes, but not on port number).

tcpdump -ntr /var/log/pflog tcp | awk ‘{print $1}’ | sort -t. -k1,1n -k2,2n -k3,3n -k4,4n | uniq

Result:

94.200.161.222.42082
95.9.28.245.60443
95.32.206.59.18985
95.174.121.23.7266
95.190.145.13.5125
97.83.108.216.29141
101.51.255.184.7634
101.108.145.106.44364
103.37.161.202.44739
103.53.52.121.61776
103.77.4.198.61898
103.197.105.82.65365
103.207.47.214.49517
103.208.235.82.15378
103.232.64.40.46758
103.235.179.67.60577
104.128.101.21.4485
104.159.224.6.9971

Disclaimer: IPs are used as an example, not from an actual log file.
@routerdragon

Weberblog.net

IT-Security, Networks, IPv6, VPN, DNSSEC, NTP