As part of an ongoing contract, there was a need to parse and specifically classify traffic from multiple common log format files. As the rest of the stack in play was using Perl, the utility written to do so was also written in Perl. Yes, I know its 2020.
For a bit more context, the classification was being done for two rules: Is this client a trusted proxy? Is this client a possible search bot?
The first implementation used the Net::Subnet module to match subnets generated from ASNs via Prefix Whois and formatted as HAProxy Map Files. This worked as intended, but sometimes took over 5 minutes to crunch through an hour’s worth of logs. As traffic ramped up, it would often would exceed the fail-safe runtime timeout of 5 minutes, and not produce any output.
This was a fairly simple script, few hundred lines of perl code. What could possibly be taking that long to crunch through fairly simple data? Time to break out the profiler!
Since this is a Perl program, the goto profiler is NYTProf. It sidecars into Perl execution simple by a command line switch, perl -d:NYTProf
, but introduces significant overhead in the wallclock runtime. No bother, reduce the window of logs processed and go make another pot of coffee.
After crunching the data, this stood out in the results:
999081 1.94s line Net/Subnet.pm
522269 284ms line Date/Parse.pm (including 1 string eval)
157478 200ms line [...]report.pl
130619 48.3ms line Time/Local.pm
so almost 2 seconds spent in Net/Subnet
, which was 95% of the entire runtime. Digging deeper, this was an innoculus anon subroutine/closure:
return sub { ((inet_aton(shift) // return !1) & $mask) eq $masked_net };
After sleeping on this, figured there was no real good way to optimise this out. Alternative methods considered were building an in memory sqlite3 database with start/end IPs for the needed matches, as well as abusing Perl dictionaries and binary operations, but both of those were quite ugly. Pinged a few of my cohort on IRC, and it was suggested to use Prefix/Radix trees.. (Thanks LH!) Now, there is a subclass of Radix trees called ‘Patricia Trie’, which are extremely well suited to doing IP subnet matching.
Sometimes finding what you need is just matter of using the proper terms. Enter Net::Patricia. This was a minor conversion from Net::Subnet, requiring two helpers:
my $ip_match4 = new Net::Patricia AF_INET;
my $ip_match6 = new Net::Patricia AF_INET6;
sub add_ip {
my $ip = shift or confess();
my $type = shift or confess();
my $lookup = $ip_match4;
if ( $ip =~ /:/ ){
$lookup = $ip_match6;
}
my $match = $lookup -> match_string($ip);
$type = $type | $match if defined $match; # combine values if there is already a matching value
$lookup -> add_string($ip,$type);
}
sub match_ip {
my $ip = shift or confess();
my $type = shift or confess();
my $lookup = $ip_match4;
if ( $ip =~ /:/ ){
$lookup = $ip_match6;
}
my $match = $lookup -> match_string($ip);
return 0 if not defined $match;
return $match & $type;
}
Each target type is defined as a bitmask. Granted this is a very quick and dirty check between IPV4 and IPV6, but it gets the job done.
The result?
522269 294ms line Date/Parse.pm (including 1 string eval)
269671 233ms line [...]report.pl
224702 130ms line Net/Patricia.pm (including 1 string eval)
130619 44.3ms line Time/Local.pm
A script that took over 5 minutes on the full data set, reduced to 1.5 seconds wallclock. While premature optimization is the root of all evil, if the profiler says it’s bad, it likely is.