Michael Blum

Developer from Chicago

Mapping IP addresses - SQL vs Redis


Setup

Indexing 3+ million IP addresses

In the next few blog posts I’m going to go into detail on creating an IP mapping service: IPbot. One of the first challanges was being able to efficiently query three million IP addresses and their associated metadata.

Using MaxMind’s free GeoLite data:

GeoLite2 IP databases

I got a set of some three million addresses and their associated metadata.

My first approach used SQLite for its ease of set up and ingestion but I discovered it took quite a long time to return a single record. Normally SQLite fits the bill perfectly - local, resolves to a single file, lightweight, etc.

I wanted to speed this up, a half-second to a full second seemed pretty long for making a local DB query. Searching for an in-memory solution, Redis looked like a good candidate.

For the use case of joining an IP address with its geographic place, a relational JOIN seems the easy way to go.

I found a nice solution for relating an IP addresses in Redis, a noSQL engine:

store ip ranges in Redis

Let’s see how much faster this approach is over the naive SQLite approach.

SQLite

It took quite a while to index the CSV data into SQLite - in the neighborhood of 15 minutes. Querying for 8.8.8.8 three different times vs Redis:

SQLite Redis
9.425 0.004
5.637 0.004
0.477 0.004

{:.tablestyle}

It seems like it takes a sec for SQLite to warm up - lets see if that’s consistant when we run 1,000 queries.

Redis

We query first for the network / CIDR with the numeric representation of the IP address we’re looking for (lets find teh home of Google’s DNS server @ 8.8.8.8):

from netaddr import IPNetwork
>>> int(IPNetwork('8.8.8.8').ip)

To find the network an IP belongs to we break down an IP address into it’s network and broadcast addresses (python’s netaddr is very useful for this):

# index cidr metadata by their cidrs as the key
network = int(IPNetwork('8.8.0.0/18')[0])
broadcast = int(IPNetwork('8.8.0.0/18')[-1])

Index the metadata for a CIDR (notation for designating an IP range):

HMSET cidr:8.8.0.0/18 broadcast 134758399 network 134742016 lat 37.7510 long -97.8220 geoid 6252001

then create an index that maps a network’s broadcast address to it’s CIDR:

ZADD cidr:index 134758399 8.8.0.0/18

I did the same for the location metadata - creating an HMSET based on the geoid.

To query for an address we simpily call:

127.0.0.1:6379> ZRANGEBYSCORE cidr:index 134744072 +inf limit 0 1
1) "8.8.0.0/18"

127.0.0.1:6379> HGETALL cidr:8.8.0.0/18
 1) "broadcast"
 2) "134758399"
 3) "lat"
 4) "37.7510"
 5) "geoid"
 6) "6252001"
 7) "long"
 8) "-97.8220"
 9) "network"
10) "134742016"

127.0.0.1:6379> HGETALL geoid:6252001
1) "city"
2) ""
3) "country"
4) "United States"

Indexing the data into Redis was taking a long time - a quick Google releaved that Redis has a special --pipe mode for ingesting millions of rows of metadata: Mass Insertions. The difference was quite impresive:

Generate a Redis-protocol file:

time (python index_redis.py)
( python index_redis.py; )  
281.49s user
5.42s system
98% cpu
4:50.11 total

Ingest protocol file into Redis:

time (cat migration.txt | redis-cli -a redis --pipe)
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 6788124

real 0m45.654s
user 0m0.952s
sys 0m4.684s

Wow..

I figured there might be a small speed increase but 45 seconds for ingesting a 798 M file is very impresive.

Protip: if you’re indexing a lot of data into Redis, use --pipe.

Benchmark

Generate 2,000 random IPv4 addresses:

ipblock = lambda: random.randrange(1, 255)

def get_ipv4_addr():
    return '{}.{}.{}.{}'.format(ipblock(), ipblock(), ipblock(), ipblock())

I got a hit rate of 1717 mappable addresses (85%) since not every random address will map to a geographic location.

Running the benchmark took some time:

( python benchmark.py; ) 492.46s user 125.16s system 96% cpu 10:40.84 total

Averaged over 1,000 queries we got these averages:

SQLite Redis
0.462 s 0.010 s

{:.tablestyle}

That’s a pretty drastic improvement although not suprising since Redis sits in memory while SQLite relies on disk I/O, which is an order of magnitude slower. I am suprised at the half-second responses from SQLite when I’m running an SSD rather than a platter drive.

Size

In terms of size, we get some apples and oranges comparisions of memory v.s. disk space:

  • SQLite file takes 315M of dosk space

  • Redis instance takes 576.72M of RAM

Came up with the Redis numbers using:

redis-cli -a redis INFO

We have a trade-off of low memory usage and half-second query times v.s. medium/high memory usage and millisecond query times. Seems like a worthwhile compromise.

Conclusions

Looking at the use-cases for SQLite: SQLite use-cases, my scenario seems to fit within SQLite’s limitations but I think the long query times are more a problem with how SQL processes JOINs across large data sets.

Github Repo

Check out the benchmark code for yourself:

ip-addresses