Easily highlight source code for your blog with our Syntax Highlighter. Join Siafoo Now or Learn More

Loading the GeoIP City database into PostgreSQL Atom Feed 1

Describes the necessary steps required to load the MaxMind GeoIP City database into a PostgreSQL database.

1   Introduction

It has always annoyed me that upon visiting the web site for some global conglamorate (like Sony), itching to spend my hard earned cash on useless gadgets, I am asked to pick my country and region. Even worse how am I supposed to know what download mirror I should be using, the one at CMU or the one at Fermilab? Sure it is only one extra click, but there really is no reason why the correct version of a corporate site or mirror can not be pre-select and the user provided with an option to pick an alternative.

So how does one go about getting the user's geographic location based on their IP? One way is by using GeoIP City Lite from MaxMind, a database mapping IPs to geographic locations. You can give it a try at the GeoIP demo site.

The database is available in two formats, a binary version accessed using a set of open source APIs (the recommended version) and a CSV version ready to load into an SQL database server. Loading the CSV version into a PostgreSQL database server is the focus of this article.

2   The GeoIP City Database

MaxMind ships two versions of the City database, a freely available Lite version and a more comprehensive commercial version. The accuracy of the Lite version is 99.3% at the country level and 69% at the city level. For the commercial version the accuracy is 99.8% at the country level and 81% at the city level.

The CSV dataset is composed of two tables, the blocks table consists of IP ranges mapped to location identifiers and the locations table contains detailed information about each location. For the GeoIP City database, the locations table contains the two letter ISO country code, region identifier (states or regions), a City name, Postal Code, Latitude and Longitude, Metro Code and Area Code.

3   Lookup Performance

Before attempting to load the data into PostgreSQL you should look at performance numbers below and decide which access method is best for your application. There is a set of APIs for accessing the GeoIP binary data, and two different methods for storing the information inside PostgreSQL.

The Open Source APIs MaxMind provides are tuned specifically for accessing the GeoIP data, as such they [seem to] perform much better than an SQL server. Take a look at the GeoIP benchmark page for a comparison of the performance of the different APIs.

With PostgreSQL you can either use the default server configuration with no contrib modules, Plain PostgreSQL, or install the IP4r contrib module.

Using a Python test script I got the following performance numbers for the three access methods.

Method Performance Size Notes
Plain PostgreSQL 3 queries / second 302MB start_end_ix + end_start_ix (see Index Performance)
IP4r PostgreSQL 6,147 queries / second 285MB GiST index on range column
GeoIP Python API 209,344 queries / second ~30MB GEOIP_MEMORY_CACHE

The table was created on an Intel i7 @ 2.67GHz (with rather slow disks) but that shouldn't matter because the numbers are given for a relative comparison between the access methods, not as an absolute measure of performance.


You can get far more queries / second if you run many at the same time.

4   Loading the data using the script

To allow for easier loading of the data, I wrote Utility for loading GeoIP data into a PostgreSQL database, that will automagically initialize the database structure, load the data and create the indexes. The script requires Python 2.3 or above, Psycopg2 and an existing PostgreSQL database.

Before you can run the script you must:

  1. Get the GeoIP City CSV data
  2. Unzip the data in a directory of your choice
  3. Create a database to store the geo_ip data, using either pgAdmin or the CREATE DATABASE PostgreSQL command
  4. (optional but highly recommended) Get and install the PostgreSQL IP4r Module (see Installing IP4r)

If all goes well with the above steps, and you have Python and Psycopg2 simply run the load_geoip.py script providing the locations of the GeoIP data files and information about your database setup.

The script takes the following options:

Run Options:
-c Create database structure
-g Load the GeoIP data
-b BLOCKS_FILE GeoIP Blocks CSV file [default: GeoLiteCity-Blocks.csv]
 GoeIP Locations CSV file [default: GeoLiteCity-Location.csv]
-z ZIP_FILE GoeIP Locations ZIP file [default: none]
Database Options:
-H DB_HOST Database host [default: localhost]
 Database name [default: geoip_db]
-U USER_NAME User [default: geoip]
-s SCHEMA Database Schema [default: public]
--ip4r Use the IP4r module [default: False]

For example the following command will create the structure and load the data into a database called geoip_db2 on the localhost using the user name geoip

python load_geoip.py -c -g -z GeoLiteCity_20090501.zip

5   Using the Database

5.1   With IP4r

When you are using the IP4r module you can use the get_location function like this:

# 's
1SELECT city, region, country
2FROM locations
3WHERE locations.id = (SELECT get_location(''))

The get_location function boils down to:

# 's
1SELECT location_id
2FROM blocks WHERE ip_range >>= ip4('')

It's pretty obvious that the >>= operator checks if the right side is contained or equal to the left side. In this case it is used to check if the given IP address is inside the range. This is one of the operators defined on the GiST index type.

5.2   Without IP4r

Without using IP4r a query similar to the one below should be used. The inet_to_bigint function exploits a trick (see Utility Functions) to convert the given IP string into a 64bit number.

# 's
1SELECT city, region, country
2FROM locations
3WHERE locations.id = (
4 SELECT location_id FROM blocks
5 WHERE inet_to_bigint('') BETWEEN start_ip AND end_ip
6 LIMIT 1)

Without the trick the WHERE clause would look something like this: WHERE 64*16777216 + 151*65536 + 75*256 + 21 BETWEEN start_ip AND end_ip)

You can also do something like this:

# 's
1SELECT city, region, country
2FROM locations
3JOIN blocks ON locations.id = blocks.location_id
4WHERE inet_to_bigint('') BETWEEN start_ip AND end_ip

On my machine the first query executes in ~0.210ms and the second in ~0.230ms. Looking at the Explain output it seems that both queries perform a Seq Scan on blocks table but while the first query does a straight forward limit and then primary key lookup the second query involves a 'Nested Loop' which may account for the slight difference in performance.

If you don't care how the data is loaded or what the performance characteristics of the various index configurations are, than you should skip the next two section and check out Usage Ideas.

5.3   The LIMIT clause

Per Dmitry Ryabov's suggestion adding a LIMIT 1 clause can drastically improve the the performance of some queries (more with non-IP4r than IP4r queries though).

As far as I can tell GeoIP will not return more than one location so it should be perfectly safe to use such a LIMIT. Please report any problems with this approach.

6   Manually Loading the data

The following sections provide information on manually creating the necessary database structures and loading the GeoIP data, for anyone who is either interested in finding out how the loader script works, or would like to manually load the data (due to lack of python or whatever else).

I have also provided a PostgreSQL script for loading GeoIP data that, with some minor customizations, will allow you to load the GeoIP data directly into a database, using the psql utility.

6.1   Schema

If you plan to integrate the GeoIP data inside one of your own databases you should create the GeoIP tables inside a schema. PostgreSQL Schemas are something like a database analog for file system directories, they allow you to logically organize and isolate database objects. When you create a database, PostgreSQL by default creates a schema called public. Unless otherwise specified newly created tables and other database objects live inside the public schema.

Using a custom schema will help you manage your database without worrying about bloating your backups with the GeoIP data, since the GeoIP data can be reloaded at will. Also when the GeoIP data is updated you can easily drop the schema and all the objects within it with: DROP SCHEMA geoip CASCADE;.

If you decide to use a schema, other than the default one, you will have to access the tables and functions inside that schema either by using their fully qualified name like schema_name.object_name or by adding your schema to the schema search_path.

Postgres uses the schema search_path to resolve the location of database objects. The first entry in the search path is the current schema, or the schema where new objects are created. The search_path can be set for the current session using a command like SET search_path TO geoip,public;.

6.2   IP4r

PostgreSQL has an excellent contrib module that provides a more flexible IP data type, ip4 and an IP range data type ip4r that can be used to store the blocks table IP ranges. Even better IP4r defines all the necessary operators on these types to allow for the creation of a GiST index on a range field.

6.2.1   Installing   Linux

Check your distribution for ip4r package (yum search ip4r).

Under Fedora (10 and probably other versions)

su -c 'yum install postgresql-ip4r'

The ip4r.sql is located: /usr/share/postgresql-ip4r/ip4r.sql   Manually

To install the module manually:

  1. Get IP4r from: http://pgfoundry.org/projects/ip4r/
  2. Untar the source: tar xvf ip4r-version.tar.gz
  3. Build it: make USE_PGXS=1 (PostgreSQL 8.0 and above)
  4. Install it: make USE_PGXS=1 install (PostgreSQL 8.0 and above)

The ip4r.sql would probably be located in: /usr/local/share/postgresql/contrib/ip4r.sql

6.2.2   Configuring

Load the IP4r data types and functions, something like: psql -U postgres -f PATH_TO_IP4R_DIR/ip4r.sql some_database


You can also load the ip4r.sql script inside the template1 database so that every database you create will contain the ip4r definitions. This is only useful if you plan to have many databases with ip4r capabilities

6.3   Table Structure

For PostgreSQL, the database structure for the City database is as follows:

# 's
 1CREATE TABLE locations
3 id bigint NOT NULL,
4 country character(2) NOT NULL,
5 region character(2),
6 city character varying(75),
7 postal_code character varying(15),
8 latitude numeric(6,4) NOT NULL,
9 longitude numeric(7,4),
10 metro_code integer,
11 area_code integer,
12 CONSTRAINT locations_pkey PRIMARY KEY (id)
17 start_ip bigint NOT NULL,
18 end_ip bigint NOT NULL,
19 location_id bigint NOT NULL,

The start_ip and end_ip columns are declared using the bigint data type because even though an IP is just a 32bit number, the SQL standard does not provide for an unsigned data type and so the most straight-forward way to store an unsigned 32bit number is by using a 64bit field.

A much better way, of course, is to use the IP4r module.

6.4   Populating the tables

To populate the tables you can use the PostgreSQL COPY command which performs a bulk load into a table directly from some sort of structured file, like CSV.


You must be logged into the database as a super user before you can use the COPY command from a file.


You need to edit each 'csv' file and remove the 'copyright' header or the COPY command will fail, as to my knowledge the command can not skip more than one header line

The GeodIP data seems to be encoded in ISO-8859-15 (US ASCII?) so if your database is UTF-8, as it should be, make sure to set your client encoding as follows:

# 's
1-- Data is in ISO-8859-15 format
2SET client_encoding to "ISO-8859-15";

Then to load each table execute (assuming your data files are in /tmp):

# 's
1-- The locations table
2COPY locations (id, country, region, city, postal_code, latitude, longitude, metro_code, area_code)
3FROM '/tmp/GeoLiteCity-Location.csv' WITH CSV HEADER;
5-- Than the blocks table
6COPY blocks
7FROM '/tmp/GeoLiteCity-Blocks.csv' WITH CSV HEADER;

6.5   Converting to IP4r

The most straight forward way to convert the database to use the IP4r contrib module is by adding the range column and then populating it with data using the builtin conversion function ip4r(ip4, ip4) like this:

# 's
1-- Add the range column
2ALTER TABLE blocks ADD COLUMN ip_range ip4r;
4-- Perform Conversion
5UPDATE blocks SET ip_range = ip4r(start_ip::ip4, end_ip::ip4)
6-- Drop Unneeded Columns
7ALTER TABLE blocks DROP COLUMN start_ip;

One of the cooler features of IP4r is that it provides a nice set of casts and conversion methods for creating ip4 and ip4r types from bigint, inet and plain strings.

6.6   Indexing

If you are using the IP4r module than a GiST index on the ip_range column is the best choice for the blocks table. This type of index allows the use of the >>= operator and achieves excellent performance.

The code for the ip_range index is:

# 's
1CREATE INDEX ix_ip_range ON blocks
2 USING gist (ip_range) WITH (FILLFACTOR=100);

If for some unfortunate reason you can not use IP4r module then you must index the start_ip and end_ip columns using a B-tree index. After running some tests (see Index Performance), it appears that the best performance for queries on the blocks table can be attained by creating two composite indexes, one using the start_ip-end_ip fields and one on the end_ip-start_ip fields (note the order).

The DDL code for generating these indexes is:

# 's
1CREATE INDEX ix_start_end_ip ON blocks
2 USING btree (start_ip, end_ip) WITH (FILLFACTOR=100);
3CREATE INDEX ix_end_start_ip ON blocks
4 USING btree (end_ip, start_ip) WITH (FILLFACTOR=100);

Because you will not be updating the blocks table the index fill factor should be set to 100%, this will [theoretically?] improve the performance since less pages need to be read to traverse the index, and it makes the index smaller on the order of 20% compared to the default (80% FILLFACTOR).


Create the indexes AFTER loading the database or your performance will suffer both during load and during runtime because FILLFACTOR=100 will cause index pages to get fragmented when values are inserted.

For the three million rows in the blocks table, the B-tree index sizes are quite modest, standing at 73MB for each composite. The blocks table itself is around 140MB, so adding the two composite indexes or the single GiST index brings the total size of the database to around 300MB. However disk space being as cheap as it is the performance benefits of the indexes considerably outweigh their storage requirements.

6.7   Utility Functions

With the IP4r module the get_location function is defined as:

# 's
1CREATE OR REPLACE FUNCTION get_location(inet) RETURNS bigint AS $$
2 SELECT location_id FROM blocks
3 WHERE ip_range >>= ip4($1)

If you decided to load the data inside a schema but don't wish to set your search_path you should change the definition of the get_location function to include the fully qualified name for the blocks table, as in:

# 's
1CREATE OR REPLACE FUNCTION get_location(inet) RETURNS bigint AS $$
2 SELECT location_id FROM schema_name.blocks
3 WHERE ip_range >>= ip4($1)

Without IP4r the IP address string must be converted to a bigint. And although you can not directly cast an inet type to an integer, a trick I accidentally discovered for indirectly performing such a cast is to subtract inet '' from an IP address in inet form (see inet_to_bigint function below).

# 's
1CREATE OR REPLACE FUNCTION inet_to_bigint(inet) RETURNS bigint AS $$
2 SELECT $1 - inet ''

In a previous iteration of this article I had defined a get_location function for use without IP4r. However for reasons I don't quite understand (yet), using the get_location function with the plain configuration resulted in really poor performance. My guess is that the query plan gets set during compile time or something else happens that makes the function not use both composite indexes.

7   Index Performance

In order to better understand what combination of B-tree indexes will result in the maximum search performance for the plain configuration, I wrote a Python script that performs the following steps for every permutation of the 4 possible indexes (start_ip, end_ip, start_ip-end_ip, end_ip-start_ip):

  1. Create a test database
  2. Populate it with GeoIP data, using the Data Loader script
  3. Setup Index configuration
  4. Restart the PostgreSQL server.
  5. Run the performance test
  6. Drop the test database

The test code (step 6) loads a pre-generated list of 10,000 random IPs and performs a SELECT statement on the blocks table for each IP in the list and records the execution time for each query.

The random IPs were obtained using:

# 's
1SELECT start_ip + (end_ip - start_ip) / 2 FROM blocks ORDER BY random() LIMIT 10000;

The following is a graph of the performance for the different index combinations. On the 'Y' Axis is the mean time per query, in seconds.


As you can see the best performance is achieved using both the start-end and the end-start composite indexes. One interesting thing about this graph is that having all the indexes present is worse than having only the correct indexes. It is not clear to me if this is due to index overhead or if the planner is simply choosing the incorrect index.

See Simple PostgreSQL index test code for the test code and the code that generated the chart.

Using the IP4r contrib module makes the database lighting fast and I am not sure there is any point in attempting to get more performance out of the system. But you may try to either increase the amount of statistics postgres stores for the blocks table, or see if using a Unique or Clustered index makes a difference.

8   Usage Ideas

Here are some relatively boring ideas for using the GeoIP data

  • Automatically selecting mirrors based on the end user's location.
  • Trying to guess a user's time zone.
  • Providing the user with the closest location for something like your local franchise, parks, museums... whatever.

8.1   Automatic Mirror Selection

Lets say you have a table of mirrors (mirrors) with a structure similar to this:

Name Type
server_ip inet
location_id bigint

You can populate the location_id using:

# 's
1UPDATE mirrors SET location_id = get_location(server_ip);

Then we can query the database for the best mirror using something along the lines of this:

# 's
1SELECT ip, distance_in_km(server_location.lat, server_location.lon, client_location.lat, client_location.lon) AS distance
2FROM mirrors,
3 locations AS client_location
4JOIN locations ON locations.id = mirrors.location_id AS server_location
5WHERE client_location.id = (SELECT get_location(client_ip))
6-- AND client_location.country = server_location.country
7ORDER BY distance ASC

There probably is a better way to perform this query, and you would be wise to also take into account the server bandwidth or some other values when performing the selection. It may also be faster to first do a lookup only selecting servers from the client's country (uncomment the AND clause above) and then perform a full query if no results are found in the given country.

See the PL/pgSQL Earth (GPS) distance formula for the definition of the distance_in_km function.

You should also look into the PostgreSQL earth distance module.

8.2   Selecting a Time Zone

Getting a user's Time Zone based on their IP was the original motivation behind looking into the GeoIP data and thus writing this article.

First to get the user's country of origin:

# 's
1SELECT country FROM locations
2WHERE locations.id = (SELECT get_location(''))

Then using the Python pytz module we can lookup possible time zones like this:

# 's
1>>> import pytz
2>>> pytz.country_timezones('bg')
4>>> pytz.country_timezones('cl')
5['America/Santiago', 'Pacific/Easter']
6>>> pytz.country_timezones('cn')
7['Asia/Shanghai', 'Asia/Harbin', 'Asia/Chongqing', 'Asia/Urumqi', 'Asia/Kashgar']
8>>> pytz.country_timezones('nz')
9['Pacific/Auckland', 'Pacific/Chatham']

Obviously this doesn't work so well. In certain cases you can get away with it but in general without more logic this approach doesn't yield useful results. Take a look at the MaxMind FAQ for some information about time zones and GeoIP. One thing we actually considered was a JavaScript method, but since Siafoo's core functionality has been designed to work without JavaScript we didn't want to require it for something as important as time zone selection.

Another approach that seems theoretically feasible is to track down a World Time Zones shape file, then to perform a hit test on the time zone polygons using the latitude and longitude obtained from the GeoIP database. That would be an interesting method and I would gladly pursue it if I could only find a free shape file.

9   Conclusion

The only problem I have with the GeoIP data is that it lacks direct Time Zone information. It would also be nice if the locations table had a continent field, something that can actually be added using this continents mapping file. Other than that the database is really nice and I wish more sites used it, in database or API form.

That is all. Have fun with the data and make stuff more usable, Thanks.


over 10 years ago (30 Mar 2009 at 01:58 PM) by tinodidriksen
Old article, but also outdated information. It has recently begun spreading on StumbleUpon, so figured I'd comment on it so others don't make the mistakes...

Do NOT use bigint or integer to store IPs. PostgreSQL has built-in datatypes for that; see the cidr and inet types at http://www.postgresql.org/docs/current/interactive/datatype-net-types.html and functions for using them at http://www.postgresql.org/docs/current/interactive/functions-net.html
over 10 years ago (30 Mar 2009 at 04:07 PM) by Stou S.
The choice of using bigints for IP ranges (read that again... RANGES not single addresses) was not a mistake but a design choice that came about after a careful evaluation of all available alternatives, including the cdir and inet datatypes.

The details escape me at the moment but the best solution I found was to use the ip4r module and if ip4r wasn't available to use bigint in order to actually index stuff and perform range lookups (which is what GeoIP does) as opposed to single IP comparisons.
over 10 years ago (03 Apr 2009 at 09:58 AM) by tinodidriksen
The cidr and inet types can store IP ranges just fine, they can be indexed, and you can perform lookups such as asking whether an IP is contained in a given range (inet_column >>= ''::inet). I don't see any use for the custom types and module that PostgreSQL does not do natively.
over 10 years ago (03 Apr 2009 at 01:36 PM) by tinodidriksen
Replying to myself...after talking in #postgresql it seems that I was wrong. Types cidr/inet are implemented, and so are the operators, but the indexes are not exactly optimal yet so any lookup would be dog slow atm. Seems ip4r is still king of the hill for IP range handling.
over 10 years ago (03 Apr 2009 at 01:49 PM) by Stou S.
Woo hoo! I was right =)
over 10 years ago (30 Mar 2009 at 09:41 PM) by J. Arzamendi
Wow! Great code! Can't wait to use it the next time I load the GeoIP City database into PostgreSQL. Thanks!
over 10 years ago (13 May 2009 at 01:43 PM) by dimk
Sufficient to use only one index "start_ip desc", if you change the query like this:

SELECT city, region, country
FROM locations
WHERE id = (
SELECT location_id FROM blocks
WHERE inet_to_bigint('') BETWEEN start_ip AND end_ip
ORDER BY start_ip DESC

This increases the speed many times (from 30-80 ms to 0.080-0.120 ms).
over 10 years ago (13 May 2009 at 03:04 PM) by Stou S.
You are absolutely right, I am not sure why I didn't think of that... since it seems that GeoIP never returns more than 1 record. Adding a 'LIMIT 1' to the various queries makes them a lot faster (and it makes sense).

Nice find.
over 9 years ago (07 Jul 2010 at 08:44 AM) by piane

This very elegant solution has a drawback : when the searched ip address is not in the table, pg does a table scan on (statistically) half the table, with a response time about 20ms.

you can fix it this way :

WITH b AS ( SELECT location_id, end_ip FROM geoipBlocks
           WHERE(inet('') - inet '') >= start_ip
           ORDER BY start_ip DESC LIMIT 1 )
SELECT * from geoiplocations, b
where id = b.location_id and (inet('') - inet '') <= b.end_ip;
over 9 years ago (26 Sep 2010 at 12:27 PM) by andrey788
What it mean WITH b AS (SELECT .....?
Post full query!
over 9 years ago (06 Oct 2010 at 05:45 AM) by dimk
Thanks for the advice. But I used a slightly different query for backward compatibility with version 8.3:

SELECT city, region, country
FROM locations
WHERE id = (
SELECT CASE WHEN inet_to_bigint('') <= end_ip THEN location_id ELSE null END
FROM blocks
WHERE inet_to_bigint('') >= start_ip
ORDER BY start_ip DESC
over 9 years ago (08 Dec 2010 at 08:31 AM) by bricklen
If your are using PostgreSQL + ip2location , creating the following spatial index on the ip2location table results in a huge performance increase. You can apply it equally well to Maxmind:

Create a spatial index on the ip_from/ip_to attributes:

-- Taken from http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks#Fast_interval_.28of_time_or_ip_addresses.29_searching_with_spatial_indexes

CREATE INDEX ip2location_range_gist ON ip2location USING gist ((box(point(ip_from,ip_from),point(ip_to,ip_to))) box_ops);

analyze verbose ip2location;

explain analyze
select *
from ip2location
where box(point(ip_from,ip_from),point(ip_to,ip_to)) @> box(point (3512069689,3512069689), point(3512069689,3512069689));

Query version where the IP needs to be converted to bigint:

explain analyze
select *
from ip2location
where box(point(ip_from,ip_from),point(ip_to,ip_to)) @> box(point (inet_to_bigint(''),inet_to_bigint('')), point(inet_to_bigint(''),inet_to_bigint('')));

The inet_to_bigint functions come from that same page:

-- from http://www.postgres.cz/index.php/PostgreSQL_SQL_Tricks#Bigint_to_ip
drop function if exists bigint_to_inet(bigint);
create function bigint_to_inet(bigint) returns inet as $$
select (($1>>24&255)||'.'||($1>>16&255)||'.'||($1>>8&255)||'.'||($1>>0&255))::inet
$$ language sql strict immutable;
grant execute on function bigint_to_inet(bigint) to public;

drop function if exists inet_to_bigint(inet);
CREATE OR REPLACE FUNCTION inet_to_bigint(inet) RETURNS bigint AS $$
SELECT $1 - inet ''
$$ LANGUAGE SQL strict immutable;
grant execute on function inet_to_bigint(inet) to public;