It has always annoyed me that upon visiting the web site for some global conglamorate (like Sony), itching to spend my hard earned cash on useless gadgets, I am asked to pick my country and region. Even worse how am I supposed to know what download mirror I should be using, the one at CMU or the one at Fermilab? Sure it is only one extra click, but there really is no reason why the correct version of a corporate site or mirror can not be pre-select and the user provided with an option to pick an alternative.
So how does one go about getting the user's geographic location based on their IP? One way is by using GeoIP City Lite from MaxMind, a database mapping IPs to geographic locations. You can give it a try at the GeoIP demo site.
The database is available in two formats, a binary version accessed using a set of open source APIs (the recommended version) and a CSV version ready to load into an SQL database server. Loading the CSV version into a PostgreSQL database server is the focus of this article.
MaxMind ships two versions of the City database, a freely available Lite version and a more comprehensive commercial version. The accuracy of the Lite version is 99.3% at the country level and 69% at the city level. For the commercial version the accuracy is 99.8% at the country level and 81% at the city level.
The CSV dataset is composed of two tables, the blocks table consists of IP ranges mapped to location identifiers and the locations table contains detailed information about each location. For the GeoIP City database, the locations table contains the two letter ISO country code, region identifier (states or regions), a City name, Postal Code, Latitude and Longitude, Metro Code and Area Code.
Before attempting to load the data into PostgreSQL you should look at performance numbers below and decide which access method is best for your application. There is a set of APIs for accessing the GeoIP binary data, and two different methods for storing the information inside PostgreSQL.
The Open Source APIs MaxMind provides are tuned specifically for accessing the GeoIP data, as such they [seem to] perform much better than an SQL server. Take a look at the GeoIP benchmark page for a comparison of the performance of the different APIs.
With PostgreSQL you can either use the default server configuration with no contrib modules, Plain PostgreSQL, or install the IP4r contrib module.
Using a Python test script I got the following performance numbers for the three access methods.
|Plain PostgreSQL||3 queries / second||302MB||start_end_ix + end_start_ix (see Index Performance)|
|IP4r PostgreSQL||6,147 queries / second||285MB||GiST index on range column|
|GeoIP Python API||209,344 queries / second||~30MB||GEOIP_MEMORY_CACHE|
The table was created on an Intel i7 @ 2.67GHz (with rather slow disks) but that shouldn't matter because the numbers are given for a relative comparison between the access methods, not as an absolute measure of performance.
You can get far more queries / second if you run many at the same time.
To allow for easier loading of the data, I wrote Utility for loading GeoIP data into a PostgreSQL database, that will automagically initialize the database structure, load the data and create the indexes. The script requires Python 2.3 or above, Psycopg2 and an existing PostgreSQL database.
Before you can run the script you must:
The script takes the following options:
- Run Options:
-c Create database structure -g Load the GeoIP data -b BLOCKS_FILE GeoIP Blocks CSV file [default: GeoLiteCity-Blocks.csv] -l LOCATIONS_FILE GoeIP Locations CSV file [default: GeoLiteCity-Location.csv] -z ZIP_FILE GoeIP Locations ZIP file [default: none]
- Database Options:
-H DB_HOST Database host [default: localhost] -d DATABASE_NAME Database name [default: geoip_db] -U USER_NAME User [default: geoip] -s SCHEMA Database Schema [default: public] --ip4r Use the IP4r module [default: False]
For example the following command will create the structure and load the data into a database called geoip_db2 on the localhost using the user name geoip
python load_geoip.py -c -g -z GeoLiteCity_20090501.zip
When you are using the IP4r module you can use the get_location function like this:
1SELECT city, region, country
3WHERE locations.id = (SELECT get_location('220.127.116.11'))
The get_location function boils down to:
2FROM blocks WHERE ip_range >>= ip4('18.104.22.168')
It's pretty obvious that the >>= operator checks if the right side is contained or equal to the left side. In this case it is used to check if the given IP address is inside the range. This is one of the operators defined on the GiST index type.
Without using IP4r a query similar to the one below should be used. The inet_to_bigint function exploits a trick (see Utility Functions) to convert the given IP string into a 64bit number.
1SELECT city, region, country
3WHERE locations.id = (
4 SELECT location_id FROM blocks
5 WHERE inet_to_bigint('22.214.171.124') BETWEEN start_ip AND end_ip
6 LIMIT 1)
Without the trick the WHERE clause would look something like this: WHERE 64*16777216 + 151*65536 + 75*256 + 21 BETWEEN start_ip AND end_ip)
You can also do something like this:
1SELECT city, region, country
3JOIN blocks ON locations.id = blocks.location_id
4WHERE inet_to_bigint('126.96.36.199') BETWEEN start_ip AND end_ip
On my machine the first query executes in ~0.210ms and the second in ~0.230ms. Looking at the Explain output it seems that both queries perform a Seq Scan on blocks table but while the first query does a straight forward limit and then primary key lookup the second query involves a 'Nested Loop' which may account for the slight difference in performance.
If you don't care how the data is loaded or what the performance characteristics of the various index configurations are, than you should skip the next two section and check out Usage Ideas.
As far as I can tell GeoIP will not return more than one location so it should be perfectly safe to use such a LIMIT. Please report any problems with this approach.
The following sections provide information on manually creating the necessary database structures and loading the GeoIP data, for anyone who is either interested in finding out how the loader script works, or would like to manually load the data (due to lack of python or whatever else).
I have also provided a PostgreSQL script for loading GeoIP data that, with some minor customizations, will allow you to load the GeoIP data directly into a database, using the psql utility.
If you plan to integrate the GeoIP data inside one of your own databases you should create the GeoIP tables inside a schema. PostgreSQL Schemas are something like a database analog for file system directories, they allow you to logically organize and isolate database objects. When you create a database, PostgreSQL by default creates a schema called public. Unless otherwise specified newly created tables and other database objects live inside the public schema.
Using a custom schema will help you manage your database without worrying about bloating your backups with the GeoIP data, since the GeoIP data can be reloaded at will. Also when the GeoIP data is updated you can easily drop the schema and all the objects within it with: DROP SCHEMA geoip CASCADE;.
If you decide to use a schema, other than the default one, you will have to access the tables and functions inside that schema either by using their fully qualified name like schema_name.object_name or by adding your schema to the schema search_path.
Postgres uses the schema search_path to resolve the location of database objects. The first entry in the search path is the current schema, or the schema where new objects are created. The search_path can be set for the current session using a command like SET search_path TO geoip,public;.
PostgreSQL has an excellent contrib module that provides a more flexible IP data type, ip4 and an IP range data type ip4r that can be used to store the blocks table IP ranges. Even better IP4r defines all the necessary operators on these types to allow for the creation of a GiST index on a range field.
Check your distribution for ip4r package (yum search ip4r).
Under Fedora (10 and probably other versions)
su -c 'yum install postgresql-ip4r'
The ip4r.sql is located: /usr/share/postgresql-ip4r/ip4r.sql
To install the module manually:
- Get IP4r from: http://pgfoundry.org/projects/ip4r/
- Untar the source: tar xvf ip4r-version.tar.gz
- Build it: make USE_PGXS=1 (PostgreSQL 8.0 and above)
- Install it: make USE_PGXS=1 install (PostgreSQL 8.0 and above)
The ip4r.sql would probably be located in: /usr/local/share/postgresql/contrib/ip4r.sql
Load the IP4r data types and functions, something like: psql -U postgres -f PATH_TO_IP4R_DIR/ip4r.sql some_database
You can also load the ip4r.sql script inside the template1 database so that every database you create will contain the ip4r definitions. This is only useful if you plan to have many databases with ip4r capabilities
For PostgreSQL, the database structure for the City database is as follows:
1CREATE TABLE locations
3 id bigint NOT NULL,
4 country character(2) NOT NULL,
5 region character(2),
6 city character varying(75),
7 postal_code character varying(15),
8 latitude numeric(6,4) NOT NULL,
9 longitude numeric(7,4),
10 metro_code integer,
11 area_code integer,
12 CONSTRAINT locations_pkey PRIMARY KEY (id)
15CREATE TABLE blocks
17 start_ip bigint NOT NULL,
18 end_ip bigint NOT NULL,
19 location_id bigint NOT NULL,
The start_ip and end_ip columns are declared using the bigint data type because even though an IP is just a 32bit number, the SQL standard does not provide for an unsigned data type and so the most straight-forward way to store an unsigned 32bit number is by using a 64bit field.
A much better way, of course, is to use the IP4r module.
To populate the tables you can use the PostgreSQL COPY command which performs a bulk load into a table directly from some sort of structured file, like CSV.
You must be logged into the database as a super user before you can use the COPY command from a file.
You need to edit each 'csv' file and remove the 'copyright' header or the COPY command will fail, as to my knowledge the command can not skip more than one header line
The GeodIP data seems to be encoded in ISO-8859-15 (US ASCII?) so if your database is UTF-8, as it should be, make sure to set your client encoding as follows:
1-- Data is in ISO-8859-15 format
2SET client_encoding to "ISO-8859-15";
Then to load each table execute (assuming your data files are in /tmp):
1-- The locations table
2COPY locations (id, country, region, city, postal_code, latitude, longitude, metro_code, area_code)
3FROM '/tmp/GeoLiteCity-Location.csv' WITH CSV HEADER;
5-- Than the blocks table
7FROM '/tmp/GeoLiteCity-Blocks.csv' WITH CSV HEADER;
The most straight forward way to convert the database to use the IP4r contrib module is by adding the range column and then populating it with data using the builtin conversion function ip4r(ip4, ip4) like this:
1-- Add the range column
2ALTER TABLE blocks ADD COLUMN ip_range ip4r;
3ALTER TABLE blocks ALTER COLUMN ip_range SET STORAGE PLAIN;
4-- Perform Conversion
5UPDATE blocks SET ip_range = ip4r(start_ip::ip4, end_ip::ip4)
6-- Drop Unneeded Columns
7ALTER TABLE blocks DROP COLUMN start_ip;
8ALTER TABLE blocks DROP COLUMN end_ip;
One of the cooler features of IP4r is that it provides a nice set of casts and conversion methods for creating ip4 and ip4r types from bigint, inet and plain strings.
If you are using the IP4r module than a GiST index on the ip_range column is the best choice for the blocks table. This type of index allows the use of the >>= operator and achieves excellent performance.
The code for the ip_range index is:
1CREATE INDEX ix_ip_range ON blocks
2 USING gist (ip_range) WITH (FILLFACTOR=100);
If for some unfortunate reason you can not use IP4r module then you must index the start_ip and end_ip columns using a B-tree index. After running some tests (see Index Performance), it appears that the best performance for queries on the blocks table can be attained by creating two composite indexes, one using the start_ip-end_ip fields and one on the end_ip-start_ip fields (note the order).
The DDL code for generating these indexes is:
1CREATE INDEX ix_start_end_ip ON blocks
2 USING btree (start_ip, end_ip) WITH (FILLFACTOR=100);
3CREATE INDEX ix_end_start_ip ON blocks
4 USING btree (end_ip, start_ip) WITH (FILLFACTOR=100);
Because you will not be updating the blocks table the index fill factor should be set to 100%, this will [theoretically?] improve the performance since less pages need to be read to traverse the index, and it makes the index smaller on the order of 20% compared to the default (80% FILLFACTOR).
Create the indexes AFTER loading the database or your performance will suffer both during load and during runtime because FILLFACTOR=100 will cause index pages to get fragmented when values are inserted.
For the three million rows in the blocks table, the B-tree index sizes are quite modest, standing at 73MB for each composite. The blocks table itself is around 140MB, so adding the two composite indexes or the single GiST index brings the total size of the database to around 300MB. However disk space being as cheap as it is the performance benefits of the indexes considerably outweigh their storage requirements.
With the IP4r module the get_location function is defined as:
1CREATE OR REPLACE FUNCTION get_location(inet) RETURNS bigint AS $$
2 SELECT location_id FROM blocks
3 WHERE ip_range >>= ip4($1)
4$$ LANGUAGE SQL;
If you decided to load the data inside a schema but don't wish to set your search_path you should change the definition of the get_location function to include the fully qualified name for the blocks table, as in:
1CREATE OR REPLACE FUNCTION get_location(inet) RETURNS bigint AS $$
2 SELECT location_id FROM schema_name.blocks
3 WHERE ip_range >>= ip4($1)
4$$ LANGUAGE SQL;
Without IP4r the IP address string must be converted to a bigint. And although you can not directly cast an inet type to an integer, a trick I accidentally discovered for indirectly performing such a cast is to subtract inet '0.0.0.0' from an IP address in inet form (see inet_to_bigint function below).
1CREATE OR REPLACE FUNCTION inet_to_bigint(inet) RETURNS bigint AS $$
2 SELECT $1 - inet '0.0.0.0'
3$$ LANGUAGE SQL;
In a previous iteration of this article I had defined a get_location function for use without IP4r. However for reasons I don't quite understand (yet), using the get_location function with the plain configuration resulted in really poor performance. My guess is that the query plan gets set during compile time or something else happens that makes the function not use both composite indexes.
In order to better understand what combination of B-tree indexes will result in the maximum search performance for the plain configuration, I wrote a Python script that performs the following steps for every permutation of the 4 possible indexes (start_ip, end_ip, start_ip-end_ip, end_ip-start_ip):
- Create a test database
- Populate it with GeoIP data, using the Data Loader script
- Setup Index configuration
- Restart the PostgreSQL server.
- Run the performance test
- Drop the test database
The test code (step 6) loads a pre-generated list of 10,000 random IPs and performs a SELECT statement on the blocks table for each IP in the list and records the execution time for each query.
The random IPs were obtained using:
1SELECT start_ip + (end_ip - start_ip) / 2 FROM blocks ORDER BY random() LIMIT 10000;
The following is a graph of the performance for the different index combinations. On the 'Y' Axis is the mean time per query, in seconds.
As you can see the best performance is achieved using both the start-end and the end-start composite indexes. One interesting thing about this graph is that having all the indexes present is worse than having only the correct indexes. It is not clear to me if this is due to index overhead or if the planner is simply choosing the incorrect index.
See Simple PostgreSQL index test code for the test code and the code that generated the chart.
Using the IP4r contrib module makes the database lighting fast and I am not sure there is any point in attempting to get more performance out of the system. But you may try to either increase the amount of statistics postgres stores for the blocks table, or see if using a Unique or Clustered index makes a difference.
Here are some relatively boring ideas for using the GeoIP data
- Automatically selecting mirrors based on the end user's location.
- Trying to guess a user's time zone.
- Providing the user with the closest location for something like your local franchise, parks, museums... whatever.
Lets say you have a table of mirrors (mirrors) with a structure similar to this:
You can populate the location_id using:
1UPDATE mirrors SET location_id = get_location(server_ip);
Then we can query the database for the best mirror using something along the lines of this:
1SELECT ip, distance_in_km(server_location.lat, server_location.lon, client_location.lat, client_location.lon) AS distance
3 locations AS client_location
4JOIN locations ON locations.id = mirrors.location_id AS server_location
5WHERE client_location.id = (SELECT get_location(client_ip))
6-- AND client_location.country = server_location.country
7ORDER BY distance ASC
There probably is a better way to perform this query, and you would be wise to also take into account the server bandwidth or some other values when performing the selection. It may also be faster to first do a lookup only selecting servers from the client's country (uncomment the AND clause above) and then perform a full query if no results are found in the given country.
See the PL/pgSQL Earth (GPS) distance formula for the definition of the distance_in_km function.
You should also look into the PostgreSQL earth distance module.
Getting a user's Time Zone based on their IP was the original motivation behind looking into the GeoIP data and thus writing this article.
First to get the user's country of origin:
1SELECT country FROM locations
2WHERE locations.id = (SELECT get_location('188.8.131.52'))
Then using the Python pytz module we can lookup possible time zones like this:
1>>> import pytz
7['Asia/Shanghai', 'Asia/Harbin', 'Asia/Chongqing', 'Asia/Urumqi', 'Asia/Kashgar']
Another approach that seems theoretically feasible is to track down a World Time Zones shape file, then to perform a hit test on the time zone polygons using the latitude and longitude obtained from the GeoIP database. That would be an interesting method and I would gladly pursue it if I could only find a free shape file.
The only problem I have with the GeoIP data is that it lacks direct Time Zone information. It would also be nice if the locations table had a continent field, something that can actually be added using this continents mapping file. Other than that the database is really nice and I wish more sites used it, in database or API form.
That is all. Have fun with the data and make stuff more usable, Thanks.