How to (really) import the MaxMind GeoIP Free Country CSV file into MySQL
There's a cool set of data over at MaxMind. They do a good job of IP geolocation and they give a decent version of their data for free. They want me to tell you: "This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com/."
I use it to see where an IP address might be located when I'm looking at my web logs. I'm using Apache with mod_log_sql so my web logs are already in a MySQL db called `apachelogs`. There are loads of other ways to get your Apache log data into MySQL.
After I have the database `apachelogs` set up, I download and extract the CSV data from MaxMind (See here and use the link to 'Download the latest GeoLite City CSV Format'). Then I fire up the mysql client and create the tables to hold the GeoIP data:
USE apachelogs; -- your database name here.
DROP TABLE IF EXISTS `blocks`; CREATE TABLE `blocks` ( `startIPNum` int(10) unsigned NOT NULL, `endIPNum` int(10) unsigned NOT NULL, `locID` int(10) unsigned NOT NULL, PRIMARY KEY (`startIPNum`,`endIPNum`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 PACK_KEYS=1 DELAY_KEY_WRITE=1;
DROP TABLE IF EXISTS `location`; CREATE TABLE `location` ( `locID` int(10) unsigned NOT NULL, `country` char(2) default NULL, `region` char(2) default NULL, `city` varchar(45) default NULL, `postalCode` char(7) default NULL, `latitude` double default NULL, `longitude` double default NULL, `dmaCode` char(3) default NULL, `areaCode` char(3) default NULL, PRIMARY KEY (`locID`), KEY `Index_Country` (`country`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 ROW_FORMAT=FIXED;
Then load the data. Ubuntu 8.10 adds a 'helpful' service called apparmor that prevents rouge software from accessing your machine. This means that MySQL can't see your home dir. The easiest fix for me was to just move the GeoLiteCity files to the folder where MySQL stored the database (mv Geo* /var/lib/mysql/apachelogs/).
use apachelogs; load data infile 'GeoLiteCity-Blocks.csv' into table `blocks` fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' ignore 2 lines;
load data infile 'GeoLiteCity-Location.csv' into table `location` fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' ignore 2 lines;
After reading the docs, you might think it's a good idea to use "where ip between startIPNum and endIPNum"- I did and it was pretty slow. Here's a function that is much faster. Once you have this function defined, IPTOLOCID(ip_address) will return the locID (location ID in the GeoLitCity-Location table) for a given ip_address given in string format:
DELIMITER $$ DROP FUNCTION IF EXISTS `IPTOLOCID` $$ CREATE FUNCTION `IPTOLOCID`( ip VARCHAR(15)) RETURNS int(10) unsigned BEGIN DECLARE ipn INTEGER UNSIGNED; DECLARE locID_var INTEGER; IF ip LIKE '192.168.%' OR ip LIKE '10.%' THEN RETURN 0; END IF; SET ipn = INET_ATON(ip); SELECT locID INTO locID_var FROM `blocks` INNER JOIN (SELECT MAX(startIPNum) AS start FROM `blocks` WHERE startIPNum <= ipn) AS s ON (startIPNum = s.start) WHERE endIPNum >= ipn; RETURN locID_var; END $$ DELIMITER ;
Here's an example of using the whole thing to see where in the world your web visitors are coming from in the last 30 days:
SELECT from_unixtime(a.time_stamp) AS `time` , a.remote_host, l.city, l.region, l.country , a.referer, a.request_uri FROM access_log AS a LEFT OUTER JOIN `location` AS l ON l.LocID = apachelogs.IPTOLOCID(a.remote_host) AND from_unixtime(a.time_stamp) > DATE_SUB(CURDATE(),INTERVAL 30 DAY) ORDER BY a.time_stamp DESC;
In production, I have a column for LocID in access_log that is set by an INSERT trigger. This makes for very fast reads.