1.1. Download Schedule ====================== Whois data are available to be be downloaded at 1:00 AM PST or 8:00 AM GMT/UTC for previous day's new domains 1.2. Directory structure and file names convention ================================================================== #zipped csv files are in the format of yyyy_mm_dd_tld.csv.gz for example: 2012_02_24_us.csv.gz represents the compressed data for Feb 24th, 2012 for tld "us" full_2012_02_24_us.csv.gz represents the compressed data that includes raw whois texts for Feb 24th, 2012 for tld "us" the directory add_2012_02_24_us represents the uncompressed data for Feb 24th, 2012 for tld "us" the directory add_full_2012_02_24_us represents the uncompressed data that includes raw whois texts for Feb 24th, 2012 for tld "us" #unzipped csv files are in the directory of format add_yyyy_mm_dd/tld, for example: add_2012_05_19/com contains all the whois data files for .com on May 19th, 2012. the files are of the format P_N.csv where P is the domain name prefix(0-9,a-z) and N is a number counter that starts from 1. For example: 0_1.csv is the first whois data file for the domain names that start with number(0-9). a_1.csv is the first whois data file for the domain names that start with letter a #zipped mysqldump files are under directories of the format add_mysqldump_yyyy_mm_dd/tld for example: add_mysqldump_2012_05_03/com contains the mysqldump file for .com on May 03, 2012 #trimmed csv files are under directories trimmed trimmed/no_concat contains csvs that are removed of personal identifiable informations such as email, name, street address, fax and phone numbers 2.1 CSV Fields ===================== "domainName","registrarName","contactEmail","whoisServer","nameServers","createdDate","updatedDate","expiresDate","standardRegCreatedDate","standardRegUpdatedDate","standardRegExpiresDate","status","Audit_auditUpdatedDate","registrant_email","registrant_name","registrant_organization","registrant_street1","registrant_street2","registrant_street3","registrant_street4","registrant_city","registrant_state","registrant_postalCode","registrant_country","registrant_fax","registrant_faxExt","registrant_telephone","registrant_telephoneExt","administrativeContact_email","administrativeContact_name","administrativeContact_organization","administrativeContact_street1","administrativeContact_street2","administrativeContact_street3","administrativeContact_street4","administrativeContact_city","administrativeContact_state","administrativeContact_postalCode","administrativeContact_country","administrativeContact_fax","administrativeContact_faxExt","administrativeContact_telephone","administrativeContact_telephoneExt","billingContact_email","billingContact_name","billingContact_organization","billingContact_street1","billingContact_street2","billingContact_street3","billingContact_street4","billingContact_city","billingContact_state","billingContact_postalCode","billingContact_country","billingContact_fax","billingContact_faxExt","billingContact_telephone","billingContact_telephoneExt","technicalContact_email","technicalContact_name","technicalContact_organization","technicalContact_street1","technicalContact_street2","technicalContact_street3","technicalContact_street4","technicalContact_city","technicalContact_state","technicalContact_postalCode","technicalContact_country","technicalContact_fax","technicalContact_faxExt","technicalContact_telephone","technicalContact_telephoneExt","zoneContact_email","zoneContact_name","zoneContact_organization","zoneContact_street1","zoneContact_street2","zoneContact_street3","zoneContact_street4","zoneContact_city","zoneContact_state","zoneContact_postalCode","zoneContact_country","zoneContact_fax","zoneContact_faxExt","zoneContact_telephone","zoneContact_telephoneExt" 2.2. Data Field Details ====================== The csv data fields are mostly self-explanatory by name except for the following: * createdDate: when the domain name was first registered/created * updatedDate: when the whois data was updated * expiresDate: when the domain name will expire * standardRegCreatedDate: created date in the standard format(YYYY-mm-dd), eg. 2012-02-01 * standardRegUpdatedDate: updated date in the standard format(YYYY-mm-dd), eg. 2012-02-01 * standardRegExpiresDate: expires date in the standard format(YYYY-mm-dd), eg. 2012-02-01 * Audit_auditUpdatedDate: the timestamp of when the whois record is collected in the standardFormat(YYYY-mm-dd), eg. 2012-02-01 * status: domain name status code see http://www.wdbc.com/domain/status-codes.cfm for details * registrant: The domain name registrant is the owner of the domain name. They are the ones who are responsible for keeping the entire WHOIS contact information up to date. * administrativeContact: The administrative contact is the person in charge of the administrative dealings pertaining to the company owning the domain name. * billingContact: the billing contact is the individual who is authorized by the registrant to receive the invoice for domain name registration and domain name renewal fees. * technicalContact: The technical contact is the person in charge of all technical questions regarding a particular domain name. * zoneContact: The domain technical/zone contact is the person who tends to the technical aspects of maintaining the domain's name server and resolver software, and database files. 2.3. Maximum Data Field Lengths domainName: 70, registrarName: 512, contactEmail: 256, whoisServer: 512, nameServers: 65535, createdDate: 200, updatedDate: 200, expiresDate: 200, standardRegCreatedDate: 200, standardRegUpdatedDate: 200, standardRegExpiresDate: 200, status: 65535, Audit_auditUpdatedDate: 19, registrant_email: 256, registrant_name: 256, registrant_organization: 256, registrant_street1: 256, registrant_street2: 256, registrant_street3: 256, registrant_street4: 256, registrant_city: 64, registrant_state: 45, registrant_postalCode: 45, registrant_country: 45, registrant_fax: 45, registrant_faxExt: 45, registrant_telephone: 45, registrant_telephoneExt: 45, administrativeContact_email: 256, administrativeContact_name: 256, administrativeContact_organization: 256, administrativeContact_street1: 256, administrativeContact_street2: 256, administrativeContact_street3: 256, administrativeContact_street4: 256, administrativeContact_city: 45, administrativeContact_state: 45, administrativeContact_postalCode: 45, administrativeContact_country: 45, administrativeContact_fax: 45, administrativeContact_faxExt: 45, administrativeContact_telephone: 45, administrativeContact_telephoneExt: 45 Chapter 3. Database dumps ======================= 3.1. Software and hardware requirements for importing mysqldump files 3.2. Importing mysqldump files 3.3. Database Schema 3.1. Software and hardware requirements for importing mysqldump files ======================= * Hardware Requirement: * Disk space requirement: at least one single 2 TB partition is required to store mysql data file once it's loaded into mysql server * Memory requirement: at least 16 GB of RAM * The server that collects the whois database has the following spec, it's recommended that your server is comparable to our server: Core i7 Quad Core i7-2600 3.4 GHz 16 GB DDR3-1333 UDIMM First Hard Drive: 2 TB SATA HDD (7200 RPM) Second Hard Drive: 2 TB SATA HDD (7200 RPM) * Software Requirment: * Mysql server 5.1+ is recommended although it should work for versions of mysql-server lower than 5.1 3.2. Importing mysqldump files ======================= You should create a new database for each tld. For each tld following the following steps: * create a database for the tld for example: mysql -uroot -ppassword -e "create database whoiscrawler_com" * import the mysqldump file into the database for example: gunzip