The scripts used to generate the data mart served through this web tool have been designed to take adwantage of Perl's pattern matching power. Defining appropriate regular expressions to filter their data, it is possible to process very large genotyping databases in an affordable period of time using standard personal computers, having all their data summarized and customized to the researcher's interests. Although all the scripts can be run independently, if the required input data for each of them is already available (all the needed files are indicated in the comments section at the top each script), a sequence of use may be defined as the following pipeline:
The SNP descriptive data contained in dbSNP is filtered for each chromosome, and merged appropriately to end up with a small set of compressed summaries (a minor data mart) to be used as added value for any SNP query. It is common to all datasets, so it is only updated whenever dbSNP releases a new build. All the needed files are indicated in the comments section at the top of the script.
Note: It takes ~27 hours to process dbSNP's build 132 descriptive data on a core2quad @ 2.40GHz.
This is the script in charge of the data mart creation, carrying the populational part of the pre-processing. It deals with any database's raw data containing genotypes in tables and stores count summaries and populational statistical indexes for their later use in CSV files.
Note: It takes ~12 hours to process all currently available datasets (all but 1000 genomes, which takes a considerable x8 times longer) on a core2duo @ 2.13GHz.
Once a database is processed by the dataParser.pl script, a list of all the SNPs contained in it is available. Using this list as an input it can be merged with the dbSNP's compressed summaries obtained with dbSNPextraInfo.pl, creating an updated list with all the additional information stored previously from dbSNP and hence enriching the contents of the data mart.
Note: It takes ~2 hour to merge all the databases' mentioned data on a core2duo @ 2.13GHz.
When all the previous scripts have done their job, this one crawls through their results generating the SQL syntax needed to store all that data in a relational database.
Note: The SQL code generation is inmediate, but it may take a few minutes to import it.
If the dbSNP information is already present, this script may be run to call all the restants in an automated way, so user interaction is no longer needed. Place all the previous scripts on the same folder, place all the needed files in the appropriate folder (note that each script states at its top this information), and configure the running options of this script as needed (you may not want to load the resultant CSV files into a mysql database, so variable should be equal to 0). Lean back, and just enjoy the melody.