Genescript Annotation Pipeline
Documentation
Installation Instructions
Perl and Perl ModulesFirst Perl must be installed along with Bioperl 1.0 and other required modules. It's important to note that Bioperl 1.0 requires Perl 5.6. For installation instructions on Perl and Perl Modules see the Perl, Bioperl, and CPAN pages. To install Genescript simply unpack the distribution to the destination directory. Change the HOMEDIR variable in the main configuration file ~genescript/config/main.conf to point to the directory where Genescript was unpacked. You must also change the 'use lib' lines in the three scripts ~genescript/gs, ~genescript/phtblast, and ~genescript/tools/genmapconf to point to the directory where Genescript was installed. Additionally, in ~genescript/phtblast you may want to specify the locations of blastall and align2png (distributed with genescript) unless they are in your path.
System Specific BinariesGenescript uses two C applications to produce graphics for the HTML output. Binaries are provided for IRIX, and Linux (Solaris binaries should be available soon). If you require binaries for other systems you can compile them directly from the source code. To compile these applications the NCBI toolkit is required. As a warning to anyone considering running Genescript on systems other than the ones listed, the third party programs that are required by Genescript are only available on a limited number of platforms. Make sure you can get these programs for the platform of your choice. When you configure Genescript be sure to select the correct binary for your system. You may move and rename the binaries as you see fit.
Third Party ToolsFinally, you must get any third party tools you plan to use from their respective web sites and authors. These are all available free of charge to academic users, but each require different license agreements in order to obtain them. Only a few of these programs are actually required. A list if required software can be found on the Download page. Listed here is the purpose of the optional software.
Gene PredictorsThese are completely optional. We recommend users get at least Genscan and HMMgene.
Advanced dbEST SearchThe dbEST database can be handled one of two ways. It can either be a used as a simple EST database, or the advanced retrieval system can be used. To use the advanced retrieval system the programs Nclever and TIGR Assembler 2 is required.
PDF SupportMaps are produced in Postscript format by default. If you would like PDFs generated as well, Ghostscript needs to be installed.
Notes on Required Software
gff2psThis program requires gawk. You may want to edit gff2ps and ensure that the GAWK variable points to your local gawk binary.
RepeatMaskerBy default Genescript is configured to use the MaskerAid with RepeatMasker. MaskerAid allows RepeatMasker to use wublast in order to increase the speed of the program. If you do not want to use MaskerAid make sure you change the appropriate configuration option as described in the Configuration section.
ConfigurationAfter all the required software has been installed you must configure Genescript. See Configuration.
ConfigurationBefore you can use Genescript it must be properly configured. In addition to entering the locations of third party software, data sources must be properly setup and configured.
EST and Genomic DatabasesAll the configuration options that need customization are marked with
Building the DatabasesThe EST and Genomic databases each consist of a BLAST database with an associated sequence database. The blast database is constructed using formatdb as documented by NCBI. The only condition is that the sequence names in the blast database must match the sequences names in the sequence database exactly. The sequence database is simply a raw FASTA file that is indexed with
Bioperl. Genescript comes with two small utilities to construct and
test the index. These can be found in the The FASTA file for the sequence database must be placed within a
subdirectory with the same name as the database. The subdirectory must
in turn be placed in a directory containing all the sequence databases.
The index for the sequence database should be placed in the same
directory and the FASTA file and called For example, if the TIGR human gene index database is called tigrhuman you would create the databases as follows. bash$ pwd /data/sequencedb/tigrhuman bash$ formatdb -t "TIGR Human Gene Index" -p F -n tigrhuman -i HGI.060102 bash$ dbindex HGI.060102 If you would like to test the database index you can use dbfetch. Note that you must specify the FULL fasta header ( eg: |gi|4028939|gb|AC001234.1|AC001234 ) to dbfetch. See dbfetch.html documentation for more information.
Updating the DatabasesThe database files must be rebuild in order to upgrade them. After you've upgraded the FASTA files rebuild the BLAST database and FASTA index as explained above. The program dbindex will prompt you that an index already exists and must be destroyed. You will then be asked if you want to rebuild the index , answer yes.
Configuring GenescriptAfter the all the databases have been built they must be entered into Genescript's main configuration file ~genescript/config/main.conf. Genescript supports two types of databases, EST databases and comparative genomic databases (non human genomic databases). EST databases must be listed
under the Each database (EST or comparative) must have parsing thresholds as well.
These are added using options of the form The location of the BLAST database must be provided as an absolute path
under the option For example, let's say we have two EST databases called tigrhuman and refseqn and one comparative database called mouse. The configuration file would be as follows. CDNALIBS=tigrhuman,refseqn # Configure TIGR CDNALIB_tigrhuman_desc=Human TIGR CDNALIB_tigrhuman_confidence=8 BLASTDB_tigrhuman=/data/blast/tigrhuman TIGRHUMAN_min_id=0.8 TIGRHUMAN_min_score=50 TIGRHUMAN_max_expect=10 # Configure NCBI Refseq CDNALIB_refseqn=NCBI Refseq CDNALIB_refseqn_confidence=10 BLASTDB_refseqn=/data/blast/refseqn REFSEQN_min_id=0.8 REFSEQN_min_score=50 REFSEQN_max_expect=10 # Configure Mouse Genomic COMPLIBS=mouse COMPLIB_mouse_desc=Mouse Genome COMPLIB_mouse_confidence=4 BLASTDB_mouse=/data/blast/mouse MOUSE_min_id=0.1 MOUSE_min_score=0 MOUSE_max_expect=1e-6 FASTADBINDEX=/data/sequencedb
Configuring gff2psIf you plan on using the graphical output options you must also configure the gff2ps display options for each database. To do this run genmapconf in the tools directory.
Accessing dbESTIn order for Genescript to function properly the human subset of dbEST is required. Depending on whether or not you plan on using the advanced EST retrieval the dbEST database can be setup in one of two ways. If you plan on using dbEST as a regular EST database (no clustering or enhanced EST retrieval) then you can retrieve the sequences for the human subset of dbEST from ftp://ftp.ncbi.nih.gov/blast/db/est_human.Z. The database is setup exactly as was done in the above section. If you plan on using the advanced retrieval and clustering method to access dbEST then you do not need a local sequence database. However, you still need a local dbEST human BLAST database. A pre-formatted BLAST database can be obtained from ftp://ftp.ncbi.nih.gov/blast/db/FormattedDatabases/est_human.tar.gz. Once you've obtained the est_human BLAST database you must properly fill out the options BLASTDB_esthuman. However you do not need to add esthuman to the CDNALIBS option. Remember to obtain TIGR Assembler 2 and Nclever as well. Adding on to the example in the above section, we will add the dbEST database using the advanced retrieval method. BLASTDB_esthuman=/data/blast/est_human
NR BLAST DatabaseIn order to be able to run BLASTX on the predicted gene sequences the NR BLAST database needs to be install. Pre-formatted NR database (both protein and nucleotide) can be retrieved from ftp://ftp.ncbi.nih.gov/blast/db/FormatedDatabases/ as the files nr.tar.gz and nt.tar.gz. After these have been unpacked you must fill out the BLASTDB_nr and BLASTDB_nt options. For example BLASTDB_nr=/data/blast/nr BLASTDB_nt=/data/blast/nt
Gene PredictorsTo use a gene predictor with Genescript there needs to be supporting
wrapper functions in
RepeatMasker and MaskerAidIf you did not install MaskerAid you must make the following change in ~genescript/config/main.conf. Change the line REPEATMASKER_opt=-q -w -gff to REPEATMASKER_opt=-q -gff in order to disable MaskerAid.
Default Execution ParametersYou can also set the default execution parameters in the main configuration file. These are the parameters that will be used if the user does not specify them in their own configuration file. See Usage for information on the execution parameters.
Usage InstructionsThe default pipeline components are controlled through the main configuration file. The administrator should set up a reasonable default configuration. Users can then create custom pipeline jobs running only the components they choose by supplying a user created configuration file. This file can override any configuration options including parsing and scoring thresholds. Perhaps the quickest way for a user to setup a pipeline job is to copy one of the supplied example configuration files from the example/ directory.
Running GenescriptGenescript is run by executing the gs script. It's syntax is gs <sequence> [-conf=config_file] [-d=result_directory] The result directory, if omitted, will be the input sequence filename with special characters removed and .gs appended. For example the command gs cftr.seq -conf myconfig.conf will use cftrseq.gs/ as the result directory. The configuration file that is passed to gs controls all the pipeline parameters. It is explained below. The gs program has full perldoc documentation for those interested.
Execution OptionsAll the execution options can be enabled or disabled by using the form OPTION=on OPTION=off
MASKSEQ_run=[on|off]This runs RepeatMasker on the input sequence. It is required for any homology searches.
PREDICTOR_name_run=[on|off]This runs the predictor 'name' on unmasked sequence. The available predictors are genscan, hmmgene, mzef, and grailexp.
PREDICTORS_parse_gene=[on|off]Causes the predicted gene sequences to be saved in a FASTA file. This is is required for PREDICTORS_blastx_run.
PREDICTORS_parse_exon=[on|off]Causes the predicted exon sequences to be saved in a FASTA file.
PREDICTORS_blastx_run=[on|off]Runs BLASTX against the NR database on each of the extracted gene sequences.
DBEST_run=[on|off]Runs the advanced dbEST search. Note that this requires Nclever and TIGR Assembler 2 to be installed.
DBEST_blastx_run=[on|off]Runs BLASTX against the NR database on each of the EST clusters as well as any singleton EST sequences.
CDNALIB_name_run=[on|off]Depending on which EST databases you installed locally, you can have them searched or not using options of this form. Replace 'name' with the lowercase database name for one of your local database. As with the PREDICTOR_name_run option you can specify this option multiple times (once for each database).
COMPLIB_name_run=[on|off]Depending on which Comparative databases you installed locally, you can have them searched or not using options of this form. Replace 'name' with the lowercase database name for one of your local database. As with the PREDICTOR_name_run option you can specify this option multiple times (once for each database).
MARKERS_epcr_run=[on|off]Controls running e-PCR on the unmasked sequence.
GENERATEHTML=[on|off]Generates HTML report files.
GENERATEPS=[on|off]Generate PostScript report files. If Ghostscript is installed then PDF files will be generated as well.
GENERATE_embl=[on|off]Generates Genbank annotation files for each GFF file.
GENERATE_genbank=[on|off]Generates EMBL annotation files for each GFF file.
GENERATE_vista=[on|off]Generates VISTA annotation files for each GFF file.
BLAST_megablast=[on|off]Controls whether megablast is used for all the homology searches.
GFF2PS_show_repeats=[on|off]Display repeats on the graphical output or not.
NR_blastn_run=[on|off]Runs BLASTN against NR on the masked sequence.
NR_blastx_run=[on|off]Runs BLASTX against NR on the masked sequence.
MODELSCORING_run=[on|off]Scores all the predicted gene models and reports more likely predictions.
Miscellaneous Options
TITLE=My TitleThis allows custom titles to be added to the HTML output of pipeline runs.
CSSBUTTONS=[on|off]By default CSS buttons are used in the HTML reports. For Netscape 4.x compatibility this should be turned off.
Predicted Gene Model Scoring ParametersThese parameters allow the user to customize the scoring thresholds for the automated scoring system.
MODELSCORING_min_fscore=[0<x<1]This causes individual exons to be parsed out based on their score. The score threshold is a number between 0 and 1.
MODELSCORING_min_gene_ave=[0<x<1]This parses entire gene models based on the average score of it's exons. The score threshold is a number between 0 and 1.
MODELSCORING_min_gene_max=[0<x<1]This parses entire gene models based on the maximum score of any of it's exons. The score threshold is a number between 0 and 1.
Database Parsing ThresholdsEach database has options associated with it which indicate which BLAST hits should be considered good. These can be overridden by the user.
DATABASE_min_id=[0<x<1]The minimum percent identity of a good hit. Takes a number from 0 to 1.
DATABASE_min_score=[x]The minimum score of a good hit. Takes a number 0 or greater.
DATABASE_max_expect=[0|1e-x]The maximum expect value of a good hit. Takes a number that is 0 or greater and can be of the form 1e-6.
Looking at the ResultsIf you are planning to use the HTML mark-up point your browser at the file resultdir/html/info.html and the interface should be fairly self explanatory. For more advanced uses of the results such as incorporation into a larger automated analysis here is the breakdown of the result directory structure.
Genescript Architecture
Internal data storageGenescript uses the GFF file format ( http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml) to store all annotation information. This format was chosen due to it's simplicity and flexibility. One of the goals in developing Genescript was to avoid using heavy database software or complicated file types that are useful only in one program. Where necessary simpler versions of the GFF files are generated for use in other programs such as gff2ps. The group field of the GFF file format is flexible enough to store almost any extra information Genescript needs to store.
Pipeline Flow DiagramThe flow of data through the pipeline is shown in the following diagram. If you are viewing this documentation via the distributed text files, see the file gsflow.eps or gsflow.png.
Troubleshooting
Log Files
Main Error LogAny errors generated by components of the pipeline are logged to the result directory in the file resultdir/log/error.log. Note that this file will not be empty even if everything is working correctly. Some 3rd party programs use the error stream to print progress information.
RepeatMasker Error LogRepeatMasker stores it's errors in the file resultdir/temp/original.seq.stderr. Again, this file can also contain warnings which don't necessarily indicate the program is not working. |