Genescript Annotation Pipeline

Documentation


Installation Instructions

Perl and Perl Modules

First Perl must be installed along with Bioperl 1.0 and other required modules. It's important to note that Bioperl 1.0 requires Perl 5.6. For installation instructions on Perl and Perl Modules see the Perl, Bioperl, and CPAN pages.

To install Genescript simply unpack the distribution to the destination directory. Change the HOMEDIR variable in the main configuration file ~genescript/config/main.conf to point to the directory where Genescript was unpacked. You must also change the 'use lib' lines in the three scripts ~genescript/gs, ~genescript/phtblast, and ~genescript/tools/genmapconf to point to the directory where Genescript was installed.

Additionally, in ~genescript/phtblast you may want to specify the locations of blastall and align2png (distributed with genescript) unless they are in your path.

System Specific Binaries

Genescript uses two C applications to produce graphics for the HTML output. Binaries are provided for IRIX, and Linux (Solaris binaries should be available soon). If you require binaries for other systems you can compile them directly from the source code. To compile these applications the NCBI toolkit is required.

As a warning to anyone considering running Genescript on systems other than the ones listed, the third party programs that are required by Genescript are only available on a limited number of platforms. Make sure you can get these programs for the platform of your choice.

When you configure Genescript be sure to select the correct binary for your system. You may move and rename the binaries as you see fit.

Third Party Tools

Finally, you must get any third party tools you plan to use from their respective web sites and authors. These are all available free of charge to academic users, but each require different license agreements in order to obtain them.

Only a few of these programs are actually required. A list if required software can be found on the Download page. Listed here is the purpose of the optional software.

Gene Predictors

These are completely optional. We recommend users get at least Genscan and HMMgene.

Advanced dbEST Search

The dbEST database can be handled one of two ways. It can either be a used as a simple EST database, or the advanced retrieval system can be used. To use the advanced retrieval system the programs Nclever and TIGR Assembler 2 is required.

PDF Support

Maps are produced in Postscript format by default. If you would like PDFs generated as well, Ghostscript needs to be installed.

Notes on Required Software

gff2ps

This program requires gawk. You may want to edit gff2ps and ensure that the GAWK variable points to your local gawk binary.

RepeatMasker

By default Genescript is configured to use the MaskerAid with RepeatMasker. MaskerAid allows RepeatMasker to use wublast in order to increase the speed of the program. If you do not want to use MaskerAid make sure you change the appropriate configuration option as described in the Configuration section.

Configuration

After all the required software has been installed you must configure Genescript. See Configuration.


Configuration

Before you can use Genescript it must be properly configured. In addition to entering the locations of third party software, data sources must be properly setup and configured.

EST and Genomic Databases

All the configuration options that need customization are marked with (change) in the example configuration file. See below for more specific instructions.

Building the Databases

The EST and Genomic databases each consist of a BLAST database with an associated sequence database.

The blast database is constructed using formatdb as documented by NCBI. The only condition is that the sequence names in the blast database must match the sequences names in the sequence database exactly.

The sequence database is simply a raw FASTA file that is indexed with Bioperl. Genescript comes with two small utilities to construct and test the index. These can be found in the ~genescript/tools/ directory and are called ~genescript/tools/dbfetch and ~genescript/tools/dbindex respectively.

The FASTA file for the sequence database must be placed within a subdirectory with the same name as the database. The subdirectory must in turn be placed in a directory containing all the sequence databases. The index for the sequence database should be placed in the same directory and the FASTA file and called fasta.index (the default for dbindex). It's very important that you run dbindex on the FASTA file only after the FASTA file is in it's final directory as Bioperl remembers the absolute path of the FASTA file.

For example, if the TIGR human gene index database is called tigrhuman you would create the databases as follows.

 bash$ pwd
 /data/sequencedb/tigrhuman
 bash$ formatdb -t "TIGR Human Gene Index" -p F -n tigrhuman -i HGI.060102
 bash$ dbindex HGI.060102

If you would like to test the database index you can use dbfetch. Note that you must specify the FULL fasta header ( eg: |gi|4028939|gb|AC001234.1|AC001234 ) to dbfetch. See dbfetch.html documentation for more information.

Updating the Databases

The database files must be rebuild in order to upgrade them. After you've upgraded the FASTA files rebuild the BLAST database and FASTA index as explained above. The program dbindex will prompt you that an index already exists and must be destroyed. You will then be asked if you want to rebuild the index , answer yes.

Configuring Genescript

After the all the databases have been built they must be entered into Genescript's main configuration file ~genescript/config/main.conf. Genescript supports two types of databases, EST databases and comparative genomic databases (non human genomic databases).

EST databases must be listed under the CDNALIBS option. You must provide descriptions and confidence levels using the CDNALIB_database_desc and CDNALIB_database_confidence options. Similarly for comparative databases you use the COMPLIBS option and associated COMPLIB_database_ options.

Each database (EST or comparative) must have parsing thresholds as well. These are added using options of the form DATABASE_min_id, DATABASE_min_score, and DATABASE_max_expect.

The location of the BLAST database must be provided as an absolute path under the option BLASTDB_database. Finally, the location of the sequence database directory must be specified under the FASTADBROOT option.

For example, let's say we have two EST databases called tigrhuman and refseqn and one comparative database called mouse. The configuration file would be as follows.

 CDNALIBS=tigrhuman,refseqn
 # Configure TIGR
 CDNALIB_tigrhuman_desc=Human TIGR
 CDNALIB_tigrhuman_confidence=8
 BLASTDB_tigrhuman=/data/blast/tigrhuman
 TIGRHUMAN_min_id=0.8
 TIGRHUMAN_min_score=50
 TIGRHUMAN_max_expect=10

 # Configure NCBI Refseq
 CDNALIB_refseqn=NCBI Refseq
 CDNALIB_refseqn_confidence=10
 BLASTDB_refseqn=/data/blast/refseqn
 REFSEQN_min_id=0.8
 REFSEQN_min_score=50
 REFSEQN_max_expect=10
 # Configure Mouse Genomic
 COMPLIBS=mouse
 COMPLIB_mouse_desc=Mouse Genome
 COMPLIB_mouse_confidence=4
 BLASTDB_mouse=/data/blast/mouse
 MOUSE_min_id=0.1
 MOUSE_min_score=0
 MOUSE_max_expect=1e-6
 FASTADBINDEX=/data/sequencedb

Configuring gff2ps

If you plan on using the graphical output options you must also configure the gff2ps display options for each database. To do this run genmapconf in the tools directory.

Accessing dbEST

In order for Genescript to function properly the human subset of dbEST is required. Depending on whether or not you plan on using the advanced EST retrieval the dbEST database can be setup in one of two ways. If you plan on using dbEST as a regular EST database (no clustering or enhanced EST retrieval) then you can retrieve the sequences for the human subset of dbEST from ftp://ftp.ncbi.nih.gov/blast/db/est_human.Z. The database is setup exactly as was done in the above section.

If you plan on using the advanced retrieval and clustering method to access dbEST then you do not need a local sequence database. However, you still need a local dbEST human BLAST database. A pre-formatted BLAST database can be obtained from ftp://ftp.ncbi.nih.gov/blast/db/FormattedDatabases/est_human.tar.gz. Once you've obtained the est_human BLAST database you must properly fill out the options BLASTDB_esthuman. However you do not need to add esthuman to the CDNALIBS option. Remember to obtain TIGR Assembler 2 and Nclever as well.

Adding on to the example in the above section, we will add the dbEST database using the advanced retrieval method.

 BLASTDB_esthuman=/data/blast/est_human

NR BLAST Database

In order to be able to run BLASTX on the predicted gene sequences the NR BLAST database needs to be install. Pre-formatted NR database (both protein and nucleotide) can be retrieved from ftp://ftp.ncbi.nih.gov/blast/db/FormatedDatabases/ as the files nr.tar.gz and nt.tar.gz. After these have been unpacked you must fill out the BLASTDB_nr and BLASTDB_nt options. For example

 BLASTDB_nr=/data/blast/nr
 BLASTDB_nt=/data/blast/nt

Gene Predictors

To use a gene predictor with Genescript there needs to be supporting wrapper functions in Tools.pm. The predictors currently supported are Genscan, HMMgene, MZEF, and GrailEXP. The options for these predictors are all pre-configured. Remove the predictors not installed from the option PREDICTORS and make sure to the change all the path information for the predictors you do install. The path information information is stored in the variables GENSCAN_cmd, GENSCAN_data, HMMGENE_cmd, GRAILEXP_cmd, GRAILEXP_env, MZEF_cmd, and MZEF_data.

RepeatMasker and MaskerAid

If you did not install MaskerAid you must make the following change in ~genescript/config/main.conf. Change the line

 REPEATMASKER_opt=-q -w -gff

to

 REPEATMASKER_opt=-q -gff

in order to disable MaskerAid.

Default Execution Parameters

You can also set the default execution parameters in the main configuration file. These are the parameters that will be used if the user does not specify them in their own configuration file. See Usage for information on the execution parameters.


Usage Instructions

The default pipeline components are controlled through the main configuration file. The administrator should set up a reasonable default configuration. Users can then create custom pipeline jobs running only the components they choose by supplying a user created configuration file. This file can override any configuration options including parsing and scoring thresholds. Perhaps the quickest way for a user to setup a pipeline job is to copy one of the supplied example configuration files from the example/ directory.

Running Genescript

Genescript is run by executing the gs script. It's syntax is

  gs <sequence> [-conf=config_file] [-d=result_directory]

The result directory, if omitted, will be the input sequence filename with special characters removed and .gs appended. For example the command

 gs cftr.seq -conf myconfig.conf

will use

 cftrseq.gs/

as the result directory. The configuration file that is passed to gs controls all the pipeline parameters. It is explained below. The gs program has full perldoc documentation for those interested.

Execution Options

All the execution options can be enabled or disabled by using the form

 OPTION=on
 OPTION=off

MASKSEQ_run=[on|off]

This runs RepeatMasker on the input sequence. It is required for any homology searches.

PREDICTOR_name_run=[on|off]

This runs the predictor 'name' on unmasked sequence. The available predictors are genscan, hmmgene, mzef, and grailexp.

PREDICTORS_parse_gene=[on|off]

Causes the predicted gene sequences to be saved in a FASTA file. This is is required for PREDICTORS_blastx_run.

PREDICTORS_parse_exon=[on|off]

Causes the predicted exon sequences to be saved in a FASTA file.

PREDICTORS_blastx_run=[on|off]

Runs BLASTX against the NR database on each of the extracted gene sequences.

DBEST_run=[on|off]

Runs the advanced dbEST search. Note that this requires Nclever and TIGR Assembler 2 to be installed.

DBEST_blastx_run=[on|off]

Runs BLASTX against the NR database on each of the EST clusters as well as any singleton EST sequences.

CDNALIB_name_run=[on|off]

Depending on which EST databases you installed locally, you can have them searched or not using options of this form. Replace 'name' with the lowercase database name for one of your local database. As with the PREDICTOR_name_run option you can specify this option multiple times (once for each database).

COMPLIB_name_run=[on|off]

Depending on which Comparative databases you installed locally, you can have them searched or not using options of this form. Replace 'name' with the lowercase database name for one of your local database. As with the PREDICTOR_name_run option you can specify this option multiple times (once for each database).

MARKERS_epcr_run=[on|off]

Controls running e-PCR on the unmasked sequence.

GENERATEHTML=[on|off]

Generates HTML report files.

GENERATEPS=[on|off]

Generate PostScript report files. If Ghostscript is installed then PDF files will be generated as well.

GENERATE_embl=[on|off]

Generates Genbank annotation files for each GFF file.

GENERATE_genbank=[on|off]

Generates EMBL annotation files for each GFF file.

GENERATE_vista=[on|off]

Generates VISTA annotation files for each GFF file.

BLAST_megablast=[on|off]

Controls whether megablast is used for all the homology searches.

GFF2PS_show_repeats=[on|off]

Display repeats on the graphical output or not.

NR_blastn_run=[on|off]

Runs BLASTN against NR on the masked sequence.

NR_blastx_run=[on|off]

Runs BLASTX against NR on the masked sequence.

MODELSCORING_run=[on|off]

Scores all the predicted gene models and reports more likely predictions.

Miscellaneous Options

TITLE=My Title

This allows custom titles to be added to the HTML output of pipeline runs.

CSSBUTTONS=[on|off]

By default CSS buttons are used in the HTML reports. For Netscape 4.x compatibility this should be turned off.

Predicted Gene Model Scoring Parameters

These parameters allow the user to customize the scoring thresholds for the automated scoring system.

MODELSCORING_min_fscore=[0<x<1]

This causes individual exons to be parsed out based on their score. The score threshold is a number between 0 and 1.

MODELSCORING_min_gene_ave=[0<x<1]

This parses entire gene models based on the average score of it's exons. The score threshold is a number between 0 and 1.

MODELSCORING_min_gene_max=[0<x<1]

This parses entire gene models based on the maximum score of any of it's exons. The score threshold is a number between 0 and 1.

Database Parsing Thresholds

Each database has options associated with it which indicate which BLAST hits should be considered good. These can be overridden by the user.

DATABASE_min_id=[0<x<1]

The minimum percent identity of a good hit. Takes a number from 0 to 1.

DATABASE_min_score=[x]

The minimum score of a good hit. Takes a number 0 or greater.

DATABASE_max_expect=[0|1e-x]

The maximum expect value of a good hit. Takes a number that is 0 or greater and can be of the form 1e-6.

Looking at the Results

If you are planning to use the HTML mark-up point your browser at the file resultdir/html/info.html and the interface should be fairly self explanatory. For more advanced uses of the results such as incorporation into a larger automated analysis here is the breakdown of the result directory structure.

/ (Graphical Overview Files)
The base of the result directory contains the graphical overview files. One of these files, combined.ps, is a condensed overview of all the strands (forward/neutral/reverse). All features of the same time are forced into a single track even if they overlap. The two other files, forward.ps and reverse.ps, contain detailed forward/neutral and neutral/reverse strand respectively.

/seq/ (FASTA Sequence Files)
This directory contains all the FASTA sequence files. These files included the masked version of the sequence, predicted gene sequences, and high scoring hits from homology searches.

/gff/ (GFF Files)
All the GFF files are stored in this directory. Files with the extension .simple.gff are simplified GFF files made for use with GFF2PS. Files with the extension .filt.gff are score gene predictions and files with the extension .evidence.gff are the associated evidence for the models.

/vista/ (VISTA Annotation Files)
If VISTA annotation files were requested they will reside in this directory.

/text/ (Raw Program Output Files)
The raw outputs from various programs are stored in this directory.

/blast/ (BLAST Report Files)
This directory stores all the BLAST report files that were generated during the annotation.

/html/ (HTML Files)
The HTML interface is stored in this directory. The most notable file here is info.html which is the root of the HTML interface.


Genescript Architecture

Internal data storage

Genescript uses the GFF file format ( http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml) to store all annotation information. This format was chosen due to it's simplicity and flexibility. One of the goals in developing Genescript was to avoid using heavy database software or complicated file types that are useful only in one program. Where necessary simpler versions of the GFF files are generated for use in other programs such as gff2ps. The group field of the GFF file format is flexible enough to store almost any extra information Genescript needs to store.

Pipeline Flow Diagram

The flow of data through the pipeline is shown in the following diagram. If you are viewing this documentation via the distributed text files, see the file gsflow.eps or gsflow.png.


Troubleshooting

Log Files

Main Error Log

Any errors generated by components of the pipeline are logged to the result directory in the file resultdir/log/error.log.

Note that this file will not be empty even if everything is working correctly. Some 3rd party programs use the error stream to print progress information.

RepeatMasker Error Log

RepeatMasker stores it's errors in the file resultdir/temp/original.seq.stderr. Again, this file can also contain warnings which don't necessarily indicate the program is not working.