sanger logo


RATT Documentation
   biomalpar

Documentation

main page

Contents

Overview

RATT is software to transfer annotation from a reference (annotated) genome to an unannotated query genome.

It was first developed to transfer annotations between different genome assembly versions. However, can also transfer annotations between strains and even different species, like Plasmodium chabaudi onto P. berghei or Salmonella enterica onto Salmonella virchow. RATT is able to transfer any entries present on a reference sequence, such as the systematic id or an annotator's notes; such information would be lost in a de novo annotation. Furthermore, RATT checks whether gene models have changed between the two sequences and can correct changed start and stop codons, or frameshifts.

Please visit the http://ratt.sourceforge.net page for examples.

Installation at Sanger

At Sanger, the program is installed so far in ~tdo/Bin/ratt. Just be sure that the set the variable RATT_HOME:

   RATT_HOME=/nfs/users/nfs_t/tdo/Bin/ratt; export RATT_HOME  (for bash)

If you want to change the configuration find, see Installation, point 4, create the file, and generate a system variable to it

   RATT_CONFIG=/nfs/users/nfs_t/tdo/Bin/ratt/RATT.config_bac; export RATT_CONFIG  (for bash)

RATT needs http://mummer.sourceforge.net/ - Mummer tool to generate the sequence comparison. So in the PATH the following files of MuMMer should be contained: nucmer, delta-filter, show-snps and show-coords. The program will not run without those files in the path. (These programs should be in the standard path at Sanger.)

Installation

RATT was tested on Linux/Unix. It should run on OS X 10, but again, third party tools must be installed. All the installation help is uniquely written for Linux/Unix.


1. Install the MUMmer package. Make sure the programs are in your path e.g. PATH=$PATH:/path/to/Mummer/; export PATH. For the visualization it is useful to have NCBI-BLAST installed (to compare genomes in ACT) - download ncbi BLAST (but this is not mandatory for RATT).

2. Please download the RATT to a specific directory:

svn co https://ratt.svn.sourceforge.net/svnroot/ratt ratt

3. Set the variable RATT_HOME to the directory where you unpacked the program. That is, if you downloaded RATT to ~/programs/. RATT will be unpacked into ~/programs/ratt/. Set the variable to (for bash):

    RATT_HOME=~/programs/ratt/; export RATT_HOME. 

These lines should be written into the ~/.bashrc (or equivalent system file).

4. As start codons and splice sites might vary between organism, it will be necessary to generate a configuration file to your specific needs. There are example configuration files for bacteria or eukaryotes called RATT.config_bac and RATT.config_euk in the $RATT_HOME directory. You will need to set the system variable pointing to the file. For example if you want to use the bacterial configuration file:

   RATT_CONFIG=$RATT_HOME/RATT.config_bac; export RATT_CONFIG  (for bash)

If you need to generate your own please do not change the ### tags. Example of config file:

  #START
  ATG 
  #STOP
  TGA
  TAA
  TAG
  #SPLICE
  GT..AG
  #CORRECTSPLICE
  1
Than just adapt the RATT_CONFIG variable.

5. You are ready to go.

How to call the program

RATT should be easy to call. The most difficult settings to get right are the nucmer parameters for the determining synteny. To aid the user we have predefined several parameter sets which should be suitable for most transfers. However, advanced users can alter the the nucmer parameters if they wish.

You will need embl files of the reference (parent) sequence, and these should be copied to a subdirectory within your working directory e.g. embl. For the query you will need a (multi-) fasta file of each contig/chromosome to be annotated.

Once you have the above files you can use RATT to transfer your annotations. For example, if you wished to transfer annotations between two strains of the same species, you would use:

$RATT_HOME/start.ratt.sh embl query.fasta Transfer1 Strain

More specifically, you can start RATT using our example dataset with:

start.ratt.sh ./embl Tb_F11.fasta F11 Strain

Here is the explanation of the paramters:

  $RATT_HOME/start.ratt.sh <Directory with embl-files> <Query-fasta sequence> <Resultname> <Transfer type> <optional: reference (multi) Fasta>
Directory name with embl-annotation files - This directory contains all the embl files that should be transfered to the query. Query.fasta - A multifasta file to, which the annotation will be mapped. ResultName - The prefix you wish to give to each result file. Transfer type - Following parameters can be used (see below for the different used sets) (i) Assembly: Transfer between different assemblies. (ii) Assembly.Repetitive: As before, but the genome is extremely repetitive. This should be run, only if the parameter Assembly doesn't return good results (misses too many annotation tags). (iii) Strain: Transfer between strains. Similarity is between 95-99%. (iv) Strain.Repetitive: As before, but the genome is extremely repetitive. This should be run, only if the parameter Strain doesn't return good results (misses too many annotation tags). (v) Species: Transfer between species. Similarity is between 50-94%. (vi) Species.Repetitive: As before, but the genome is extremely repetitive. This should be run, only if the parameter Species doesn't return good results (misses too many annotation tags). (vii) Multiple: When many annotated strains are used as a reference, and you assume the newly sequenced genome has many insertions compared to the strains in the query (reference?). This parameter will use the best regions of each reference strain to transfer tags. (viii)Free: The user sets all parameter individually.
reference fasta - Name of multi-fasta. VERY I M P O R T A N T The name of each sequence in the fasta description, MUST be the same name as its corresponding embl file. So if your embl file is call Tuberculosis.embl, in your reference.fasta file, the description has to be >Tuberculsosis ATTGCGTACG ...


Here is the explanation of the parameter used for the synteny with MUMer:

Parameter set for RATT
parameter name word size identity cutoff cluster size max extend cluster anchor choice rearrange Faux SNP example use
Assembly 30 99 400 1000 -g -o 0 yes Plasmodium falciparum onto itself
Assembly.Repetitive 30 99 400 1000 --maxmatch -g -o 0 yes Plasmodium berghei onto itself
Strain 20 90 400 500 -r -o 1 yes
Strain.global 20 90 400 500 -g -o 1 yes Mycobacterium tuberculosis H37Rv onto M.tuberculosis F11
Strain.Repetitive 20 90 400 500 --maxmatch -r -o 1 yes
Strain.global.Repetitive 20 90 400 500 --maxmatch -g -o 1 yes
Species 10 40 400 1000 -r -o 5 no Salmonella thypirium onto S. virkow
Species.global 10 40 400 1000 -g -o 5 no Salmonella thypirium onto S. virkow
Species.Repetitive 10 40 400 1000 --maxmatch -r -o 5 no Plasmodium chabaudi onto P. berghei
Species.global.Repetitive 10 40 400 1000 --maxmatch -g -o 5 no Plasmodium chabaudi onto P. berghei
Multiple 25 98 400 1000 --maxmatch -q -o 1 no Different Salmonella onto S. virkow
Free* RATT_l RATT_ind RATT_c RATT_g RATT_anchor RATT_rearrange no

(*) - must be set as bash variables. Alternatively the user might just update the start.ratt.sh file.

Example Files

We included an example, see http://ratt.sourceforge.net/example.html. It describes the transfer Mycobacterium tuberculosis H37Rv onto M.tuberculosis F11.

Output files

There are several types of output file: Statistics that report differences, files that refer to the query and files that refer to the reference. The files start with the resultName prefix specified by the user when starting RATT. Report files end with .csv and can be imported into spreadsheet programs. These files ends with gff or embl, and can be loaded into Artemis or ACT, see below. All files that have the name of a replicon of the reference, are relative to the reference. Those files that contain the name of the query replicons, are relative to the query sequence.

Reports:
The first report is given when the program is running. It tells the user how many regions of the reference are syntenic with the query and vice versa. It also reports, how many tags are transferred and how many are not. Tags include features like ncRNA, UTR, gap-tags, repetitive regions or CDS.

  1. The file ResultName-prefix.replicon.report.csv - Reports how many gene model were wrong after the transfer, and how they could be corrected.


Files for the reference:

  1. ResultName-prefix.replicon.NOTTransfered.embl - These are annotations that couldn't be transfered. This can include whole genes, or just exons.
  2. Reference/ResultName-prefix.replicon.Mutations.gff - This files contains all the difference of the query compared to the reference. Also it shows the regions that are not syntenic between both genomes. This can be due to insertions/deletions, low similarity, or 100% similar repeats. Important the annotation of those regions cannot be transferred!

Files for the query:

  1. ResultName-prefix.replicon.embl - These are the uncorrected transfered annotations from the reference onto the query.
  2. ResultName-prefix.replicon.final.embl - These are the corrected annotations for the query.
  3. ResultName-prefix.replicon.report.gff - An important file, as it shows, where RATT has corrected CDS models, or where errors remain. This includes corrections/errors in start/stop codon, splice sites, frameshifts and joined exons.
  4. Query/ResultName-prefix.replicon.Mutations.gff - This files contains all the differences between the reference and query. In addition, it shows regions that are not syntenic between both genomes. This can be due to insertions/deletions, low similarity, or 100% similar repeats. Important the annotation of these regions will not be transferred! These regions in the query the annotation must be determined by other tools.

Post visualization

The best way to visualize RATT results is to use Artemis and ACT. http://ratt.sourceforge.net/example.html - gives examples using these tools but we include a brief tutorial here as well.

First, if your target genome has more than one replicon, the Query.fasta must be split into single contigs:

   mkdir Seq;
   cd Seq;
   $RATT_HOME/main.ratt.pl Split F11.fasta
   cd ..


Assuming your ResultName was F11 and the query is called F11.fasta.

To view the annotation:

   art Seq/F11.fasta + F11.embl + F11.final.embl + Query/F11.Mutations.gff + F11.Report.gff 


To see a comparative view with the transferred and untransferred gene models, you must first generate a comparison file (-m8) using BLAST. To perform this with the example set, make sure blastall is installed.

  formatdb -p F -i embl/Tb_H37Rv.fasta
  blastall -p blastn -m 8 -e 1e-10 -d embl/Tb_H37Rv -i Seq/F11.fasta -o comp.tb.blast


Now it can be opened in act:

   act embl/Tb_H37Rv.embl comp.Tb.blast F11.fasta

Then open the annotation files, by clicking on File -> F11.fasta -> open entries and select the files F11.final.embl and F11.report.gff.

One can see that the first gene models have transfered perfectly.

To see regions where the annotation couldn't be transferred, load the file F11.H37Rv.NOTtransfer.embl onto the Tb_H37Rv.embl file (Menu: File -> Tb_H37Rv.embl -> New Entry). For comparative purposes load the entries F11.orignal.embl and F11.embl onto the F11.fasta file (Menu: File -> Tb_H37Rv.embl -> New Entry). Next right mouse click over the F11 genome sequence, a pop-up will show: Select "one line per entry". Please repeat this for the H37Rv genome.

Biological interpretation

Here we describe how the output of RATT could be analysed. The primary goal of RATT is to transfer annotation between genomes. Differences between genomes are often the focus of a comparative genomics project. Deleted genes, new genes, genes that are or were pseudogenes, or changes in genes, are of particular interest. By accurately annotating regions that are the same, annotator attention can be directed to areas where the sequences have diverged. Note that even subtle difference can have a major biological effect (e.g. a few SNP's could change the transcription of a whole promoter). For this reason, RATT also highlights single nucleotide polymorphisms (SNPs) between genomes.


First the results should be loaded into artemis:

   art Seq/F11.fasta + F11.final.embl + Query/F11.Mutations.gff + F11.Report.gff 

Seq/F11.fasta is the sequence file.
F11.final.embl contains the final annotation.
Query/F11.Mutations.gff contains the differences between the two genomes (SNP's/indels) as well as the regions of the genomes that are not in synteny.
F11.Report.gff reports the changes made by RATT - therefore, it also indicates where genes are different.

Obviously, these files can also be loaded into a act view, as described in Post visualization. Here we describe the use in Artemis, which is nearly identical to ACT.

First, one should have a look at the regions that have no synteny with the reference:

  Menu: Select -> Feature Selector: As key, replace "CDS" with "Synteny". 
  Check the Key box and uncheck the Qualifier box. 
  Then press view.

A new window will open. Browsing through this window, each line records a region with no synteny If the region is small, less than 200 base pairs, it is probably from a deletion. RATT should be able to fix the gene models where these deletions occur, but the resulting genes are likely to be quite different. If the region is bigger, it might be a gap or a real insertion in the query. Therefore there might be genes that:

  (i) Have lower similarity than specified in the comparison
  (ii) Are deleted in the reference
  (iii) Are a possible horizontal transfer

Next we propose to look for changes in the genes. First just tick the entry F11.Report.gff in the Artemis window. (Disable the entries F11.final.embl and F11.Mutations.gff. The lines in Artemis you see are: Error, Frameshift, CorrectStart, CorrectStop. By systematically going through this list, and checking the new annotation (enable again the entry F11.final.embl). You can find:

  Extended genes 
  Shorter genes - important domain deleted?
  Genes that are now pseudo genes
  Genes that were pseudo genes

This is very useful for getting a feeling for what kind of genes have changed. A biologist working with the species, will can easily determine whether important genes have changed.

The last step is to open the not NOTtransfered genes. These are the genes that couldn't be transfered due to deletions or too low similarity. The file can be seen it directly using ACT, or in Artemis:

  art Tb_H37Rv.embl + F11.H37Rv.NOTTransfered.embl

Just unselect the Tb_H37Rv.embl and you will see the non mapped annotation features.

For more information about Artemis and Act, please find user manuals here: http://www.sanger.ac.uk/Software/Artemis/manual/ and http://www.sanger.ac.uk/Software/ACT/v7/manual/.

run RATT after an iCORN run

After running iCORN in general the corrected sequence should be also annotated, so far annotation exists. Just call ratt as

  $RATT_HOME/start.ratt.sh <Directory with embl-files> <reference of latest annotation> <Resultname> 
This should transfer the annotation without any problems.

run RATT after iMAGE or ABACAS

When run ABACAS or iMAGE just use the output result of the programs as

Query
input for RATT, and annotation from a reference should be transferred. In ABACAS the reference should been the used on for the ordering, for iMAGE it could be the genome annotation before the gap closing, or from a close annotated reference.

Functionality of main.ratt.pl

The main program is main.ratt.pl. Normally a user won't need to call this program directly. Never-the-less we describe here its different functions and how to call it:

$RATT_HOME/main.ratt.pl Transfer <embl Directory> <mummer SNP file> <mummer coord file> <ResultName>

This functionality uses the mummer output to map the annotation from embl files, which are in the <embl Directory>, to the query. It generates all the new annotation files (ResultName.replicon.embl), as well as files describing which annotations remain untransferred (Replicon_reference.NOTtransfered.embl).


$RATT_HOME/main.ratt.pl Correct <EMBL file> <fasta file> <ResultName>

Corrects a given annotation, as described previously. The corrections are reported and the new file is saved as <ResultName>.embl.


$RATT_HOME/main.ratt.pl Check <EMBL file> <fasta file> <ResultName>

Similar to the correct option, but it will only report errors in an EMBL file.


$RATT_HOME/main.ratt.pl EMBLFormatCheck <EMBL file> <ResultName postfix>

Some EMBL files have feature positions spanning several lines, this function consolidates these features so they appear on one line. The result name is <EMBL File>.<ResultName postfix>.


$RATT_HOME/main.ratt.pl Mutate <(multi-)fasta-file>

Every 250 base pairs a base is changed (mutated). The result is saved as <fastafile>.mutated. This is necessary to recalibrate RATT for similar genomes.


$RATT_HOME/main.ratt.pl Split <multifasta-file>

Splits a given multifasta file into individual files containing one sequence. This is necessary as visualization tools (e.g. Artemis) prefer single fasta files.


$RATT_HOME/main.ratt.pl Difference <mummer SNP file> <mummer coord file> <ResultName>

Generates files that report the SNP, indels and regions not shared by the reference and query. It also prints a statistic reporting coverage for each replicon.


$RATT_HOME/main.ratt.pl Embl2Fasta <EMBL dir> <fasta file>

Extracts the sequence from embl files in the <EMBL directory> and saves it as a <fasta file>.

main page