RATT is software to transfer annotation from a reference (annotated) genome to an unannotated query genome.
It was first developed to transfer annotations between different genome assembly versions. However, can also transfer annotations between strains and even different species, like Plasmodium chabaudi onto P. berghei or Salmonella enterica onto Salmonella virchow. RATT is able to transfer any entries present on a reference sequence, such as the systematic id or an annotator's notes; such information would be lost in a de novo annotation. Furthermore, RATT checks whether gene models have changed between the two sequences and can correct changed start and stop codons, or frameshifts.
Please visit the http://ratt.sourceforge.net page for examples.
Installation at Sanger
At Sanger, the program is installed so far in ~tdo/Bin/ratt. Just be sure that the set the variable RATT_HOME:
RATT_HOME=/nfs/users/nfs_t/tdo/Bin/ratt; export RATT_HOME (for bash)
If you want to change the configuration find, see Installation, point 4, create the file, and generate a system variable to it
RATT_CONFIG=/nfs/users/nfs_t/tdo/Bin/ratt/RATT.config_bac; export RATT_CONFIG (for bash)
RATT needs http://mummer.sourceforge.net/ - Mummer tool to generate the sequence comparison. So in the PATH the following files of MuMMer should be contained: nucmer, delta-filter, show-snps and show-coords. The program will not run without those files in the path. (These programs should be in the standard path at Sanger.)
RATT was tested on Linux/Unix. It should run on OS X 10, but again, third party tools must be installed. All the installation help is uniquely written for Linux/Unix.
2. Please download the RATT to a specific directory:
3. Set the variable RATT_HOME to the directory where you unpacked the program. That is, if you downloaded RATT to ~/programs/. RATT will be unpacked into ~/programs/ratt/. Set the variable to (for bash):
RATT_HOME=~/programs/ratt/; export RATT_HOME.
These lines should be written into the ~/.bashrc (or equivalent system file).
4. As start codons and splice sites might vary between organism, it will be necessary to generate a configuration file to your specific needs. There are example configuration files for bacteria or eukaryotes called RATT.config_bac and RATT.config_euk in the $RATT_HOME directory. You will need to set the system variable pointing to the file. For example if you want to use the bacterial configuration file:
RATT_CONFIG=$RATT_HOME/RATT.config_bac; export RATT_CONFIG (for bash)
If you need to generate your own please do not change the ### tags. Example of config file:
#START ATG #STOP TGA TAA TAG #SPLICE GT..AG #CORRECTSPLICE 1Than just adapt the RATT_CONFIG variable.
5. You are ready to go.
How to call the program
RATT should be easy to call. The most difficult settings to get right are the nucmer parameters for the determining synteny. To aid the user we have predefined several parameter sets which should be suitable for most transfers. However, advanced users can alter the the nucmer parameters if they wish.
You will need embl files of the reference (parent) sequence, and these should be copied to a subdirectory within your working directory e.g. embl. For the query you will need a (multi-) fasta file of each contig/chromosome to be annotated.
Once you have the above files you can use RATT to transfer your annotations. For example, if you wished to transfer annotations between two strains of the same species, you would use:
$RATT_HOME/start.ratt.sh embl query.fasta Transfer1 Strain
More specifically, you can start RATT using our example dataset with:
start.ratt.sh ./embl Tb_F11.fasta F11 Strain
Here is the explanation of the paramters:
$RATT_HOME/start.ratt.sh <Directory with embl-files> <Query-fasta sequence> <Resultname> <Transfer type> <optional: reference (multi) Fasta>
(*) - must be set as bash variables. Alternatively the user might just update the start.ratt.sh file.
We included an example, see http://ratt.sourceforge.net/example.html. It describes the transfer Mycobacterium tuberculosis H37Rv onto M.tuberculosis F11.
There are several types of output file: Statistics that report differences, files that refer to the query and files that refer to the reference. The files start with the resultName prefix specified by the user when starting RATT. Report files end with .csv and can be imported into spreadsheet programs. These files ends with gff or embl, and can be loaded into Artemis or ACT, see below. All files that have the name of a replicon of the reference, are relative to the reference. Those files that contain the name of the query replicons, are relative to the query sequence.
Files for the query:
First, if your target genome has more than one replicon, the Query.fasta must be split into single contigs:
mkdir Seq; cd Seq; $RATT_HOME/main.ratt.pl Split F11.fasta cd ..
To view the annotation:
art Seq/F11.fasta + F11.embl + F11.final.embl + Query/F11.Mutations.gff + F11.Report.gff
formatdb -p F -i embl/Tb_H37Rv.fasta blastall -p blastn -m 8 -e 1e-10 -d embl/Tb_H37Rv -i Seq/F11.fasta -o comp.tb.blast
act embl/Tb_H37Rv.embl comp.Tb.blast F11.fasta
Then open the annotation files, by clicking on File -> F11.fasta -> open entries and select the files F11.final.embl and F11.report.gff.
One can see that the first gene models have transfered perfectly.
To see regions where the annotation couldn't be transferred, load the file F11.H37Rv.NOTtransfer.embl onto the Tb_H37Rv.embl file (Menu: File -> Tb_H37Rv.embl -> New Entry). For comparative purposes load the entries F11.orignal.embl and F11.embl onto the F11.fasta file (Menu: File -> Tb_H37Rv.embl -> New Entry). Next right mouse click over the F11 genome sequence, a pop-up will show: Select "one line per entry". Please repeat this for the H37Rv genome.
Here we describe how the output of RATT could be analysed. The primary goal of RATT is to transfer annotation between genomes. Differences between genomes are often the focus of a comparative genomics project. Deleted genes, new genes, genes that are or were pseudogenes, or changes in genes, are of particular interest. By accurately annotating regions that are the same, annotator attention can be directed to areas where the sequences have diverged. Note that even subtle difference can have a major biological effect (e.g. a few SNP's could change the transcription of a whole promoter). For this reason, RATT also highlights single nucleotide polymorphisms (SNPs) between genomes.
art Seq/F11.fasta + F11.final.embl + Query/F11.Mutations.gff + F11.Report.gff
Seq/F11.fasta is the sequence file.
Obviously, these files can also be loaded into a act view, as described in Post visualization. Here we describe the use in Artemis, which is nearly identical to ACT.
First, one should have a look at the regions that have no synteny with the reference:
Menu: Select -> Feature Selector: As key, replace "CDS" with "Synteny". Check the Key box and uncheck the Qualifier box. Then press view.
A new window will open. Browsing through this window, each line records a region with no synteny If the region is small, less than 200 base pairs, it is probably from a deletion. RATT should be able to fix the gene models where these deletions occur, but the resulting genes are likely to be quite different. If the region is bigger, it might be a gap or a real insertion in the query. Therefore there might be genes that:
(i) Have lower similarity than specified in the comparison (ii) Are deleted in the reference (iii) Are a possible horizontal transfer
Next we propose to look for changes in the genes. First just tick the entry F11.Report.gff in the Artemis window. (Disable the entries F11.final.embl and F11.Mutations.gff. The lines in Artemis you see are: Error, Frameshift, CorrectStart, CorrectStop. By systematically going through this list, and checking the new annotation (enable again the entry F11.final.embl). You can find:
Extended genes Shorter genes - important domain deleted? Genes that are now pseudo genes Genes that were pseudo genes
This is very useful for getting a feeling for what kind of genes have changed. A biologist working with the species, will can easily determine whether important genes have changed.
The last step is to open the not NOTtransfered genes. These are the genes that couldn't be transfered due to deletions or too low similarity. The file can be seen it directly using ACT, or in Artemis:
art Tb_H37Rv.embl + F11.H37Rv.NOTTransfered.embl
Just unselect the Tb_H37Rv.embl and you will see the non mapped annotation features.
run RATT after an iCORN run
After running iCORN in general the corrected sequence should be also annotated, so far annotation exists. Just call ratt as
$RATT_HOME/start.ratt.sh <Directory with embl-files> <reference of latest annotation> <Resultname>This should transfer the annotation without any problems.
run RATT after iMAGE or ABACAS
Queryinput for RATT, and annotation from a reference should be transferred. In ABACAS the reference should been the used on for the ordering, for iMAGE it could be the genome annotation before the gap closing, or from a close annotated reference.
Functionality of main.ratt.pl
The main program is main.ratt.pl. Normally a user won't need to call this program directly. Never-the-less we describe here its different functions and how to call it:
$RATT_HOME/main.ratt.pl Transfer <embl Directory> <mummer SNP file> <mummer coord file> <ResultName>
This functionality uses the mummer output to map the annotation from embl files, which are in the <embl Directory>, to the query. It generates all the new annotation files (ResultName.replicon.embl), as well as files describing which annotations remain untransferred (Replicon_reference.NOTtransfered.embl).
Corrects a given annotation, as described previously. The corrections are reported and the new file is saved as <ResultName>.embl.
Similar to the correct option, but it will only report errors in an EMBL file.
Some EMBL files have feature positions spanning several lines, this function consolidates these features so they appear on one line. The result name is <EMBL File>.<ResultName postfix>.
Every 250 base pairs a base is changed (mutated). The result is saved as <fastafile>.mutated. This is necessary to recalibrate RATT for similar genomes.
Splits a given multifasta file into individual files containing one sequence. This is necessary as visualization tools (e.g. Artemis) prefer single fasta files.
Generates files that report the SNP, indels and regions not shared by the reference and query. It also prints a statistic reporting coverage for each replicon.
Extracts the sequence from embl files in the <EMBL directory> and saves it as a <fasta file>.