ProSplign

Overview

Download

Documentation

Algorithm

FAQ

Contacts

ProSplign is a global alignment tool developed by Dr. Boris Kiryutin. It produces accurate spliced alignments and computes alignments of distantly related proteins with low similarity. Extra afford is taken to locate frameshift positions.

ProSplign algorithm is an integral component of the NCBI Eukaryotic Genome Annotation Pipeline, which has been used to annotate critical genomes that include many different plant and animal species (such as human, mouse, cow etc.). The Pipeline was used by the Sea Urchin Genome Sequencing center for sequence analysis of the 814-megabase genome of the sea urchin Strongylocentrotus purpuratus that was published in Science in 2006. The integration of ProSplign with the genome annotation pipeline significantly improved the quality of genome annotation over existing available methods. Due to the success of the method it was used to annotate Tribolium castaneum (Nature, 2008), Taurine Cattle (Science, 2009), Acyrthosiphon Pisum (PLoS Biology, 2010), Nasonia (Science, 2010), and many other genomes.

Also ProSplign is a central part of the automatic pipeline for Influenza virus genomes, an important part of the Influenza Genome Sequencing Project. Sponsored by the National Institutes of Health, the Influenza Project is an international collaboration of critical importance for the public health. It has already led to multiple new discoveries about the recent evolution and pathogenesis of influenza, which have been published in leading journals including Journal of Virology, PLoS Biology, and Nature.

ProSplign is a utility for computing the alignment of proteins to genomic nucleotide sequence. This alignment can include eukaryotic splicing. At the heart of the program is a global alignment algorithm that specifically accounts for introns and splice signals. It is due to this algorithm that ProSplign is accurate in determining splice sites and tolerant to sequencing errors.

ProSplign uses BLAST hits to identify possible locations of genes and their duplications on genomic sequences and then to speed up the core dynamic programming.

Please follow one of the links below or navigate using the menu bar at the top of this page.

This web site is a single-point source of information on ProSplign, the tool for computing protein-to-genomic alignments that include an effort to account for mRNA splicing. ProSplign was developed with the following goals in mind:

Accuracy in determining splice signals
Recognition of short exons and non-consensus splices where feasible
Ability to identify and separate multiple compartments typically representing gene copying events
Frameshift detection

ProSplign is used in the NCBI Eukaryotic Genome Annotation Pipeline to compute spliced protein alignments and in the NCBI Prokaryotic Genome Annotation Pipeline to find frameshifted genes and to locate frameshift positions on genome.

ProSplign is available for use in a number of different ways. There is no online version of ProSplign. You must download and install the console version which is available for Linux (and may also be available for a few other platforms - please request). You can also link to ProSplign library from your own applications in a portable way since ProSplign is a part of the NCBI C++ Toolkit. And finally, ProSplign is available as a plugin for NCBI Genome Workbench.

Reference: ProSplign - Protein to Genomic Alignment Tool. B. Kiryutin, A. Souvorov, T. Tatusova. Manuscript in preparation

Binaries (updated 02/23/15)
Pre-built executables are available for Linux/i386 (64bit)

Sources
ProSplign was written for gene prediction at NCBI. There is no effort to encompass backward-compatibility between versions.
ProSplign is included into the NCBI C++ Toolkit. For details on how to download, configure, and build the Toolkit, please consult the NCBI C++ Toolkit book.
You can browse the Toolkit's code through the LXR or Doxygen source browsers. Search for CProSplign C/C++ Symbol to go directly to ProSplign sources.

Graphical view
NCBI Genome Workbench provides graphical alignment views. Watch NCBI Genome Workbench tutorial for ProSplign.
Video tutorial is also available on Youtube.

Using the console version

The console ProSplign can be launched in two modes - pairwise and batch. The pairwise mode is useful if you need to quickly align a few sequences and you don't want to compute separate blast hits for them. Batch mode is the best candidate for performing massive transcript alignment jobs, e.g. as a part of your genome annotation process. To see the parameters run "./prosplign -help" Most of the parameters are for the internal NCBI gene prediction process.

In pairwise mode, put your protein query and nucleic acid subject sequences in two files (only first sequences in each file will be aligned) and the command-line "./prosplign -full -nfa nuc.fa -pfa prot.fa -out aln.txt -fasn aln.asn". The nfa parameter is the file of the nucleic acid subject, the pfa parameter is the file of the protein query. The output is text output to the file specified in the out parameter and ASN1 output to the file specified in the fasn parameter.

Batch mode is organized in three steps.

Run BLAST program to generate the 12-column, tab-separated output. Make sure the output is sorted by subject and query. For example (input fasta files could be found here ):

makeblastdb -dbtype nucl -in subj.fa
tblastn -query query.fa -db subj.fa -outfmt 6  | sort -k 2,2 -k 1,1 > blast.hit

resulting in:


gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 35.000  140     57      5       58      163     20639910        20639491        2.87e-11        62.0
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 42.400  125     39      3       58      149     20602325        20601951        1.35e-15        74.7
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 42.400  125     39      3       58      149     20625221        20624847        1.44e-14        71.6
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 45.455  88      44      3       108     191     20647262        20646999        2.94e-12        64.7
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 47.500  40      20      1       58      96      20610519        20610400        1.44e-05        45.1
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 52.500  40      19      0       22      61      20602657        20602538        1.20e-05        45.1
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 52.500  40      19      0       22      61      20625553        20625434        1.44e-05        45.1
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 52.500  40      19      0       22      61      20640242        20640123        3.08e-05        43.9
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 55.000  40      17      1       58      96      20647507        20647388        6.43e-08        52.0
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 56.897  58      23      2       108     163     20610274        20610101        4.94e-11        61.2
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 60.976  41      16      0       22      62      20610837        20610715        5.15e-10        58.2
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 63.235  68      24      1       149     216     20609895        20609695        5.39e-23        96.3
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 63.235  68      24      1       149     216     20639285        20639085        4.97e-21        90.5
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 65.000  40      14      0       22      61      20647824        20647705        7.01e-10        57.8
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 66.176  68      22      1       149     216     20601700        20601500        4.58e-23        96.3
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 66.176  68      22      1       149     216     20624596        20624396        4.58e-23        96.3
gi|6679997|ref|NP_032143.1|     gi|37544107|ref|NT_010783.14|Hs17_10940 67.647  68      21      1       149     216     20646883        20646683        5.12e-25        102

Run the compartment tool to find approximate locations of the protein instances on the nucleic acid (cat blast.hit | ./procompart -t > comp). Each line of the output file represents a single instance, or 'compartment'.

1       NT_010783.14    NP_032143.1     20601000        20603157        -       195     210.778
2       NT_010783.14    NP_032143.1     20609195        20611337        -       184     238.625
3       NT_010783.14    NP_032143.1     20623896        20626053        -       195     207.712
4       NT_010783.14    NP_032143.1     20638585        20640742        -       195     183.236
5       NT_010783.14    NP_032143.1     20646183        20648324        -       184     238.046

Tab separated columns are

compartment number, genomic id, protein id, compartment from, compartment to, strand, protein coverage, compartment BLAST score

The last two columns are for internal use, ignored by ProSplign.

Run ProSplign with the compartment file and the fasta files to generate an alignment for each compartment (./prosplign -i comp -fasta subj.fa,query.fa -nogenbank -o pro.asn -eo pro.txt). The .asn file contains alignments in ASN format. The .txt file is designed for human reading.

When ProSplign is run without '-full' option, output file shows 'partial' alignments. A partial alignment is made from the full global alignment by throwing out low identity portions of the alignment and keeping conserved portions. The conserved portions are marked in text 'pro.txt' file with stars in the status line. Introns are marked with dots in the protein line. For example, the following fragment

1	NT_010783.14	NP_032143.1	20601000	20603157	-
20602957    CCTTTGGGCACAACGTGTCCTGAGGGGAGAGGCAGCGCCCTGTAGATGGGACGGGGGCACTAACCCTCAGGTTTGGGGCTTATGAATGTGAGTATCGCCA   20602858
                                                                                                                
                                                                                                                
            ------------------ M  A  T  D ----------------------------------------------------------------------
                                                                                                                
20602857    TCTAAGGCCAGATATTTGGCCAATCTCTGAATGTTCCTGGTCTCTGGAGGGATGGAGAGAGAGAAAAAAACAAACAGCTCCTGGAGCAGGGAGAGCGCTG   20602758
                                                                                                                
                                                                                                                
            ----------------------------------------------------------------------------------------------------
                                                                                                                
20602757    GCCTCTTCCTCTCCGGCTCCCTCCATTGCCCTCCGGTTTCTCCCCAGGCTCCCGGACGTCCCTGCTCCTGGCTTTTGCCCTGCTCTGCCTGCCCTGGCTT   20602658
                                                              S  R  T  S  L  L  L  A  F  A  L  L  C  L  P  W  L 
                                                              |  |  |  |     |  |        +  |  |  |  |     |    
            ------------------------------------------------- S  R  T  S  W  L  L  T  V  S  L  L  C  L  L  W  P 
                                                             ***************************************************
20602657    CAAGAGGCTGGTGCCGTCCAAACCGTTCCGTTATCCAGGCTTTTTGACCACGCTATGCTCCAAGCCCATCGCGCGCACCAGCTGGCCATTGACACCTACC   20602558
             Q  E  A  G  A  V  Q  T  V  P  L  S  R  L  F  D  H  A  M  L  Q  A  H  R  A  H  Q  L  A  I  D  T  Y  
             |  |  |     |           +  |  |  |     |  |     +  |  +  |  +  |           |  |  |  |     |  |  |  
             Q  E  A  S  A  F  P  A  M  P  L  S  S  L  F  S  N  A  V  L  R  A  Q  H  L  H  Q  L  A  A  D  T  Y  
            ****************************************************************************************************
20602557    AGGAGTTTGTAAGTTCTTGGGGAATGGGTGCGGGTCAGGGGTGGCAAGAAGGGGTGACTTTCCCCCACTGGGGAAGTAATGGGAGGAGACTAAGGAGCTC   20602458
            Q  E  F                                                                                             
            +  |  |                                                                                             
            K  E  F ............................................................................................
            ****************************************************************************************************
20602457    AGGGTTGTTTTCTGAAGCGAAAATGCAGGCAGATGAGCATAGGCTGAGCCAGGTTCCCAGAAAAGCAACAATGGGAGCTGGTCTCCAGCATAGAAACCAG   20602358
                                                                                                                
                                                                                                                
            ....................................................................................................
            ****************************************************************************************************
20602357    CAGTCCTTCTTGGTGGGGGGTCCTTCTCCTAGGAAGAAACCTATATCCCAAAGGACCAGAAGTATTCATTCCTGCATGACTCCCAGACCTCCTTCTGCTT   20602258
                                             E  E  T  Y  I  P  K  D  Q  K  Y  S  F  L  H  D  S  Q  T  S  F  C  F
                                             |        |  |  |  +     |  +  |  |     +     +  +  |     +  |  |  |
            ................................ E  R  A  Y  I  P  E  G  Q  R  Y  S --- I  Q  N  A  Q  A  A  F  C  F
            ****************************************************************************************************
            ...

means that the first four aminoacids (MATD) were not aligned. The alignment starts with SRTS... on the protein. The first exon ends at KEF. The second exon starts with ERA... on the protein. Intron with GT/AG splice is marked with dots.

Algorithmic details

ProSplign works with input sequences on a pairwise basis. In other words, exon/intron structures are determined independently for each query and subject.

The dynamic programming alone is accurate in determining splice junctions but computationally expensive. Also, if copies of a gene share same genomic sequence and strand, direct application may produce incorrect results by connecting exons from different copies.

Thus, for every input query/subject pair, it is important to localize genes on the genomic sequence which ProSplign achieves with the algorithm to compartmentize the BLAST hits. The compartmentization step starts with computing protein-to-genomic blast hits. These give initial insight into the structure of compartments. Hits are separated into two same-strand sets and then compartments are identified within each strand. To do so, we formally define the optimization problem in terms of genomic sequence coverage and then solve it with a dynamic programming algorithm whose running time is short compared to the core dynamic programming described above.

Frequently Asked Questions

Q: Why am I getting "Unable to locate XXX" exceptions?
A: Please make sure that sequence identifiers in the input hit file match those in the index file. When indexing your fasta files, ProSplign records sequence IDs exactly as they appear after the leading '>' while your blast program could have printed them slightly differently.

Q: What does 'No compartment found' log file message mean? What is compartment?
A: Compartment is a localized interval on genomic sequence providing bounds for ProSplign in its search for exons. Compartments are identified based on input blast hits, so when there are not enough hits or hits are too weak or not consistent with each other to form a compartment, this message is generated.