BLAT Program Specifications

General:

Blat produces two major classes of alignments: at the DNA level between two sequences that are of 95% or greater identity, but which may include large inserts, and at the protein or translated DNA level between sequences that are of 80% or greater identity and may also include large inserts.  The output of BLAT is flexible.  By default it is a simple tab-delimited file which describes the alignment, but which does not include the sequence of the alignment itself.  Optionally it can produce BLAST and WU-BLAST compatable output as well as a number of other formats.

 

There are three main programs in the BLAT suite:  a stand-alone program called ‘blat’.  A server which maintains an index of a genome in memory called ‘gfServer’, and a client which can query the index over the network called ‘gfClient’.  Since it takes some time (10 to 25 minutes) to index an entire genome, the gfServer/gfClient model is best suited for situations where interactive users wish to quickly locate a few sequences in the genome.  The genome index does take memory – close to a gigabyte for a nucleotide based index and over two gigabytes for a translated protein index. The stand-alone program is most suited for large batch alignments, which can be spread across many machines.  There are two additional programs – faToNib which converts a .fa nucleotide file to a denser format which is suitable for random access, and nibFrag which can convert all or a portion of a .nib file back to .fa format.

Command Line:

The command line options of each of the programs is described below. Similar summaries of usage are printed when a command is run with no arguments.

 

blat

 

blat - Standalone BLAT sequence search command line tool

usage:

   blat database query [-ooc=11.ooc] output.psl

where:

   database is either a .fa file, a .nib file, or a list of .fa or .nib

   files, query is similarly a .fa, .nib, or list of .fa or .nib files

   -ooc=11.ooc tells the program to load over-occurring 11-mers from

               and external file.  This will increase the speed

               by a factor of 40 in many cases, but is not required

   output.psl is where to put the output.

options:

   -t=type     Database type.  Type is one of:

                 dna - DNA sequence

                 prot - protein sequence

                 dnax - DNA sequence translated in six frames to protein

               The default is dna

   -q=type     Query type.  Type is one of:

                 dna - DNA sequence

                 rna - RNA sequence

                 prot - protein sequence

                 dnax - DNA sequence translated in six frames to protein

                 rnax - DNA sequence translated in three frames to protein

               The default is dna

   -prot       Synonymous with -d=prot -q=prot

   -ooc=N.ooc  Use overused tile file N.ooc.  N should correspond to

               the tileSize

   -tileSize=N sets the size of match that triggers an alignment.

               Usually between 8 and 12

               Default is 11 for DNA and 5 for protein.

   -oneOff=N   If set to 1 this allows one mismatch in tile and still

               triggers an alignments.  Default is 0.


   -minMatch=N sets the number of tile matches.  Usually set from 2 to 4

               Default is 2 for nucleotide, 1 for protein.

   -minScore=N sets minimum score.  This is twice the matches minus the

               mismatches minus some sort of gap penalty.  Default is 30

   -minIdentity=N Sets minimum sequence identity (in percent).  Default is

               90 for nucleotide searches, 25 for protein or translated

               protein searches.

   -maxGap=N   sets the size of maximum gap between tiles in a clump.  Usually

               set from 0 to 3.  Default is 2. Only relevent for minMatch > 1.

   -noHead     suppress .psl header (so it's just a tab-separated file)

   -makeOoc=N.ooc Make overused tile file

   -repMatch=N sets the number of repetitions of a tile allowed before

               it is marked as overused.  Typically this is 256 for tileSize

               12, 1024 for tile size 11, 4096 for tile size 10.

               Default is 1024.  Typically only comes into play with makeOoc

   -mask=type  Mask out repeats.  Alignments won't be started in masked region

               but may extend through it in nucleotide searches.  Masked areas

               are ignored entirely in protein or translated searches. Types are

                 lower - mask out lower cased sequence

                 upper - mask out upper cased sequence

                 out   - mask according to database.out RepeatMasker .out file

                 file.out - mask database according to RepeatMasker file.out

   -qMask=type Mask out repeats in query sequence.  Similar to -mask above but

               for query rather than target sequence.

   -minRepDivergence=NN - minimum percent divergence of repeats to allow

               them to be unmasked.  Default is 15.  Only relevant for

               masking using RepeatMasker .out files.

   -dots=N     Output dot every N sequences to show program's progress

   -trimT      Trim leading poly-T

   -noTrimA    Don't trim trailing poly-A

   -trimHardA  Remove poly-A tail from qSize as well as alignments in psl output

   -out=type   Controls output file format.  Type is one of:

                   psl - Default.  Tab separated format without actual sequence

                   pslx - Tab separated format with sequence

                   axt - blastz-associated axt format

                   maf - multiz-associated maf format

                   wublast - similar to wublast format

                   blast - similar to NCBI blast format

   -fine       For high quality mRNAs look harder for small initial and

               terminal exons.  Not recommended for ESTs

 

Here are some blat settings for common usage scenarios:

 

1) Mapping ESTs to the genome within the same species

    -ooc=11.ooc

2) Mapping full length mRNAs to the genome in the same species

    -ooc=11.ooc -fine -q=rna

3) Mapping ESTs to the genome across species

    -q=dnax -t=dnax

4) Mapping mRNA to the genome across species

    -q=rnax -t=dnax

5) Mapping proteins to the genome

    -q=prot -t=dnax

6) Mapping DNA to DNA in the same species

    -ooc=11.ooc -fastMap

7) Mapping DNA from one species to another species

    -q=dnax -t=dnax

    When mapping DNA from one species to another the

    query side of the alignment should be cut up into chunks

    of 25kb or less for best performance.

 


gfServer

gfServer - Make a server to quickly find where DNA occurs in genome.

To set up a server:

   gfServer start host port file(s).nib

To remove a server:

   gfServer stop host port

To query a server with DNA sequence:

   gfServer query host port probe.fa

To query a server with protein sequence:

   gfServer protQuery host port probe.fa

To query a server with translated dna sequence:

   gfServer transQuery host port probe.fa

To process one probe fa file against a .nib format genome (not starting server):

   gfServer direct probe.fa file(s).nib

To figure out usage level

   gfServer status host port

To get input file list

   gfServer files host port

Options:

   -tileSize=N size of n-mers to index.  Default is 11 for nucleotides, 4 for

               proteins (or translated nucleotides).

   -minMatch=N Number of n-mer matches that trigger detailed alignment

               Default is 2 for nucleotides, 3 for protiens.

   -maxGap=N   Number of insertions or deletions allowed between n-mers.

               Default is 2 for nucleotides, 0 for protiens.

   -trans      Translate database to protein in 6 frames.  Note: it is best

               to run this on RepeatMasked data in this case.

   -log=logFile keep a log file that records server requests.

   -seqLog    Include sequences in log file

 

gfClient

gfClient - A client for the genomic finding program

usage:

   gfClient host port nibDir in.fa out.psl

where

   host is the name of the machine running the gfServer

   port is the same as you started the gfServer with

   nibDir is the path of the nib files relative to the current dir

       (note these are needed by the client as well as the server)

   in.fa a fasta format file.  May contain multiple records

   out.psl where to put the output

options:

   -t=type     Database type.  Type is one of:

                 dna - DNA sequence

                 prot - protein sequence

                 dnax - DNA sequence translated in six frames to protein

               The default is dna

   -q=type     Query type.  Type is one of:

                 dna - DNA sequence

                 rna - RNA sequence

                 prot - protein sequence

                 dnax - DNA sequence translated in six frames to protein

                 rnax - DNA sequence translated in three frames to protein

   -dots=N   Output a dot every N query sequences

   -nohead   Suppresses psl five line header

   -out=type   Controls output file format.  Type is one of:

                   psl - Default.  Tab separated format without actual sequence

                   pslx - Tab separated format with sequence

                   axt - blastz-associated axt format

                   maf - multiz-associated maf format

                   wublast - similar to wublast format

                   blast - similar to NCBI blast format

 

faToNib

faToNib - Convert from .fa to .nib format

usage:

   faToNib in.fa out.nib

 

nibFrag

nibFrag - Extract part of a nib file as .fa

usage:

   nibFrag file.nib start end strand out.fa

 

pslPretty

pslPretty - Convert PSL to human readable output

usage:

   pslPretty in.psl target.lst query.lst pretty.out

options:

   -axt - save in Scott Schwartz's axt format

   -dot=N Put out a dot every N records

   -long - Don't abbreviate long inserts

 

It's a really good idea if the psl file is sorted by target if it contains multiple targets.  Otherwise this will be very very slow.   The target and query lists can either be fasta files, nib files, or a list of fasta and/or nib files one per line.  Currently this only handles nucleotide based psl files.

 

File Formats

 

.nib files

A .nib file describes a DNA sequence packing two bases into each byte.  A nib file begins with a 32 bit signature which is 0x6BE93D3A in the archetecture of the machine that created the file, and possibly a byte-swapped version of the same number on another machine.  This is followed by a 32 bit number in the same format which describes the number of bases in the file.   This is followed by the bases themselves packed two bases to the byte.  The first base is packed in the high order 4 bits, the second base in the low order four bits.  In C code:

byte = (base1<<4) + base2

The numerical values for the bases are:

0 – T,  1 – C,  2 – A,  3 – G,  4 – N (unknown), 5-15 – unused

 

.psl files

A .psl file describes a series of alignments in a dense easily parsed text format.  It begins with a five line header which describes each field.  Following this is one line for each alignment with a tab between each field.  The fields are describe below in  a format suitable for many relational databases.

    matches int unsigned ,       # Number of bases that match that aren't repeats

    misMatches int unsigned ,    # Number of bases that don't match

    repMatches int unsigned ,     # Number of bases that match but are part of repeats

    nCount int unsigned ,           # Number of 'N' bases

    qNumInsert int unsigned ,     # Number of inserts in query

    qBaseInsert int unsigned ,     # Number of bases inserted in query

    tNumInsert int unsigned ,      # Number of inserts in target

    tBaseInsert int unsigned ,      # Number of bases inserted in target

    strand char(2) ,                # + or - for query strand, optionally followed by + or – for target strand

    qName varchar(255) ,           # Query sequence name

    qSize int unsigned ,            # Query sequence size

    qStart int unsigned ,         # Alignment start position in query

    qEnd int unsigned ,             # Alignment end position in query

    tName varchar(255) ,           # Target sequence name

    tSize int unsigned ,            # Target sequence size

    tStart int unsigned ,           # Alignment start position in target

    tEnd int unsigned ,             # Alignment end position in target

    blockCount int unsigned ,    # Number of blocks in alignment

    blockSizes longblob ,        # Size of each block in a comma separated list

    qStarts longblob ,      # Start of each block in query in a comma separated list

    tStarts longblob ,      # Start of each block in target in a comma separated list

Currently the program does not distinguish between matches and repMatches.  repMatches is always zero.

There is a little gotcha in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.

 

Here's an example of a 30-mer that has 2 blocks that align on the minus strand and 2 blocks on the plus strand (this sort of stuff happens in real life in response to assembly errors sometimes).

0         1         2         3 tens position in query
0123456789012345678901234567890 ones position in query
            ++++          +++++ plus strand alignment on query
    --------    ----------      minus strand alignment on query

Plus strand:
     qStart 12 qEnd 31 blockSizes 4,5 qStarts 12,26
Minus strand:
     qStart 4 qEnd 26 blockSizes 10,8 qStarts 5,19

Essentially the minus strand blockSizes and qStarts are what you would get if you reverse complemented the query.However the qStart and qEnd are non-reversed. To get from one to the other:
     qStart = qSize - revQEnd
     qEnd = qSize - revQStart

Limits

The gfServer program requires approximately 1 byte for every 3 bases in the genome it is indexing in DNA mode, and 1.5 bytes for each unmasked base in translated mode. The blat program requires approximately two bytes for each base in the genome in DNA mode, and three bytes for each base in translated mode. The other programs use relatively little memory.