Blat produces two major classes of alignments: at the DNA level between two sequences that are of 95% or greater identity, but which may include large inserts, and at the protein or translated DNA level between sequences that are of 80% or greater identity and may also include large inserts. The output of BLAT is flexible. By default it is a simple tab-delimited file which describes the alignment, but which does not include the sequence of the alignment itself. Optionally it can produce BLAST and WU-BLAST compatable output as well as a number of other formats.
There are three main programs in the BLAT suite: a stand-alone program called ‘blat’. A server which maintains an index of a genome in memory called ‘gfServer’, and a client which can query the index over the network called ‘gfClient’. Since it takes some time (10 to 25 minutes) to index an entire genome, the gfServer/gfClient model is best suited for situations where interactive users wish to quickly locate a few sequences in the genome. The genome index does take memory – close to a gigabyte for a nucleotide based index and over two gigabytes for a translated protein index. The stand-alone program is most suited for large batch alignments, which can be spread across many machines. There are two additional programs – faToNib which converts a .fa nucleotide file to a denser format which is suitable for random access, and nibFrag which can convert all or a portion of a .nib file back to .fa format.
The command line options of each of the programs is described below. Similar summaries of usage are printed when a command is run with no arguments.
blat
blat
- Standalone BLAT sequence search command line tool
usage:
blat database query [-ooc=11.ooc]
output.psl
where:
database is either a .fa file, a .nib file,
or a list of .fa or .nib
files, query is similarly a .fa, .nib, or
list of .fa or .nib files
-ooc=11.ooc tells the program to load
over-occurring 11-mers from
and external file. This will increase the speed
by a factor of 40 in many
cases, but is not required
output.psl is where to put the output.
options:
-t=type
Database type. Type is one of:
dna - DNA sequence
prot - protein sequence
dnax - DNA sequence
translated in six frames to protein
The default is dna
-q=type
Query type. Type is one of:
dna - DNA sequence
rna - RNA sequence
prot - protein sequence
dnax - DNA sequence
translated in six frames to protein
rnax - DNA sequence
translated in three frames to protein
The default is dna
-prot
Synonymous with -d=prot -q=prot
-ooc=N.ooc
Use overused tile file N.ooc. N
should correspond to
the tileSize
-tileSize=N sets the size of match that
triggers an alignment.
Usually between 8 and 12
Default is 11 for DNA and 5 for
protein.
-oneOff=N
If set to 1 this allows one mismatch in tile and still
triggers an alignments. Default is 0.
-minMatch=N sets the number of tile
matches. Usually set from 2 to 4
Default is 2 for nucleotide, 1 for
protein.
-minScore=N sets minimum score. This is twice the matches minus the
mismatches minus some sort of
gap penalty. Default is 30
-minIdentity=N Sets minimum sequence
identity (in percent). Default is
90 for nucleotide searches, 25
for protein or translated
protein searches.
-maxGap=N
sets the size of maximum gap between tiles in a clump. Usually
set from 0 to 3. Default is 2. Only relevent for minMatch
> 1.
-noHead
suppress .psl header (so it's just a tab-separated file)
-makeOoc=N.ooc Make overused tile file
-repMatch=N sets the number of repetitions
of a tile allowed before
it is marked as overused. Typically this is 256 for tileSize
12, 1024 for tile size 11, 4096 for tile size 10.
Default is 1024. Typically only comes into play with makeOoc
-mask=type
Mask out repeats. Alignments
won't be started in masked region
but may extend through it in nucleotide
searches. Masked areas
are ignored entirely in protein
or translated searches. Types are
lower - mask out lower cased
sequence
upper - mask out upper cased
sequence
out - mask according to database.out RepeatMasker .out file
file.out - mask database
according to RepeatMasker file.out
-qMask=type Mask out repeats in query
sequence. Similar to -mask above but
for query rather than target
sequence.
-minRepDivergence=NN - minimum percent
divergence of repeats to allow
them to be unmasked. Default is 15. Only relevant for
masking using RepeatMasker .out
files.
-dots=N
Output dot every N sequences to show program's progress
-trimT
Trim leading poly-T
-noTrimA
Don't trim trailing poly-A
-trimHardA
Remove poly-A tail from qSize as well as alignments in psl output
-out=type
Controls output file format.
Type is one of:
psl - Default. Tab separated format without actual sequence
pslx - Tab separated format
with sequence
axt - blastz-associated axt
format
maf - multiz-associated maf
format
wublast - similar to
wublast format
blast - similar to NCBI
blast format
-fine
For high quality mRNAs look harder for small initial and
terminal exons. Not recommended for ESTs
Here are some blat settings for common usage scenarios:
1) Mapping ESTs to the genome within the same species
-ooc=11.ooc
2) Mapping full length mRNAs to the genome in the same species
-ooc=11.ooc -fine -q=rna
3) Mapping ESTs to the genome across species
-q=dnax -t=dnax
4) Mapping mRNA to the genome across species
-q=rnax -t=dnax
5) Mapping proteins to the genome
-q=prot -t=dnax
6) Mapping DNA to DNA in the same species
-ooc=11.ooc -fastMap
7) Mapping DNA from one species to another species
-q=dnax -t=dnax
When mapping DNA from one species to
another the
query side of the alignment should be cut
up into chunks
of 25kb or less for best performance.
gfServer
gfServer -
Make a server to quickly find where DNA occurs in genome.
To set up a
server:
gfServer start host port file(s).nib
To remove a
server:
gfServer stop host port
To query a
server with DNA sequence:
gfServer query host port probe.fa
To query a
server with protein sequence:
gfServer protQuery host port probe.fa
To query a
server with translated dna sequence:
gfServer transQuery host port probe.fa
To process
one probe fa file against a .nib format genome (not starting server):
gfServer direct probe.fa file(s).nib
To figure out
usage level
gfServer status host port
To get input
file list
gfServer files host port
Options:
-tileSize=N size of n-mers to index. Default is 11 for nucleotides, 4 for
proteins (or translated
nucleotides).
-minMatch=N Number of n-mer matches that
trigger detailed alignment
Default is 2 for nucleotides, 3 for protiens.
-maxGap=N
Number of insertions or deletions allowed between n-mers.
Default is 2 for nucleotides, 0
for protiens.
-trans
Translate database to protein in 6 frames. Note: it is best
to run this on RepeatMasked
data in this case.
-log=logFile keep a log file that records
server requests.
-seqLog
Include sequences in log file
gfClient
gfClient - A
client for the genomic finding program
usage:
gfClient host port nibDir in.fa out.psl
where
host is the name of the machine running the
gfServer
port is the same as you started the
gfServer with
nibDir is the path of the nib files
relative to the current dir
(note these are needed by the client as
well as the server)
in.fa a fasta format file. May contain multiple records
out.psl where to put the output
options:
-t=type
Database type. Type is one of:
dna - DNA sequence
prot - protein sequence
dnax - DNA sequence translated in six frames to protein
The default is dna
-q=type
Query type. Type is one of:
dna - DNA sequence
rna - RNA sequence
prot - protein sequence
dnax - DNA sequence
translated in six frames to protein
rnax - DNA sequence
translated in three frames to protein
-dots=N
Output a dot every N query sequences
-nohead
Suppresses psl five line header
-out=type
Controls output file format.
Type is one of:
psl - Default. Tab separated format without actual sequence
pslx - Tab separated format
with sequence
axt - blastz-associated axt
format
maf - multiz-associated maf format
wublast - similar to
wublast format
blast - similar to NCBI
blast format
faToNib
faToNib -
Convert from .fa to .nib format
usage:
faToNib in.fa out.nib
nibFrag
nibFrag - Extract
part of a nib file as .fa
usage:
nibFrag file.nib start end strand out.fa
pslPretty
pslPretty -
Convert PSL to human readable output
usage:
pslPretty in.psl target.lst query.lst
pretty.out
options:
-axt - save in Scott Schwartz's axt format
-dot=N Put out a dot every N records
-long - Don't abbreviate long inserts
It's a really good idea if the psl file is sorted by target if it contains multiple targets. Otherwise this will be very very slow. The target and query lists can either be fasta files, nib files, or a list of fasta and/or nib files one per line. Currently this only handles nucleotide based psl files.
.nib files
A .nib file describes a DNA sequence packing two bases into each byte. A nib file begins with a 32 bit signature which is 0x6BE93D3A in the archetecture of the machine that created the file, and possibly a byte-swapped version of the same number on another machine. This is followed by a 32 bit number in the same format which describes the number of bases in the file. This is followed by the bases themselves packed two bases to the byte. The first base is packed in the high order 4 bits, the second base in the low order four bits. In C code:
byte
= (base1<<4) + base2
The numerical values for the bases are:
0
– T, 1 – C, 2 – A, 3 – G, 4 – N (unknown), 5-15 – unused
.psl files
A .psl file describes a series of alignments in a dense easily parsed text format. It begins with a five line header which describes each field. Following this is one line for each alignment with a tab between each field. The fields are describe below in a format suitable for many relational databases.
matches int unsigned , #
Number of bases that match that aren't repeats
misMatches int unsigned , # Number of bases that don't match
repMatches int unsigned , # Number of bases that match but are part of
repeats
nCount int unsigned , #
Number of 'N' bases
qNumInsert int unsigned , # Number of inserts in query
qBaseInsert int unsigned , # Number of bases inserted in query
tNumInsert int unsigned
, #
Number of inserts in target
tBaseInsert int unsigned
, #
Number of bases inserted in target
strand char(2) , #
+ or - for query strand, optionally followed by + or – for target strand
qName varchar(255) , #
Query sequence name
qSize int unsigned , #
Query sequence size
qStart int unsigned , #
Alignment start position in query
qEnd int unsigned , #
Alignment end position in query
tName varchar(255) , #
Target sequence name
tSize int unsigned , #
Target sequence size
tStart int unsigned , #
Alignment start position in target
tEnd int unsigned , #
Alignment end position in target
blockCount int unsigned , # Number of blocks in alignment
blockSizes longblob , #
Size of each block in a comma separated list
qStarts longblob , #
Start of each block in query in a comma separated list
tStarts longblob , #
Start of each block in target in a comma separated list
Currently the program does not distinguish between matches and repMatches. repMatches is always zero.
There is a little gotcha in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.
Here's an example of a 30-mer that has 2 blocks that align on the minus strand and 2 blocks on the plus strand (this sort of stuff happens in real life in response to assembly errors sometimes).
0 1 2 3 tens
position in query
0123456789012345678901234567890 ones position in query
++++ +++++ plus strand alignment on query
-------- ---------- minus
strand alignment on query
Plus strand:
qStart 12 qEnd 31 blockSizes 4,5 qStarts 12,26
Minus strand:
qStart 4 qEnd 26 blockSizes 10,8 qStarts 5,19
Essentially the minus strand blockSizes and qStarts are what
you would get if you reverse complemented the query.However the qStart and qEnd
are non-reversed. To get from one to the other:
qStart = qSize - revQEnd
qEnd = qSize - revQStart
The gfServer program requires approximately 1 byte for every 3 bases in the genome it is indexing in DNA mode, and 1.5 bytes for each unmasked base in translated mode. The blat program requires approximately two bytes for each base in the genome in DNA mode, and three bytes for each base in translated mode. The other programs use relatively little memory.