Bioinformatics Glossary
A B C D EFGHIJKLMNOPQRSTUVWXYZ
Accession number: An identifier supplied by
the curators of the major biological databases upon submission of a novel
entry that uniquely identifies that sequence (or other) entry.
Active site:The amino acid residues at the
catalytic site of an enzyme. These residues provide the binding and activation
energy needed to place the substrate into its transition state and bridge
the energy barrier of the reaction undergoing catalysis
Adenine: A purine base found in DNA and RNA
Agents: Independent, autonomous, software modules
that can search the Internet for data or content pertinent to a particular
application, such as a gene, protein, or biological system.
Algorithm: A series of steps defining a procedure
or formula for solving a problem, that can be coded into a programming
language and executed. Bioinformatics algorithms typically are used to
process, store, analyze, visualize and make predictions from biological
data.
Alignment: The result of a comparison of two
or more gene or protein sequences in order to determine their degree of
base or amino acid similarity. Sequence alignments are used to determine
the similarity, homology, function or other degree of relatedness between
two or more genes or gene products.
Allele: A given form of a gene that occupies
a specific position or locus on a chromosome. Variant forms of genes occurring
at the same locus are said to be alleles of one another.
Alternative splicing: One of the alternate
combinations of a folded protein that are possible due to by recombination
of multiple gene segments during mRNA splicing that occurs in higher organisms.
Alternative splice-form: One of the possible
alternate combinations of exons into a folded protein that are possible
by recombining multiple gene segments during mRNA splicing in higher organisms.
Alu family: A common set of dispersed DNA sequences
found throughout the human genome; each is about 300 bases long and they
are repeated at least 500,000 times. Alu sequences are speculated to have
originated from viral RNA sequences that integrated into human DNA thousands
of years ago.
Amino acid: One of the 20 chemical building
blocks that are joined by amide (peptide) linkages to form a polypeptide
chain of a protein
Analogy: Reasoning by which the function of
a novel gene or protein sequence may be deduced from comparisons with other
gene or protein sequences of known function. Identifying analogous
or homologous genes via similarity searching and alignment is one of the
chief uses of Bioinformatics. (See also alignment, similarity search.)
Annotation: A combination of comments, notations,
references, and citations, either in free format or utilizing a controlled
vocabulary, that together describe all the experimental and inferred information
about a gene or protein. Annotations can also be applied to the description
of other biological systems. Batch, automated annotation of bulk
biological sequence is one of the key uses of Bioinformatics tools.
Anticodon: The triplet of contiguous bases
on tRNA that binds to the codon sequence of nucleotides on mRNA. Example:
The codon for Glycine is GGG. The anticodon for Glycine is CCC.
Antigen: Any foreign molecule that stimulates
an immune response in a vertebrate organism. Many antigens are proteins
such as the surface proteins of foreign organisms.
Antisense: DNA or RNA composed of the complementary
sequence to the target DNA/RNA. Also used to describe a therapeutic strategy
that uses antisense DNA or RNA sequences to target specific gene DNA sequences
or mRNA implicated in disease, in order to bind and physically inhibit
their expression by physically blocking them.
Assembly: Compilation of overlapping sequences
from one or more related genes that have been clustered together based
on their degree of sequence identity or similarity. Sequence assembly may
be used to piece together "shotgun" sequencing fragments (see shotgun sequencing)
based upon overlapping restriction enzyme digests, or may be used to identify
and index novel genes from "single-pass" cDNA sequencing efforts.
back to top
Bacterial artificial chromosome (BAC): Cloning
vector that can incorporate large fragments of DNA. (see YACS)
Bacteriophage: A virus that infects bacteria.
The bacteriophage DNA has served as a basis for cloning vectors, and is
also utilized to create phage libraries containing human or other genes.
Baculovirus: An insect virus which forms the
basis of a protein expression system
Base pair: A pair of nitrogenous bases (a purine
and a pyrimidine), held together by hydrogen bonds, that form the core
of DNA and RNA i.e the A:T, G:C and A:U interactions.
Beta sheet: A three dimensional arrangement
taken up by polypeptide chains that consists of alternating strands linked
by hydrogen bonds. The alternating strands together form a sheet that is
frequently twisted. One of the secondary structural elements characteristic
of proteins.
Bioinformatics:
1.The field of endeavor that relates to the collection, organization
and analysis of large amounts of biological data using networks of computers
and databases (usually with reference to the genome project and DNA sequence
information).
2. Bioinformatics, sometimes, is used interchangeably with the term
Computational
Biology. Precisely, Computational Biology is defined as the systematic
development and application of computing systems and computational solution
techniques to models of biological phenomena; Bioinformatics is defined
as the systematic development and application of computing systems and
computational solution techniques analyzing data obtained by experiments,
modeling, database search, and instrumentation regarding biological aspect.
Bivalent: Having two binding sites; having
2 free electrons available for binding.
Blunt-end (ligation): The joining of DNA fragments
that contain no overhang at either end and consequently no DNA bases available
for hybridization (cf. sticky-end ligation).
back to top
Carboxyl group: The -COOH functional group,
acidic in nature, found in all amino acids
cDNA (complementary DNA): A DNA strand copied
from mRNA using reverse transcriptase. A cDNA library represents all of
the expressed DNA in a cell.
cDNA library: A set of DNA fragments prepared
from the total mRNA obtained from a selected cell, tissue or organism.
Chimeric clone: A cloning artifact created
by a foreign gene being inserted into a vector in an incorrect orientation
resulting in the expression of a protein consisting of a fusion of two
different gene products.
Chromat: Data file output from most popular
DNA sequencers. Chromat files consist of the fluorescent traces generated
by the sequencer for each of the four chemical bases, A, C, G, and T, together
with the sequence and measures of the error in the traces at each sequence
position.
Chromatin: The chromosome as it appears in
its condensed state, composed of DNA and associated proteins (mainly histones).
Chromosome: The structure in the cell nucleus
that contains all of the cellular DNA together with a number of proteins
that compact and package the DNA.
Clone: A population of genetically identical
cells or DNA molecules.
Cloning: The formation of clones or exact genetic
replicas.
Cluster: The grouping of similar objects in
a multidimensional space. Clustering is used for constructing new
features which are abstractions of the existing features of those objects.
The quality of the clustering depends crucially on the distance metric
in the space. In bioinformatics, clustering is performed on sequences,
high-throughput expression and other experimental data. Clusters of partial
or complete gene sequences can be used to identify the complete (contiguous)
sequence and to better identify its function. Clustering expression data
enables the researcher to discern patterns of co-regulation in groups of
genes.
Coding regions (CDS): The portion of a genomic
sequence bounded by start and stop codons that identifies the sequence
of the protein being coded for by a particular gene.
Codon: A sequence of three adjacent nucleotides
that designates a specific amino acid or start/stop site for transcription.
Combinatorial chemistry: The use of chemical
methods to generate all possible combinations of chemicals starting with
a subset of compounds. The building blocks may be peptides, nucleic acids
or small molecules. The libraries of compounds formed by this methodology
are used to probe for new pharmaceutical reagents (see high-throughput
screening).
Complementary determining region (CDR): The
hypervariable regions of an antibody molecule, consisting of three loops
from the heavy chain and three from the light chain, that together form
the antigen-binding site.
Complexity (of gene sequence): The term "low
complexity sequence" may be thought of as synonymous with regions of locally
biased amino acid composition. In these regions, the sequence composition
deviates from the random model that underlies the calculation of the statistical
significance (P-value) of an alignment. Such alignments among low
complexity sequences are statistically but not biologically significant,
i.e., one cannot infer homology (common ancestry) or functional similarity.
Conformation: The precise three-dimensional
arrangement of atoms and bonds in a molecule describing its geometry and
hence its molecular function.
Consensus sequence: A single sequence delineated
from an alignment of multiple constituent sequences that represents a "best
fit" for all those sequences. A "voting" or other selection procedure is
used to determine which residue (nucleotide or amino acid) is placed at
a given position in the event that not all of the constituent sequences
have the identical residue at that position.
Constitutive synthesis (expression): Synthesis
of mRNA and protein at an unchanging or constant rate regardless of a cell’s
requirements (see housekeeping genes).
Contig: A length of contiguous sequence assembled
from partial, overlapping sequences, generated from a "shotgun" sequencing
project. Contigs are typically created computationally, by comparing
the overlapping ends of several sequencing reads generated by restriction
enzyme digestion of a segment of genomic DNA. The creation of contigs
in the presence of sequencing errors, ambiguities and the presence of repeats
is one of the most computationally challenging aspects of the role of Bioinformatics
in genome analysis.
Convergence
The end-point of any algorithm that uses iteration or recursion to guide
a series of data processing steps. An algorithm is usually said to have
reached convergence when the difference between the computed and observed
steps falls below a pre-defined threshold.
Cosmids
DNA vectors that allow the insertion of long fragments of DNA (up to
50 kbases).
Crystal structure
Term used to describe the high resolution molecular structure derived
by x- ray crytallographic analysis of protein or other biomolecular crystals.
Cytoplasm
The medium of the cell between the nucleus and the cell membrane.
Cytosine
A pyrimidine base found in DNA and RNA.
back to top
Data Cleaning
A process whereby automated or semi-automated algorithms are used to
process experimental data, including noise, experimental errors and other
artifacts, in order to generate and store high-quality data for use in
subsequent analysis. Data cleaning is typically required in high-throughput
sequencing where compression or other experimental artifacts limit the
amount of sequence data generated from each sequencing run or "read."
Data Mining
The ability to query very large databases in order to satisfy a hypothesis
("top-down" data mining); or to interrogate a database in order to generate
new hypotheses based on rigorous statistical correlations ("bottom-up"
data mining).
Data Processing
Data processing is defined as the systematic performance of operations
upon data such as handling, merging, sorting, and computing. The semantic
content of the original data should not be changed, but the semantic content
of the processed data may be changed.
Data Warehouses
Vast arrays of heterogeneous (biological) data, stored within a single
logical data repository, that are accessible to different querying and
manipulation methods.
Database
Any file system by which data gets stored following a logical process.
(see also relational database)
Deconvolution
Mathematical procedure to separate out the overlapping effects of molecules
such as mixtures of compounds in a high-throughput screen, or mixtures
of cDNAs in a high density array.
Deletion
A chromosomal alteration in which a portion of the chromosome or the
underlying DNA is lost.
Deletion mapping
Process in which different deletions in a region of DNA are created
and used to map the functionally critical areas of that DNA. e.g the minimal
region of DNA required for a test promoter can be ascertained by systematic
deletions in the region of interest.
Dendrogram
A graphical procedure for representing the output of a hierarchical
clustering method. A dendrogram is strictly defined as a binary tree
with a distinguished root, that has all the data items at its leaves.
Conventionally, all the leaves are shown at the same level of the drawing.
The ordering of the leaves is arbitrary, as is their horizontal position.
The heights of the internal nodes may be arbitrary, or may be related to
the metric information used to form the clustering.
Dimer
A composite molecule formed by the binding of two molecules (see homo
and heterodimers).
Disulfide bond
Covalent link formed between the sulfur atoms of two different cysteine
residues in a protein. Important in maintaining the folded structure of
a protein, and also for linking different proteins in a complex.
DNA (deoxyribonucleic acid)
The chemical that forms the basis of the genetic material in virtually
all organisms. DNA is composed of the four nitrogenous bases Adenine, Cytosine,
Guanine, and Thymine, which are covalently bonded to a backbone of deoxyribose-phosphate
to form a DNA strand. Two complementary strands (where all Gs pair with
Cs and As with Ts) form a double helical structure which is held together
by hydrogen bonding between the cognate bases.
DNA fingerprinting
A technique for identifying human individuals based on a restriction
enzyme digest of tandemly repeated DNA sequences that are scattered throughout
the human genome, but are unique to each individual.
DNA microarrays
The deposition of oligonucleotides or cDNAs onto an inert substrate
such as glass or silicon. Thousands of molecules may be organized spatially
into a high-density matrix. These DNA chips may be probed to allow expression
monitoring of many thousands of genes simultaneously. Uses include study
of polymorphisms in genes, de novo sequencing or molecular diagnosis of
disease.
DNA polymerase
An enzyme that catalyzes the synthesis of DNA from a DNA template given
the deoxyribonucleotide precursors.
DNA probes
Short single stranded DNA molecules of specific base sequence, labeled
either radioactively or immunologically, that are used to detect and identify
the complementary base sequence in a gene or genome by hybridizing specifically
to that gene or sequence.
DNA sequencing
The technique in which the specific sequence of bases forming a particular
DNA region is deciphered.
DNase (Deoxyribonuclease)
One of a series of enzymes that can digest DNA.
Domain (protein)
A region of special biological interest within a single protein sequence.
However, a domain may also be defined as a region within the three-dimensional
structure of a protein that may encompass regions of several distinct protein
sequences that accomplishes a specific function. A domain class is a group
of domains that share a common set of well-defined properties or characteristics.
Drug
An agent that affects a biological process. Specifically, a molecule
whose molecular structure can be correlated with its pharmacological activity.
Drug discovery cycle
The cycle of events required to develop a new drug. Typically this involves
research, preclinical testing and clinical development, and can take from
5 to 12 years.
back to top
Electronic Northerns
The use of an electronic database of cDNA sequences (or probes derived
from them) in order to measure the relative levels of mRNAs expressed in
different cells or tissues. An example of the use of an electronic Northern
might be to identify the differences in the genes expressed in prostate
cancer and those in benign prostate hyperplasia, by subtracting the database
of one from the other and seeing which cDNAs remain.
Electrophoresis
The use of an external electric field to separate large biomolecules
on the basis of their charge by running them through acrylamide
or agarose gels.
Enhancers
DNA sequences that can greatly increase the transcription rates of genes
even though they may be far upstream or downstream from the promoter they
stimulate.
Enzyme
A class of proteins that are capable of catalyzing chemical reactions
(the making or breaking of chemical bonds). They do so by orienting their
substrates into a suitable geometry in a particular location (the active
site) where electrophilic or nucleophilic amino acid residues can participate
in the reaction. Enzymes are protein catalyst that speeds up chemical reactions
that would otherwise be prohibitively slow under physiological conditions.
Epigenomics
The study of complex expression networks or linkages both spatially
(within the body) and temporally (at different times in development).
Equilibrium constant
Value that describes the equilibrium state of the reversible reaction
between two molecular species.
Eukaryote
A cell or organism with a distinct membrane-bound nucleus as well as
specialized membrane-based organelles (see also prokaryote).
Exon
The region of DNA within a gene that codes for a polypeptide chain or
domain. Typically a mature protein is composed of several domains coded
by different exons within a single gene.
Expressed Sequence Tags (ESTs)
A small sequence from an expressed gene that can be amplified by PCR.
ESTs act as physical markers for cloning and full length sequencing of
the cDNAs of expressed genes. Typically identified by purifying mRNAs,
converting to cDNAs, and then sequencing a portion of the cDNAs.
Expression (gene or protein)
A measure of the presence, amount, and time-course of one or more gene
products in a particular cell or tissue. Expression studies are typically
performed at the RNA (mRNA) or protein level in order to determine the
number, type, and level of genes that may be up-regulated or down-regulated
during a cellular process, in response to an external stimulus, or in sickness
or disease. Gene chips and proteomics now allow the study of expression
profiles of sets of genes or even entire genomes.
Expression profile
The level and duration of expression of one or more genes, selected
from a particular cell or tissue type, generally obtained by a variety
of high-throughput methods, such as sample sequencing, serial analysis,
or microarray-based detection.
Expression vector
A cloning vector that is engineered to allow the expression of protein
from a cDNA. The expression vector provides an appropriate promoter and
restriction sites that allow insertion of cDNA.
back to top
FASTA format
A sequence in FASTA format begins with a single-line description, followed
by lines of sequence data. The description line is distinguished from the
sequence data by a greater-than (">") symbol in the first column. It is
recommended that all lines of text be shorter than 80 characters in length.
An example sequence in FASTA format is:
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
A FASTA file can also contain multiple sequences;
>VECTOR32 Synthetic vector sequence #32
ATGAGCGGCGGCCCCATGGGCGGCAGGCCCGGCGGCAGGGGCGCCCCCGCCGTGCAGCAG
AACATCCCCAGCACCCTGCTGCAGGACCACGAGAACCAGAGGCTGTTCGAGATGCTGGGC
>VECTOR33 Synthetic vector sequence #33
ACGAGCGGCGGTCCCATGGGCGCCAGGCCCGGCGGCAGGGGCGCTGCCGCCGTGCAGCAC
ATCATCCCCAGCACCCTGCAGCAGGACCACGAGTACCAGAGGCTGTTCGAGATGCTGGGC
>VECTOR34 Synthetic vector sequence #34
GTGAGCGGCGGCTACTTGGGCGGCAGGCCCGGCGGCAGGGGCGCCCACGCCGTGCAGCAG
Sequences are expected to be represented in the standard IUB/IUPAC amino
acid and nucleic acid codes with these exceptions: lower-case letters
are accepted and are mapped into upper-case; a single hyphen or dash can
be used to represent a gap of indeterminate length; and in amino acid sequences,
U and * are acceptable letters (see below). Invalid characters (digits,
blanks) are automatically removed.
Fingerprint
A fingerprint is a set of motifs used to predict the occurrence of similar
motifs, in either an individual sequence or in a database. Fingerprints
are refined by iterative scanning of a composite protein sequence database.
A composite or multiple-motif fingerprint contains a number of aligned
motifs taken from different parts of a multiple alignment. True family
members are then easy to identify by virtue of possessing all elements
of the fingerprint, while subfamily members may be identified by possessing
only part of it.
Frameshift
A deletion, substitution, or duplication of one or more bases that causes
the reading-frame of a structural gene to shift from the normal series
of triplets.
Functional genomics
The use of genomic information to delineate protein structure, function,
pathways and networks. Function may be determined by "knocking out" or
"knocking in" expressed genes in model organisms such as worm, fruitfly,
yeast or mouse.
Fusion protein
The protein resulting from the genetic joining and expression of 2 different
genes (see chimeric)
back to top
Gaps (affine gaps)
A gap is defined as any maximal, consecutive run of spaces in a single
string of a given alignment. Gaps help create alignments that better conform
to underlying biological models and more closely fit patterns that one
expects to find in meaningful alignment. The idea is to take in account
the number of continuous gaps and not only the number of spaces when calculating
an alignment. Affine gaps contain a component for gap insertion and a component
for gap extension, where the extension penalty is usually much lower than
the insertion penalty. This mimics biological reality as multiple gaps
would imply multiple mutations, but a single mutation can lead to a long
gap quite easily.
Gap penalties
The penalty applied to a similarity score for the introduction of an
insertion or deletion gap, the extension of a gap, or both. Gap penalties
are usually subtracted from a cumulative score being determined for the
comparison of two or more sequences via an optimization algorithm that
attempts to maximize that score.
Gel electrophoresis
A technique by which molecules are separated by size or charge by passing
them through a gel under the influence of an external electric field.
Gene Index
A listing of the number, type, label and sequence of all the genes identified
within the genome of a given organism. Gene indices are usually created
by assembling overlapping EST sequences into clusters, and then determining
if each cluster corresponds to a unique gene. Methods by which a cluster
can be identified as representing a unique gene include identification
of long open reading frames (ORFs), comparison to genomic sequence, and
detection of SNPs or other features in the cluster that are known to exist
in the gene.
GenBank
Data bank of genetic sequences operated by a division of the National
Institutes of Health.
Gene
Classically, a unit of inheritance. In practice, a gene is a segment
of DNA on a chromosome that encodes a protein and all the regulatory sequences
(promoter) required to control expression of that protein.
Gene chips (also Gene arrays)
The covalent attachment of oligonucleotides or cDNA directly onto a
small glass or silicon chip in organized arrays. Over 50,000 different
DNA fragments can be presented on a single chip providing a high throughput
parallel method of probing gene expression, genotype or gene function.
Gene expression
The conversion of information from gene to protein via transcription
and translation.
Gene families
Subsets of genes containing homologous sequences which usually correlate
with a common function.
Gene library
A collection of cloned DNA fragments created by restriction endonuclease
digestion that represent part or all of an organism’s genome.
Gene product
The product, either RNA or protein, that results from expression of
a gene. The amount of gene product reflects the activity of the gene.
Gene therapy
The use of genetic material for therapeutic purposes. The therapeutic
gene is typically delivered using recombinant virus or liposome based delivery
systems.
Genetic code
The mapping of all possible codons into the 20 amino acids including
the start and stop codons.
Genetic engineering (Recombinant DNA technology)
The procedures used to isolate, splice and manipulate DNA outside the
cell. Genetic Engineering allows a recombinantly engineered DNA segment
to be introduced into a foreign cell or organism, and be able to replicate
and function normally.
Genetic marker
Any gene that can be readily recognized by its phenotypic effect, and
which can be used as a marker for a cell, chromosome, or individual carrying
that gene. Also, any detectable polymorphism used to identify a specific
gene.
Genome
The complete genetic content of an organism.
Genomic DNA (sequence)
DNA sequence typically obtained from mammalian or other higher-order
species, which includes both intron and exon sequence (coding sequence),
as well as non-coding regulatory sequences such as promoter, and enhancer
sequences.
Genomics
The analysis of the entire genome of a chosen organism.
Genotype
Strictly, all of the genes possessed by an individual. In practice,
the particular alleles present in a specific genetic locus.
Glycosylation
The addition of carbohydrate groups (sugars) e.g. to polypeptide chains
Guanine (G)
One of the nitrogenous purine bases found in DNA and RNA
back to top
Hairpin
A double-helical region in a single DNA or RNA strand formed by the
hydrogen-bonding between adjacent inverse complementary sequences to form
a hairpin shaped structure.
Haploid
A cell or organism containing only one set of chromsomes without the
homologous pairs. (cf. diploid)
Heterodimer
Protein composed of 2 different chains or subunits.
Heteroduplex
Hybrid structure formed by the annealing of two DNA strands (or an RNA
and DNA) that have sufficient complementarity in their sequence to allow
hydrogen bonding.
Hidden Markov model (HMM)
A joint statistical model for an ordered sequence of variables.
The result of stochastically perturbing the variables in a Markov chain
(the original variables are thus "hidden"), where the Markov chain has
discrete variables which select the "state" of the HMM at each step. The
perturbed values can be continuous and are the "outputs" of the HMM. A
Hidden Markov Model is equivalently a coupled mixture model where the joint
distribution over states is a Markov chain. Hidden Markov models are valuable
in bioinformatics because they allow a search or alignment algorithm to
be trained using unaligned or unweighted input sequences; and because they
allow position-dependent scoring parameters such as gap penalties, thus
more accurately modeling the consequences of evolutionary events on sequence
families.
High-throughput screening
The method by which very large numbers of compounds are screened against
a putative drug target in either cell-free or whole-cell assays. Typically,
these screenings are carried out in 96 well plates using automated, robotic
station based technologies or in higher- density array ("chip") formats.
HLA complex
Another name for the MHC in humans; refers to the "Human Leukocyte Antigen"
complex located on chromosome 6.
Homeobox
A highly conserved region in a homeotic gene composed of 180 bases (60
amino acids) that specifies a protein domain (the homeodomain) that serves
as a master genetic regulatory element in cell differentiation during development
in species as diverse as worms, fruitflies, and humans.
Homeodomain
A 60 amino-acid protein domain coded for by the homeobox region of a
homeotic gene.
Homeotic gene
A gene that controls the activity of other genes involved in the development
of a body plan. Homeotic genes have been found in organisms ranging from
plants to humans.
Homology
(strict) Two or more biological species, systems or molecules that share
a common evolutionary ancestor. (general) Two or more gene or protein sequences
that share a significant degree of similarity, typically measured by the
amount of identity (in the case of DNA), or conservative replacements (in
the case of protein), that they register along their lengths. Sequence
"homology" searches are typically performed with a query DNA or protein
sequence to identify known genes or gene products that share significant
similarity and hence might inform on the ancestry, heritage and possible
function of the query gene.
Housekeeping genes
Genes that are always expressed (ie. they are said to be constitutively
expressed) due to their constant requirement by the cell.
Human Anti-Murine Antibody Response (HAMA)
An immune response generated in humans to antibodies raised in murine
(e.g. mouse or rat) cells.
Hybridization
The interaction of complementary nucleic acid strands. This can occur
between two DNA strands or between DNA and RNA strands, and is the basis
of many techniques such as Southern and northern blots.
Hydrogen bond
A weak chemical interaction between an electronegative atom (e.g. nitrogen
or oxygen) and a hydrogen atom that is covalently attached to another atom.
This bond maintains the two-helices of DNA together and is also the primary
interaction between water molecules.
Hydrophilicity
(lit. water-loving) The degree to which a molecule is soluble in water.
Hydrophilicity depends to a large degree on the charge and polarizability
of the molecule and its ability to form transient hydrogen-bonds with (polar)
water molecules.
Hydrophobicity
(lit. water-hating) The degree to which a molecule is insoluble in water,
and hence is soluble in lipids. If a molecule lacking polar groups is placed
in water, it will be entropically driven to finding a hyrdophobic environment
(such as the interior of a protein or a membrane).
back to top
Idiotype
Antibody variants localized to the variable portion of an immunoglobulin
that are recognised by their antigenic determinants. The determinants are
composed from the antigen-combining site or CDRs. Every unique antigenic
determinant has a specific antibody with its own unique idiotype.
Immunoglobulin
A member of the globulin protein family consisting of two light and
two heavy chains linked by disulfide bonds. All antibodies are immunoglobulins.
in silico (biology)
(Lit. computer mediated). The use of computers to simulate, process,
or analyse a biological experiment.
in situ hybridization
A variation of the DNA/RNA hybridization procedure in which the denatured
DNA is in place in the cell and is then challenged with RNA or DNA extracted
from another source. (See also fluorescence in situ hybridization).
Integration
The physical insertion of DNA into the host cell genome. The process
is used by retroviruses where a specific enzyme catalyses the process or
can occur at random sites with other DNA (eg. transposons).
Intracellular signalling
The communication of a molecular message from the surface of the cell
to the nucleus via the participation of a series of molecules, including
receptors, enzymes, proteins, and small-molecules. The end result of the
signalling process is the up- or down-regulation of a particular series
of genes that may be involved in cell growth, division or differentiation.
Introns
Nucleotide sequences found in the structural genes of eukaryotes that
are non-coding and interrupt the sequences containing information that
codes for polypeptide chains. Intron sequences are spliced out of their
RNA transcripts before maturation and protein synthesis. (cf. Exons)
Isoschizomers
Two different restriction enzymes which recognize and cut DNA at the
same recognition site. e.g Sma I and Xma I both recognize and cut the sequence
CCCGGG.
Isozymes
Two or more enzymes capable of catalyzing the same reaction but varying
in their specificity due to differences in their structures and hence their
efficiencies under different environmental conditions.
Iteration
A series of steps in an algorithm whereby the processing of data is
performed repetitively until the result exceeds a particular threshold.
Iteration is often used in multiple sequence alignments whereby each set
of pairwise alignments are compared with every other, starting with the
most similar pairs and progressing to the least similar, until there are
no longer any sequence-pairs remaining to be aligned.
back to top
Junk DNA
Term used to describe the excess DNA that is present in the genome beyond
that required to encode proteins. A misleading term since these regions
are likely to be involved in gene regulation, and other as yet unidentified
functions.
back to top
Karyotype
The constitution (typically number and size) of chromosomes in a cell
or individual.
Knockout mice (gene targeting)
Mice which have been engineered to lack a chosen gene. The gene is inactivated
in so called embryonic stem cells using the technique of homologous recombination.
These cells are then introduced into a early stage embryo (blastocyst)
and this is then transplanted into a recipient mouse. The subsequent progeny
lack the targeted gene in some cells. This technique is used to determine
the function of the chosen gene.
back to top
"Lab on a chip"
Term describing microdevices that allow rapid, microanalytical analysis
of DNA or protein in a single, fully integrated system. Typically, these
devices are miniature surfaces, made of silicon, glass or plastic, which
carry the necessary microdevices (pumps, valves, microfluidic controllers,
and detectors) that allow sample separation and analysis. These devices
are used in drug discovery, genetic testing and separation science.
Lead compound
A candidate compound identified as the best "hit" (tight binder) after
screening of a combinatorial (or other) compound library, that is then
taken into further rounds of screening to determine its suitability as
a drug.
Lead optimization
The process of converting a putative lead compound ("hit") into a therapeutic
drug with maximal activity and minimal side affects, typically using a
combination of computer-based drug design, medicinal chemistry and pharmacology.
Leucine zipper
Protein motif which binds DNA in which 4-5 Leucines are found at 7 amino
acid intervals. This motif is present typically in transcription factors
and other proteins that bind DNA.
Lexicon
In Bioinformatics, a lexicon refers to a pre-defined list of terms that
together completely define the contents of a particular database.
(strict.) The component in the grammar which is in bare form a list
of words or lexical entries.
Library
A large collection of compounds, peptides, cDNAs or genes which may
be screened in order to isolate cognate molecules.
Ligand
Any small molecule that binds to a protein or receptor; the cognate
partner of many cellular proteins, enzymes, and receptors.
Linkage
The association of genes (or genetic loci) on the same chromosome. Genes
that are linked together tend to be transmitted together.
Linkage map
A genetic map of a chromosome or genome delineated by mapping the positions
of genes to their chromosomes by their linkage to readily identifiable
genetic loci.
Locus
The specific position occupied by a gene on a chromosome. At a given
locus, any one of the variant forms of a gene may be present. The variants
are said to be alleles of that gene.
back to top
Map unit
A measure of genetic distance between two linked genes that corresponds
to a recombination frequency of 1%.
Markov chain
Any multivariate probability density whose independence diagram is a
chain.The variables are ordered, and each variable "depends" only on its
neighbors in the sense of being conditionally independent of the others.
Markov chains are an integral component of hidden Markov models.
Meiosis
A process within the cell nucleus that results in the reduction of the
chromosome number from diploid (two copies of each chromosome) to haploid
(a single copy) through two reductive divisions in germ cells.
Melting (of DNA)
The denaturation of double-stranded DNA into two single strands by the
application of heat. (Denaturation breaks the hydrogen bonds holding the
double-stranded DNA together).
Messenger RNA (mRNA)
The complementary RNA copy of DNA formed from a single-stranded DNA
template during transcription that migrates from the nucleus to the cytoplasm
where it is processed into a sequence carrying the information to code
for a polypeptide domain.
Methylation
The addition of -CH3 (methyl) groups to a target site. Typically such
addition occurs on to the cytosine bases of DNA. (see maternal imprinting).
Microarray
A 2D array, typically on a glass, filter, or silicon wafer, upon which
genes or gene fragments are deposited or synthesized in a predetermined
spatial order allowing them to be made available as probes in a high-throughput,
parallel manner.
Microfluidics
The miniaturization of chemical reactions or pharmacalogical assays
into microscopic tubes or vessels in order to greatly increase their throughput,
by placing many of them side-by-side in an array.
Mimetics
Compounds that mimic the function of other molecules via their high
degree of structural (conformational) similarity, and hence physio-chemical
properties.
Missense mutation
A point mutation in which one codon (triplet of bases) is changed into
another designating a different amino acid.
Mitosis
The nuclear division that results in the replication of the genetic
material and its redistribution into each of the daughter cells during
cell division.
Modeling
In bioinformatics, modeling usually refers to molecular modeling, a
process whereby the three-dimensional architecture of biological molecules
is interpreted (or predicted), visually represented, and manipulated in
order to determine their molecular properties. (general) A series of mathematical
equations or procedures which simulate a real-life process, given a set
of assumptions, boundary parameters, and initial conditions.
Monomer
A single unit of any biological molecule or macromolecule, such as an
amino acid, nucleic acid, polypeptide domain, or protein.
Monovalent
Having one binding site; strictly, an atom with only one free electron
available for binding in its highest energy shell.
Motif
A conserved element of a protein sequence alignment that usually correlates
with a particular function. Motifs are generated from a local multiple
protein sequence alignment corresponding to a region whose function or
structure is known. It is sufficient that it is conserved, and is hence
likely to be predictive of any subsequent occurrence of such a structural/functional
region in any other novel protein sequence.
Multigene family
A set of genes derived by duplication of an ancestral gene, followed
by independent mutational events resulting in a series of independent genes
either clustered together on a chromosome or dispersed throughout the genome.
Multiple (sequence) alignment
A Multiple Alignment of k sequences is a rectangular array, consisting
of characters taken from the alphabet A, that satisfies the following
conditions: There are exactly k rows; ignoring the gap character,
row number i is exactly the sequence sI; and each
column contains at least one character different from "-". In practice
multiple sequence alignments include a cost/weight function, that defines
the penalty for the insertion of gaps (the "-" character) and weights identities
and conservative substitutions accordingly. Multiple alignment algorithms
attempt to create the optimal alignment defined as the one with the lowest
cost/weight score.
Multiplex sequencing
Approach to high-throughput sequencing that uses several pooled DNA
samples run through gels simultaneously and then separated and analyzed.
Mutagen
Any agent that can cause an increase in the rate of mutations in an
organism.
Mutation
An inheritable alteration to the genome that includes genetic (point
or single base) changes, or larger scale alterations such as chromosomal
deletions or rearrangements.
back to top
Naked DNA
Pure, isolated DNA devoid of any proteins that may bind to it.
NCEs (New Chemical Entity)
Compounds identified as potential drugs that are sent from research
and development into clinical trials to determine their suitability.
Nested PCR
The second round amplification of an already PCR-amplified sequence
using a new pair of primers which are internal to the original primers.
Typically done when a single PCR reaction generates insufficient amounts
of product.
Neural net
A neural net is an interconnected assembly of simple processing elements,
units or nodes, whose functionality is loosely based on the animal brain.
The processing ability of the network is stored in the inter-unit connection
strengths, or weights, obtained by a process of adaptation to, or learning
from, a set of training patterns. Neural nets are used in bioinformatics
to map data and make predictions, such as taking a multiple alignment of
a protein family as a training set in order to identify novel members of
the family from their sequence data alone.
Nonsense mutation
A point mutation in which a codon specific for an amino-acid is converted
into a "stop" codon.
Northern blotting
A technique to identify RNA molecules by hybridization that is analogous
to Southern blotting (see Southern blotting).
Nuclease
Any enzyme that can cleave the phosphodiester bonds of nucleic acid
backbones.
Nucleoside
A five-carbon sugar covalently attached to a nitrogen base.
Nucleotide
A nucleic acid unit composed of a five carbon sugar joined to a phosphate
group and a nitrogen base.
back to top
Object-Relational Database
Object databases combine the elements of object orientation and object-oriented
programming languages with database capabilities. They provide more than
persistent storage of programming language objects. Object databases extend
the functionality of object programming languages (e.g., C++, Smalltalk,
or Java) to provide full-featured database programming capability. The
result is a high level of congruence between the data model for the application
and the data model of the database. Object-relational databases are
used in Bioinformatics to map molecular biological objects (such as sequences,
structures, maps and pathways) to their underlying representations (typically
within the rows and columns of relational database tables.) This enables
the user to deal with the biological objects in a more intuitive manner,
as they would in the laboratory, without having to worry about the underlying
data model of their representation.
Oligonucleotide
A short molecule consisting of several linked nucleotides (typically
between 10 and 60) covalently attached by phosphodiester bonds.
Open reading frame (ORF)
Any stretch of DNA that potentially encodes a protein. Open reading
frames start with a start codon, and end with a termination codon. No termination
codons may be present internally. The identification of an ORF is the first
indication that a segment of DNA may be part of a functional gene.
Operator
A segment of DNA that interacts with the products of regulatory genes
and facilitates the transcription of one or more structural genes.
Operon
A unit of transcription consisting of one or more structural genes,
an operator, and a promoter.
Ortholog
Orthologs are genes in different species that evolved from a common
ancestral gene by speciation. Normally, orthologs retain the same function
in the course of evolution. Identification of orthologs is critical for
reliable prediction of gene function in newly sequenced genomes. (See also
Paralogs.)
Overlapping clones
Collection of cloned sequences made by generating randomly overlapping
DNA fragments with infrequently cutting restriction enzymes.
back to top
Palindrome
A region of DNA with a symmetrical arrangement of bases occuring about
a single point such that the base sequences on either side of that point
are identical (if the strands are both read in the same direction) e.g
5’ GAATTC 3’ whose complementary sequence is 3’ CTTAAG 5’.
Paralog
Paralogs are genes related by duplication within a genome. Orthologs
retain the same function in the course of evolution, whereas paralogs evolve
new functions, even if these are related to the original one.
Parameters
Parameters are user-selectable values, typically experimentally determined,
that govern the boundaries of an algorithm or program. For instance, selection
of the appropriate input parameters governs the success of a search algorithm.
Some of the most common search parameters in bioinformatics tools include
the stringency of an alignment search tool, and the weights (penalties)
provided for mismatches and gaps.
Pathways
Bioinformatics strives to define representations of key biological datatypes,
algorithms and inference procedures, including sequences, structures, biological
pathways and reactions. Representing and computing with biological pathways
requires ontologies for representing pathway knowledge; User interfaces
to these databases; Physico-chemical properties of enzymes and their substrates
in pathways; And pathway analysis of whole genomes including identifying
common patterns across species and species differences.
Pattern
Molecular biological patterns usually occur at the level of the characters
making up the gene or protein sequence. A pattern language must be defined
in order to apply different criteria to different positions of a sequence.
In order to have position-specific comparison done by a computer, a pattern-matching
algorithm must allow alternative residues at a given position, repetitions
of a residue, exclusion of alternative residues, weighting, and ideally,
combinatorial representation.
Peptide
A short stretch of amino acids each covalently coupled by a peptide
(amide) bond.
Peptide bond (amide bond)
A covalent bond formed between two amino acids when the amino group
of one is linked to the carboxy group of another (resulting in the elimination
of one water molecule).
Phage (Bacteriophage)
A virus that infects bacterial cells and serves as a useful vector for
introducing genes into bacteria for a number of purposes.
Phage display
A technique in which phage are engineered to fuse a foreign peptide
or protein with their capsid (surface) proteins and hence display it on
their cell surfaces. The immobilized phage may then be used as a screen
to see what ligands bind to the expressed fusion protein exhibited (displayed)
on the phage surface.
Pharmacogenomics
The use of (DNA-based) genotyping in order to target pharmaceutical
agents to specific patient populations. Genetic differences are known to
affect responses to many types of drug therapy, and pharmacogenomics analysis
serves to customize the use of pharmaceuticals for specific subgroups of
patients.The rationale for this approach is that observed gene expression
differences may correlate with, and explain, the differences in side effects
and efficacy to drugs in humans.
Pharmacophore
The three dimensional spatial arrangment of atoms, substituents, functional
groups, or chemical features that together are sufficient to describe the
pharmacologically active components of a drug molecule or molecule series.
Phenotype
Any observable feature of an organism that is the result of one or more
genes.
Phylum
The segmentation of the animal kingdom into about 30 major groups collectively
known as phyla. The members of each phylum share the same basic structure
and organization. For instance, fish, birds, and human beings belong to
one phylum - the Chordata - because all have spinal cords.
Physical map
A physical map consists of a linearly ordered set of DNA fragments encompassing
the genome or region of interest. Physical maps are of two types, macro-restriction
maps and ordered clone maps. The former consists of an ordered set of large
DNA fragments generated by using restriction enzymes whose recognition
sequences are infrequently represented in the genome. An ordered clone
map consists of an overlapping collection of cloned DNA fragments. The
DNA may be cloned into any one of the available vector systems--YACs, cosmids,
phage, or even plasmids. Major advantages of ordered clone
maps are that they are of high resolution and directly provide the
clones for further study.
Plasmid
Any replicating DNA element that can exist in the cell independently
of the chromosomes. Synthetic plasmids are used for DNA cloning. Most commonly
found in bacterial cells.
Pleitropy
The multiple effects on an organism’s phenotype due to a single gene
or allele e.g the cytokines which can bind to multiple cellular receptors
and effect growth and multiple immune pathways.
Point mutation
A mutation in which a single nucleotide in a DNA sequence is substituted
by another nucleotide.
Poly(A) tail
The stretch of Adenine (A) residues at the 3’ end of eukaryotic mRNA
that is added to the pre-mRNA as it is processed, before its transport
from the nucleus to the cytoplasm and subsequent translation at the ribosome.
Polyadenylation site
A site on the 3’-end of messenger RNA (mRNA) that signals the addition
of a series of Adenines during the RNA processing step and before the mRNA
migrates to the cytoplasm. These so-called poly(A) "tails" increase
mRNA stability andallow one to isolate mRNA from cells by PCR-amplification
using poly(T) primers.
Polygenic inheritance
Inheritance involving alleles at many genetic loci.
Polymerase chain reaction (PCR )
Technique used to amplify or generate large amounts of replica DNA of
a segment of any DNA whose "flanking" sequences are known. Oligonucleotide
primers which bind these flanking sequences are used by an enzyme (Taq
polymerase) to copy the sequence in between the primers. Cycles of heat
to break apart the DNA strands, cooling to allow the primers to bind, and
heating again to allow the enzyme to copy the intervening sequence lead
to a doubling of DNA at each cycle. The reactions are typically carried
out on a regulated heating block and consist of 30-35 cycles of repeated
amplification of all the DNA present. Single molecules of "target" DNA
can be amplified to microgram amounts of DNA. The target DNA can be of
any origin.
Polymorphism
(lit. many forms) The existence of a gene in a population in at least
two different forms at a frequency far higher than that attributable to
recurrent mutation alone. Variations in a population may be measured by
determining the rate of mutation in polymorphic genes (see SNPs).
Polypeptide
A single chain of covalently attached amino acids joined by peptide
bonds. Polypeptide chains usually fold into a compact, stable form (a domain)
that is part (or all) of the final protein.
Positional cloning
Method used to define the location of a gene on a chromosome and use
this information to identify and clone the gene. The location of the gene
is determined by linkage analysis of DNA from a large family containing
afflicted and normal members to identify linkages between the transmission
of the disease gene and observable genetic markers. This information is
then used to screen (by chromosomal jumping and walking) the location for
putative genes. The disease gene must be compared between the afflicted
and normal family members and be shown to be different in the two groups.
The full sequencing of the gene will then provide information regarding
the characteristics and function of the gene product, and a potential explanation
for the cause of the disease.
Post-transcriptional modification
Alterations made to pre-mRNA before it leaves the nucleus and becomes
mature mRNA.
Post-translational modification
Alterations made to a protein after its synthesis at the ribosome. These
modifications, such as the addition of carbohydrate or fatty acid chains,
may be critical to the function of the protein.
Primary sequence (protein)
The linear sequence of a polypeptide or protein.
Primary structure (protein)
see primary sequence.
Primer
A short oligonucleotide that provides a free 3’ hydroxyl for DNA or
RNA synthesis by the appropriate polymerase (DNA polymerase or RNA polymerase).
Probe
Any biochemical that is labelled or tagged in some way so that it can
be used to identify or isolate a gene, RNA, or protein.
Profile
Sequence profiles are usually derived from multiple alignments of sequences
with a known relationship, and consist of tables of position-specific scores
and gap-penalties. Each position in the profile contains scores for all
of the possible amino acids, as well as one penalty score for opening and
one for continuing a gap at the specified position. Attempts have been
made to further improve the sensitivity of the profile by refining the
procedures to construct a profile starting from a given multiple alignment.
Other representations for sequence domains or motifs do not necessarily
require the presence of a correct and complete multiple alignment, such
as hidden Markov models.
Prokaryote
An organism or cell that lacks a membrane-bounded nucleus. Bacteria
and blue-green algae are the only surviving prokaryotes (cf. Eukaryote).
Promoter (site)
A promoter site is defined by its recognition by eukaryotic RNA polymerase
II; its activity in a higher eukaryote; by experimental evidence, or homology
and sufficient similarity to an experimentally defined promoter; and by
observed biological function.
Protein families
Sets of proteins that share a common evolutionary origin reflected by
their relatedness in function which is usually reflected by similarities
in sequence, or in primary, secondary or tertiary structure. Subsets of
proteins with related structure and function.
Proteome
The entire protein complement of a given organism.
Proteomics
The study of the proteome. Typically, the cataloging of all the expressed
proteins in a particular cell or tissue type, obtained by identifying the
proteins from cell extracts using a combination of 2D gel electrophoresis
and mass spectrometry. The large scale analysis of the protein composition
and function. (cf genomics)
Purine
A nitrogen-containing compound with a double-ring structure. The parent
compound of Adenine and Guanine.
Pyrimidine
A nitrogen-containing compound with a single six-membered ring structure.
The parent compound of Thymidine and Cytosine.
back to top
Query (sequence)
A DNA, RNA of protein sequence used to search a sequence database in
order to identify close or remote family members (homologs) of known function,
or sequences with similar active sites or regions (analogs), from whom
the function of the query may be deduced.
back to top
Rational drug design (Structure based drug design)
The development of drugs based on the 3-dimensional molecular structure
of a particular target.
Reading frame
A sequence of codons beginning with an intiation codon and ending with
a termination codon, typically of at least 150 bases (50 amino acids) coding
for a polypeptide or protein chain (see ORF and URF).
Reagents
Sources of biological or chemical material that can be used as the starting
blocks in laboratory experiments. Reagents can range from chemicals needed
to perform a particular chemical reaction, constituents of a laboratory
protocol, or clones to be used in a large-scale gene expression study.
Recessive
Any trait that is expressed phenotypically only when present on both
alleles of a gene (cf dominant).
Recombinant DNA (rDNA)
DNA molecules resulting from the fusion of DNA from different sources.
The technology employed for splicing DNA from different sources and for
amplifying the resultant heterogenous DNA.
Recombination
A new combination of alleles resulting from the rearrangement occuring
by crossing-over or by independent assortment (see crossing over).
Recursion
An algorithmic procedure whereby an algorithm calls on itself to perform
a calculation until the result exceeds a threshold, in which case the algorithm
exits. Recursion is a powerful procedure with which to process data and
is computationally quite efficient.
Regulatory gene
A DNA sequence that functions to control the expression of other genes
by producing a protein that modulates the synthesis of their products (typically
by binding to the gene promoter). (cf. Structural gene).
Relational Database
A database that follows E. F. Codd’s 11 rules, a series of mathematical
and logical steps for the organization and systemization of data into a
software system that allows easy retrieval, updating, and expansion. An
RDBMS stores data in a database consisting of one or more tables of rows
and columns. The rows correspond to a record (tuple); the columns correspond
to attributes (fields) in the record. In an RDBMS, a view, defined as a
subset of the database that is the result of the evaluation of a query,
is a table. RDBMSs use Structured Query Language (SQL) for data definition,
data management, and data access and retrieval. Relational and object-relational
databases are used extensively in bioinformatics to store sequence and
other biological data.
Relational Database Management Systems (RDBMS)
A software system that includes a database architecture, query language,
and data loading and updating tools and other ancillary software that together
allow the creation of a relational database application.
Repeats (repeat sequences)
Repeat sequences and approximate repeats occur throughout the DNA of
higher organisms (mammals). For example, the Alu sequences of length
about 300 characters, appear hundreds of thousands of times in Human DNA
with about 87% homology to a consensus Alu string. Some short substrings
such as TATA-boxes, poly-A and (TG)* also appear more often than by chance.
Repeat sequences may also occur within genes, as mutations or alterations
to those genes. Repetitive sequences, especially mobile elements, have
many applications in genetic research. DNA transposons and retroposons
are routinely used for insertional mutagenesis, gene mapping, gene tagging,
and gene transfer in several model systems.
Repetitive elements
Repetitive elements provide important clues about chromosome dynamics,
evolutionary forces, and mechanisms for exchange of genetic information
between organisms The most ubiquitous class of repetitive elements in the
DNA sequence in primate genomes is the Alu family of interspersed
repeats which have arisen in the last 65 million years of evolution Alu
repeats belong to a class of sequences defined as short interspersed elements
(SINEs). Approximately 500,000 Alu SINEs exist within the human
genome, representing about 5% of the genome by mass.
Replication
The synthesis of an informationally identical macromolecule (e.g. DNA)
from a template molecule.
Repressor
The protein product of a regulatory gene that combines with a specific
operator (regulatory DNA sequence) and hence blocks the transcription of
genes in an operon.
Restriction enzyme (restriction endonuclease)
A type of enzyme that recognizes specific DNA sequences (usually palindromic
sequences 4, 6, 8 or 16 base pairs in length) and produces cuts on both
strands of DNA containing those sequences only. The "molecular scissors"
of rDNA technology.
Restriction fragment length polymorphisms (RFLPs)
Variation within the DNA sequences of organisms of a given species that
can be identified by fragmenting the sequences using restriction enzymes,
since the variation lies within the restriction site. RFLPs can be used
to measure the diversity of a gene in a population.
Restriction map
A physical map or depiction of a gene (or genome) derived by ordering
overlapping restriction fragments produced by digestion of the DNA with
a number of restriction enzymes.
Reverse Genetics
The use of protein information to elucidate the genetic sequence encoding
that protein. Used to describe the process of gene isolation starting with
a panel of afflicted patients (see positional cloning).
Reverse transcriptase
A DNA polymerase that can synthesise a complementary DNA (cDNA) strand
using RNA as a template - a so-called RNA-dependent DNA polymerase.
Reverse transcriptase-PCR (RT-PCR)
Procedure in which PCR amplification is carried out on DNA that is first
generated by the conversion of mRNA to cDNA using reverse transcriptase.
Ribonucleic acid (RNA)
A category of nucleic acids in which the component sugar is ribose and
consisting of the four nucleotides Thymidine, Uracil, Guanine, and Adenine.
The three types of RNA are messenger RNA (mRNA), transfer RNA (tRNA) and
ribosomal RNA (rRNA).
back to top
Secondary structure (protein)
The organization of the peptide backbone of a protein that occurs as
a result of hydrogen bonds e.g alpha helix, Beta pleated sheet.
Selectivity
Selectivity of bioinformatics similarity search algorithms is defined
as the significance threshold for reporting database sequence matches.
As an example, for BLAST searches, the parameter E is interpreted as the
upper bound on the expected frequency of chance occurrence of a match within
the context of the entire database search. E may be thought of as
the number of matches one expects to observe by chance alone during the
database search.
Sense strand
The strand of double-stranded DNA that acts as the template strand for
RNA synthesis. Typically only one gene product is produced per gene, reading
from the sense strand only. (Some viruses have open reading frames in both
the sense and the antisense strands).
Sensitivity
Sensitivity of bioinformatics similarity search algorithms centers around
two areas: First, how well can the method detect biologically meaningful
relationships between two related sequences in the presence of mutations
and sequencing errors; Secondly how does the heuristic nature of the algorithm
affect the probability that a matching sequence will not be detected. At
the user's discretion, the speed of most similarity search programs can
be sacrificed in exchange for greater sensitivity - with an emphasis on
detecting lower scoring matches.
Sequence Tagged Site (STS)
A unique sequence from a known chromosomal location that can be amplified
by PCR. STSs act as physical markers for genomic mapping and cloning.
Sexual PCR (Molecular Diversity)
Sexual PCR is a form of PCR in which similar, but not identical, DNA
sequences are reassembled to obtain novel juxtapositions, simulating the
result of genetic recombination. The result is the creation of an array
of related genes which may possess improved characteristics. By repeated
rounds of recombination, selection and PCR-based amplification vastly improved
gene-products, such as enzymes with greater activity, may be generated
and selected.
Shotgun cloning
The cloning of an entire gene segment or genome by generating a random
set of fragments using restriction endonucleases to create a gene library
that can be subsequently mapped and sequenced to reconstruct the entire
genome.
Similarity (homology) search
Given a newly sequenced gene, there are two main approaches to the prediction
of structure and function from the amino acid sequence. Homology methods
are the most powerful and are based on the detection of significant extended
sequence similarity to a protein of known structure, or of a sequence pattern
characteristic of a protein family. Statistical methods are less successful
but more general and are based on the derivation of structural preference
values for single residues, pairs of residues, short oligopeptides or short
sequence patterns. The transfer of structure/function information to a
potentially homologous protein is straightforward when the sequence similarity
is high and extended in length, but the assessment of the structural significance
of sequence similarity can be difficult when sequence similarity is weak
or restricted to a short region.
Signal sequence (leader sequence)
A short sequence added to the amino-terminal end of a polypeptide chain
that forms an amphipathic helix allowing the nascent polypeptide to migrate
through membranes such as the endoplasmic reticulum or the cell membrane.
It is cleaved from the polypeptide after the protein has crossed the membrane.
Single nucleotide polymorphisms (SNPs)
Variations of single base pairs scattered throughout the human genome
that serve as measures of the genetic diversity in humans. About 1 million
SNPs are estimated to be present in the human genome, and SNPs are useful
markers for gene mapping studies.
Single-pass sequencing
Rapid sequencing of large segments of the genome of an organism by isolating
as many expressed (cDNA) sequences as possible and performing single sequencer
runs on their 5’ or 3’ ends. Single-pass sequencing typically results in
individual, error-prone sequencing reads of 400-700 bases, depending on
the type of sequencer used. However, if many of these are generated from
numerous clones from different tissues, they may be overlapped and assembled
to remove the errors and generate a contiguous sequence for the entire
expressed gene.
Site
Sites in sequences can be located either in DNA (e.g. binding sites,
cleavage sites) or in proteins. In order to identify a site in DNA, ambiguity
symbols are used to allow several different symbols at one position. Proteins,
however, need a different mechanism (see Pattern). Restriction enzyme cleavage
sites, for instance, have the following properties: limited length
(typically, less than 20 base pairs); definition of the cleavage site and
its appearance (3', 5' overhang or blunt); definition of the binding site.
Southern blotting
A procedure for the identification of DNA by transmitting a fragment
isolated on an agarose gel to a nitrocellulose filter where it can be hybridized
with a complementary "probe" sequence.
Splice form
By using alternative splicing, a single message precursor from DNA can
generate an entire family of mRNAs and proteins. This can be utilized to
create specificity in cell-cell or cell-ligand interactions. A cell may
produce a given protein, but it will be a different splice-form of the
protein than that produced by an adjacent cell. In this manner, the two
cells have the potential to interact differently with other cells or molecules.
Two places where this has been extremely important is in the production
of cell-surface specificity proteins in the immune and nervous systems.
Splice site
The sequence found at the 5’ and 3’ region of exon/intron boundaries,
usually defined by a consensus sequence:
Intron
5’ CAGGTAAGT---------TNCAGG 3’
A G C T
N represents any nucleotide; the bottom line represents alternative
nucleotides at the indicated positions.
Splicing
The joining together of separate DNA or RNA component parts. For example,
RNA splicing in eukaryotes involves the removal of introns and the stitching
together of the exons from the pre-mRNA transcript before maturation.
Solvent accessibility
The surface area (typically measured in square angstroms) of a biological
molecule, usually a protein, that is exposed to solvent in its native,
folded form. Determining the solvent accessibility of a protein helps define
which amino acids in its molecular sequence are on the exterior of the
molecule, and thus available to participate in interactions with other
molecules.
Structural gene
Gene which encodes a structural protein (cf. Regulatory gene).
Structure prediction
Algorithms that predict the secondary, tertiary and sometimes even quarternary
structure of proteins from their sequences. Determining protein structure
from sequence has been dubbed "the second half of the Genetic Code" since
it is the folded tertiary structure of a protein that governs how it functions
as a gene product. As yet most structure prediction methods are only
partially successful, and typically work best for certain well-defined
classes of proteins.
Substitution matrix
A model of protein evolution at the sequence level resulting in the
development of a set of widely used substitution matrices. These are frequently
called Dayhoff, MDM (Mutation Data Matrix), BLOSUM or PAM (Percent Accepted
Mutation) matrices. They are derived from global alignments of closely
related sequences. Matrices for greater evolutionary distances are
extrapolated from those for lesser ones.
Subtraction library
A cDNA library that only contains cDNAs uniquely expressed in a given
cell or tissue. e.g T cells and B cells will express many common RNAs,
as well as a very small percentage which will be unique for T cells and
B cells respectively. To make a T cell subtraction library, the cDNA from
a T cell library is hybridized with a vast excess of B cell RNA. The commonly
expressed genes will result in RNA-cDNA hybrids which can be removed (or
subtracted) to leave only T cell specific cDNAs.
back to top
Trace
A series of coloured peaks from which the individual bases of
a sequence are derived. The original format is produced by the ABI Analysis
software. This format is converted to SCF for use by xgap, by the Squirrel
program.
Tentative Consensus (TC)
The identification of a sequence from an EST cluster that represents
part or all of a complete gene. TCs are usually determined by clustering
ESTs allowing for sequencing errors, artefacts such as chimeric clones,
and naturally occuring biological phenomena such as alternative splicing.
Creation of a cluster allows one to generate a consensus sequence and then
identify a long open reading frame which would suggest the possibility
of that consensus representing a bona fide gene.
Tentative Human Consensus sequences (THCs)
A consensus sequence generated from human EST fragments. THCs may be
validated by comparison against databases of known human gene sequences,
human genomic sequences, or by identification of the ORFs or other sequence
features contained within the consensus as belonging to a known human gene
product.
Tertiary structure
Folding of a protein chain via interactions of its sideschain molecules
including formation of disulfide bonds between cysteline residues.
Thymine
A pyrimidine base found in DNA but not in RNA.
Tissue
Section of an organ that consists of a largely homogenous population
of cell types. Since many organs are multifunctional, they have developed
highly specialized cell types to perform different functions. Identifying
the section of an organ that is homogenous for a particular cell type ensures
that the gene expression profiles extracted from those cells will accurately
resemble the class of cells that make up the tissue.
Transcript
The single-stranded mRNA chain that is assembled from a gene template.
Transcription
The assembly of complementary single-stranded RNA on a DNA template.
Transcription factors
A group of regulatory proteins that are required for transcription in
eukaryotes. Transcription factors bind to the promoter region of a gene
and facilitate transcription by RNA polymerase.
Transfer RNA (tRNA)
A small RNA molecule that recognizes a specific amino acid, transports
it to a specific codon in the mRNA, and positions it properly in the nascent
polypeptide chain.
Transformation
A genetic alteration to a cell as a result of the incorporation of DNA
from a genetically diferent cell or virus; can also refer to the introduction
of DNA into bacterial cells for genetic manipulation.
Transgene
A foreign gene that is introduced into a cell or whole organism (eg.transgenic
mice) for therapeutic or experimental purposes.
Translation
The process of converting RNA to protein by the assembly of a polypeptide
chain from an mRNA molecule at the ribosome.
Transmembrane region
The region of a transmembrane protein that actually spans the membrane.
Transmembrane regions are usually hydrophobic in order to be thermodynamically
compatible with the lipid bilayer portion of the membrane. They may
consist of either alpha-helical or beta-strand secondary structure elements,
but in either case the external residues (the ones facing the membrane)
are invariably hydrophobic while the internal residues may be hydrophilic
(as in the case of a pore or channel) or polar. One common transmembrane
structural domain is the seven-helix bundle seen in numerous channel proteins.
back to top
Unidentified reading frame (URF)
An open reading frame encoding a protein of undefined function.
Uracil
Nitrogenous pyrimidine base found in RNA but not DNA.
back to top
Variable numbers of tandem repeats (VNTRs)
DNA sequence blocks of 2-60 base pairs which are repeated from two to
more than 20 times in different individuals. This polymorphism makes VNTRs
very useful DNA markers used in genomic mapping, linkage analysis and also
DNA fingerprinting.
Variation (genetic)
Variation in genetic sequences and the detection of DNA sequence variants
genome-wide allow studies relating the distribution of sequence variation
to a population history. This in turn allows one to determine the density
of SNPS or other markers needed for gene mapping studies. Quantitation
of these variations together with analytical tools for studying sequence
variation also relate genetic variations to phenotype.
Vector
Any agent that transfers material (typically DNA) from one host to another.
Typically DNA vectors are autonomous DNA elements (such as plasmids) that
can be manipulated and integrated into a host’s DNA or recombinant viruses.
Virtual libraries
The creation and storage of vast collections of molecular structures
in an electronic database. These databases may be queried for subsets that
exhibit specific physicochemical features, or may be "virtually screened"
for their ability to bind a drug target. This process may be performed
prior to the synthesis and testing of the molecules themselves.
Visualization
Visualization is the process of representing abstract scientific data
as images that can aid in understanding the meaning of the data.
back to top
Weight matrix
The density of binding sites in a gene or sequence can be used to derive
a ratio of density for each element in a pattern of interest. The combined
individual density ratios of all elements are then collectively used to
build a scoring profile known as a weight matrix. This profile can be used
to test the prediction of the identification of the selected pattern and
the ability of the algorithm to discriminate them from non-pattern sequences.
Western blot
Technique in which specific antibodies are used to identify their antigens
from a mixture of proteins. Typically, these proteins mixtures are first
separated by electrophoresis and then transfered onto nylon sheets by electrotransfer.
Radiolabeled or enzyme-linked antibodies are incubated with the sheets
and unbound antibodies washed away allowing the position of the bound antibody
to be revealed by autoradiography or color which is formed upon addition
of a substrate.
Wild type
Form of a gene or allele that is considered the "standard" or most common.
back to top
X chromosome
In mammals, the sex chromosome that is found in two copies in the homogametic
sex (female in humans) and one copy in the heterogametic sex (male in humans).
back to top
Yeast 2-hybrid system
A yeast-based method used to simultaneously identify, and clone the
gene for, proteins interacting with a known protein. The basis of this
method is a "transcriptional reporter assay" (see definition) in which
reporter gene expression is dependent on two domains. The first domain
is linked to the known protein. The second domain is genetically linked
to a library. If the library is screened against the known protein the
two domains will interact only if a protein from the library binds the
known protein, resulting in transcription activation of the reporter gene,
and a blue color. The "blue yeast clone" will contain the gene encoding
the newly identified protein.
back to top
Z-DNA
A conformation of DNA existing as a left-handed double helix (the phosphate-sugar
backbone forms a left-handed zig-zag course), which may play a role in
gene regulation.
Zinc fingers
A protein motif formed by the interaction of repeated cysteine and histidine
residues with a zinc ion. The spacing of the repeats results in finger
like arrangements of the protein loops formed from the interaction which
interact with DNA. These motifs are typically found in transcription factors. |