Introduction to Bioinformatics
Homework 2
due March 1st in class
This text is also available on the course web page.
In this homework you will have to go through a few of the common
sequence analysis steps that start with raw, experimentally obtained
DNA sequences and lead to biological interpretations. The sequences
that you are given are from an actual experiment and you will try to
analyze them in much the same way a researcher working on this project
might do.
Here are a few details of the experiment. The goal was to identify
proteins responsible for the development of a group of cells called
the primary mesenchyme cells (PMCs) that are formed during sea urchin
embryogenesis. PMCs have a migratory phase during which they align
themselves in a characteristic ring pattern. Their purpose is to
synthesize the larval skeleton, a stereotypically branched structure
made primarily of calcium carbonate. PMCs begin to synthesize the
skeleton as soon as they have formed the ring pattern. So reasonable
proteins to look for would be some that are involved in migration,
interactions with the extracellular matrix on which they migrate, or
synthesis of the skeleton. To identify such candidates, mRNA was
purified from a crude PMC preparation (contaminated with other cell
types such as ectoderm and endoderm) from embryos of the sea urchin
species Strongylocentrotus purpuratus. The mRNA was
reverse-transcribed into cDNA from which an arrayed library was
prepared. 1000 cDNA clones were randomly chosen for sequencing. For
each clone two sequencing reactions were performed, one from each
end. Such an experiment is called EST (for Expressed Sequence Tag)
analysis, meaning that it provides a snapshot of the set of expressed
mRNAs by that particular tissue.
20 of these raw sequences are given to you to work with. You can find
them following the link on the course web page. The sequences are
given in FASTA format, which consists for each sequence of a first
descriptive line starting with '>', followed by one or more lines of
sequence. This format is recognized by a large variety of
programs. However, you will have to check this for each particular
server program you will encounter.
To perform your analysis you will use several public servers
accessible over the internet. There usually is more than one server
for each specific task so if one of the recommended ones is
unavailable you can search for another similar one. Two good places to
start are:
Prepare a written repot in which you address the following
questions:
- First you need to check how many different mRNA sequences are
present in your data set, i.e. two ore more clones could be repeats of
the same mRNA. If this happens, the sequences most likely overlap only
partially. You need to detect such overlaps and form consensus
sequences called contigs. A program to do that is CAP3, available at
http://hercules.tigem.it/ASSEMBLY/assemble.html . Provide a list
of all the independent sequences (contigs or singletons) that were
found.
- Translate each independent sequence into protein. Since you don't
know the correct reading reading frame, do a translation for each of
the 6 reading frames. You can use the server at
http://www.expasy.ch/tools/dna.html . From the distribution of
stop codons, can you determine the most likely reading frame for any
of the sequences? Provide an example of a translation and explain why
you consider it the most likely one to be translated, or why it was
impossible to decide on a likely reading frame.
- Perform a BLAST search with all independent sequences against
Genbank using the NCBI server at http://www.ncbi.nlm.nih.gov/. Do a
search for both nucleotides and proteins. To do the protein search you
can use either your translated sequences from above, or simply the
blastx server. Which of the sequences give significant hits against
the database? Which of the hits are probably random? Use the list of
top hits and a couple of alignments in your argument.
- Looking at the information you have so far, try to identify how
similar each sequence is to something that is already known. You might
want to place each sequence into one of the following categories:
- The sequence most likely does not code for protein
- The sequence codes for a novel protein
- The protein might share some sequence motifs with other known
proteins
- This protein is already known in another species, but not in
S. purpuratus
- This protein has already been described in
S. purpuratus
This analysis involves a lot of speculation and you don't need to
decide with certainty for one of the above categories. It is OK if you
narrow down your decision in some cases to 2-3 of the categories. Give
a detailed explanation of your reasoning, with sequence/alignment
examples.
- The sequences you are working with are very likely to contain
errors. Some sources of errors are the following:
- During preparation of the cDNA library a small number of clones
can be formed in which the DNA has been reorganized. There could be
missing pieces, reshuffling of fragments, or even joining of 2
fragments from different mRNAs in the same clone.
- When the DNA sequence is read out from the sequencing gel there
could be misread, skipped, or extra bases. Such errors are more likely
to appear at both ends of a raw sequence.
- In the process of contig assembly some of the reading errors from
above can be propagated and further amplified in the final
consensus.
Using the data you have so far, give an example of an error in your
sequences. Explain why you believe it is an error and its possible
source.
- Using the protein sequences that are most likely to be encoded by
your DNA fragments (possibly with some manual corrections of errors
you found) do a search against the Prosite sequence motif database
(http://www.expasy.ch/tools/scnpsit1.html). What motifs did you find?
Are there any clear true positives or clear false positives?
- Pick one of the sequences for which there are other known similar
proteins. Retrieve the amino acid sequence of several of the similar
proteins and, together with yours, do a multiple alignment of the
set. A useful server can be found at
http://www.ibc.wustl.edu/msa/clustal.html. Report the obtained
alignment and the algorithm used.
- From the above alignment identify conserved residues within the
family or superfamily of proteins. How do the conserved motifs relate
to the function of the known proteins? Can you predict anything about
the function of your own protein? To answer these questions you will
need to look up a couple of papers from the literature. A good
starting point to do this is Medline, available through the NCBI server.
Extra credit:
- Analyze the possible function of the protein sequence you picked
above in the context of sea urchin development. Would it be worthwhile to
pursue this protein in wet lab experiments?