Introduction to Bioinformatics
Homework 2

due March 1st in class
This text is also available on the course web page.


In this homework you will have to go through a few of the common sequence analysis steps that start with raw, experimentally obtained DNA sequences and lead to biological interpretations. The sequences that you are given are from an actual experiment and you will try to analyze them in much the same way a researcher working on this project might do.

Here are a few details of the experiment. The goal was to identify proteins responsible for the development of a group of cells called the primary mesenchyme cells (PMCs) that are formed during sea urchin embryogenesis. PMCs have a migratory phase during which they align themselves in a characteristic ring pattern. Their purpose is to synthesize the larval skeleton, a stereotypically branched structure made primarily of calcium carbonate. PMCs begin to synthesize the skeleton as soon as they have formed the ring pattern. So reasonable proteins to look for would be some that are involved in migration, interactions with the extracellular matrix on which they migrate, or synthesis of the skeleton. To identify such candidates, mRNA was purified from a crude PMC preparation (contaminated with other cell types such as ectoderm and endoderm) from embryos of the sea urchin species Strongylocentrotus purpuratus. The mRNA was reverse-transcribed into cDNA from which an arrayed library was prepared. 1000 cDNA clones were randomly chosen for sequencing. For each clone two sequencing reactions were performed, one from each end. Such an experiment is called EST (for Expressed Sequence Tag) analysis, meaning that it provides a snapshot of the set of expressed mRNAs by that particular tissue.

20 of these raw sequences are given to you to work with. You can find them following the link on the course web page. The sequences are given in FASTA format, which consists for each sequence of a first descriptive line starting with '>', followed by one or more lines of sequence. This format is recognized by a large variety of programs. However, you will have to check this for each particular server program you will encounter.

To perform your analysis you will use several public servers accessible over the internet. There usually is more than one server for each specific task so if one of the recommended ones is unavailable you can search for another similar one. Two good places to start are:

Prepare a written repot in which you address the following questions:

  1. First you need to check how many different mRNA sequences are present in your data set, i.e. two ore more clones could be repeats of the same mRNA. If this happens, the sequences most likely overlap only partially. You need to detect such overlaps and form consensus sequences called contigs. A program to do that is CAP3, available at http://hercules.tigem.it/ASSEMBLY/assemble.html . Provide a list of all the independent sequences (contigs or singletons) that were found.
  2. Translate each independent sequence into protein. Since you don't know the correct reading reading frame, do a translation for each of the 6 reading frames. You can use the server at http://www.expasy.ch/tools/dna.html . From the distribution of stop codons, can you determine the most likely reading frame for any of the sequences? Provide an example of a translation and explain why you consider it the most likely one to be translated, or why it was impossible to decide on a likely reading frame.
  3. Perform a BLAST search with all independent sequences against Genbank using the NCBI server at http://www.ncbi.nlm.nih.gov/. Do a search for both nucleotides and proteins. To do the protein search you can use either your translated sequences from above, or simply the blastx server. Which of the sequences give significant hits against the database? Which of the hits are probably random? Use the list of top hits and a couple of alignments in your argument.
  4. Looking at the information you have so far, try to identify how similar each sequence is to something that is already known. You might want to place each sequence into one of the following categories:
    1. The sequence most likely does not code for protein
    2. The sequence codes for a novel protein
    3. The protein might share some sequence motifs with other known proteins
    4. This protein is already known in another species, but not in S. purpuratus
    5. This protein has already been described in S. purpuratus
    This analysis involves a lot of speculation and you don't need to decide with certainty for one of the above categories. It is OK if you narrow down your decision in some cases to 2-3 of the categories. Give a detailed explanation of your reasoning, with sequence/alignment examples.
  5. The sequences you are working with are very likely to contain errors. Some sources of errors are the following:
    1. During preparation of the cDNA library a small number of clones can be formed in which the DNA has been reorganized. There could be missing pieces, reshuffling of fragments, or even joining of 2 fragments from different mRNAs in the same clone.
    2. When the DNA sequence is read out from the sequencing gel there could be misread, skipped, or extra bases. Such errors are more likely to appear at both ends of a raw sequence.
    3. In the process of contig assembly some of the reading errors from above can be propagated and further amplified in the final consensus.
    Using the data you have so far, give an example of an error in your sequences. Explain why you believe it is an error and its possible source.
  6. Using the protein sequences that are most likely to be encoded by your DNA fragments (possibly with some manual corrections of errors you found) do a search against the Prosite sequence motif database (http://www.expasy.ch/tools/scnpsit1.html). What motifs did you find? Are there any clear true positives or clear false positives?
  7. Pick one of the sequences for which there are other known similar proteins. Retrieve the amino acid sequence of several of the similar proteins and, together with yours, do a multiple alignment of the set. A useful server can be found at http://www.ibc.wustl.edu/msa/clustal.html. Report the obtained alignment and the algorithm used.
  8. From the above alignment identify conserved residues within the family or superfamily of proteins. How do the conserved motifs relate to the function of the known proteins? Can you predict anything about the function of your own protein? To answer these questions you will need to look up a couple of papers from the literature. A good starting point to do this is Medline, available through the NCBI server.
Extra credit:
  1. Analyze the possible function of the protein sequence you picked above in the context of sea urchin development. Would it be worthwhile to pursue this protein in wet lab experiments?