General help on EST clustering and the information provided here


Index


What is an EST?

Expressed Sequence Tag: ESTs are short stretch of an expressed cDNA sequence. Usually RNAs are extracted from a certain tissue or cell type. The RNA is reversely transcribed. Part of the cDNA is sequenced by one sequencing run from one side or by two runs from both sides. Many cDNA clones, libraries are collected at the
I.M.A.G.E. Consortium.
Overview of EST generation

What is a Cluster?

An EST CLUSTER is a set of sequences which overlap or contain each other. A cluster is build by finding all-against-all sequence similarities and grouping together similar sequences as described in the
Method. In the next step the sequences of each individual cluster are assembled. During assembly a cluster might fall apart into different CONTIGS. For each of the contigs a CONSENSUS sequence which best represents the alignment is computed. Further for each cluster representative CLONES are selected.

Why clustering ESTs?

The information in the EST databases is very redundant and not of a very high quality. Through the process of removing the reducing the redundancy one can obtain longer and higher quality sequence. This information is linked to the information about the known genes within a cluster and can be used to obtain a thorough analysis of the sequence. The goal is to collect most available information about a Gene, whether it is known or unknown, in one place. Further, we select a representative clone which physically represents the sequence and can be used to construct biological experiments. The clones can be obtained from the
Resource Center of the German Human Genome Project.

Why are there often more than one contig in a cluster?

The sequences in a cluster are assembled to get an alignment and conensus sequences. The assembly procedure is however more strict and requires strict overlapt. Therefore during this procedure the cluster might fall apart into several contigs. There are several possibilities, why this might happen. First it might be a technical shortcoming in the data or the alignment method. Errors in the alignment might sum up, so that other sequences cannot be aligned with the given parameters any more. Also some sequences might be contaminated with low quality or random sequence, vectors, concatemers, etc. On the other hand it might be an intrinsic inconsitancy in the data. Sequences might have differences in fact due to differential splicing.

What information can be found on this web site?

We clustered the EST and gene information for several organisms. Information about each cluster can be accessed with our cluster name or by searching for the GenBank accession number or the IMAGE ID of one of the sequences contained in the cluster. On the Detailed Query page you can also search for a library number, a text in the genes description or you can search with a DNA or Protein sequence against the consensus sequences of the clusters using BLAST. Blast outputs are visualized by a simple blast viewing tool.

For each cluster some of the information is linked on a dynamically created web page. There is some general information about the cluster, for example, the representative clones (which are linked to the RZPD Home Page where you can order them). You might also get the aligned and tagged sequences as a Staden Project (you have to install the Staden Package locally). Further there is information for each contig.

A contig might contain one or more known genes and one or more cDNA clones. The clones are sequenced from the end. An overview over the alignment is displayed. The sequences are linked to the EMBL entries. The known genes will be annotated and linked to GeneCards if possible.

Some further analysis of the sequence can be visualized: You can tag Repeat regions; display of open reading frames; further reads of clones within other clusters (clustering is only done by sequence overlap and not by annotation); sequence homologies to other clusters; sequence homologies to the SYSTERS protein families.


further questions
Tim Beissbarth - Last Change: 6.03.00