An overview of the various planarian transcriptomes

posted in: Bioinformatics | 0
This post is taken directly from my blog at blog.nextgenetics.net

In terms of sequences, there are currently quite a lot of data in the planarian (Schmidtea medterranea) field. We have an assembled genome from University of Washington’s genome institute and various transcriptome assemblies using different sequencing platforms.

The four main transcriptome assemblies that we have are:

I am going to briefly go over the various transcriptomes and give some thoughts on what can be improved.


Pre-NGS data

The genome assembly is provided by University of Washington’s genome institute. Predicted to be ~850 MB, the planarian genome consist of 43,294 supercontigs. Along with the genome assembly, an additional 77,833 ESTs are available, sequenced from various sources. The majority of the ESTs comes from the genome sequencing. 

Genome   ESTs
Genome coverage ~11.6x
Number of contigs 43,294
Number of bases 901,626,601
Average length 20,825
Cytosine % 14.35 % (129,442,937)
Guanine % 14.35 % (129,419,385)
Adenine % 33.65 % (303,435,602)
Thymine % 33.63 % (303,283,077)
N % 3.99 % (36,045,600)
 
Transcriptome coverage ???
Number of ESTs 77,833
Number of bases 48,285,735
Average length 620
Cytosine % 17.95 % (8,670,358)
Guanine % 18.46 % (8,916,672)
Adenine % 32.45 % (15,668,223)
Thymine % 31.04 % (14,990,344)
N % 0.08 % (40,138)

There is also a set of 30,825 gene annotations (SmedGD) done with MAKER using homology and available EST evidence: Sofia M.C. Robb, Eric Ross and Alejandro Sánchez Alvarado (2007) SmedGD: the Schmidtea mediterranea Genome Database Nucleic Acids Research, 36:D599-D606, doi:10.1093/nar/gkm684


NGS data

The transcriptome assemblies were done using Roche 454, Illumina, and ABI SOLiD. Here is a table of the raw data that went into each assembly. 

Assembly 454 Illumina SOLiD
AAA
(Blythe et al)
0.58 million SE X 507 million SE
Smed454
(Abril et al)
0.58 million SE X X
BIMSB
(Adamidi et al)
1.3 million SE 56 milion PE, 20 million SE X
Heidelberg
(Sandmann et al)
1.3 million SE 336 million PE X


Assemblies

The assembly methods varies with the sequencing platform. De novo assembly is usually done with Newbler on 454 reads. De novo assembly can be done on Illumina reads with various short-read assemblers like SOAPdeNovo or Velvet. Reference assembly can also be done on illumina reads with BWA as a mapper and Cufflinks as an assembler. Currently, the only reliable option to assembling SOLiD reads is a reference assembly, typically done with Tophat/Bowtie and Cufflinks.

The AAA data set was assembled in our lab. I was able to get two other assembled sets of transcripts from BIMSB (Adamidi et al) and Heidelberg (Sandmann et al). Here are some statistics of these three assemblies:

   AAA  BIMSB  Heidelberg
Number of transcripts 25,052 18,619 28,926
Mean length 950.52 1,288.28 1074.91
Median length 804 1,078 715
Range of lengths 21 – 22,140 8 – 14,735 101 – 17,609
Number of bases 23,812,409 23,986,427 31,092,796
Cytosine % 15.88 % (3,781,022) 17.03 % (4,085,255) 16.32 % (5,135,878)
Guanine % 17.42 % (4,148,181) 17.07 % (4,093,813) 16.75 % (5,207,613)
Adenosine %  34.10 % (8,120,012) 32.48 % (7,790,708) 30.41 % (9,456,616)
Thymine %  32.60 % (7,762,999) 32.55 % (7,808,091) 29.99 % (9,326,216)
N % 0.01 % (195) 0.87 % (208,560) 6.32 % (1,966,473)
Number mapped to genome*  24,842 (99.16 %)  18,331 (98.45 %)  28,226 (97.58 %)
Number of mapped loci*  39,993  28,002  48,513

*mapping was done with GMAP


AAA dataset

The AAA dataset was assembled using 454 and SOLiD reads. The 454 reads and ESTs were de novo assembled using Newbler and the SOLiD reads were reference assembled using BioScope/Cufflinks. Since this assembly was done before Tophat and Bowtie supported mapping of color-space reads, we had to use BioScope to find split-reads. 

The 454 assembly was used as the backbone. The SOLiD assembly was used to determine strandness where possible and extend the 454 assembly. I believe we have saturated the transcriptome with our SOLiD reads, but because of variable distribution of SOLiD reads across transcripts, we had a hard time assembling full length transcripts.  

Pros: A more complete coverage of transcribed regions due to the SOLiD sequencing, strandness can be determined with SOLiD, samples taken from a range of regenerating time-points

Cons: Redundancy in transcripts, not full length, some transcripts containing introns were likely assembled from pre-mRNA


BIMSB dataset

The BIMSB assembly had a fair amount of illumina 36bp pair-end reads which were assembled with SOAPdenovo. They also had a good amount of 454 reads assembled with Newbler. The initial assembly actually produced around 26,000 transcripts. Using BLAT, they were able to determine transcript fusion/fission events and recluster the ~26,000 transcripts into ~18,000. 

Pros: Very little redundancy, transcript are more complete, proteomics data to support the transcripts

Cons: Slightly lower coverage of all genes due to the strict transcript clustering step where 5% lower quantile were discarded, unknown sample conditions


Heidelberg dataset

The most recent transcriptome assembled is the Heidelberg transcriptome. The raw data contains a good amount of 454 reads from previous studies and a large amount of Illumina 36bp PE reads. They were able to use the 454 + ESTs de novo assembly as a scaffold. Velvet + Oassis were then used to assemble the Illumina data with the 454 assembly. 

Pros: Good coverage of all the genes, more complete transcript lengths

Cons: PE read assembly contains a lot of Ns, same redundancy problem as AAA, all samples were taken from head pieces might bias read composition


General issues with assemblies

Multiple sequencing platforms. NGS read assemblers are usually design for a specific platform. There is a hybrid assembler available (Mira2), however it is mainly used for genomic assemblies. Most of the planarian transcriptome studies utilizes two sequencing platforms resulting in an initial dual assemble and then a merging step. 

The problem with this approach is that since each assembler attempts to deal with the short-comings of the sequencing platform by various methods, there is no standardized metric of determining the ‘goodness’ of an assembly. When we merge two assemblies from two different platforms, are we compounding the faults of both? 

Sample preparation. Read composition of the biological sample could skew assembly statistics depending on the condition of the organism when the sample was taken and library preparation methods. How does the various assemblers deal with read composition? 

Let’s say a transcript, ‘X’ is expressed just enough in one library to pass the threshold for amount of reads required for assembly. But in another library, it is not expressed at all. If we put together the reads of both libraries and assemble it, do we run into the risk of discarding ‘X’? Is it better to assemble all the libraries individually and then merge the individual assemblies? How much coverage would we be losing if we did do that?

SOLiD reference assembly. ABI SOLiD reads are a bitch to work with because they are in color-space. There are no reliable de novo assemblers for SOLiD currently (at least one that can handle planarian’s AT rich transcriptome). The available de novo assemblers just convert the color-space reads into nucleotide-space for a de novo assembly.

The best we can do is map the reads to the genome and reference assemble the reads. The reliances on this incomplete genome means we cannot discover anything that isn’t on the genome. Another issue is that the genome is of a sexual strain of planarians. Mapping asexual reads onto a sexual genome is obviously not ideal.


Final Thoughts

There are 4 separate transcriptomes in planarians right now. 4 indepdent transciptome studies within a year of each other. These 4 individual studies consist of a combined: over 2 million 454 reads, over 400 million illumina reads, and over 500 million SOLiD reads. I think that’s enough said.