This post is taken directly from my blog at blog.nextgenetics.net.
In terms of sequences, there are currently quite a lot of data in the planarian (Schmidtea medterranea) field. We have an assembled genome from University of Washington’s genome institute and various transcriptome assemblies using different sequencing platforms.
The four main transcriptome assemblies that we have are:
Blythe MJ, Kao D, Malla S, Rowsell J, Wilson R, et al. (2010) A Dual Platform Approach to Transcript Discovery for the Planarian Schmidtea Mediterranea to Establish RNAseq for Stem Cell and Regeneration Biology. PLoS ONE 5(12): e15617. doi:10.1371/journal.pone.0015617
Josep F Abril, Francesc Cebrià1, Gustavo Rodríguez-Esteban, et al. (2010) Smed454 dataset: unravelling the transcriptome of Schmidtea mediterranea. BMC Genomics 2010, 11:731 doi:10.1186/1471-2164-11-731
Catherine Adamidi, Yongbo Wang, Dominic Gruen, et al. (2011) De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics. Genome Research 2011. doi:10.1101/gr.113779.110
Thomas Sandmann, Matthias C Vogg, Suthira Owlarn, et al. (2011) The head-regeneration transcriptome of the planarian Schmidtea mediterranea. Genome Biology 2011, 12:R76 doi:10.1186/gb-2011-12-8-r76
I am going to briefly go over the various transcriptomes and give some thoughts on what can be improved.
The genome assembly is provided by University of Washington’s genome institute. Predicted to be ~850 MB, the planarian genome consist of 43,294 supercontigs. Along with the genome assembly, an additional 77,833 ESTs are available, sequenced from various sources. The majority of the ESTs comes from the genome sequencing.
There is also a set of 30,825 gene annotations (SmedGD) done with MAKER using homology and available EST evidence: Sofia M.C. Robb, Eric Ross and Alejandro Sánchez Alvarado (2007) SmedGD: the Schmidtea mediterranea Genome Database Nucleic Acids Research, 36:D599-D606, doi:10.1093/nar/gkm684
The transcriptome assemblies were done using Roche 454, Illumina, and ABI SOLiD. Here is a table of the raw data that went into each assembly.
(Blythe et al)
|0.58 million SE||X||507 million SE|
(Abril et al)
|0.58 million SE||X||X|
(Adamidi et al)
|1.3 million SE||56 milion PE, 20 million SE||X|
(Sandmann et al)
|1.3 million SE||336 million PE||X|
The assembly methods varies with the sequencing platform. De novo assembly is usually done with Newbler on 454 reads. De novo assembly can be done on Illumina reads with various short-read assemblers like SOAPdeNovo or Velvet. Reference assembly can also be done on illumina reads with BWA as a mapper and Cufflinks as an assembler. Currently, the only reliable option to assembling SOLiD reads is a reference assembly, typically done with Tophat/Bowtie and Cufflinks.
The AAA data set was assembled in our lab. I was able to get two other assembled sets of transcripts from BIMSB (Adamidi et al) and Heidelberg (Sandmann et al). Here are some statistics of these three assemblies:
|Number of transcripts||25,052||18,619||28,926|
|Range of lengths||21 – 22,140||8 – 14,735||101 – 17,609|
|Number of bases||23,812,409||23,986,427||31,092,796|
|Cytosine %||15.88 % (3,781,022)||17.03 % (4,085,255)||16.32 % (5,135,878)|
|Guanine %||17.42 % (4,148,181)||17.07 % (4,093,813)||16.75 % (5,207,613)|
|Adenosine %||34.10 % (8,120,012)||32.48 % (7,790,708)||30.41 % (9,456,616)|
|Thymine %||32.60 % (7,762,999)||32.55 % (7,808,091)||29.99 % (9,326,216)|
|N %||0.01 % (195)||0.87 % (208,560)||6.32 % (1,966,473)|
|Number mapped to genome*||24,842 (99.16 %)||18,331 (98.45 %)||28,226 (97.58 %)|
|Number of mapped loci*||39,993||28,002||48,513|
*mapping was done with GMAP
The AAA dataset was assembled using 454 and SOLiD reads. The 454 reads and ESTs were de novo assembled using Newbler and the SOLiD reads were reference assembled using BioScope/Cufflinks. Since this assembly was done before Tophat and Bowtie supported mapping of color-space reads, we had to use BioScope to find split-reads.
The 454 assembly was used as the backbone. The SOLiD assembly was used to determine strandness where possible and extend the 454 assembly. I believe we have saturated the transcriptome with our SOLiD reads, but because of variable distribution of SOLiD reads across transcripts, we had a hard time assembling full length transcripts.
Pros: A more complete coverage of transcribed regions due to the SOLiD sequencing, strandness can be determined with SOLiD, samples taken from a range of regenerating time-points
Cons: Redundancy in transcripts, not full length, some transcripts containing introns were likely assembled from pre-mRNA
The BIMSB assembly had a fair amount of illumina 36bp pair-end reads which were assembled with SOAPdenovo. They also had a good amount of 454 reads assembled with Newbler. The initial assembly actually produced around 26,000 transcripts. Using BLAT, they were able to determine transcript fusion/fission events and recluster the ~26,000 transcripts into ~18,000.
Pros: Very little redundancy, transcript are more complete, proteomics data to support the transcripts
Cons: Slightly lower coverage of all genes due to the strict transcript clustering step where 5% lower quantile were discarded, unknown sample conditions
The most recent transcriptome assembled is the Heidelberg transcriptome. The raw data contains a good amount of 454 reads from previous studies and a large amount of Illumina 36bp PE reads. They were able to use the 454 + ESTs de novo assembly as a scaffold. Velvet + Oassis were then used to assemble the Illumina data with the 454 assembly.
Pros: Good coverage of all the genes, more complete transcript lengths
Cons: PE read assembly contains a lot of Ns, same redundancy problem as AAA, all samples were taken from head pieces might bias read composition
General issues with assemblies
Multiple sequencing platforms. NGS read assemblers are usually design for a specific platform. There is a hybrid assembler available (Mira2), however it is mainly used for genomic assemblies. Most of the planarian transcriptome studies utilizes two sequencing platforms resulting in an initial dual assemble and then a merging step.
The problem with this approach is that since each assembler attempts to deal with the short-comings of the sequencing platform by various methods, there is no standardized metric of determining the ‘goodness’ of an assembly. When we merge two assemblies from two different platforms, are we compounding the faults of both?
Sample preparation. Read composition of the biological sample could skew assembly statistics depending on the condition of the organism when the sample was taken and library preparation methods. How does the various assemblers deal with read composition?
Let’s say a transcript, ‘X’ is expressed just enough in one library to pass the threshold for amount of reads required for assembly. But in another library, it is not expressed at all. If we put together the reads of both libraries and assemble it, do we run into the risk of discarding ‘X’? Is it better to assemble all the libraries individually and then merge the individual assemblies? How much coverage would we be losing if we did do that?
SOLiD reference assembly. ABI SOLiD reads are a bitch to work with because they are in color-space. There are no reliable de novo assemblers for SOLiD currently (at least one that can handle planarian’s AT rich transcriptome). The available de novo assemblers just convert the color-space reads into nucleotide-space for a de novo assembly.
The best we can do is map the reads to the genome and reference assemble the reads. The reliances on this incomplete genome means we cannot discover anything that isn’t on the genome. Another issue is that the genome is of a sexual strain of planarians. Mapping asexual reads onto a sexual genome is obviously not ideal.
There are 4 separate transcriptomes in planarians right now. 4 indepdent transciptome studies within a year of each other. These 4 individual studies consist of a combined: over 2 million 454 reads, over 400 million illumina reads, and over 500 million SOLiD reads. I think that’s enough said.