Chapter 8. Molecular Tools for Synthetic Biology in Plants

A First-Generation Open Bioinformatics Workshop

Ron Shigeta, Niranjana Nagarajan, Shriram Bharath, Wilifred Tang, Tony Hecht, Alex Alekseyenko, Bryce Wolfe, Corey Hudson, Jamey Kain, Urvish Parikh, and Scott Fay

Abstract

Synthetic biology has had profound effects on human life. It has provided more effective anti-malarial medicine, cheaper insulin, new useful biomaterials, and greener biofuels. However, much remains to be learned in order to synthesize proteins more efficiently. To explore the potential of the DIY biology movement to engage in meaningful synthetic biology bioinformatics research, we developed a bioinformatics workshop to study determinants of protein expression levels in plants. We extracted possible ribosome binding and translation initiation sequences and looked for correlations with experimentally determined protein levels using publicly available datasets for the widely studied plants Oryza sativa and Arabidopsis thaliana. The working group was open to the public and met every other week for three hours, typically starting with a short, relevant presentation followed by hands-on data work. We aim to develop, experimentally validate, and publish our consensus sequences, anticipating that our work will be useful for plant synthetic biology research. We hope our experience will be a model for future community projects that serve the dual purpose of educating curious members of the public while also generating useful scientific results.

Introduction

Advances in sequencing technology have produced an avalanche of biological data over the past 12 years. The bottleneck in discovery has consequently shifted from data generation to data analysis, suggesting that much data is not used to its full potential.[16]

Crowdsourcing is one technique used to gain more insight from existing biological data. Putting the diverse eyes and hands of the general public to the purpose of bioinformatics is not new.[17] Examples include protein[18] and RNA folding, and both paid Ingenuity Systems and unpaid[19] curation of literature.

Rather than approach a problem strictly as professionals, we developed an open source DIY workshop where scientists and the public worked together to tackle a synthetic biology project resulting in a publishable outcome. The problem to be solved would need data from completely open sources and not require difficult analysis. A modest goal was set to do a survey of plant translation initiation motifs, aiming to create an open source parts list for controlling translation in metabolic engineering and synthetic biology. Working meetings were posted through Counter Culture Labs and Berkeley Bio Labs (groups with >100 members each) on Meetup.com and met every week or two over three months.

Plants offer many advantages as systems to do fine-tuned biological engineering (e.g., modification to enhance production of economically valuable terpinoid,[20] or modification of lignin biosynthesis to expedite biofuel synthesis[21]). There is a paucity of published information, however, on how to control sets of genes working in concert. Use of small sequence motifs as ribosome binding site parts for synthetic biology has been proposed in bacteria,[22] and similar parts have been produced for yeast. Estimates for RBS parts in prokaryotic systems show that the translation level of a gene can be shifted by greater than an order of magnitude, indicating their potential utility in synthetic biology projects. Generating an estimate of the regulatory power of plant translation initiation motifs was thus seen as a useful goal for our project.

In most eukaryotic plant genes, the 5' cap of the mRNA transcript acts as the ribosome binding site and the Kozak sequence acts as the signal for translation initiation. Due to the bacterial origins of the chloroplast, transcripts of genes encoded within the chloroplast genome contain distinct consensus sequences in comparison to transcripts from the nucleus. Instead of the 5' cap, there is a short motif called the Shine-Delgarno sequence where the ribosome binds and then initiates translation, generally eight nucleotides downstream, though this distance varies. Although there has been some experimental work on ribosome binding sites and Kozak sequences in plants,[23] genomic-scale surveys have not been performed.

Here we use publicly available, combined RNA- and protein-expression data for both nuclear and chloroplast genes to estimate the power of the ribosome binding and translation initiation sequence motifs to initiate translation. These are initial results; experimental confirmation of the motifs will follow.

Methods

Plant Genome Survey and Motif Extraction

A broad survey of the translation initiation motifs from both the TAIR10 Arabidopsis thaliana genome build[24] and the IRGSP 1.0 Japanese Rice Genome[25] was carried out. In order to capture translation initiation motifs as well as possible leader peptide sequences, the gene description GFF files were used to extract the 25 bases before and 18 bases after the start codon of each gene for each genome build. The terms "CDS" or "mRNA" were used to extract protein coding regions. With the data from the rice genome, we were unable to separate coding sequences in all three possible reading frames; therefore, we excluded coding sequences that did not initiate with the canonical "ATG" start codon.

Chloroplast Survey and Motif Extraction

Because of the small number of genes in the chloroplast, a broad collection of motifs were also extracted for chloroplasts. The GenBank chromosome sequences were scraped from the Choloroplast DB webpage[26] and used to extract motifs using Biopython.[27] This yielded 11,810 initiation motifs from 109 organisms, which gave good consistency in the start codon with translation initiation generally occurring 8 nucleotides downstream of the ribosome binding site, as expected.

Transcriptome Data

As we could find no publicly available matched proteome/transcriptome datasets, we obtained arrays from arabidopsis leaf and rice leaf. All replicate arrays for noontime leaf expression in adult plants were obtained from Gene Expression Omnibus,[28] via GEOSearch.[29]

The following table lists the datasets that were chosen in the workshop session.

Experiment Array designation Sample description

GSE11966

GSM302918

Expression data from rice leaf rep 1

GSE11966

GSM302919

Expression data from rice leaf rep 2

GSE22788

GSM563421

Rice Kitaake Leaf rep1

GSE22788

GSM563422

Rice Kitaake Leaf rep2

GSE22788

GSM563423

Rice Kitaake Leaf rep3

GSE24048

GSM591761

Control Azucena leaf biologial rep 1

GSE24048

GSM591762

Control Azucena leaf biologial rep 2

GSE24048

GSM591763

Control Azucena leaf biologial rep 3

GSE24048

GSM591764

Control Bala leaf biologial rep 1

GSE24048

GSM591765

Control Bala leaf biologial rep 2

GSE24048

GSM591766

Control Bala leaf biologial rep 3

Overall, 10 arrays, including replicates of 4 separate measurements, were downloaded as Affymetrix Arabidopsis Genome ATH1 and Rice Genome Array CEL files. These were scaled using the MAS5 algorithm[30] using the affy Biocoductor library[31] in R.[32] The resulting dataframe was reduced to mean, median, and standard deviation estimates for each probe set for Rice and Arabidopsis. The results were that the mean measurement standard deviation was 71% and the median differed by 19% from the mean, indicating a reasonable sample variance that was satisfactory where doubling of intensities is considered significant.

Proteome Data

The Rice Proteome Project has a comprehensive set of quantitative proteome estimates from 2D SDS PAGE gel including different stages of the plant growth and portions of the plant as well as an organelle survey. Quantitation from gel densitometry, MASCOT scores, and UniProt associations were downloaded as tables.[33]

Only a few hundred measurements were found from multiple sources for Arabidopsis, which did not cover the organelles explicitly and less than 10% of the known leaf proteome. As the data proved to be inadequate for this study, the Arabidopsis survey had to be set aside.

UniProt identifiers were mapped to rice probe set identifiers. Many of the UniProt identifiers were directly mappable to probe set (236 out of 554 identifiers, split among 123 chloroplast, 235 mature leaf, and 196 seedling leaf probes) using the Rice Coexpression Database.[34] The remaining UniProt identifiers were manually mapped, BLASTP searching UniProt sequence against the Oryza sativa Nipponbare reference genome.[35]

Translation Initiation Estimation for motifs

The interrelationship between Rice Gene, Protein in the Proteome set, and the MicroArray Probe set required several data sources. The Probe Set Annotation data for the Rice IVT Expression Array was extracted from the Probe Set Annotation CSV file provided by Affymetrix.[36] Because UniProt accessions drift over time, the Rice Proteome, which was generated circa 2004, had no protein accessions that were in UniProt. Data relationships to gene names and probe sets were assembled through multiple processes. Reviewing archival Rice Genome Array annotations, we were able to find about 50% of the probe set mappings we needed, and the rest were recovered manually by searches of http://uniprot.org and if necessary, BLAST alignment of nucleotide sequences against the Rice Genome at http://msu.edu.

For genes that had proteome protein concentration estimates, the translational coefficient, Θ, for a given gene was estimated as the ratio of the protein to the mean RNA concentration as estimated by the microarray intensity.

In order to reduce the influence of outliers, the mRNA concentration was taken as proportional to the median of the microarray values from the 10 datasets. This reduced the range of the values by two logs compared to taking the mean microarray probe set intensity.

Results

Genome Surveys

Surveys of the nuclear chromosomes of Arabidopsis thaliana and Oryza sativa japonica yielded thousands of sequence motifs. A conventional logo survey[37] shows the expected Kozak sequence in the nuclear genes (see Figure 8-1). In the case of chloroplast chromosome, since the Shine-Delgarno sequence does not have a fixed location with respect to the start codon,[38] the weblogo does not show any appreciable signal (see Figure 8-2).

Sequence logo of chromosome 1 of Oryza sativa japonica, derived from 2,134 sequences, restricted to those initiating with an ATG codon. This logo shows a canonical Kozak motif surrounding the initiating ATG. The x-axis represents the nucleotide position 20 bases upstream and 20 bases downstream of the ATG initiation codon. Some information in the wobble bases (third position) shows in the coding portion of the sequence. The other chromosomes were similar.
Figure 8-1. Sequence logo of chromosome 1 of Oryza sativa japonica, derived from 2,134 sequences, restricted to those initiating with an ATG codon. This logo shows a canonical Kozak motif surrounding the initiating ATG. The x-axis represents the nucleotide position 20 bases upstream and 20 bases downstream of the ATG initiation codon. Some information in the wobble bases (third position) shows in the coding portion of the sequence. The other chromosomes were similar.
Sequence logo of chloroplast translation initiation motifs. The x-axis represents the nucleotide position 20 bases upstream and 20 bases downstream of the ATG initiation codon. Bias in the wobble base of the codons is much more pronounced in this logo since only 81 sequences were available to analyze in Arabidopsis chloroplasts.
Figure 8-2. Sequence logo of chloroplast translation initiation motifs. The x-axis represents the nucleotide position 20 bases upstream and 20 bases downstream of the ATG initiation codon. Bias in the wobble base of the codons is much more pronounced in this logo since only 81 sequences were available to analyze in Arabidopsis chloroplasts.

Translation Initiation Estimates

The relative power estimates for the proteome-to-transcript ratio range over 12 powers of natural log (see Figure 8-3), which is 165,000. The average value is –2.4 with an asymetrical distribution, with a greater range for enhancements to protein production (Θ > 1).

Histogram of relative protein to mRNA abundance ratio (power) for the chloroplast proteome. Using the median value of the microarray intensity, the power varied over a factor of 58,000.
Figure 8-3. Histogram of relative protein to mRNA abundance ratio (power) for the chloroplast proteome. Using the median value of the microarray intensity, the power varied over a factor of 58,000.

The correlation between mRNA and protein available in the cell turned out to be poor—the mRNA and proteome scores had a correlation of 0.12, which implies that there are likely several factors that are influencing both of these numbers that go into Θ, indicating that the model is too simple.

Next Steps

Though the workshop has performed some novel analyses, this is a preliminary work. It’s clear the estimate of transcript initiation has a tremendous amount of uncertainty associated with it. Microarray probe sets are not distinctly comparable with each other, as the specific sequences of the probes vary in their target affinity.

An abundance of cell processes can affect the actual amount of protein produced compared to the mRNA reported by a microarray. Just a few of these may include nonsense mediated decay, inhibitory RNA, post-translational editing, protein sorting among cellular compartments, and secondary structure in the mRNA.

Still for the largest and smallest Θ values, the values determined might give some correlation with strong and weak translation. We will next test the leader sequences associated with the largest transcription initiation power in vivo in collaboration with the Glowing Plant project. To this end, we’ll be taking the motifs for the 10 largest and some smaller Θ motifs and installing the sequence into a plasmid that can be validated in a plant cell by quantitation of florescence from a GFP versus a control construct with its current constitutive motif sequence. When their relative strengths have been determined, the parts themselves will be placed in the GoldenBraid public repository.[39]

The collection of motifs will also enable us to examine chloroplast Shine-Delgarno sequences and their relative effects on translation.

Open Workshop

One of us (Ron Shigeta) initiated the open workshop as an experiment to bring together the populations of curious laymen, experienced wet biologists, and software engineering talent in the East Bay area, and all three of these groups were represented in the attendees. In addition, several working bioinformaticians contributed.

The project was structured to give an introduction and purpose to looking at a variety of publicly available biological data. The first five meetings each were spent on a category of biological data: chromosomal sequences, individual open reading frames, microarray data, quantitative proteomics data in 2D polyacrylamide gel electrophoresis, and quantitative proteome data from gas chromatography/mass spectroscopy. In each of these sessions, data was gathered from public sources, and participants had hands-on experience with the raw data. Attendance ranged from 25 to 30 participants. As a public-scientific–interface event, hands-on work with data and computers was quite engaging, and several useful scripts were written to process the data in Python and R.

The following two months of biweekly data analysis sessions were less fully attended, with an average of two to four participants. Possible reasons for this decline include lack of understanding of the subject matter or technical skills needed to fully participate, inability to commit to an extended project, and unclear direction or incentives to continue. The more open-ended nature of data interpretation and analysis is also a difficult process to relate to an introductory course; it was difficult for newcomers to biology to attach to these tasks.

Future workshops may be structured into beginner, intermediate, and advanced levels that would be more accessible to participants from diverse educational backgrounds and will likely be shorter in length to reduce attrition. Another idea is to take on a project with the sole goal of doing that project, rather than anticipating publishable results. As an experiment, the workshop did succeed in bringing together a range of talent and covered a broad set of biological data.

Slides for these sessions are available at http://boundaryconditions.org/biology.html. When we have completed screening out parts, scripts, data collected, and analysis for this project, it will be made available at https://bitbucket.org/ronbo/glowingplantparts.

Acknowledgments

The authors would like to thank SudoRoom, a tech makerspace in downtown Oakland, California, for physically hosting the workshop. We would also like to thank the many other individuals who came to the workshop at one time or another: Felicia Betancourt, Ryan Bethencourt, Jack Cunha, Cristina Deptula, A. Dangerfield, Brian Gordon, Carl Gorringe, Rajat Jain, Ahnon Milman, and Heather Wilson.



[16] Lockhart, David J. and Elizabeth A. Winzeler. "Genomics, gene expression and DNA arrays," Nature, 405, 827–836.

[17] See Good, Benjamin M. and Andrew I. Su. "Crowdsourcing for bioinformatics," Bioinformatics 29, 2013, 1925–1933 and Marbach, Daniel et al. "Wisdom of crowds for robust gene network inference," Nature Methods 9, 2012, 796–804.

[18] Lane, Thomas J. et al. "To milliseconds and beyond: challenges in the simulation of protein folding," Current Opinion in Structural Biology, 2012.

[19] Hingamp, Pascal et al. "Metagenome annotation using a distributed grid of undergraduate students," PLOS Biology 6, 2008, e296.

[20] Moses, Tessa et al. "Bioengineering of plant (tri) terpenoids: from metabolic engineering of plants to synthetic biology in vivo and in vitro," New Phytologist, 2013.

[21] Li, Xu et al. "Improvement of biomass through lignin modification," The Plant Journal 54, 2008, 569–581.

[22] Salis, Howard M. et al. "Automated design of synthetic ribosome binding sites to control protein expression," Nature Biotechnology 27, 2009, 946–950. See also http://bit.ly/1iP05Bu.

[23] Shine, J. and L. Dalgarno. "The 3'-Terminal Sequence of Escherichia coli 16S Ribosomal RNA: Complementarity to Nonsense Triplets and Ribosome Binding Sites," Proceedings of the National Academy of Sciences of the United States of America 71, 1974, 1342–1346.

[24] Lamesch, Philippe et al. "The Arabidopsis Information Resource (TAIR): gene structure and function annotation," Nucleic Acids Research 36, 2007, D1009–D1014.

[25] Kawahara, Yoshihiro et al. "Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data," Rice 6, 2013, 4.

[26] Cui, Liying et al. "ChloroplastDB: the Chloroplast Genome Database," Nucleic Acids Research 34, 2006, D692–D696.

[27] Cock, Peter J. A. et al. "Biopython: freely available Python tools for computational molecular biology and bioinformatics," Bioinformatics 25, 2009, 1422–1423.

[28] Barrett, Tanya et al. "NCBI GEO: archive for functional genomics data sets–update," Nucleic Acids Research 41, 21 D991–D995 (2012).

[29] Zhu, Yuelin et al. "GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus,"Bioinformatics 24, 2008, 2798–2800.

[30] Wei Keat Lim, Wei Keat et al. "Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks," Bioinformatics 23, 2007, i282–i288.

[31] Gautier, Laurent et al. "affy—analysis of Affymetrix GeneChip data at the probe level," Bioinformatics 20, 2004, 307–315.

[32] R Deveopment Core Team. "R: A Language and Environment for Statistical Computing." Vienna, Austria : the R Foundation for Statistical Computing, 2013.

[33] Tanaka, N. et al. "Proteomics of the rice cell: systematic identification of the protein populations in subcellular compartments," Molecular Genetics and Genomics 271, 2004, 566–576.

[34] Sato,et al. "RiceXPro Version 3.0: expanding the informatics resource for rice transcriptome," Nucleic Acids Research 41, 2013, D1206–D1213.

[35] Kawahara et al. "Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data," Rice 6, 2013.

[36] Liu et al. "NetAffx: Affymetrix probesets and annotations," Nucleic Acids Research 31, 2003, 82–86.

[37] Crooks, Gavin E. "WebLogo: A Sequence Logo Generator," Genome Research 14, 2004, 1188–1190.

[38] Hirose, Tetsuro and Masahiro Sugiura. "Functional Shine-Dalgarno-Like Sequences for Translational Initiation of Chloroplast mRNAs," Plant and Cell Physiology 45, 2004, 114–117.

[39] Sarrion-Perdigones, Alejandro et al. "GoldenBraid 2.0: A Comprehensive DNA Assembly Framework for Plant Synthetic Biology," Plant Physiology 162, 2013, 1618–1631.