1. What Are Shine-Dalgarno Sequences?
To make a protein, a cell must translate a messenger RNA (mRNA) into a chain of amino acids. The molecular machine that does this -- the ribosome -- first needs to find the right starting position on the mRNA. In bacteria, it does this by recognizing a short signal sequence upstream of the start codon.
That signal is the Shine-Dalgarno (SD) sequence, typically GGAGG or close variants. It works by physically base-pairing with a complementary stretch on the ribosome's own RNA (the anti-SD on the 16S ribosomal RNA). This base-pairing anchors the ribosome in the correct position to begin reading the gene.
This is the intended use of the SD motif. But here is the problem: the same short sequence can also appear inside a gene, purely as an accident of codon choice. When it does, three things could go wrong:
- Ribosome pausing: The translating ribosome's own 16S rRNA hybridizes with the internal SD motif, causing it to stall mid-translation.
- Aberrant internal initiation: A free ribosomal subunit (not the one already translating) recognizes the internal SD and starts a new, spurious translation from mid-gene -- especially if there is a downstream AUG codon.
- Frameshifting: The SD-rRNA interaction nudges the ribosome off its reading frame, producing a garbled protein from that point onward.
The question this experiment addresses: do internal SD motifs measurably reduce protein production, and if so, does the effect scale with the number of motifs?
2. The Scientific Debate
Whether internal SD motifs actually harm translation is an active controversy with contradictory evidence.
The Original Claim
Li, Oh & Weissman (2012) used ribosome profiling in E. coli and reported that ~70% of strong translation pauses occurred at internal SD-like sequences. They concluded that internal SDs cause the translating ribosome to pause by hybridizing with its own 16S rRNA.
Li et al., Nature 484, 538--541 (2012)
The Methodological Challenge
Mohammad, Green & Buskirk (2019) showed this was a methodological artifact. The original protocol selected for long ribosome footprints (30--35 nt). SD-rRNA base-pairing extends the protected mRNA region, creating artificially longer footprints. By selecting only long footprints, the study enriched signal at SD sites. When all footprint sizes were captured: no SD-associated pauses were observed.
Mohammad et al., eLife 8:e42591 (2019)
What IS Established
- 94% of 187 bacterial genomes show fewer internal SD motifs than expected by chance (Yang et al. 2016, G3).
- Highly expressed genes avoid internal SDs more strongly (p < 10-18).
- Internal SDs are under negative selection -- they evolve away faster than neutral expectation (Hockenberry et al. 2018, MBE).
- The likely mechanism is aberrant internal initiation, not pausing: an internal SD + downstream AUG recruits free 30S ribosomal subunits.
Yang et al., G3 6(9), 2709--2720 (2016) | Hockenberry et al., MBE 35(10), 2480--2486 (2018)
The Counter-Argument
Yurovsky & Bhatt (2018) argued that apparent SD depletion is actually generic G-rich sequence avoidance, observed even in eukaryotes and organisms that lack an anti-SD sequence entirely.
Yurovsky & Bhatt, PLOS ONE 13(10):e0202768 (2018)
What our experiment adds: A direct functional test in a controlled cell-free system, independent of profiling artifacts or genomic correlations.
3. The Confound Nobody Has Broken
In natural E. coli genomes, high CAI (Codon Adaptation Index -- a measure of how well a gene's codons match the organism's preferred usage) and low internal SD count always co-occur. This is because E. coli's preferred codons happen not to form SD-like patterns.
The consequence is fundamental: from genomic data alone, you cannot disentangle whether expression depends on CAI, SD count, or both. Every highly expressed gene has good codons AND few internal SDs. Every poorly expressed gene tends to have bad codons AND more internal SDs.
No published study has experimentally varied SD count while holding CAI constant. Our experiment is the first to break this confound.
4. Our Experiment
Design
- 5 synonymous variants of mRFP (231 amino acids, identical protein sequence)
- Internal SD motif count varied: 9, 13, 17, 21, 25
- CAI held near-maximal: 0.947 to 1.000
- 5' region fixed across all constructs (identical MFE = -1.7 kcal/mol)
- Cell-free expression system (Ginkgo Bioworks CFPE)
- 12 replicates per construct (HiBiT luminescence quantification)
Biological Constraint
mRFP contains 7 Glu-Gly dipeptides, 1 Trp-Glu-Ala tripeptide, and 1 Met-Glu-Gly overlap where every possible codon combination creates an SD motif. The minimum achievable internal SD count is therefore 9 (not zero). The 5-mers counted: GGAGG, GAAGG, GAGGG, GGAAG, AGGAG.
Construct Relationships
SD-13 is the baseline -- this is the max-CAI sequence with zero codon changes. SD-09 removes 4 motifs from that baseline. SD-17, SD-21, and SD-25 add motifs by introducing synonymous codon substitutions.
| Construct | SD Count | CAI | Codons Changed | 5' MFE (kcal/mol) |
| SD-09 | 9 | 0.989 | 4 (removals) | -1.7 |
| SD-13 | 13 | 1.000 | 0 (baseline) | -1.7 |
| SD-17 | 17 | 0.986 | 3 (additions) | -1.7 |
| SD-21 | 21 | 0.967 | 7 (additions) | -1.7 |
| SD-25 | 25 | 0.947 | 11 (additions) | -1.7 |
5. Results
Expression vs Internal SD Count
Each point below represents one of 12 replicate wells. The orange line connects the means. Points are jittered horizontally so all 12 are visible at each SD count.
Key observation: The relationship is non-monotonic -- expression decreases from SD-09 to SD-17 then increases at SD-21 and SD-25. Note the increasing spread (CV) and bimodal clustering at higher SD counts.
Summary Statistics
| SD Count | Mean (nM) | Std Dev | CV% | Min | Max |
| 9 | 269.9 | 10.9 | 4.1% | 257.6 | 289.2 |
| 13 | 218.7 | 6.3 | 2.9% | 209.1 | 226.9 |
| 17 | 186.4 | 6.9 | 3.7% | 177.1 | 196.9 |
| 21 | 204.8 | 12.7 | 6.2% | 192.4 | 222.4 |
| 25 | 246.0 | 20.4 | 8.3% | 215.9 | 262.9 |
Replicate Distribution by Construct
Strip plots for each construct, colored by SD count. The bimodal pattern in SD-21 and SD-25 is clearly visible -- replicates separate into two distinct clusters.
Variability and Plate Effects at Higher SD Counts
The variability across replicates is higher at elevated SD counts (CV: 4.1% at SD-09 vs 8.3% at SD-25), though the pattern is non-monotonic (SD-13 has the lowest CV at 2.9%). The bimodal clustering visible in SD-21 and SD-25 replicates is most likely a plate position artifact — the high and low clusters correspond to systematic column positions on the assay plate, not biological variation. This means the CV increase at higher SD counts may be entirely driven by plate effects interacting differently with different constructs, rather than by SD biology.
6. The U-Shaped Curve -- Real or Artifact?
The Critical Confound: mRNA Structure
Our sequence analysis revealed that SD-17 has uniquely destabilized mRNA structure in the nt 80--160 region. The 3 codon changes that create SD-17 (GGC to GGG at positions 31 and 33; GGC to GGA at position 51) break a stable hairpin in this region.
This matters because it exposes the only canonical GGAGG motif (at position 150). In the baseline sequence, this motif is sequestered within a stable secondary structure and inaccessible to the ribosome. In SD-17, the hairpin is broken and the motif is exposed.
Critically, the additional codon changes in SD-21 and SD-25 accidentally restore the local structure, re-sequestering the exposed motifs.
| Construct | SD Count | Local MFE (nt 80-160) | GGAGG Count | Structural State |
| SD-09 | 9 | -22.0 kcal/mol | 0 | Structured (motifs sequestered) |
| SD-13 | 13 | -20.3 kcal/mol | 0 | Baseline structure |
| SD-17 | 17 | -15.4 kcal/mol | 1 | DISRUPTED (hairpin broken) |
| SD-21 | 21 | -21.2 kcal/mol | 1 | Restored (accidental) |
| SD-25 | 25 | -21.2 kcal/mol | 2 | Restored |
What this means: The U-shape may not be about SD count at all. SD-17 is the worst expresser not because it has 17 SDs, but because its specific codon changes disrupted a structural element that exposes a strong (canonical GGAGG) SD motif. SD-21 and SD-25 added more SDs but their codon changes happened to restore the structure. This is a fundamental confound of having n=1 construct per condition.
7. What IS Robust
Despite the structural confound, three findings hold up to scrutiny:
1. Expression varies 1.45x across CAI-matched constructs (186 to 270 nM). This proves that something beyond CAI affects expression of synonymous variants in cell-free systems. CAI alone does not fully predict expression.
2. SD-09 > SD-13 (the baseline). Removing 4 SD motifs from the max-CAI sequence increased expression by 23% (270 vs 219 nM). This is the cleanest comparison in the dataset: only 4 codon changes, minimal structural perturbation (the removed motifs were already base-paired and sequestered in the original structure). The 5' MFE is identical.
3. SD-17 < SD-13. Adding SD motifs (particularly an exposed GGAGG) reduced expression by 15% (186 vs 219 nM). However, this comparison is confounded by the structural disruption described above -- we cannot separate the effect of "more SD motifs" from "broken hairpin exposing a strong motif."
8. Hypotheses and Future Experiments
Ranked by plausibility given current evidence:
1
Positional / Structural Confound (MOST LIKELY)
With n=1 per condition, the specific positions and structural context of codon changes matter more than raw motif count. The U-shape is driven by SD-17's unique structural disruption.
Test: Make 5 independent 17-SD constructs with SDs at different positions.
2
tRNA Pool Mismatch
CAI was matched to E. coli genomic codon usage, but the PURE cell-free system has defined tRNA concentrations that may differ from genomic frequencies. The codon changes in SD-09 might coincidentally match the PURE tRNA pool better.
Test: Compute tAI using actual PURE system tRNA concentrations instead of genomic CAI.
3
Motif Strength Confound
Lower-count constructs may contain only weak SD motifs (e.g. GAAGG) while higher-count constructs contain strong ones (GGAGG). The effect might depend on motif strength, not count.
Test: Make a 25-SD construct using only weak (non-GGAGG) motifs and compare.
4
mRNA Structure Redistribution
Synonymous codon changes alter internal secondary structure throughout the transcript, not just near the changes. Global structural rearrangement could modulate ribosome processivity.
Test: Compute windowed MFE profiles across the full CDS. (Already done -- confirms SD-17 is structurally disrupted in the nt 80-160 window.)
5
Ribosome Queuing Effects
At very high SD density, uniformly slow translation may paradoxically improve throughput by preventing ribosome collisions that cause premature termination.
Test: TASEP (Totally Asymmetric Simple Exclusion Process) simulation with actual SD positions and elongation rates.
9. Context in the Literature
- First CAI-controlled SD dose-response (to our knowledge): No prior study we found has experimentally varied internal SD count while holding CAI constant. This addresses the fundamental confound in all genomic correlation studies.
- First cell-free test of internal SD effects (to our knowledge): Previous functional evidence came from in-vivo systems or ribosome profiling. Cell-free systems eliminate transcriptional regulation and growth-rate effects, and greatly simplify the mRNA degradation landscape.
- Relevant to the Li vs Mohammad debate: Our result shows a functional effect of synonymous changes that correlate with SD content, independent of profiling artifacts. The effect exists even if SD-induced pausing does not.
- The cleanest result -- SD-09 vs SD-13: A 23% expression improvement from removing 4 SD motifs at matched CAI. Minimal codon changes, minimal structural perturbation, same 5' region.
- The U-shape needs more constructs: With n=1 per condition and a demonstrated structural confound, the non-monotonic relationship cannot be confirmed or refuted without additional constructs per SD-count level.
10. Honest Assessment
What We Can Claim
- Internal SD motifs affect cell-free expression independently of CAI (the SD-09 vs SD-13 comparison, p < 0.001)
- The relationship between SD count and expression is not simple -- it is non-monotonic, though structural confounds exist
- Variability increases with SD count, suggesting higher-SD constructs may have more complex translational dynamics
What We Cannot Claim
- A dose-response relationship (the U-shape is likely confounded by structural effects at n=1 per condition)
- That SD motifs cause ribosome pausing (the pausing mechanism from Li et al. is debunked by Mohammad et al.)
- That the magnitude or direction of SD effects in cell-free systems transfers to in-vivo E. coli
What We Should Do Next
- Multiple independent constructs per SD count (n=5--10 each) to distinguish count effects from positional effects
- Control for internal mRNA structure (compute windowed MFE for all candidates, match it across conditions)
- Test in both cell-free AND in-vivo E. coli to assess transferability
- Vary individual strong SD motifs (GGAGG) at specific positions with structural controls
- Compute tAI using PURE system tRNA concentrations instead of genomic CAI
Methods
- Gene: mRFP, 231 amino acids, 693 nt CDS
- SD motifs scored: 5-mers GGAGG, GAAGG, GAGGG, GGAAG, AGGAG
- Expression system: Ginkgo Bioworks Cell-Free Protein Expression (CFPE). Note: the SD constructs encode a 231-aa mRFP variant, which differs from the 225-aa mRFP used by other constructs in the broader panel. Cross-group comparisons should be interpreted cautiously.
- Quantification: HiBiT luminescence assay (Promega), calibrated to nM protein concentration
- Replicates: 12 wells per construct, single plate
- Structure prediction: ViennaRNA (RNAfold) for MFE calculations
- All constructs share: Identical T7 promoter, RBS (AAGGAG), 5' region (codons 1--10), GGGS linker, HiBiT tag, T7 terminator
- CAI calculation: Sharp & Li method, reference set = E. coli K-12 highly expressed genes