Goodfire Bio -- Experiment Deep-Dive

Internal Shine-Dalgarno Motifs and Protein Expression:
A Controlled Dose-Response in Cell-Free Systems

The first CAI-controlled test of whether ribosome-confusing sequences inside a gene affect protein production

1. What Are Shine-Dalgarno Sequences?

To make a protein, a cell must translate a messenger RNA (mRNA) into a chain of amino acids. The molecular machine that does this -- the ribosome -- first needs to find the right starting position on the mRNA. In bacteria, it does this by recognizing a short signal sequence upstream of the start codon.

That signal is the Shine-Dalgarno (SD) sequence, typically GGAGG or close variants. It works by physically base-pairing with a complementary stretch on the ribosome's own RNA (the anti-SD on the 16S ribosomal RNA). This base-pairing anchors the ribosome in the correct position to begin reading the gene.

This is the intended use of the SD motif. But here is the problem: the same short sequence can also appear inside a gene, purely as an accident of codon choice. When it does, three things could go wrong:

The question this experiment addresses: do internal SD motifs measurably reduce protein production, and if so, does the effect scale with the number of motifs?

2. The Scientific Debate

Whether internal SD motifs actually harm translation is an active controversy with contradictory evidence.

The Original Claim

Li, Oh & Weissman (2012) used ribosome profiling in E. coli and reported that ~70% of strong translation pauses occurred at internal SD-like sequences. They concluded that internal SDs cause the translating ribosome to pause by hybridizing with its own 16S rRNA.

Li et al., Nature 484, 538--541 (2012)

The Methodological Challenge

Mohammad, Green & Buskirk (2019) showed this was a methodological artifact. The original protocol selected for long ribosome footprints (30--35 nt). SD-rRNA base-pairing extends the protected mRNA region, creating artificially longer footprints. By selecting only long footprints, the study enriched signal at SD sites. When all footprint sizes were captured: no SD-associated pauses were observed.

Mohammad et al., eLife 8:e42591 (2019)

What IS Established

Yang et al., G3 6(9), 2709--2720 (2016) | Hockenberry et al., MBE 35(10), 2480--2486 (2018)

The Counter-Argument

Yurovsky & Bhatt (2018) argued that apparent SD depletion is actually generic G-rich sequence avoidance, observed even in eukaryotes and organisms that lack an anti-SD sequence entirely.

Yurovsky & Bhatt, PLOS ONE 13(10):e0202768 (2018)

What our experiment adds: A direct functional test in a controlled cell-free system, independent of profiling artifacts or genomic correlations.

3. The Confound Nobody Has Broken

In natural E. coli genomes, high CAI (Codon Adaptation Index -- a measure of how well a gene's codons match the organism's preferred usage) and low internal SD count always co-occur. This is because E. coli's preferred codons happen not to form SD-like patterns.

The consequence is fundamental: from genomic data alone, you cannot disentangle whether expression depends on CAI, SD count, or both. Every highly expressed gene has good codons AND few internal SDs. Every poorly expressed gene tends to have bad codons AND more internal SDs.

No published study has experimentally varied SD count while holding CAI constant. Our experiment is the first to break this confound.

4. Our Experiment

Design

Biological Constraint

mRFP contains 7 Glu-Gly dipeptides, 1 Trp-Glu-Ala tripeptide, and 1 Met-Glu-Gly overlap where every possible codon combination creates an SD motif. The minimum achievable internal SD count is therefore 9 (not zero). The 5-mers counted: GGAGG, GAAGG, GAGGG, GGAAG, AGGAG.

Construct Relationships

SD-13 is the baseline -- this is the max-CAI sequence with zero codon changes. SD-09 removes 4 motifs from that baseline. SD-17, SD-21, and SD-25 add motifs by introducing synonymous codon substitutions.

ConstructSD CountCAICodons Changed5' MFE (kcal/mol)
SD-0990.9894 (removals)-1.7
SD-13131.0000 (baseline)-1.7
SD-17170.9863 (additions)-1.7
SD-21210.9677 (additions)-1.7
SD-25250.94711 (additions)-1.7

5. Results

Expression vs Internal SD Count

Each point below represents one of 12 replicate wells. The orange line connects the means. Points are jittered horizontally so all 12 are visible at each SD count.

Key observation: The relationship is non-monotonic -- expression decreases from SD-09 to SD-17 then increases at SD-21 and SD-25. Note the increasing spread (CV) and bimodal clustering at higher SD counts.

Summary Statistics

SD CountMean (nM)Std DevCV%MinMax
9269.910.94.1%257.6289.2
13218.76.32.9%209.1226.9
17186.46.93.7%177.1196.9
21204.812.76.2%192.4222.4
25246.020.48.3%215.9262.9

Replicate Distribution by Construct

Strip plots for each construct, colored by SD count. The bimodal pattern in SD-21 and SD-25 is clearly visible -- replicates separate into two distinct clusters.

Variability and Plate Effects at Higher SD Counts

The variability across replicates is higher at elevated SD counts (CV: 4.1% at SD-09 vs 8.3% at SD-25), though the pattern is non-monotonic (SD-13 has the lowest CV at 2.9%). The bimodal clustering visible in SD-21 and SD-25 replicates is most likely a plate position artifact — the high and low clusters correspond to systematic column positions on the assay plate, not biological variation. This means the CV increase at higher SD counts may be entirely driven by plate effects interacting differently with different constructs, rather than by SD biology.

6. The U-Shaped Curve -- Real or Artifact?

The Critical Confound: mRNA Structure

Our sequence analysis revealed that SD-17 has uniquely destabilized mRNA structure in the nt 80--160 region. The 3 codon changes that create SD-17 (GGC to GGG at positions 31 and 33; GGC to GGA at position 51) break a stable hairpin in this region.

This matters because it exposes the only canonical GGAGG motif (at position 150). In the baseline sequence, this motif is sequestered within a stable secondary structure and inaccessible to the ribosome. In SD-17, the hairpin is broken and the motif is exposed.

Critically, the additional codon changes in SD-21 and SD-25 accidentally restore the local structure, re-sequestering the exposed motifs.

ConstructSD CountLocal MFE (nt 80-160)GGAGG CountStructural State
SD-099-22.0 kcal/mol0Structured (motifs sequestered)
SD-1313-20.3 kcal/mol0Baseline structure
SD-1717-15.4 kcal/mol1DISRUPTED (hairpin broken)
SD-2121-21.2 kcal/mol1Restored (accidental)
SD-2525-21.2 kcal/mol2Restored

What this means: The U-shape may not be about SD count at all. SD-17 is the worst expresser not because it has 17 SDs, but because its specific codon changes disrupted a structural element that exposes a strong (canonical GGAGG) SD motif. SD-21 and SD-25 added more SDs but their codon changes happened to restore the structure. This is a fundamental confound of having n=1 construct per condition.

7. What IS Robust

Despite the structural confound, three findings hold up to scrutiny:

1. Expression varies 1.45x across CAI-matched constructs (186 to 270 nM). This proves that something beyond CAI affects expression of synonymous variants in cell-free systems. CAI alone does not fully predict expression.

2. SD-09 > SD-13 (the baseline). Removing 4 SD motifs from the max-CAI sequence increased expression by 23% (270 vs 219 nM). This is the cleanest comparison in the dataset: only 4 codon changes, minimal structural perturbation (the removed motifs were already base-paired and sequestered in the original structure). The 5' MFE is identical.

3. SD-17 < SD-13. Adding SD motifs (particularly an exposed GGAGG) reduced expression by 15% (186 vs 219 nM). However, this comparison is confounded by the structural disruption described above -- we cannot separate the effect of "more SD motifs" from "broken hairpin exposing a strong motif."

8. Hypotheses and Future Experiments

Ranked by plausibility given current evidence:

1

Positional / Structural Confound (MOST LIKELY)

With n=1 per condition, the specific positions and structural context of codon changes matter more than raw motif count. The U-shape is driven by SD-17's unique structural disruption.

Test: Make 5 independent 17-SD constructs with SDs at different positions.

2

tRNA Pool Mismatch

CAI was matched to E. coli genomic codon usage, but the PURE cell-free system has defined tRNA concentrations that may differ from genomic frequencies. The codon changes in SD-09 might coincidentally match the PURE tRNA pool better.

Test: Compute tAI using actual PURE system tRNA concentrations instead of genomic CAI.

3

Motif Strength Confound

Lower-count constructs may contain only weak SD motifs (e.g. GAAGG) while higher-count constructs contain strong ones (GGAGG). The effect might depend on motif strength, not count.

Test: Make a 25-SD construct using only weak (non-GGAGG) motifs and compare.

4

mRNA Structure Redistribution

Synonymous codon changes alter internal secondary structure throughout the transcript, not just near the changes. Global structural rearrangement could modulate ribosome processivity.

Test: Compute windowed MFE profiles across the full CDS. (Already done -- confirms SD-17 is structurally disrupted in the nt 80-160 window.)

5

Ribosome Queuing Effects

At very high SD density, uniformly slow translation may paradoxically improve throughput by preventing ribosome collisions that cause premature termination.

Test: TASEP (Totally Asymmetric Simple Exclusion Process) simulation with actual SD positions and elongation rates.

9. Context in the Literature

10. Honest Assessment

What We Can Claim

  • Internal SD motifs affect cell-free expression independently of CAI (the SD-09 vs SD-13 comparison, p < 0.001)
  • The relationship between SD count and expression is not simple -- it is non-monotonic, though structural confounds exist
  • Variability increases with SD count, suggesting higher-SD constructs may have more complex translational dynamics

What We Cannot Claim

  • A dose-response relationship (the U-shape is likely confounded by structural effects at n=1 per condition)
  • That SD motifs cause ribosome pausing (the pausing mechanism from Li et al. is debunked by Mohammad et al.)
  • That the magnitude or direction of SD effects in cell-free systems transfers to in-vivo E. coli

What We Should Do Next

  • Multiple independent constructs per SD count (n=5--10 each) to distinguish count effects from positional effects
  • Control for internal mRNA structure (compute windowed MFE for all candidates, match it across conditions)
  • Test in both cell-free AND in-vivo E. coli to assess transferability
  • Vary individual strong SD motifs (GGAGG) at specific positions with structural controls
  • Compute tAI using PURE system tRNA concentrations instead of genomic CAI

Methods