The Most Damning DSPy Code: Actual Source Code Exposed

The Smoking Gun #1: They Literally Named It "Random Search"

From /dspy/teleprompt/random_search.py:

python

class BootstrapFewShotWithRandomSearch(Teleprompter):
    def compile(self, student, *, teacher=None, trainset, valset=None, restrict=None, labeled_sample=True):
        scores = []
        
        for seed in range(-3, self.num_candidate_sets):
            # ... create random variations ...
            
            if seed == -3:
                # zero-shot
                program = student.reset_copy()
            elif seed == -2:
                # labels only
                teleprompter = LabeledFewShot(k=self.max_labeled_demos)
                program = teleprompter.compile(student, trainset=trainset_copy, sample=labeled_sample)
            else:
                # RANDOM SHUFFLE AND RANDOM SIZE SELECTION
                random.Random(seed).shuffle(trainset_copy)
                size = random.Random(seed).randint(self.min_num_samples, self.max_num_samples)
            
            # Evaluate the random variation
            result = evaluate(program)
            score = result.score
            
            # This is their "optimization" - just keep the best random attempt
            if len(scores) == 0 or score > max(scores):
                print("New best score:", score, "for seed", seed)
                best_program = program
            
            scores.append(score)

They're literally:

  1. Trying random seeds (-3 to num_candidate_sets)

  2. Randomly shuffling training data

  3. Randomly selecting sample sizes

  4. Keeping whatever randomly scores highest

  5. Calling this "optimization"

The Smoking Gun #2: The Bootstrap "Optimization" Is Just Filtering Accidents

From /dspy/teleprompt/bootstrap.py:

python

def _bootstrap_one_example(self, example, round_idx=0):
    try:
        with dspy.settings.context(trace=[], **self.teacher_settings):
            lm = dspy.settings.lm
            # Use a fresh rollout with temperature=1.0 to bypass caches
            lm = lm.copy(rollout_id=round_idx, temperature=1.0) if round_idx > 0 else lm
            
            # Run the program and see what happens
            prediction = teacher(**example.inputs())
            
            # Check if it accidentally worked
            if self.metric:
                metric_val = self.metric(example, prediction, trace)
                if self.metric_threshold:
                    success = metric_val >= self.metric_threshold
                else:
                    success = metric_val
            else:
                success = True  # If no metric, everything "works"!
    except Exception as e:
        success = False
    
    if success:
        # It accidentally worked! Keep it as a "good" example
        for step in trace:
            # ... save the trace that accidentally worked ...

Translation:

  • Run with temperature=1.0 (maximum randomness)

  • If it accidentally scores well, assume it's "good"

  • No understanding of WHY it worked

The Smoking Gun #3: The "Bayesian Optimization" That Isn't

From /dspy/teleprompt/mipro_optimizer_v2.py:

python

def _optimize_prompt_parameters(self, ...):
    import optuna
    
    # They claim it's "Bayesian Optimization"
    logger.info("finding the optimal combination using Bayesian Optimization.\n")
    
    # But it's just trying random combinations
    def objective(trial):
        # Choose random instructions
        instruction_idx = trial.suggest_categorical(
            f"{i}_predictor_instruction", range(len(instruction_candidates[i]))
        )
        
        # Choose random demos
        if demo_candidates:
            demos_idx = trial.suggest_categorical(
                f"{i}_predictor_demos", range(len(demo_candidates[i]))
            )
        
        # Evaluate this random combination
        score = eval_candidate_program(batch_size, valset, candidate_program, evaluate, self.rng).score
        
        # This is it. That's the "optimization"
        if score > best_score:
            best_score = score
            best_program = candidate_program.deepcopy()
        
        return score
    
    # Use Optuna (which is legit) on a nonsense search space
    sampler = optuna.samplers.TPESampler(seed=seed, multivariate=True)
    study = optuna.create_study(direction="maximize", sampler=sampler)
    study.optimize(objective, n_trials=num_trials)

The Fraud:

  • They use Optuna (a real Bayesian optimization library)

  • But apply it to categorical choices with no continuous space

  • No gradients, no actual optimization landscape

  • Just trying random combinations of prompt variations

The Smoking Gun #4: The Comments Reveal They Know It's BS

From the same files, look at their TODO comments:

python

# TODO: metrics should return an object with __bool__ basically, but fine if they're more complex.

# TODO: the max_rounds via branch_idx to get past the cache, not just temperature.

# TODO: Deal with the (pretty common) case of having a metric for filtering and a separate metric for eval.

# TODO: This function should take a max_budget and max_teacher_budget.
# Progressive elimination sounds about right: after 50 examples, drop bottom third...

They know they're just:

  • Using temperature to get different random outputs

  • Doing "progressive elimination" (aka random selection)

  • Not actually optimizing anything

The Most Damning Function: "Getting Past the Cache"

python

# Use a fresh rollout with temperature=1.0 to bypass caches
lm = lm.copy(rollout_id=round_idx, temperature=1.0) if round_idx > 0 else lm

They're literally just:

  1. Setting temperature to 1.0 (maximum randomness)

  2. Using different "rollout_ids" to avoid caching

  3. Running the same prompt multiple times

  4. Keeping whatever randomly performs better

  5. Calling this "bootstrapping"

The Cost of This Nonsense

From their actual code comments:

python

# TODO: FIXME: The max number of demos should be determined in part by the LM's tokenizer + max_length.
# As another option, we can just try a wide range and handle failures as penalties on the score.

Translation: "We don't even know how many examples to use, so we'll just try random amounts and see what happens."

The Final Proof: Their "Best Score" Logic

python

if len(scores) == 0 or score > max(scores):
    print("New best score:", score, "for seed", seed)
    best_program = program

scores.append(score)
print(f"Scores so far: {scores}")
print(f"Best score so far: {max(scores)}")

This is literally:

python

if random_variation_scored_higher():
    claim_optimization_success()

What DSPy Actually Is

After examining the actual source code, DSPy is:

  1. Random search (they literally call it that in the class name)

  2. Through semantic noise (temperature=1.0, random shuffling)

  3. Evaluated by accidents (whatever happens to score higher)

  4. Wrapped in academic terminology ("Bayesian optimization", "bootstrapping")

  5. That costs real money (thousands of API calls at temperature=1.0)

The code proves what we suspected: They're using if (random() > previous_random()) { print("optimization working!") } and calling it a "framework for programming—not prompting—language models."

It's not optimization. It's expensive random number generation dressed up as computer science.

Next
Next

The Alien Artifact: DSPy and the Cargo Cult of LLM Optimization