SWEBench Automated Code-Repair Evaluation

Comprehensive analysis of 225 automated patch-generation attempts across 6 Python repositories

Key Insight

Half of all patch generation attempts succeed (114/225), but 33.8% fail for non-quality reasons requiring infrastructure fixes.

Action: Prioritize infrastructure fixes (token budget increases, source code availability, pipeline robustness) over model retraining to unlock quick wins.

Methodology

Data Source: cohort_outcomes_artifacts/cohort_comprehensive_metrics.csv

Metric Calculation: Outcome classification based on holistic_evaluator score and patch presence. patch_present (score ≥5.0 AND patch generated), poor_quality_patch (score <5.0 OR truncated), failed_other (no patch due to infrastructure issues), incomplete_patch (truncated mid-generation), unknown (edge cases). Percentages computed as count/225.

Visualization Type: Donut chart with center annotation showing success rate (patch_present / total). Uses Paul Tol SRON color palette per metadata requirements.

Metrics Methodology

Key Insight

Repository complexity creates 51.4pp performance gap: simple schemas (marshmallow-code: 88.9%) vastly outperform complex scientific codebases (pyvista: 37.5%).

Action: Implement repository-specific token budgets and exploration strategies: increase budgets for scientific repos (pvlib, pydicom), optimize navigation for CLI tools (sqlfluff), and expand training data for under-sampled repos (pyvista).

Methodology

Data Source: cohort_outcomes_artifacts/cohort_comprehensive_metrics.csv

Aggregation Logic: Samples grouped by repository slug (extracted from instance_id). Pass rate calculated as (patch_present count / total samples per repo). Bars colored by performance tier: green (>70%), yellow (50-70%), red (<50%). Sample counts annotated on each bar.

Axes: X-axis shows pass rate percentage (0-100%), Y-axis lists repository names sorted by pass rate descending. 50% threshold line marks target performance level.

Key Insight

Scientific repos (pvlib: 30.2%, pydicom: 28.6%) suffer 3-6× higher truncation than simple libs (marshmallow: 0%), directly driving failure rates.

Action: Immediately increase output token limits to 4096+ for scientific/medical repositories (pvlib, pydicom, pyvista) and implement adaptive budgets based on repository complexity.

Methodology

Data Source: cohort_outcomes_artifacts/cohort_comprehensive_metrics.csv (max_tokens_rate column) + improvement_roadmap analysis

Metric Formulas: Truncation rate = (count of samples with stop_reason='max_tokens' / total samples per repo). Failure rate = (fail_count / total samples). Dashed line at 14.2% represents dataset-wide average truncation rate.

Visualization: Grouped bars enable direct comparison of truncation vs overall failure, revealing correlation: repos with high truncation (pvlib: 30.2%) have proportionally high failure rates (44.4%).

Key Insight

Risk scoring enables 35.6% of patches (low-risk/high-quality) to bypass manual review, potentially cutting QA time by 30-40%.

Action: Develop and deploy automated risk-scoring model to route patches: low-risk → fast-track approval, medium-risk → automated tests + spot checks, high-risk → mandatory human review.

Methodology

Data Source: risk_quality_matrix_artifacts/sample_risk_mapping.csv

Axes: X-axis = risk_score (0-1 composite from syntactic_correctness, diff_quality, completeness flags). Y-axis = diff_score (0-1 from diff_quality_analyzer). Quadrant boundaries at risk=0.5, quality=0.6 define four zones: Safe (low risk + high quality), Medium (mixed), High-Risk (high risk + low quality).

Segmentation: Each point represents one sample. Colors indicate quadrant membership. Quadrant counts annotated (Low-Risk/High-Quality: 80 samples = 35.6%; High-Risk/Low-Quality: 86 samples = 38.2%).

Detailed Analysis

Key Insight

70% of samples achieve at least partial success (correct understanding), but execution gaps (incomplete implementation, truncation) prevent full success.

Action: Focus improvement efforts on execution infrastructure (source code access, token limits, verification prompts) rather than comprehension tuning, which is already performing well.

Methodology

Data Source: failure_taxonomy analysis + canonical_tables artifacts

Hierarchical Structure: Top-level categories (Success, Partial Success, Failure) subdivide into specific patterns. Success splits by score tier (Excellent 9.0-10.0 vs Good 7.6-8.9). Partial Success splits by gap type (incomplete implementation, missing edge cases, integration gaps). Failure splits by root cause (truncation, hallucination, poor analysis).

Rectangle Sizing: Area represents sample count. Percentages computed as (category count / 225 total samples). Color shades indicate severity within each tier (darker = more severe subcategory).

Key Insight

marshmallow-code and pylint-dev excel across all dimensions (green column), while pyvista struggles universally (red column)—repository architecture, not domain knowledge, drives variance.

Action: Segment repositories by complexity tier and apply tailored strategies: Tier 1 (simple libs) → maintain current approach; Tier 2 (mid-complexity) → targeted fixes (exploration for sqlfluff, token budget for pvlib); Tier 3 (complex 3D/scientific) → comprehensive intervention (training data, budgets, prompts).

Methodology

Data Source: cohort_outcomes_artifacts/cohort_comprehensive_metrics.csv

Normalization: Each metric column (pass_rate, avg_quality, syntax_correctness, issue_understanding, exploration) normalized to 0-1 scale using min-max normalization: (value - min) / (max - min). This enables cross-metric comparison despite different scales.

Color Scale: Diverging palette from red (0, poor) through white (0.5, medium) to green (1.0, excellent). Cell values show raw (non-normalized) scores for interpretability.

Key Insight

Successful patches cluster near origin (1-5 line changes), while failures scatter widely (10-500+ lines)—minimality correlates strongly with success.

Action: Update prompts to enforce minimality: add constraint "Limit patch to <10 lines unless absolutely necessary" and provide few-shot examples of successful minimal patches (samples 75, 175, 200) to guide model toward surgical fixes.

Methodology

Data Source: Patch statistics extracted from sample-level analysis + canonical_tables

Axes: X-axis = patch_lines_added (log scale), Y-axis = patch_lines_deleted (log scale). Log scaling handles wide range (1-500+ lines) while preserving detail in low-value region where most successes cluster.

Marginal Histograms: Top histogram shows distribution of lines_added, right histogram shows lines_deleted. These reveal that most patches are small (median 1-5 lines) with long tail of large outliers.

Color Coding: Green = Success (holistic >7.5), Yellow = Partial Success (5.0-7.5), Red = Failure (<5.0). Visual clustering reveals minimality-success correlation.

Key Insight

test_alignment (-37.5%), code_quality (-23.7%), and patch_correctness (-20.8%) degrade significantly over time, while task_understanding remains stable—context window limitations likely cause late-batch quality drops.

Action: Investigate root cause: (1) Analyze sample difficulty distribution across batches to rule out confounding; (2) Test shorter evaluation sessions or context resets between batches; (3) If context is the culprit, implement periodic context pruning or fresh sessions every N samples.

Methodology

Data Source: temporal_evolution_tracker analysis + holistic_evaluator dimension-level scores

Batch Definition: Samples divided into 9 chronological batches (batch 1 = samples 1-25, batch 2 = samples 26-50, etc.). Dimension scores extracted from holistic_evaluator nested JSON fields (test_alignment, patch_correctness, code_quality, task_understanding, approach_quality).

Trend Analysis: Mean score per batch computed for each dimension. Early period (batches 1-4) vs Late period (batches 5-9) comparison reveals degradation patterns. Solid lines = execution metrics (degrade), dashed lines = comprehension metrics (stable).

Key Insight

holistic_evaluator produces valid scores only 17.8% of the time (82% failure rate), with 27.6% truncation causing most losses—urgent token budget increase required.

Action: URGENT: Increase holistic_evaluator token limit from current (~2K) to 4K+; optimize prompt to output structured score before verbose analysis; implement score validation to detect null scores and trigger retries; rerun all 185 missing samples after fixes.

Methodology

Data Source: tooling_reliability audit + metadata_health_check

Score Availability: Percentage of samples with non-null score for each evaluator. Calculated as (samples with valid score / 225 total). holistic_evaluator: 17.8% (only 40/225 have scores).

Truncation Rate: Percentage of samples where evaluator hit max_tokens limit. holistic_evaluator: 27.6% (62/225 truncated), causing score loss even when status='completed'.

Error Rate: Percentage of samples with detected errors in evaluator output (malformed JSON, validation failures, etc.).

Key Insight

Dataset metadata is 100% complete for all critical fields (lifecycle, display, reasoning), but 19.6% of samples have null exploration_backtracking_inspector scores—targeted rerun needed.

Action: Rerun 44 samples with null exploration_backtracking_inspector scores and 15 samples with null diff_quality_analyzer scores after fixing underlying evaluator issues (see tooling_reliability recommendations). Document null score conditions in schema to clarify when null is expected vs error.

Methodology

Data Source: metadata_health_check artifacts + sample-level schema validation

Field Categories: Six categories of metadata fields: Top-level (index, status), Lifecycle (timestamps), Sample Content (query, answer), Metric Display (formatted values), Metric Reasoning (explanations), Metric Scores (numeric outputs).

Coverage Calculation: Perfect = 100% non-null across 225 samples, Partial = some nulls present, Missing = systematically absent. Metric Scores category shows 80% perfect (most evaluators), 14% partial (exploration_backtracking_inspector: 19.6% null), 6% missing (diff_quality_analyzer: 6.7% null).

Key Insight

Excellent samples (28/225, score 9.0+) universally exhibit minimality (95%), proper diff format (100%), and surgical precision (98%)—these are the gold standards to replicate.

Action: Update prompts to explicitly require the 4 core success traits (minimality, precision, diff format, comments); include 3-4 Excellent-tier samples as few-shot examples; add post-generation validation to reject patches missing these traits (e.g., flag patches >10 lines, improper diff format).

Methodology

Data Source: failure_taxonomy + success_pattern_extractor qualitative analysis

Dimensions: Six success-pattern traits extracted from top-performing samples: Minimality (1-5 line patches), Surgical Precision (exact root cause targeting), Clear Reasoning (explicit explanations), Proper Diff Format (unified diff with context), No Scope Creep (focused on issue only), Inline Comments (code explanations).

Scoring: For each quality tier (Excellent, Good, Fair, Poor), percentage of samples exhibiting each trait computed. Excellent tier (28 samples, holistic 9.0-10.0): 95% minimal, 98% precise, 100% proper format. Poor tier (<5.0): 30% minimal, 35% precise, showing clear degradation.