Comprehensive analysis of 225 automated patch-generation attempts across 6 Python repositories
Half of all patch generation attempts succeed (114/225), but 33.8% fail for non-quality reasons requiring infrastructure fixes.
Data Source: cohort_outcomes_artifacts/cohort_comprehensive_metrics.csv
Metric Calculation: Outcome classification based on holistic_evaluator score and patch presence. patch_present (score ≥5.0 AND patch generated), poor_quality_patch (score <5.0 OR truncated), failed_other (no patch due to infrastructure issues), incomplete_patch (truncated mid-generation), unknown (edge cases). Percentages computed as count/225.
Visualization Type: Donut chart with center annotation showing success rate (patch_present / total). Uses Paul Tol SRON color palette per metadata requirements.
Repository complexity creates 51.4pp performance gap: simple schemas (marshmallow-code: 88.9%) vastly outperform complex scientific codebases (pyvista: 37.5%).
Data Source: cohort_outcomes_artifacts/cohort_comprehensive_metrics.csv
Aggregation Logic: Samples grouped by repository slug (extracted from instance_id). Pass rate calculated as (patch_present count / total samples per repo). Bars colored by performance tier: green (>70%), yellow (50-70%), red (<50%). Sample counts annotated on each bar.
Axes: X-axis shows pass rate percentage (0-100%), Y-axis lists repository names sorted by pass rate descending. 50% threshold line marks target performance level.
Scientific repos (pvlib: 30.2%, pydicom: 28.6%) suffer 3-6× higher truncation than simple libs (marshmallow: 0%), directly driving failure rates.
Data Source: cohort_outcomes_artifacts/cohort_comprehensive_metrics.csv (max_tokens_rate column) + improvement_roadmap analysis
Metric Formulas: Truncation rate = (count of samples with stop_reason='max_tokens' / total samples per repo). Failure rate = (fail_count / total samples). Dashed line at 14.2% represents dataset-wide average truncation rate.
Visualization: Grouped bars enable direct comparison of truncation vs overall failure, revealing correlation: repos with high truncation (pvlib: 30.2%) have proportionally high failure rates (44.4%).
Risk scoring enables 35.6% of patches (low-risk/high-quality) to bypass manual review, potentially cutting QA time by 30-40%.
Data Source: risk_quality_matrix_artifacts/sample_risk_mapping.csv
Axes: X-axis = risk_score (0-1 composite from syntactic_correctness, diff_quality, completeness flags). Y-axis = diff_score (0-1 from diff_quality_analyzer). Quadrant boundaries at risk=0.5, quality=0.6 define four zones: Safe (low risk + high quality), Medium (mixed), High-Risk (high risk + low quality).
Segmentation: Each point represents one sample. Colors indicate quadrant membership. Quadrant counts annotated (Low-Risk/High-Quality: 80 samples = 35.6%; High-Risk/Low-Quality: 86 samples = 38.2%).
70% of samples achieve at least partial success (correct understanding), but execution gaps (incomplete implementation, truncation) prevent full success.
Data Source: failure_taxonomy analysis + canonical_tables artifacts
Hierarchical Structure: Top-level categories (Success, Partial Success, Failure) subdivide into specific patterns. Success splits by score tier (Excellent 9.0-10.0 vs Good 7.6-8.9). Partial Success splits by gap type (incomplete implementation, missing edge cases, integration gaps). Failure splits by root cause (truncation, hallucination, poor analysis).
Rectangle Sizing: Area represents sample count. Percentages computed as (category count / 225 total samples). Color shades indicate severity within each tier (darker = more severe subcategory).
marshmallow-code and pylint-dev excel across all dimensions (green column), while pyvista struggles universally (red column)—repository architecture, not domain knowledge, drives variance.
Data Source: cohort_outcomes_artifacts/cohort_comprehensive_metrics.csv
Normalization: Each metric column (pass_rate, avg_quality, syntax_correctness, issue_understanding, exploration) normalized to 0-1 scale using min-max normalization: (value - min) / (max - min). This enables cross-metric comparison despite different scales.
Color Scale: Diverging palette from red (0, poor) through white (0.5, medium) to green (1.0, excellent). Cell values show raw (non-normalized) scores for interpretability.
Successful patches cluster near origin (1-5 line changes), while failures scatter widely (10-500+ lines)—minimality correlates strongly with success.
Data Source: Patch statistics extracted from sample-level analysis + canonical_tables
Axes: X-axis = patch_lines_added (log scale), Y-axis = patch_lines_deleted (log scale). Log scaling handles wide range (1-500+ lines) while preserving detail in low-value region where most successes cluster.
Marginal Histograms: Top histogram shows distribution of lines_added, right histogram shows lines_deleted. These reveal that most patches are small (median 1-5 lines) with long tail of large outliers.
Color Coding: Green = Success (holistic >7.5), Yellow = Partial Success (5.0-7.5), Red = Failure (<5.0). Visual clustering reveals minimality-success correlation.
test_alignment (-37.5%), code_quality (-23.7%), and patch_correctness (-20.8%) degrade significantly over time, while task_understanding remains stable—context window limitations likely cause late-batch quality drops.
Data Source: temporal_evolution_tracker analysis + holistic_evaluator dimension-level scores
Batch Definition: Samples divided into 9 chronological batches (batch 1 = samples 1-25, batch 2 = samples 26-50, etc.). Dimension scores extracted from holistic_evaluator nested JSON fields (test_alignment, patch_correctness, code_quality, task_understanding, approach_quality).
Trend Analysis: Mean score per batch computed for each dimension. Early period (batches 1-4) vs Late period (batches 5-9) comparison reveals degradation patterns. Solid lines = execution metrics (degrade), dashed lines = comprehension metrics (stable).
holistic_evaluator produces valid scores only 17.8% of the time (82% failure rate), with 27.6% truncation causing most losses—urgent token budget increase required.
Data Source: tooling_reliability audit + metadata_health_check
Score Availability: Percentage of samples with non-null score for each evaluator. Calculated as (samples with valid score / 225 total). holistic_evaluator: 17.8% (only 40/225 have scores).
Truncation Rate: Percentage of samples where evaluator hit max_tokens limit. holistic_evaluator: 27.6% (62/225 truncated), causing score loss even when status='completed'.
Error Rate: Percentage of samples with detected errors in evaluator output (malformed JSON, validation failures, etc.).
Dataset metadata is 100% complete for all critical fields (lifecycle, display, reasoning), but 19.6% of samples have null exploration_backtracking_inspector scores—targeted rerun needed.
Data Source: metadata_health_check artifacts + sample-level schema validation
Field Categories: Six categories of metadata fields: Top-level (index, status), Lifecycle (timestamps), Sample Content (query, answer), Metric Display (formatted values), Metric Reasoning (explanations), Metric Scores (numeric outputs).
Coverage Calculation: Perfect = 100% non-null across 225 samples, Partial = some nulls present, Missing = systematically absent. Metric Scores category shows 80% perfect (most evaluators), 14% partial (exploration_backtracking_inspector: 19.6% null), 6% missing (diff_quality_analyzer: 6.7% null).
Excellent samples (28/225, score 9.0+) universally exhibit minimality (95%), proper diff format (100%), and surgical precision (98%)—these are the gold standards to replicate.
Data Source: failure_taxonomy + success_pattern_extractor qualitative analysis
Dimensions: Six success-pattern traits extracted from top-performing samples: Minimality (1-5 line patches), Surgical Precision (exact root cause targeting), Clear Reasoning (explicit explanations), Proper Diff Format (unified diff with context), No Scope Creep (focused on issue only), Inline Comments (code explanations).
Scoring: For each quality tier (Excellent, Good, Fair, Poor), percentage of samples exhibiting each trait computed. Excellent tier (28 samples, holistic 9.0-10.0): 95% minimal, 98% precise, 100% proper format. Poor tier (<5.0): 30% minimal, 35% precise, showing clear degradation.