BrowseComp-Plus Deep-Research Agent Evaluation

Interactive Performance Dashboard – 352 Completed Queries

πŸ“Š Table of Contents

πŸ“ Metrics Methodology

How Key Metrics Were Computed

  • Answer Correctness Rate: Percentage of queries where the agent's extracted final answer matches the human-verified gold answer. Computed by the browsecomp_failure_analyzer scorer using exact match and semantic equivalence. Target: 80%.
  • Retrieval Recall: Percentage of gold (ground truth) documents successfully retrieved by the agent. Calculated as (# gold docs in retrieved set) / (# total gold docs). The 80% recall threshold separates retrieval errors from reasoning errors.
  • Retrieval Error: Failures where the agent did not retrieve sufficient relevant documents (retrieval recall < 80%). Indicates "didn't find the information" failures.
  • Reasoning Error: Failures where the agent retrieved adequate information (recall β‰₯ 80%) but failed to synthesize correctly. Indicates "found it but didn't use correctly" failures.
  • Query Complexity: Number of criteria that must be satisfied to answer correctly. Assigned by browsecomp_query_categorizer: Simple (1-2 criteria), Medium (3-4 criteria), Complex (5+ criteria).

Data Sources

  • Scoring Agents: browsecomp_failure_analyzer (correctness, retrieval recall, error classification), browsecomp_query_categorizer (topic, knowledge type, complexity)
  • Aggregators: 12+ specialized analyzers compute distributions, correlations, and cross-tabulations across taxonomy dimensions
  • Sample Size: 352 completed queries out of 830 total (42.4% coverage)

πŸ“ˆ Executive Summary

Key Insight: Deep-Research Agent achieves 70.74% correctness with strong retrieval (78.63% recall), falling short of the 80% target by 9.3 percentage points.
Action: Prioritize the top 5 improvement recommendations to close the gap to 80% target correctness.

What This Shows

This gauge displays the agent's overall answer correctness rate (70.74%) against a target threshold of 80%. The gauge needle indicates current performance, with color zones representing performance levels: red (<60%), yellow (60-75%), blue (>75%). Data from 352 completed BrowseComp-Plus queries.

Calculation

Correctness Rate = 249 correct / 352 total = 70.74%. Computed by comparing the agent's extracted final answer to the human-verified gold answer using the browsecomp_failure_analyzer scorer. Source: actionable_improvement_priorities.md

Key Insight: 33.7% of failures are retrieval errors (agent didn't find the right documents), while 9.5% are reasoning errors.
Action: Prioritize retrieval improvements (search budgets, query formulation, iterative refinement) to address the larger failure category first.

What This Shows

This donut chart shows the distribution of 155 failures: Retrieval Errors (121 failures, 33.7%) where the agent failed to find relevant documents (retrieval recall <80%), and Reasoning Errors (34 failures, 9.5%) where adequate information was retrieved but synthesis failed.

Calculation

Retrieval Error: Failures with retrieval recall < 80%. Reasoning Error: Failures with retrieval recall β‰₯ 80%. The 80% threshold distinguishes "didn't find it" from "found it but used it wrong." Source: failure_mode_deep_dive.md, failure_mode_summary.csv

Key Insight: Strong positive correlation: queries with <20% recall have 0% correctness, while 80%+ recall achieves 87.8% correctness.
Action: Implement adaptive search budgets and query refinement strategies to move more queries from low-recall to high-recall buckets.

What This Shows

This bar chart displays answer correctness rate (y-axis) across retrieval recall ranges (x-axis). Each bar represents queries grouped by how many gold documents they successfully retrieved. Bar colors shift from red (poor performance) to green (strong performance). Sample counts shown on bars. Data from 280 queries with correlation r=0.567 (p<0.000001).

Calculation

Correctness Rate per bucket: (# correct answers in bucket) / (# total queries in bucket). Retrieval Recall: (# gold docs retrieved) / (# total gold docs). Source: retrieval_recall_analysis.md, recall_distribution.csv

Key Insight: S1 (Search budgets not well-adapted to query complexit...) offers highest impact (~21 additional correct answers) with low difficulty.
Action: Implement S1 immediately as a quick win, then proceed to the next highest-priority improvements.

What This Shows

This bubble chart plots the top 5 improvement recommendations: implementation difficulty (x-axis, 1-3 scale) vs expected lift in correct answers (y-axis). Bubble size represents queries affected. Colors indicate difficulty (green=low, blue=medium, yellow=high). Data from actionable improvement priorities analysis.

Calculation

Expected Lift: Estimated additional correct answers if improvement is implemented, based on historical patterns and affected query analysis. Priority Score: Weighted combination of expected lift, affected queries, and inverse difficulty. Source: improvement_priorities.json

πŸ” Failure Mode Analysis

Key Insight: 33.7% of failures stem from retrieval (didn't find documents) vs 9.5% from reasoning (found but didn't use correctly).
Action: Allocate 34% of engineering resources to retrieval improvements and 9% to reasoning improvements.

Calculation: Retrieval errors have recall <80% (avg 7.8%), reasoning errors have recall β‰₯80% (avg 94.1%). The 55pp gap validates the threshold. Source: failure_mode_summary.csv

Key Insight: Within retrieval errors, 'Insufficient Coverage' dominates (17 failures, 37.0%). Within reasoning errors, 'Information Extraction' is primary (20 failures, 58.8%).
Action: Implement explicit extraction prompts to fix 'Information Extraction' errors, and deploy semantic search + query expansion for 'Insufficient Coverage'.

Calculation: Subcategories identified through qualitative analysis of failure explanations. Rectangle size = failure count. Source: retrieval_error_subcategories.csv, reasoning_error_subcategories.csv

Key Insight: The 80% threshold cleanly separates error types: retrieval errors average 7.8% recall, reasoning errors average 94.1% recallβ€”a 86% gap validates the threshold.
Action: Use the 80% threshold with confidence for error classification in production evaluation pipelines.

Calculation: Average retrieval recall within each outcome group. Correlation r=0.567 (p<0.000001) confirms strong relationship between recall and correctness. Source: failure_mode_summary.csv

πŸ“Š Query Characteristics & Performance

Key Insight: Literature (56% correct, 32 queries) and Science & technology (57% correct, 35 queries) are high-volume underperformers.
Action: Focus improvement efforts on Literature and Science & technology topics to maximize impact on overall performance.

Calculation: Correctness rate per topic = (# correct in topic) / (# total queries in topic). Only topics with β‰₯5 queries shown to avoid noise. Bar color indicates performance zone. Source: topic_performance.csv

Key Insight: Analytical queries show critical gap: only 25% correct with 30% retrieval recallβ€”but only 4 samples (1.1% of dataset) indicates severe underrepresentation.
Action: Expand the evaluation dataset with 35-50 Analytical queries to enable comprehensive knowledge type assessment.

Calculation: Metrics aggregated by knowledge_type dimension from query categorization. Knowledge types: Factual (specific facts), Biographical (people's backgrounds), Analytical (comparison/synthesis), Procedural (how-to). Source: knowledge_type_performance.csv

Key Insight: 99.1% of queries are Complex (5+ criteria), 0.9% Medium, 0% Simpleβ€”this extreme imbalance prevents evaluating baseline capabilities.
Action: Immediately add 60-80 Simple and 110-130 Medium complexity queries to achieve balanced evaluation coverage.

Calculation: Complexity assigned by query categorizer based on constraint count. Simple=1-2 criteria, Medium=3-4, Complex=5+. Balance score: 0.072 (critical, target β‰₯0.6). Source: complexity_distribution.csv

πŸ”Ž Retrieval Quality & Impact

Key Insight: Bimodal distribution: 29 queries near 0% recall (complete retrieval failure), 140 queries at 90-100% (strong retrieval)β€”clear separation.
Action: Focus on the 0-20% bucket (29 queries): implement multi-angle search strategies to rescue complete retrieval failures.

Calculation: Queries grouped by retrieval recall into 10% buckets. Recall = (# gold docs retrieved) / (# total gold docs). Source: recall_distribution.csv

Key Insight: Strong correlation (r=0.567, p<0.000001): queries with higher retrieval recall are significantly more likely to be answered correctly.
Action: Justify investment in retrieval infrastructure using this correlation as evidence that retrieval improvements will translate to correctness gains.

Calculation: Each point is one query plotted by recall (x) and correctness (y, 0 or 1). Trendline shows linear regression. Correlation coefficient r=0.567 means ~32% of correctness variance explained by recall alone. Source: recall_distribution.csv

Key Insight: Correctness jumps from 0% (0-20% recall) to 83% (80-100% recall)β€”even partial retrieval significantly improves outcomes.
Action: Set retrieval quality gates: require minimum 40-60% recall before attempting answer generation; target 80%+ recall for high-confidence queries.

Calculation: Queries grouped by recall bucket. Correctness rate = (# correct in bucket) / (# queries in bucket). Shows performance cliff below 40% recall and plateau above 80%. Source: gold_doc_utilization.csv

🎯 Improvement Priorities

Key Insight: Top 10 improvements could add ~173 correct answers (theoretical max). S1 offers highest priority score with low difficulty.
Action: Implement the top 3 improvements (S1, S2, R2) in the next development cycle.
Priority Improvement Queries
Affected
Expected
Lift
Difficulty Type
S1 Search budgets not well-adapted to query complexity...
352
21.1
Low System
S2 Current 80% retrieval recall threshold may be suboptimal...
110
21.7
Medium System
R2 Queries with very low retrieval recall (<50%) fail almost entirely...
50
24.9
Medium Retrieval
RS1 Queries with high retrieval recall (β‰₯80%) still fail due to synthesis errors...
41
24.6
Medium Reasoning
S3 No confidence estimation or uncertainty handling mechanism...
352
10.6
Medium System
RS5 Queries with 100% retrieval recall still fail - pure reasoning errors...
29
20.3
Medium Reasoning
RS4 Most common failure mode: Information Extraction...
28
14.0
Medium Reasoning
RS2 Complex multi-constraint queries fail at synthesis stage despite good retrieval...
39
25.4
High Reasoning
R3 Literature queries show systematically lower retrieval recall...
31
3.8
Medium Retrieval
R4 Queries with moderate retrieval recall (50-80%) still fail - missing critical do...
12
6.7
High Retrieval

Calculation: Priority score = (expected_lift Γ— affected_queries) / difficulty_factor. Expected lift = estimated additional correct answers based on affected query analysis. Bars show relative magnitude. Source: improvement_priorities.json

Key Insight: 1 improvement(s) in the "Quick Wins" quadrant (top-left): S1β€”high impact with low effort.
Action: Focus first on Quick Wins quadrant, then proceed to Strategic Projects quadrant once quick wins are validated.

Calculation: 2Γ—2 matrix with quadrants: Quick Wins (low difficulty, high impact), Strategic Projects (medium/high difficulty, high impact), Fill-Ins (low difficulty, low impact), Avoid (high difficulty, low impact). Bubble size = queries affected. Source: improvement_priorities.json