Interactive Performance Dashboard β 352 Completed Queries
browsecomp_failure_analyzer scorer using exact match and semantic equivalence. Target: 80%.browsecomp_query_categorizer: Simple (1-2 criteria), Medium (3-4 criteria), Complex (5+ criteria).browsecomp_failure_analyzer (correctness, retrieval recall, error classification), browsecomp_query_categorizer (topic, knowledge type, complexity)This gauge displays the agent's overall answer correctness rate (70.74%) against a target threshold of 80%. The gauge needle indicates current performance, with color zones representing performance levels: red (<60%), yellow (60-75%), blue (>75%). Data from 352 completed BrowseComp-Plus queries.
Correctness Rate = 249 correct / 352 total = 70.74%. Computed by comparing the agent's extracted final answer to the human-verified gold answer using the browsecomp_failure_analyzer scorer. Source: actionable_improvement_priorities.md
This donut chart shows the distribution of 155 failures: Retrieval Errors (121 failures, 33.7%) where the agent failed to find relevant documents (retrieval recall <80%), and Reasoning Errors (34 failures, 9.5%) where adequate information was retrieved but synthesis failed.
Retrieval Error: Failures with retrieval recall < 80%. Reasoning Error: Failures with retrieval recall β₯ 80%. The 80% threshold distinguishes "didn't find it" from "found it but used it wrong." Source: failure_mode_deep_dive.md, failure_mode_summary.csv
This bar chart displays answer correctness rate (y-axis) across retrieval recall ranges (x-axis). Each bar represents queries grouped by how many gold documents they successfully retrieved. Bar colors shift from red (poor performance) to green (strong performance). Sample counts shown on bars. Data from 280 queries with correlation r=0.567 (p<0.000001).
Correctness Rate per bucket: (# correct answers in bucket) / (# total queries in bucket). Retrieval Recall: (# gold docs retrieved) / (# total gold docs). Source: retrieval_recall_analysis.md, recall_distribution.csv
This bubble chart plots the top 5 improvement recommendations: implementation difficulty (x-axis, 1-3 scale) vs expected lift in correct answers (y-axis). Bubble size represents queries affected. Colors indicate difficulty (green=low, blue=medium, yellow=high). Data from actionable improvement priorities analysis.
Expected Lift: Estimated additional correct answers if improvement is implemented, based on historical patterns and affected query analysis. Priority Score: Weighted combination of expected lift, affected queries, and inverse difficulty. Source: improvement_priorities.json
Calculation: Retrieval errors have recall <80% (avg 7.8%), reasoning errors have recall β₯80% (avg 94.1%). The 55pp gap validates the threshold. Source: failure_mode_summary.csv
Calculation: Subcategories identified through qualitative analysis of failure explanations. Rectangle size = failure count. Source: retrieval_error_subcategories.csv, reasoning_error_subcategories.csv
Calculation: Average retrieval recall within each outcome group. Correlation r=0.567 (p<0.000001) confirms strong relationship between recall and correctness. Source: failure_mode_summary.csv
Calculation: Correctness rate per topic = (# correct in topic) / (# total queries in topic). Only topics with β₯5 queries shown to avoid noise. Bar color indicates performance zone. Source: topic_performance.csv
Calculation: Metrics aggregated by knowledge_type dimension from query categorization. Knowledge types: Factual (specific facts), Biographical (people's backgrounds), Analytical (comparison/synthesis), Procedural (how-to). Source: knowledge_type_performance.csv
Calculation: Complexity assigned by query categorizer based on constraint count. Simple=1-2 criteria, Medium=3-4, Complex=5+. Balance score: 0.072 (critical, target β₯0.6). Source: complexity_distribution.csv
Calculation: Queries grouped by retrieval recall into 10% buckets. Recall = (# gold docs retrieved) / (# total gold docs). Source: recall_distribution.csv
Calculation: Each point is one query plotted by recall (x) and correctness (y, 0 or 1). Trendline shows linear regression. Correlation coefficient r=0.567 means ~32% of correctness variance explained by recall alone. Source: recall_distribution.csv
Calculation: Queries grouped by recall bucket. Correctness rate = (# correct in bucket) / (# queries in bucket). Shows performance cliff below 40% recall and plateau above 80%. Source: gold_doc_utilization.csv
| Priority | Improvement | Queries Affected |
Expected Lift |
Difficulty | Type |
|---|---|---|---|---|---|
| S1 | Search budgets not well-adapted to query complexity... | 352 | 21.1 | Low | System |
| S2 | Current 80% retrieval recall threshold may be suboptimal... | 110 | 21.7 | Medium | System |
| R2 | Queries with very low retrieval recall (<50%) fail almost entirely... | 50 | 24.9 | Medium | Retrieval |
| RS1 | Queries with high retrieval recall (β₯80%) still fail due to synthesis errors... | 41 | 24.6 | Medium | Reasoning |
| S3 | No confidence estimation or uncertainty handling mechanism... | 352 | 10.6 | Medium | System |
| RS5 | Queries with 100% retrieval recall still fail - pure reasoning errors... | 29 | 20.3 | Medium | Reasoning |
| RS4 | Most common failure mode: Information Extraction... | 28 | 14.0 | Medium | Reasoning |
| RS2 | Complex multi-constraint queries fail at synthesis stage despite good retrieval... | 39 | 25.4 | High | Reasoning |
| R3 | Literature queries show systematically lower retrieval recall... | 31 | 3.8 | Medium | Retrieval |
| R4 | Queries with moderate retrieval recall (50-80%) still fail - missing critical do... | 12 | 6.7 | High | Retrieval |
Calculation: Priority score = (expected_lift Γ affected_queries) / difficulty_factor. Expected lift = estimated additional correct answers based on affected query analysis. Bars show relative magnitude. Source: improvement_priorities.json
Calculation: 2Γ2 matrix with quadrants: Quick Wins (low difficulty, high impact), Strategic Projects (medium/high difficulty, high impact), Fill-Ins (low difficulty, low impact), Avoid (high difficulty, low impact). Bubble size = queries affected. Source: improvement_priorities.json