Effective observability isn’t just about collecting data—it’s about knowing what questions to ask. With OpsPilot’s natural language interface and OpenTelemetry’s comprehensive telemetry data, you can establish a daily routine that keeps you ahead of issues before they impact users. Here are the essential questions every team should ask OpsPilot daily.
Morning Health Check: System-Wide Status
“What happened overnight?”
This simple question gives you an immediate overview of your system’s behavior during off-hours. OpsPilot will intelligently analyze your OTel traces, metrics, and logs to surface any anomalies, errors, or performance degradations that occurred while your team was offline.
What OpsPilot delivers: A comprehensive analysis that includes:
- Alert Status: Current firing alerts across all monitoring rules
- Service Inventory: Complete list of operational services (e.g., “50+ services running including FusionShop, OpenTelemetry demo services, and various infrastructure components”)
- Resource Utilization: CPU and memory usage with specific percentages (e.g., “~24.9% average CPU utilization, ~33.3% average memory utilization”)
- Error Analysis: Application and server error rates across all services
- Trace Investigation: Examination of error traces with duration and impact details
- Prioritized Findings: Issues organized by severity with clear impact assessment
Real OpsPilot Response Example:
When asked “What happened overnight?”, OpsPilot provided this comprehensive overnight analysis:
🔍 Overnight System Analysis Results
📊 System Health Overview
Component Status Key Finding Impact
Alerts ✅ Clean No firing alerts No critical issues detected
Services ✅ Operational 50+ services running All services up and responding
CPU Usage ✅ Stable ~24.9% average utilization Consistent, healthy levels
Memory Usage ✅ Stable ~33.3% average utilization Normal resource consumption
🚨 Issues Detected
Priority Component Issue Time Range Impact
⚠️ Low Grafana Alloy Kafka connectivity errors Ongoing Monitoring component only
⚠️ Low Load Generator Timeout errors (demo) 09:17-09:20 Expected demo behavior
⚠️ Low Quote Service Minor server errors Intermittent Low error rate (~0.01%)
💡 Assessment
✅ Overall System Health: GOOD
The overnight period was largely uneventful with:
- No production issues - All critical services performed normally
- Stable resource usage - CPU and memory remained at healthy levels
- Expected demo errors - The error traces are from the OpenTelemetry demo environment
- Minor monitoring issues - Kafka connectivity errors don't impact core functionality Why it matters: Issues that develop overnight can compound throughout the day. This comprehensive analysis—which would take 30+ minutes manually across multiple tools—happens in seconds from a single question.
“Have I had any issues with my OTel services in the last 24 hours?”
This targeted question helps you focus on your OpenTelemetry-instrumented services specifically, providing detailed analysis of services using OTel for observability.
Real OpsPilot Analysis Example:
🔍 OpenTelemetry Services Health Analysis - Last 24 Hours
Key Takeaway: Your OpenTelemetry services have been running well overall with only
minor, expected issues in the demo environment and some intermittent errors in the
quote service.
📊 OTel Services Status Overview
Service Status Error Rate Key Finding
otlp-ad-fr ✅ Excellent 0% application errors Zero errors detected
otlp-fraud-detection-fr ✅ Excellent 0% application errors Zero errors detected
quote-service-lucee ⚠️ Minor Issues ~0.4-1.2% server errors Intermittent errors
OpenTelemetry Demo Services ✅ Stable No error traces found Expected behavior
🚨 Issues Detected
Priority Service Issue Type Details Time Range
⚠️ Low quote-service-lucee Server Errors Intermittent HTTP errors Throughout 24h
⚠️ Low OpenTelemetry Demo/Ad Span Export "cannot ship spans" (503) ~17 hours ago
ℹ️ Info OpenTelemetry Demo Normal Activity Profile control operations Recent
📈 Performance Analysis
Application Error Rates (24-hour period):
- otlp-ad-fr: Perfect performance with 0% application error rate
- otlp-fraud-detection-fr: Perfect performance with 0% application error rate
- quote-service-lucee: Minor server errors ranging from 0.4% to 1.2% at peak
Server Error Patterns:
quote-service-lucee showed variable server error rates:
- Peak errors: ~709 errors/sec (early in period)
- Typical range: 4-20 errors/sec during active periods
- Many periods with zero errors
- Current status: Minimal errors (~114 errors/sec most recently)
🔍 Trace Analysis
Error Traces: No error traces found in the OpenTelemetry demo services over the last
24 hours, indicating stable request processing.
Recommendations:
1. Monitor quote-service-lucee - The server error pattern suggests potential resource
constraints or configuration issues during peak load
2. Verify span export configuration - Ensure OpenTelemetry exporters have proper retry
and backoff configurations
3. Continue monitoring - Current error rates are manageable but trending should be watched
✅ Overall Health: GOOD
Summary: Your OpenTelemetry services have been performing well over the last 24 hours.
The OTLP services are running perfectly with zero errors. The quote-service-lucee has
some intermittent server errors but they're at low levels and don't appear to be
impacting overall functionality. What OpsPilot did automatically:
- Filtered all metrics to focus specifically on OTel-related services
- Queried Prometheus for application and server error rates across OTel services
- Searched Tempo traces for error traces in OpenTelemetry demo services
- Analyzed Loki logs for errors, exceptions, and failures in OTel services
- Identified service-specific patterns (quote-service-lucee error spikes)
- Distinguished between critical services (otlp-ad-fr, otlp-fraud-detection-fr) and demo environments
- Provided error rate trends (peak vs. typical vs. current)
- Delivered specific recommendations for each identified issue
Why this matters: When you’re specifically instrumenting services with OpenTelemetry, you need focused analysis of how those services are performing. OpsPilot automatically identifies OTel services and provides service-by-service breakdowns with error patterns, trace analysis, and targeted recommendations.
“Show me error rates across all services in the last 24 hours”
Understanding your baseline error rates helps you identify services that may need attention. OpsPilot can query your Prometheus metrics and generate visualizations comparing current error rates against historical patterns.
What to look for: OpsPilot will help you identify:
- Services with zero errors (healthy baseline like otlp-ad-fr and otlp-fraud-detection-fr)
- Services with minimal errors like “quote-service-lucee showing server errors at 0.4-1.2%”
- Services with concerning error rates requiring immediate attention
- Expected errors from test/demo environments that can be safely ignored
OpsPilot’s intelligence: The OTel services analysis distinguished between:
- Perfect performance (0% error rate)
- Minor issues with low impact (0.4-1.2% error rate)
- Intermittent patterns (variable error rates throughout the day)
- Expected vs. unexpected behavior (demo environment vs. production services)
“Which services had the slowest response times yesterday?”
Performance degradation often starts gradually. This question helps you identify services experiencing latency issues before they become critical problems.
Action item: If the same services appear repeatedly, prioritize investigation and optimization efforts there. OpsPilot can correlate slow response times with resource utilization to help identify root causes.
Mid-Morning: Trace Analysis
“Show me traces with errors from the last hour”
Once your morning rush begins and traffic increases, this question helps you catch issues as they develop in real-time.
Why OTel matters here: OpenTelemetry’s distributed tracing captures the complete journey of requests across your microservices. OpsPilot uses TraceQL to search through your Tempo data, identifying error traces and providing context about which services and operations were involved.
Real example from OpsPilot analysis:
Trace Analysis: No error traces found in the OpenTelemetry demo services over the
last 24 hours, indicating stable request processing. This type of finding is just as valuable as finding errors—knowing that request processing is stable across your OTel-instrumented services gives you confidence in your system health.
“What are the longest-running traces in the past 6 hours?”
Slow transactions frustrate users and consume resources. Identifying these traces helps you understand where optimization efforts will have the biggest impact.
OpsPilot advantage: Instead of manually constructing TraceQL queries, you get instant analysis with context about which services contributed to the delay, typical duration patterns, and whether the slow traces represent anomalies or expected behavior.
Pre-Lunch: Resource Utilization Check
“Show me CPU and memory usage trends across my infrastructure”
Resource exhaustion is a common cause of application failures. Regular monitoring prevents surprise outages.
What OpsPilot provides: Queries against your Prometheus metrics with specific, actionable data:
Real example from overnight analysis:
📈 Performance Metrics Summary
System Resources (12-hour average):
- CPU Utilization: 24.9% - Well within normal operating range
- Memory Utilization: 33.3% - Healthy memory usage with plenty of headroom
- Error Rates: Near zero for production services OpsPilot not only provides the numbers but interprets them: “Well within normal operating range” and “Healthy memory usage with plenty of headroom” give you immediate confidence that your systems are operating correctly.
“Are there any services approaching resource limits?”
This proactive question helps you identify capacity issues before they cause problems.
Best practice: Ask this during your daily standup so the entire team has visibility into potential scaling needs. OpsPilot will analyze current utilization against historical baselines to flag concerning trends.
Afternoon: Service-Specific Health Checks
“How are my OpenTelemetry exporters performing?”
If you’re using OpenTelemetry exporters to send data to various backends, this question helps you ensure your observability pipeline is healthy.
What to look for:
- Span export errors (like the “cannot ship spans” batch failed error)
- Exporter throughput and latency
- Dropped or rejected spans
- Backpressure indicators
OpsPilot insight: In the OTel services analysis, OpsPilot identified a span export error from 17 hours ago with the recommendation to “Verify span export configuration – Ensure OpenTelemetry exporters have proper retry and backoff configurations.”
“Which of my services are experiencing the most errors?”
This question helps you prioritize investigation efforts on services with the highest impact.
Real example from OTel analysis:
Server Error Patterns for quote-service-lucee:
- Peak errors: ~709 errors/sec (early in period)
- Typical range: 4-20 errors/sec during active periods
- Many periods with zero errors
- Current status: Minimal errors (~114 errors/sec most recently) This level of detail shows not just that there are errors, but the pattern of errors throughout the day—helping you understand whether this is a persistent issue or related to specific load patterns.
“Compare today’s traffic to the same day last week”
Understanding normal traffic patterns helps you distinguish between expected behavior and genuine anomalies.
Why weekly comparison matters: Many applications have weekly cycles (lower weekend traffic, higher Monday traffic, etc.). Week-over-week comparison accounts for these patterns.
“Show me the top 10 endpoints by request volume today”
Knowing which endpoints receive the most traffic helps you prioritize performance optimization and understand your users’ primary workflows.
OpsPilot insight: This data can reveal unexpected usage patterns or identify endpoints that might benefit from caching or optimization.
Late Afternoon: Dependency Health
“Are there any external service integrations experiencing issues?”
Modern applications depend on numerous external services. OpsPilot can analyze your OTel data to identify problems with third-party APIs, databases, or other dependencies.
Critical for: SaaS applications, e-commerce platforms, and any system relying on external payment processors, authentication providers, or data services.
Real example from OpsPilot analysis:
OpsPilot identified “Kafka connectivity errors in Grafana Alloy” but intelligently assessed the impact:
Priority Component Issue Impact
⚠️ Low Grafana Alloy Kafka connectivity errors Monitoring component only
Recommendation: The Kafka connectivity errors in Grafana Alloy should be addressed
when convenient, but they don't impact your core application monitoring capabilities. This type of contextual assessment prevents unnecessary alarm while ensuring issues are documented for follow-up.
“Show me database query performance for the last 8 hours”
Database performance often degrades gradually. Regular monitoring helps you catch slow queries before they impact user experience.
What to investigate: Queries taking significantly longer than baseline, queries with high execution frequency, or queries with increasing duration trends.
Real OpsPilot Database Performance Analysis:
When asked “Show me database query performance for the last 8 hours”, OpsPilot delivered this comprehensive analysis:
📊 Database Query Performance Analysis - Last 8 Hours
Key Takeaway: Database performance shows significant variability with periods of
excellent performance and some concerning spikes, particularly during peak load
periods around 02:00-04:00 UTC.
🎯 Performance Summary
Metric Status Peak Value Average Trend
Query Response Time ⚠️ Variable 2,034ms (quotes table) ~200-500ms High variability
MySQL Throughput ✅ Stable 3.02 queries/sec ~2.7 q/sec Consistent
Slow Queries ✅ Low 6 total <1% of queries Minimal impact
Connections ✅ Healthy 18 concurrent 6-8 average Normal scaling
📈 Detailed Performance Metrics
Application-Level Database Performance:
Table Operation Best Response Worst Response Current Performance
customers INSERT 3ms 529ms 29ms
customers SELECT 0.2ms 1,474ms 1.8ms
quotes SELECT 0.4ms 2,035ms 201ms
services SELECT 0.2ms 786ms 7ms
🚨 Performance Issues Detected
Time Period Issue Impact Details
02:00-04:00 UTC High query latency ⚠️ Moderate Quotes table queries peaked at 2+ seconds
02:40-04:20 UTC Connection spike ⚠️ Low Connections increased to 18 (normal: 6-8)
Intermittent Database errors ⚠️ Low Sporadic timeout errors (30-second range)
🔧 Recommendations:
1. Investigate quotes table indexing - The high variability suggests potential
missing indexes or inefficient queries
2. Review query patterns during 02:00-04:00 UTC - This appears to be a consistent
peak load period
3. Consider connection pooling optimization - The connection spikes correlate with
performance issues
4. Monitor for long-running transactions - The timeout errors suggest some queries
may be holding locks too long
Overall Assessment: Database performance is generally good with stable server metrics,
but application-level query performance shows concerning variability that should be
investigated, particularly for the quotes table during peak load periods. Why this matters: Instead of manually querying database metrics, examining slow query logs, and correlating connection pools with performance issues, OpsPilot:
- Analyzed 8 hours of database performance data
- Identified specific tables with performance issues (quotes table)
- Correlated performance spikes with time periods (02:00-04:00 UTC)
- Distinguished between best-case, worst-case, and current performance
- Provided table-by-table and operation-by-operation breakdowns
- Assessed connection pool behavior
- Delivered specific, actionable recommendations
This level of analysis transforms database monitoring from reactive troubleshooting into proactive optimization.
End of Day: Summary and Planning
“Give me a summary of today’s system performance”
This comprehensive question provides a holistic view of your system’s health throughout the day.
OpsPilot’s value: Rather than reviewing multiple dashboards, you get a synthesized summary with clear health assessment:
Example from real OpsPilot analysis:
💡 Assessment
✅ Overall System Health: GOOD
The overnight period was largely uneventful with:
✅ No production issues - All critical services performed normally
✅ Stable resource usage - CPU and memory remained at healthy levels
✅ Expected demo errors - Error traces are from OpenTelemetry demo environment
✅ Minor monitoring issues - Don't impact core functionality
Your systems ran smoothly overnight with no significant incidents or
performance issues requiring immediate attention. The analysis includes prioritized findings organized by impact level with specific recommendations about what needs attention now versus what can wait.
“What issues occurred today that I should investigate further?”
Not every issue requires immediate attention, but logging them for follow-up ensures nothing falls through the cracks.
Team practice: Share OpsPilot’s prioritized issue list in your daily wrap-up. Examples from real analyses:
Overnight analysis identified:
- Kafka connectivity errors (monitoring component only – low priority)
- Load generator timeout errors (expected demo behavior – informational)
- Quote service minor errors (very low rate at ~0.01% – low priority)
OTel services analysis identified:
- quote-service-lucee server error patterns (potential resource constraints – moderate priority)
- Span export configuration (ensure proper retry/backoff – low priority)
- Continued monitoring of error rate trends (proactive measure)
Database analysis identified:
- Quotes table indexing investigation (high variability indicator – moderate priority)
- Peak load period optimization (02:00-04:00 UTC pattern – moderate priority)
- Connection pooling configuration review (correlation with performance – low priority)
This prioritization helps teams focus on what matters while documenting known issues for later investigation.
“Show me any recurring errors from the past week”
Recurring errors indicate systemic issues rather than transient problems. Identifying patterns helps you prioritize fixes that will have lasting impact.
Follow-up action: Create tickets for recurring issues and track them through resolution. OpsPilot can help distinguish between recurring production errors and expected behavior from test environments.
Weekly Deep Dive Questions
While daily questions maintain situational awareness, weekly questions provide strategic insights:
“What are the performance trends over the past week?”
Identify gradual degradations that might not be obvious in daily snapshots. For example, database query times gradually increasing or specific time periods (like 02:00-04:00 UTC) consistently showing performance issues.
“Which services had the most errors this week?”
Prioritize engineering efforts on services with the highest error impact. OpsPilot can aggregate error data across services and time periods to identify problem areas like the quote-service-lucee pattern of intermittent errors.
“Show me the 95th percentile response times by service”
Understanding tail latency helps you optimize the experience for all users, not just the average case. The database performance analysis showing worst-case times of 2+ seconds while averages are 200-500ms demonstrates why percentile analysis matters.
“Compare this week’s incident count to last month”
Track whether your reliability is improving or degrading over time.
“How is my OpenTelemetry instrumentation coverage?”
Understand which services are instrumented with OTel and which might need additional coverage for complete observability.
Incident Response Questions
When issues occur, these questions help you respond quickly:
“What changed in the last 2 hours?”
Deployments, configuration changes, and infrastructure modifications often correlate with incidents. OpsPilot can help identify recent changes by analyzing your OTel data.
“Show me all traces containing [specific error message]”
Quickly find all instances of a specific error and understand the context in which it occurs. OpsPilot uses TraceQL to search through your distributed traces efficiently.
“Which services are calling [problematic service]?”
Understanding upstream dependencies helps you assess the blast radius of an incident.
“What was the error rate during [specific time period]?”
Quantify the impact of an incident for post-mortems and customer communication. OpsPilot can provide specific time ranges like “09:17-09:20” or “02:00-04:00 UTC” for precise incident documentation.
Proactive Capacity Planning Questions
“Show me storage growth over the past month”
Prevent storage exhaustion by monitoring growth trends.
“What’s the peak concurrent request count for each service?”
Understand your capacity needs for proper resource allocation and auto-scaling configuration. The database analysis showing connections scaling from 2 to 18 during peak periods demonstrates healthy capacity management.
“Which services scale most frequently?”
Identify services with volatile load patterns that might benefit from optimized scaling policies.
How OpsPilot Processes Your Questions: Behind the Scenes
Understanding OpsPilot’s analytical workflow helps you ask better questions and interpret the results:
Example 1: “What happened overnight?” query workflow:
Step 1-8: [Previous workflow steps remain the same]
Example 2: “Have I had any issues with my OTel services in the last 24 hours?” workflow:
Step 1: Service Identification
Identify OpenTelemetry-related services using pattern matching:
- Services with "otlp" in the name
- Services with "opentelemetry" in the name
- Services instrumented with OTel SDKs
Result: Found otlp-ad-fr, otlp-fraud-detection-fr, quote-service-lucee,
OpenTelemetry Demo services Step 2: Application Error Analysis
Query Prometheus
sum(rate(app_error_count[5m])) by (job) and on(job)
label_replace({__name__=~"app_error_count"}, "job", "$1", "job", "(.*opentelemetry.*|.*otlp.*)")
Result:
- otlp-ad-fr: 0% error rate
- otlp-fraud-detection-fr: 0% error rate
- quote-service-lucee: some errors detected Step 3: Server Error Analysis
Query Prometheus
sum(rate(app_server_error_count[5m])) by (job) and on(job)
label_replace({__name__=~"app_server_error_count"}, "job", "$1", "job", "(.*opentelemetry.*|.*otlp.*)")
Result: quote-service-lucee showing variable server error rates:
- Peak: ~709 errors/sec
- Typical: 4-20 errors/sec
- Current: ~114 errors/sec Step 4: Trace Investigation
Search Tempo Traces
{.service.name=~"opentelemetry-demo.*"} && {status=error}
Result: No error traces found in OpenTelemetry demo services - stable request processing Step 5: Log Analysis
Query Loki Logs (First Pass)
{job=~".*opentelemetry.*|.*otlp.*"} |~ "(?i)(error|exception|fail|critical)"
Limit: 30 entries, Direction: backward, Time: last 24 hours
Query Loki Logs (Recent Focus)
{job=~".*opentelemetry.*|.*otlp.*"} |~ "(?i)(error|exception|fail|critical|timeout)"
Limit: 50 entries, Time: last 8 hours
Result:
- Found one span export error: "cannot ship spans" batch failed (503) ~17 hours ago
- Recent logs show normal FusionReactor profiling operations
- No critical errors in recent activity Step 6: Pattern Recognition
Analyze error patterns across time:
- Identify peak error periods
- Distinguish between persistent and intermittent issues
- Correlate errors across services
- Compare error rates: peak vs typical vs current Step 7: Impact Assessment
Categorize findings by severity:
- Perfect performance: otlp-ad-fr, otlp-fraud-detection-fr (0% errors)
- Minor issues: quote-service-lucee (0.4-1.2% error rate)
- Informational: Normal demo environment activity
- Low priority: Temporary span export issue Step 8: Recommendations Generation
Based on patterns identified:
- quote-service-lucee: Monitor for resource constraints during peak load
- Span export: Verify exporter retry/backoff configuration
- Overall: Continue monitoring trends This multi-layered analysis—spanning service identification, error rate querying, trace searching, log analysis, pattern recognition, and recommendation generation—happens automatically from a single natural language question targeting OpenTelemetry services specifically.
Example 3: “Show me database query performance for the last 8 hours” workflow:
[Previous database workflow steps remain the same]
Best Practices for Daily OpsPilot Usage
Establish a Routine – Ask the same core questions at the same time each day. Starting with “What happened overnight?” gives you immediate confidence about system health. Follow up with service-specific checks like “Have I had any issues with my OTel services?” and database performance reviews before peak traffic periods.
Document Interesting Findings – When OpsPilot surfaces something unexpected, document it in your team wiki or runbook:
- The 02:00-04:00 UTC peak load pattern
- quote-service-lucee error patterns during peak periods
- Quotes table performance variability
- Normal error rate baselines for each service (otlp services at 0%, quote-service-lucee at 0.4-1.2%)
Share Insights with Your Team – Use OpsPilot’s clear, structured output during standups or team meetings. The organized format with health indicators (✅⚠️ℹ️), priority levels, and specific metrics makes it easy to communicate:
- “Our OTel OTLP services are running perfectly with zero errors”
- “quote-service-lucee has intermittent errors we’re monitoring, but they’re at low levels”
- “Database performance is good overall, but we need to investigate quotes table indexing”
Act on Recommendations – OpsPilot doesn’t just identify issues; it provides specific, prioritized recommendations. Create tickets for medium and high-priority items:
- Investigate quotes table indexing (from database analysis)
- Monitor quote-service-lucee resource constraints (from OTel services analysis)
- Verify span export configuration (from OTel services analysis)
- Schedule investigation time for lower-priority concerns
Refine Your Questions – As you learn more about your system’s patterns, adjust your daily questions:
- If you have multiple service groups (OTel services, legacy services, third-party integrations), ask group-specific questions
- If database performance shows consistent patterns at specific times, add time-specific questions
- If certain services appear frequently in error reports, add service-specific deep dives
Use OpsPilot Hub – Populate OpsPilot Hub with your infrastructure diagrams, known issues, and incident history. This context makes OpsPilot’s responses more accurate and relevant:
- Helps distinguish between expected demo behavior and actual production issues
- Enables understanding of service dependencies for impact assessment
- Provides context about normal error baselines and acceptable thresholds
Trust the Analysis – OpsPilot provides clear health assessments based on comprehensive data analysis:
- “✅ Overall System Health: GOOD” (overnight analysis)
- “✅ Overall Health: GOOD” (OTel services analysis)
- “generally good with stable server metrics” (database analysis)
These assessments come after analyzing multiple data sources and provide confidence levels you can trust.
Review Recommendations Regularly – OpsPilot’s recommendations often reveal systemic issues:
- The database performance analysis suggesting quotes table indexing isn’t just about fixing one slow query—it’s about addressing a pattern affecting user experience
- The OTel services analysis identifying resource constraints in quote-service-lucee points to scaling or optimization needs
- The span export configuration recommendation ensures your observability pipeline remains healthy
Leveraging OTel Data Through OpsPilot
The combination of OpenTelemetry’s comprehensive data collection and OpsPilot’s natural language interface creates a powerful daily workflow:
Complete Visibility – OTel captures metrics (Prometheus), logs (Loki), and traces (Tempo) across your entire stack:
- Application-level metrics (error rates, response times)
- Service-level metrics (throughput, latency by service)
- Infrastructure metrics (CPU, memory, connections)
- Database metrics (query performance, throughput, slow queries)
Accessible Analysis – OpsPilot eliminates the learning curve of complex query languages (PromQL, LogQL, TraceQL), making observability data accessible to everyone on your team, not just monitoring specialists.
Service-Specific Intelligence – OpsPilot can automatically filter and analyze specific service groups:
Query: "Have I had any issues with my OTel services?"
Result: Automatically identifies and analyzes only OpenTelemetry-instrumented services
- Pattern matching on service names (otlp*, opentelemetry*)
- Focused error rate analysis for identified services
- Service-by-service status breakdown
- Comparative analysis (0% vs 0.4-1.2% error rates) Contextual Understanding – OpsPilot intelligently interprets your data:
- “0% application error rate” becomes “Perfect performance”
- “0.4-1.2% server errors” becomes “Minor issues” with “Low impact” assessment
- “~709 errors/sec peak, 4-20 errors/sec typical, ~114 errors/sec current” shows trends, not just snapshots
- “No error traces found” becomes “stable request processing” – positive confirmation
- Span export errors assessed as temporary vs persistent issues
Multi-Dimensional Analysis – OpsPilot automatically analyzes across dimensions:
For OTel Services:
- Application error rates by service
- Server error rates with trend analysis (peak/typical/current)
- Trace-level error investigation
- Log analysis for exceptions and failures
- Time-based pattern recognition
For Database Performance:
- Application-level query performance (response times by table and operation)
- Server-level performance (throughput, slow query counts)
- Resource utilization (connection pool behavior)
- Time-based patterns (peak periods, gradual degradations)
- Correlation between metrics (connections vs. performance)
Faster Troubleshooting – Questions trigger comprehensive analysis across all your data sources:
“What happened overnight?”
- Alert checking and service inventory
- Metric discovery and querying
- Log searching with proper regex
- Error rate analysis with baselines
- Resource utilization checks
- Trace investigation with TraceQL
“Have I had any issues with my OTel services?”
- Service identification and filtering
- Application and server error rate analysis
- Trace-level error search specific to OTel services
- Log analysis focused on OTel service logs
- Pattern recognition across error types
- Trend analysis (peak vs typical vs current)
“Show me database query performance”
- Performance data across 8 hours
- Table and operation breakdowns
- Time-based correlation
- Connection pool analysis
- Root cause hypothesis
This analysis, which would require dozens of manual queries across multiple tools and significant expertise to correlate and interpret, happens automatically in seconds.
Intelligent Prioritization – OpsPilot doesn’t just show you data; it assesses impact and prioritizes issues:
Priority Indicators:
- ✅ Excellent/Stable: No action needed
- ⚠️ Low: Monitor, address when convenient
- ⚠️ Moderate: Investigate and optimize
- 🚨 High/Critical: Immediate attention required
- ℹ️ Info: Awareness, no action needed
Impact Assessment:
- “No issues” (otlp services at 0% errors)
- “Low impact” (quote-service-lucee at 0.4-1.2%)
- “Monitoring component only” (Kafka connectivity)
- “Demo environment only” (expected test behavior)
- “Moderate” (database performance spikes)
Actionable Recommendations – OpsPilot goes beyond identifying problems to suggest specific solutions:
- “Investigate quotes table indexing” (specific table identified)
- “Monitor quote-service-lucee for resource constraints” (specific service with context)
- “Verify span export configuration for proper retry and backoff” (specific OTel component)
- “Review query patterns during 02:00-04:00 UTC” (specific time period)
- “Consider connection pooling optimization” (correlation identified)
- “Monitor for long-running transactions” (root cause hypothesis)
Conclusion: From Reactive to Proactive
Establishing a daily routine of asking OpsPilot targeted questions transforms your observability practice from reactive firefighting to proactive system management. By leveraging OpenTelemetry’s standardized data collection and OpsPilot’s AI-powered analysis, you can:
- Start confidently: Know exactly what happened overnight with comprehensive analysis covering 50+ services, 12 hours of metrics, and complete trace investigation—in seconds
- Focus effectively: Ask service-specific questions like “Have I had any issues with my OTel services?” to get targeted analysis of specific technology stacks or service groups
- Catch issues early: Distinguish between expected behavior and genuine problems before they impact users, like identifying database performance patterns during peak periods or service error trends
- Build team knowledge: Share clear, structured insights that everyone can understand, from “Overall System Health: GOOD” assessments to specific service-level metrics like “otlp-ad-fr: 0% errors, quote-service-lucee: 0.4-1.2% errors”
- Make data-driven decisions: Base capacity and scaling decisions on precise metrics (connection scaling from 2 to 18 during peaks) and performance trends (02:00-04:00 UTC peak patterns, error rate progressions from 709 to 114 errors/sec)
- Respond faster: Access correlated data from metrics, logs, and traces through simple questions, with automatic analysis spanning application, database, infrastructure, and service-specific layers
- Prioritize effectively: Trust OpsPilot’s prioritization (✅⚠️ℹ️) to work on what matters most, whether that’s investigating quotes table indexing (moderate priority), monitoring quote-service-lucee patterns (low priority), or acknowledging demo environment behavior (informational)
- Optimize proactively: Receive specific, actionable recommendations based on comprehensive pattern analysis rather than reactive alerts
The key is consistency. Start with the morning “What happened overnight?” health check, add service-specific questions like “Have I had any issues with my OTel services?” to focus on critical technology stacks, include mid-day database performance reviews during peak traffic periods, and establish end-of-day summaries. Gradually expand to include the full daily routine and customize based on what matters most to your specific applications and services.
As demonstrated in the real-world examples, well-crafted questions can give you complete visibility into:
System-Wide:
- 50+ services operational status with alert correlation
- 12 hours of system performance metrics with interpretation
- Error analysis across all applications with impact assessment
- Resource utilization trends with capacity insights
Service-Specific (OTel):
- Per-service error rate analysis (0% for otlp services, 0.4-1.2% for quote-service)
- Error pattern trends (peak: 709 errors/sec, typical: 4-20, current: 114)
- Trace-level stability confirmation (“No error traces found”)
- Log analysis with specific error identification (span export issues)
- Service-by-service health assessment with clear status indicators
Database-Specific:
- 8 hours of query performance across multiple tables and operations
- Table-by-table and operation-by-operation performance breakdowns
- Time-based patterns identifying peak load periods (02:00-04:00 UTC)
- Connection pool behavior with scaling analysis (2-18 connections)
- Best/worst/current performance comparison for each operation
Across All Dimensions:
- Prioritized issues with specific, actionable recommendations
- Overall health assessments with confidence levels
- Pattern recognition identifying systemic vs transient issues
- Impact assessment distinguishing critical vs informational findings
Ready to transform your daily observability routine? Start your FusionReactor trial today and discover how OpsPilot’s natural language interface makes OpenTelemetry data accessible, actionable, and essential to your team’s daily workflow.
About FusionReactor
FusionReactor combines comprehensive OpenTelemetry support with OpsPilot AI to deliver observability that’s both powerful and accessible. With native OTel integration supporting Prometheus metrics, Loki logs, and Tempo traces, plus natural language querying capable of analyzing:
- Dozens of services with service-specific filtering and analysis
- Thousands of data points across 24-hour periods
- Eight hours of database performance with table-level granularity
- Per-service error patterns with trend analysis (peak/typical/current)
- Complete trace investigation with TraceQL
- Multi-dimensional log analysis with context-aware filtering
All from single questions like “What happened overnight?”, “Have I had any issues with my OTel services?”, or “Show me database query performance”.
OpsPilot’s intelligent analysis provides clear health assessments (GOOD/CONCERNING/CRITICAL), service-by-service breakdowns with comparative analysis (0% vs 0.4-1.2% error rates), multi-dimensional performance metrics, prioritized findings with visual indicators (✅⚠️ℹ️), and actionable recommendations based on pattern recognition and root cause analysis—transforming raw telemetry data into confident decision-making.
Whether you’re checking overnight system health across all services, investigating OpenTelemetry-instrumented services specifically, or analyzing database performance patterns, OpsPilot delivers comprehensive insights through simple, natural language questions that anyone on your team can ask.
Perfect! This is another excellent example showing OpsPilot’s ability to analyze specific service groups. Let me add this to the blog post to demonstrate service-specific monitoring capabilities.
