Observability Model
AgenticAnts provides comprehensive observability for AI agents through traces, metrics, logs, and metadata.
What is Observability?
Observability is the ability to understand the internal state of a system by examining its external outputs. For AI systems, this means:
- Understanding what your agents are doing
- Diagnosing why they behave certain ways
- Optimizing performance and costs
- Ensuring quality and compliance
The Three Pillars of Observability
1. Traces
Traces show the complete execution path of a request:
Trace ID: trace_abc123
Duration: 2.5s
Status: Success
├─ input-validation (10ms)
├─ context-retrieval (200ms)
│ ├─ fetch-user-data (50ms)
│ ├─ fetch-history (100ms)
│ └─ fetch-preferences (50ms)
├─ llm-inference (2.0s)
│ ├─ prompt-construction (20ms)
│ ├─ api-call (1.95s)
│ └─ response-parsing (30ms)
└─ response-formatting (290ms)2. Metrics
Metrics are quantitative measurements over time:
// Performance metrics
latency: {
p50: 1200ms,
p95: 3500ms,
p99: 5200ms
}
// Volume metrics
throughput: 45 requests/second
total_requests: 1,234,567
// Quality metrics
error_rate: 0.5%
success_rate: 99.5%
// Cost metrics
total_tokens: 125M
cost_per_request: $0.0233. Logs
Logs capture discrete events:
[2025-10-23 14:23:45] INFO Agent started: customer-support-agent
[2025-10-23 14:23:45] DEBUG Input received: "Help with my order"
[2025-10-23 14:23:46] INFO Context retrieved: 3 documents
[2025-10-23 14:23:47] DEBUG LLM tokens: prompt=150, completion=200
[2025-10-23 14:23:48] INFO Response sent successfully
[2025-10-23 14:23:48] METRIC Duration: 2.5s, Cost: $0.0105AgenticAnts Data Model
Hierarchy
Organization
└─ Project(s)
└─ Environment(s)
└─ Agent(s)
└─ Trace(s)
└─ Span(s)
└─ Event(s)Entities
Organization
Your company or team:
{
id: 'org_abc123',
name: 'Acme Corp',
plan: 'enterprise',
credits: 50000
}Project
A logical grouping of agents:
{
id: 'proj_xyz789',
name: 'Customer Support',
organization: 'org_abc123',
environments: ['production', 'staging', 'development']
}Environment
Deployment environment:
{
id: 'env_prod',
name: 'production',
project: 'proj_xyz789'
}Agent
An AI agent or system:
{
id: 'agent_support',
name: 'customer-support-agent',
version: '1.2.3',
framework: 'langchain',
model: 'gpt-4'
}Trace
Complete execution of a request:
{
id: 'trace_abc123',
name: 'customer-support-query',
startTime: '2025-10-23T14:23:45Z',
endTime: '2025-10-23T14:23:48Z',
duration: 2500, // ms
status: 'success',
input: 'Help with my order',
output: 'I can help you track your order...',
metadata: {
userId: 'user_123',
sessionId: 'session_abc',
channel: 'web'
},
spans: [...],
tokens: 350,
cost: 0.0105
}Span
Single unit of work within a trace:
{
id: 'span_xyz',
traceId: 'trace_abc123',
parentId: null, // or parent span ID
name: 'llm-inference',
startTime: '2025-10-23T14:23:46Z',
endTime: '2025-10-23T14:23:48Z',
duration: 2000, // ms
attributes: {
model: 'gpt-4',
temperature: 0.7,
maxTokens: 500
},
events: [...],
status: 'ok'
}Event
Point-in-time occurrence:
{
id: 'event_123',
spanId: 'span_xyz',
timestamp: '2025-10-23T14:23:47Z',
name: 'token_usage',
attributes: {
promptTokens: 150,
completionTokens: 200,
totalTokens: 350
}
}Collection Methods
SDK Instrumentation
Most common method - use our SDKs:
import { AgenticAnts } from '@agenticants/sdk'
const ants = new AgenticAnts({ apiKey: process.env.AGENTICANTS_API_KEY })
// Manual instrumentation
const trace = await ants.trace.create({
name: 'my-agent',
input: userQuery
})
const result = await myAgent.process(userQuery)
await trace.complete({
output: result
})Auto-Instrumentation
Automatic instrumentation for supported frameworks:
from agenticants import AgenticAnts
from agenticants.integrations import langchain
# Auto-instrument LangChain
ants = AgenticAnts(api_key=os.getenv('AGENTICANTS_API_KEY'))
langchain.auto_instrument()
# Now all LangChain calls are automatically traced
from langchain import OpenAI
llm = OpenAI()
result = llm("What is AI?") # Automatically traced!OpenTelemetry
Standards-based instrumentation:
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'
import { AgenticAntsExporter } from '@agenticants/opentelemetry'
const provider = new NodeTracerProvider()
provider.addSpanProcessor(
new BatchSpanProcessor(
new AgenticAntsExporter({
apiKey: process.env.AGENTICANTS_API_KEY
})
)
)
provider.register()Querying Data
Dashboard UI
Visual exploration of data:
- Live Dashboard: Real-time metrics and traces
- Trace Explorer: Search and filter traces
- Metrics Dashboard: Time-series visualizations
- Agent Analytics: Per-agent insights
REST API
Programmatic access:
# Get traces
curl https://api.agenticants.ai/v1/traces \
-H "Authorization: Bearer $API_KEY" \
-d start_time="2025-10-23T00:00:00Z" \
-d end_time="2025-10-23T23:59:59Z"
# Get metrics
curl https://api.agenticants.ai/v1/metrics \
-H "Authorization: Bearer $API_KEY" \
-d metric="latency_p95" \
-d agent="customer-support"Query Language
Rich query capabilities:
// Query traces
const traces = await ants.traces.query({
where: {
agent: 'customer-support',
status: 'error',
duration: { $gt: 5000 } // > 5 seconds
},
orderBy: { timestamp: 'desc' },
limit: 100
})
// Aggregate metrics
const metrics = await ants.metrics.aggregate({
metric: 'cost',
groupBy: ['agent', 'customer'],
period: 'daily',
startDate: '2025-10-01',
endDate: '2025-10-31'
})Visualization
Real-Time Dashboards
Monitor live metrics:
// Create custom dashboard
await ants.dashboards.create({
name: 'Agent Performance',
widgets: [
{
type: 'time-series',
metric: 'throughput',
title: 'Requests per Second'
},
{
type: 'gauge',
metric: 'error_rate',
title: 'Error Rate',
thresholds: { warning: 1, critical: 5 }
},
{
type: 'histogram',
metric: 'latency',
title: 'Latency Distribution'
}
]
})Trace Visualization
Flamegraphs and waterfalls:
┌─────────────────────────────────────────────────────────┐
│ customer-support-query (2.5s) │
├─────────────────────────────────────────────────────────┤
│ ▓ input-validation (10ms) │
│ ██ context-retrieval (200ms) │
│ ▓ fetch-user (50ms) │
│ █ fetch-history (100ms) │
│ ▓ fetch-prefs (50ms) │
│ ████████████████████ llm-inference (2.0s) │
│ ▓ construct (20ms) │
│ ███████████████████ api-call (1.95s) │
│ ▓ parse (30ms) │
│ ██ format (290ms) │
└─────────────────────────────────────────────────────────┘Data Retention
Retention Policies
Configure how long data is kept:
await ants.config.setRetention({
// Raw traces
traces: {
hot: '7d', // Fast access
warm: '30d', // Standard access
cold: '90d' // Archive access
},
// Aggregated metrics
metrics: {
'1m': '7d', // 1-minute resolution for 7 days
'1h': '30d', // 1-hour resolution for 30 days
'1d': '365d' // 1-day resolution for 1 year
},
// Logs
logs: '30d'
})Data Lifecycle
New Data → Hot Storage (7 days, fast queries)
↓
Warm Storage (30 days, normal queries)
↓
Cold Storage (90 days, slower queries)
↓
Deleted (configurable)Best Practices
1. Rich Context
Include relevant metadata:
await ants.trace.create({
name: 'agent-execution',
input: query,
metadata: {
// User context
userId: 'user_123',
sessionId: 'session_abc',
// Business context
customerId: 'customer_456',
accountType: 'enterprise',
// Technical context
agentVersion: '1.2.3',
model: 'gpt-4',
region: 'us-east-1'
}
})2. Consistent Naming
Use clear, hierarchical names:
Good:
customer-support.classify-intent
customer-support.retrieve-context
customer-support.generate-response
Avoid:
func1
process
handler3. Error Tracking
Always capture errors with context:
try:
result = agent.process(query)
trace.complete(output=result)
except Exception as error:
trace.error(
error=str(error),
stack=traceback.format_exc(),
severity='error',
context={
'query': query,
'agent_state': agent.get_state()
}
)4. Sampling Strategy
Sample intelligently to control costs:
// Always sample errors and slow requests
// Sample 10% of normal requests
const shouldTrace = (request) => {
if (request.error) return true
if (request.duration > 5000) return true
return Math.random() < 0.1
}Next Steps
- Learn about Tracing - Deep dive into distributed tracing
- Explore SRE - Reliability engineering practices
- Set Up Monitoring - Complete monitoring guide