Docs/Core Concepts/Observability

Observability Model

AgenticAnts provides comprehensive observability for AI agents through traces, metrics, logs, and metadata.

What is Observability?

Observability is the ability to understand the internal state of a system by examining its external outputs. For AI systems, this means:

  • Understanding what your agents are doing
  • Diagnosing why they behave certain ways
  • Optimizing performance and costs
  • Ensuring quality and compliance

The Three Pillars of Observability

1. Traces

Traces show the complete execution path of a request:

2. Metrics

Metrics are quantitative measurements over time:

typescript
// Performance metrics latency: { p50: 1200ms, p95: 3500ms, p99: 5200ms } // Volume metrics throughput: 45 requests/second total_requests: 1,234,567 // Quality metrics error_rate: 0.5% success_rate: 99.5% // Cost metrics total_tokens: 125M cost_per_request: $0.023

3. Logs

Logs capture discrete events:

code
[2025-10-23 14:23:45] INFO Agent started: customer-support-agent [2025-10-23 14:23:45] DEBUG Input received: "Help with my order" [2025-10-23 14:23:46] INFO Context retrieved: 3 documents [2025-10-23 14:23:47] DEBUG LLM tokens: prompt=150, completion=200 [2025-10-23 14:23:48] INFO Response sent successfully [2025-10-23 14:23:48] METRIC Duration: 2.5s, Cost: $0.0105

AgenticAnts Data Model

Hierarchy

Entities

Organization

Your company or team:

typescript
{ id: 'org_abc123', name: 'Acme Corp', plan: 'enterprise', credits: 50000 }

Project

A logical grouping of agents:

typescript
{ id: 'proj_xyz789', name: 'Customer Support', organization: 'org_abc123', environments: ['production', 'staging', 'development'] }

Environment

Deployment environment:

typescript
{ id: 'env_prod', name: 'production', project: 'proj_xyz789' }

Agent

An AI agent or system:

typescript
{ id: 'agent_support', name: 'customer-support-agent', version: '1.2.3', framework: 'langchain', model: 'gpt-4' }

Trace

Complete execution of a request:

typescript
{ id: 'trace_abc123', name: 'customer-support-query', startTime: '2025-10-23T14:23:45Z', endTime: '2025-10-23T14:23:48Z', duration: 2500, // ms status: 'success', input: 'Help with my order', output: 'I can help you track your order...', metadata: { userId: 'user_123', sessionId: 'session_abc', channel: 'web' }, spans: [...], tokens: 350, cost: 0.0105 }

Span

Single unit of work within a trace:

typescript
{ id: 'span_xyz', traceId: 'trace_abc123', parentId: null, // or parent span ID name: 'llm-inference', startTime: '2025-10-23T14:23:46Z', endTime: '2025-10-23T14:23:48Z', duration: 2000, // ms attributes: { model: 'gpt-4', temperature: 0.7, maxTokens: 500 }, events: [...], status: 'ok' }

Event

Point-in-time occurrence:

typescript
{ id: 'event_123', spanId: 'span_xyz', timestamp: '2025-10-23T14:23:47Z', name: 'token_usage', attributes: { promptTokens: 150, completionTokens: 200, totalTokens: 350 } }

Collection Methods

SDK Instrumentation

Most common method - use our SDKs:

typescript
const ants = new AgenticAnts({ apiKey: process.env.AGENTICANTS_API_KEY }) // Manual instrumentation const trace = await ants.trace.create({ name: 'my-agent', input: userQuery }) const result = await myAgent.process(userQuery) await trace.complete({ output: result })

Auto-Instrumentation

Automatic instrumentation for supported frameworks:

python
from agenticants import AgenticAnts from agenticants.integrations import langchain # Auto-instrument LangChain ants = AgenticAnts(api_key=os.getenv('AGENTICANTS_API_KEY')) langchain.auto_instrument() # Now all LangChain calls are automatically traced from langchain import OpenAI llm = OpenAI() result = llm("What is AI?") # Automatically traced!

OpenTelemetry

Standards-based instrumentation:

typescript
const provider = new NodeTracerProvider() provider.addSpanProcessor( new BatchSpanProcessor( new AgenticAntsExporter({ apiKey: process.env.AGENTICANTS_API_KEY }) ) ) provider.register()

Querying Data

Dashboard UI

Visual exploration of data:

  • Live Dashboard: Real-time metrics and traces
  • Trace Explorer: Search and filter traces
  • Metrics Dashboard: Time-series visualizations
  • Agent Analytics: Per-agent insights

REST API

Programmatic access:

bash
# Get traces curl https://api.agenticants.ai/v1/traces \ -H "Authorization: Bearer $API_KEY" \ -d start_time="2025-10-23T00:00:00Z" \ -d end_time="2025-10-23T23:59:59Z" # Get metrics curl https://api.agenticants.ai/v1/metrics \ -H "Authorization: Bearer $API_KEY" \ -d metric="latency_p95" \ -d agent="customer-support"

Query Language

Rich query capabilities:

typescript
// Query traces const traces = await ants.traces.query({ where: { agent: 'customer-support', status: 'error', duration: { $gt: 5000 } // > 5 seconds }, orderBy: { timestamp: 'desc' }, limit: 100 }) // Aggregate metrics const metrics = await ants.metrics.aggregate({ metric: 'cost', groupBy: ['agent', 'customer'], period: 'daily', startDate: '2025-10-01', endDate: '2025-10-31' })

Visualization

Real-Time Dashboards

Monitor live metrics:

typescript
// Create custom dashboard await ants.dashboards.create({ name: 'Agent Performance', widgets: [ { type: 'time-series', metric: 'throughput', title: 'Requests per Second' }, { type: 'gauge', metric: 'error_rate', title: 'Error Rate', thresholds: { warning: 1, critical: 5 } }, { type: 'histogram', metric: 'latency', title: 'Latency Distribution' } ] })

Trace Visualization

Flamegraphs and waterfalls:

Data Retention

Retention Policies

Configure how long data is kept:

typescript
await ants.config.setRetention({ // Raw traces traces: { hot: '7d', // Fast access warm: '30d', // Standard access cold: '90d' // Archive access }, // Aggregated metrics metrics: { '1m': '7d', // 1-minute resolution for 7 days '1h': '30d', // 1-hour resolution for 30 days '1d': '365d' // 1-day resolution for 1 year }, // Logs logs: '30d' })

Data Lifecycle

code
New Data Hot Storage (7 days, fast queries) Warm Storage (30 days, normal queries) Cold Storage (90 days, slower queries) Deleted (configurable)

Best Practices

1. Rich Context

Include relevant metadata:

typescript
await ants.trace.create({ name: 'agent-execution', input: query, metadata: { // User context userId: 'user_123', sessionId: 'session_abc', // Business context customerId: 'customer_456', accountType: 'enterprise', // Technical context agentVersion: '1.2.3', model: 'gpt-4', region: 'us-east-1' } })

2. Consistent Naming

Use clear, hierarchical names:

code
Good: customer-support.classify-intent customer-support.retrieve-context customer-support.generate-response Avoid: func1 process handler

3. Error Tracking

Always capture errors with context:

python
try: result = agent.process(query) trace.complete(output=result) except Exception as error: trace.error( error=str(error), stack=traceback.format_exc(), severity='error', context={ 'query': query, 'agent_state': agent.get_state() } )

4. Sampling Strategy

Sample intelligently to control costs:

typescript
// Always sample errors and slow requests // Sample 10% of normal requests const shouldTrace = (request) => { if (request.error) return true if (request.duration > 5000) return true return Math.random() < 0.1 }

Next Steps

Learn About Tracing →

© 2026 ANTS Platform, Inc.Docs v1.0 · Last updated June 2026