Observability Model

AgenticAnts provides comprehensive observability for AI agents through traces, metrics, logs, and metadata.

What is Observability?

Observability is the ability to understand the internal state of a system by examining its external outputs. For AI systems, this means:

Understanding what your agents are doing
Diagnosing why they behave certain ways
Optimizing performance and costs
Ensuring quality and compliance

The Three Pillars of Observability

1. Traces

Traces show the complete execution path of a request:

Trace ID: trace_abc123
Duration: 2.5s
Status: Success

├─ input-validation (10ms)
├─ context-retrieval (200ms)
│  ├─ fetch-user-data (50ms)
│  ├─ fetch-history (100ms)
│  └─ fetch-preferences (50ms)
├─ llm-inference (2.0s)
│  ├─ prompt-construction (20ms)
│  ├─ api-call (1.95s)
│  └─ response-parsing (30ms)
└─ response-formatting (290ms)

2. Metrics

Metrics are quantitative measurements over time:

// Performance metrics
latency: {
  p50: 1200ms,
  p95: 3500ms,
  p99: 5200ms
}
 
// Volume metrics
throughput: 45 requests/second
total_requests: 1,234,567
 
// Quality metrics
error_rate: 0.5%
success_rate: 99.5%
 
// Cost metrics
total_tokens: 125M
cost_per_request: $0.023

3. Logs

Logs capture discrete events:

[2025-10-23 14:23:45] INFO Agent started: customer-support-agent
[2025-10-23 14:23:45] DEBUG Input received: "Help with my order"
[2025-10-23 14:23:46] INFO Context retrieved: 3 documents
[2025-10-23 14:23:47] DEBUG LLM tokens: prompt=150, completion=200
[2025-10-23 14:23:48] INFO Response sent successfully
[2025-10-23 14:23:48] METRIC Duration: 2.5s, Cost: $0.0105

AgenticAnts Data Model

Hierarchy

Organization
└─ Project(s)
   └─ Environment(s)
      └─ Agent(s)
         └─ Trace(s)
            └─ Span(s)
               └─ Event(s)

Entities

Organization

Your company or team:

{
  id: 'org_abc123',
  name: 'Acme Corp',
  plan: 'enterprise',
  credits: 50000
}

Project

A logical grouping of agents:

{
  id: 'proj_xyz789',
  name: 'Customer Support',
  organization: 'org_abc123',
  environments: ['production', 'staging', 'development']
}

Environment

Deployment environment:

{
  id: 'env_prod',
  name: 'production',
  project: 'proj_xyz789'
}

Agent

An AI agent or system:

{
  id: 'agent_support',
  name: 'customer-support-agent',
  version: '1.2.3',
  framework: 'langchain',
  model: 'gpt-4'
}

Trace

Complete execution of a request:

{
  id: 'trace_abc123',
  name: 'customer-support-query',
  startTime: '2025-10-23T14:23:45Z',
  endTime: '2025-10-23T14:23:48Z',
  duration: 2500,  // ms
  status: 'success',
  input: 'Help with my order',
  output: 'I can help you track your order...',
  metadata: {
    userId: 'user_123',
    sessionId: 'session_abc',
    channel: 'web'
  },
  spans: [...],
  tokens: 350,
  cost: 0.0105
}

Span

Single unit of work within a trace:

{
  id: 'span_xyz',
  traceId: 'trace_abc123',
  parentId: null,  // or parent span ID
  name: 'llm-inference',
  startTime: '2025-10-23T14:23:46Z',
  endTime: '2025-10-23T14:23:48Z',
  duration: 2000,  // ms
  attributes: {
    model: 'gpt-4',
    temperature: 0.7,
    maxTokens: 500
  },
  events: [...],
  status: 'ok'
}

Event

Point-in-time occurrence:

{
  id: 'event_123',
  spanId: 'span_xyz',
  timestamp: '2025-10-23T14:23:47Z',
  name: 'token_usage',
  attributes: {
    promptTokens: 150,
    completionTokens: 200,
    totalTokens: 350
  }
}

Collection Methods

SDK Instrumentation

Most common method - use our SDKs:

import { AgenticAnts } from '@agenticants/sdk'
 
const ants = new AgenticAnts({ apiKey: process.env.AGENTICANTS_API_KEY })
 
// Manual instrumentation
const trace = await ants.trace.create({
  name: 'my-agent',
  input: userQuery
})
 
const result = await myAgent.process(userQuery)
 
await trace.complete({
  output: result
})

Auto-Instrumentation

Automatic instrumentation for supported frameworks:

from agenticants import AgenticAnts
from agenticants.integrations import langchain
 
# Auto-instrument LangChain
ants = AgenticAnts(api_key=os.getenv('AGENTICANTS_API_KEY'))
langchain.auto_instrument()
 
# Now all LangChain calls are automatically traced
from langchain import OpenAI
llm = OpenAI()
result = llm("What is AI?")  # Automatically traced!

OpenTelemetry

Standards-based instrumentation:

import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node'
import { AgenticAntsExporter } from '@agenticants/opentelemetry'
 
const provider = new NodeTracerProvider()
provider.addSpanProcessor(
  new BatchSpanProcessor(
    new AgenticAntsExporter({
      apiKey: process.env.AGENTICANTS_API_KEY
    })
  )
)
provider.register()

Querying Data

Dashboard UI

Visual exploration of data:

Live Dashboard: Real-time metrics and traces
Trace Explorer: Search and filter traces
Metrics Dashboard: Time-series visualizations
Agent Analytics: Per-agent insights

REST API

Programmatic access:

# Get traces
curl https://api.agenticants.ai/v1/traces \
  -H "Authorization: Bearer $API_KEY" \
  -d start_time="2025-10-23T00:00:00Z" \
  -d end_time="2025-10-23T23:59:59Z"
 
# Get metrics
curl https://api.agenticants.ai/v1/metrics \
  -H "Authorization: Bearer $API_KEY" \
  -d metric="latency_p95" \
  -d agent="customer-support"

Query Language

Rich query capabilities:

// Query traces
const traces = await ants.traces.query({
  where: {
    agent: 'customer-support',
    status: 'error',
    duration: { $gt: 5000 }  // > 5 seconds
  },
  orderBy: { timestamp: 'desc' },
  limit: 100
})
 
// Aggregate metrics
const metrics = await ants.metrics.aggregate({
  metric: 'cost',
  groupBy: ['agent', 'customer'],
  period: 'daily',
  startDate: '2025-10-01',
  endDate: '2025-10-31'
})

Visualization

Real-Time Dashboards

Monitor live metrics:

// Create custom dashboard
await ants.dashboards.create({
  name: 'Agent Performance',
  widgets: [
    {
      type: 'time-series',
      metric: 'throughput',
      title: 'Requests per Second'
    },
    {
      type: 'gauge',
      metric: 'error_rate',
      title: 'Error Rate',
      thresholds: { warning: 1, critical: 5 }
    },
    {
      type: 'histogram',
      metric: 'latency',
      title: 'Latency Distribution'
    }
  ]
})

Trace Visualization

Flamegraphs and waterfalls:

┌─────────────────────────────────────────────────────────┐
│ customer-support-query (2.5s)                            │
├─────────────────────────────────────────────────────────┤
│ ▓ input-validation (10ms)                               │
│ ██ context-retrieval (200ms)                            │
│   ▓ fetch-user (50ms)                                    │
│   █ fetch-history (100ms)                               │
│   ▓ fetch-prefs (50ms)                                  │
│ ████████████████████ llm-inference (2.0s)               │
│   ▓ construct (20ms)                                    │
│   ███████████████████ api-call (1.95s)                  │
│   ▓ parse (30ms)                                        │
│ ██ format (290ms)                                       │
└─────────────────────────────────────────────────────────┘

Data Retention

Retention Policies

Configure how long data is kept:

await ants.config.setRetention({
  // Raw traces
  traces: {
    hot: '7d',      // Fast access
    warm: '30d',    // Standard access
    cold: '90d'     // Archive access
  },
  
  // Aggregated metrics
  metrics: {
    '1m': '7d',     // 1-minute resolution for 7 days
    '1h': '30d',    // 1-hour resolution for 30 days
    '1d': '365d'    // 1-day resolution for 1 year
  },
  
  // Logs
  logs: '30d'
})

Data Lifecycle

New Data → Hot Storage (7 days, fast queries)
         ↓
       Warm Storage (30 days, normal queries)
         ↓
       Cold Storage (90 days, slower queries)
         ↓
       Deleted (configurable)

Best Practices

1. Rich Context

Include relevant metadata:

await ants.trace.create({
  name: 'agent-execution',
  input: query,
  metadata: {
    // User context
    userId: 'user_123',
    sessionId: 'session_abc',
    
    // Business context
    customerId: 'customer_456',
    accountType: 'enterprise',
    
    // Technical context
    agentVersion: '1.2.3',
    model: 'gpt-4',
    region: 'us-east-1'
  }
})

2. Consistent Naming

Use clear, hierarchical names:

 Good:
  customer-support.classify-intent
  customer-support.retrieve-context
  customer-support.generate-response

 Avoid:
  func1
  process
  handler

3. Error Tracking

Always capture errors with context:

try:
    result = agent.process(query)
    trace.complete(output=result)
except Exception as error:
    trace.error(
        error=str(error),
        stack=traceback.format_exc(),
        severity='error',
        context={
            'query': query,
            'agent_state': agent.get_state()
        }
    )

4. Sampling Strategy

Sample intelligently to control costs:

// Always sample errors and slow requests
// Sample 10% of normal requests
const shouldTrace = (request) => {
  if (request.error) return true
  if (request.duration > 5000) return true
  return Math.random() < 0.1
}

Next Steps

Learn about Tracing - Deep dive into distributed tracing
Explore SRE - Reliability engineering practices
Set Up Monitoring - Complete monitoring guide

Learn About Tracing →

Credit System Tracing Fundamentals