Tracing Fundamentals

Distributed tracing is essential for understanding AI agent behavior. Learn how AgenticAnts implements tracing for AI systems.

What is Tracing?

Tracing tracks a request as it flows through your system, capturing:

What happened
When it happened
How long it took
What data was involved
Any errors that occurred

Trace vs Span vs Event

Trace (Complete request lifecycle)
└─ Span (Unit of work)
   └─ Event (Point-in-time occurrence)

Example:

Trace: "Process customer support query"
├─ Span: "Classify intent"
│  └─ Event: "LLM tokens used: 50"
├─ Span: "Retrieve context"
│  ├─ Span: "Query vector DB"
│  │  └─ Event: "Found 3 documents"
│  └─ Span: "Fetch customer history"
├─ Span: "Generate response"
│  └─ Event: "LLM tokens used: 200"
└─ Span: "Send response"

Creating Traces

Basic Trace

import { AgenticAnts } from '@agenticants/sdk'
 
const ants = new AgenticAnts({ apiKey: process.env.AGENTICANTS_API_KEY })
 
async function processQuery(query: string) {
  // Start a trace
  const trace = await ants.trace.create({
    name: 'process-customer-query',
    input: query,
    metadata: {
      userId: 'user_123',
      timestamp: new Date().toISOString()
    }
  })
 
  try {
    // Your agent logic
    const result = await agent.process(query)
    
    // Complete the trace
    await trace.complete({
      output: result,
      metadata: {
        success: true,
        confidence: 0.95
      }
    })
    
    return result
  } catch (error) {
    // Record error
    await trace.error({
      error: error.message,
      stack: error.stack
    })
    throw error
  }
}

Nested Spans

Add spans for detailed breakdown:

async function processQuery(query: string) {
  const trace = await ants.trace.create({
    name: 'process-query'
  })
 
  // Span 1: Classification
  const classifySpan = trace.span('classify-intent')
  const intent = await classifyIntent(query)
  classifySpan.end({ intent })
 
  // Span 2: Retrieval
  const retrievalSpan = trace.span('retrieve-context')
  const context = await retrieveContext(intent)
  retrievalSpan.end({ documents: context.length })
 
  // Span 3: Generation
  const generationSpan = trace.span('generate-response')
  const response = await generateResponse(query, context)
  generationSpan.end({ 
    tokens: response.usage.total,
    cost: response.cost
  })
 
  await trace.complete({ output: response.text })
}

Trace Context

Propagation

Trace context flows through your system:

// Service A starts a trace
const trace = await ants.trace.create({ name: 'main-request' })
const traceContext = trace.getContext()
 
// Pass context to Service B
await fetch('https://service-b.com/api', {
  headers: {
    'x-trace-id': traceContext.traceId,
    'x-span-id': traceContext.spanId
  }
})
 
// Service B continues the trace
const trace = ants.trace.fromContext(traceContext)
const span = trace.span('service-b-operation')
// ...

Correlation

Link related traces:

// Parent trace
const parentTrace = await ants.trace.create({
  name: 'user-session'
})
 
// Child traces reference parent
const childTrace1 = await ants.trace.create({
  name: 'query-1',
  parentTraceId: parentTrace.id
})
 
const childTrace2 = await ants.trace.create({
  name: 'query-2',
  parentTraceId: parentTrace.id
})

Metadata and Tags

Adding Metadata

Enrich traces with context:

const trace = await ants.trace.create({
  name: 'agent-execution',
  metadata: {
    // User information
    userId: 'user_123',
    userEmail: 'user@example.com',
    userTier: 'premium',
    
    // Request context
    requestId: 'req_abc',
    sessionId: 'session_xyz',
    ipAddress: '192.168.1.1',
    
    // Business context
    customerId: 'customer_456',
    accountType: 'enterprise',
    feature: 'customer-support',
    
    // Technical context
    agentVersion: '1.2.3',
    model: 'gpt-4',
    temperature: 0.7,
    region: 'us-east-1'
  }
})

Using Tags

Categorize and filter traces:

trace = ants.trace.create(
    name='agent-execution',
    tags={
        'environment': 'production',
        'team': 'customer-success',
        'priority': 'high',
        'ab_test': 'variant_b'
    }
)
 
# Query by tags later
traces = ants.traces.query(
    tags={'environment': 'production', 'priority': 'high'}
)

Sampling Strategies

Head-Based Sampling

Decide at trace creation:

const ants = new AgenticAnts({
  apiKey: process.env.AGENTICANTS_API_KEY,
  sampling: {
    strategy: 'head-based',
    rate: 0.1  // Sample 10% of traces
  }
})
 
// Or custom logic
const shouldSample = (request) => {
  // Always sample errors
  if (request.expectedError) return true
  
  // Always sample premium users
  if (request.userTier === 'premium') return true
  
  // Sample 10% of others
  return Math.random() < 0.1
}

Tail-Based Sampling

Decide after trace completes:

ants = AgenticAnts(
    api_key=os.getenv('AGENTICANTS_API_KEY'),
    sampling={
        'strategy': 'tail-based',
        'rules': [
            # Keep all errors
            {'condition': 'error = true', 'rate': 1.0},
            # Keep all slow requests
            {'condition': 'duration > 5000', 'rate': 1.0},
            # Keep 50% of high-value customers
            {'condition': 'customer_tier = "enterprise"', 'rate': 0.5},
            # Keep 10% of everything else
            {'condition': 'true', 'rate': 0.1}
        ]
    }
)

Performance Tracking

Measuring Latency

const trace = await ants.trace.create({ name: 'agent-run' })
 
// Automatic timing
const span = trace.span('llm-call')
const result = await llm.generate(prompt)
span.end()  // Duration calculated automatically
 
// Manual timing
const start = Date.now()
const result = await operation()
const duration = Date.now() - start
span.end({ duration })

Token Tracking

span = trace.span('llm-inference')
 
response = openai.chat.completions.create(
    model='gpt-4',
    messages=messages
)
 
span.end({
    'tokens': {
        'prompt': response.usage.prompt_tokens,
        'completion': response.usage.completion_tokens,
        'total': response.usage.total_tokens
    },
    'cost': calculate_cost(response.usage, 'gpt-4')
})

Error Tracking

Recording Errors

try {
  const result = await riskyOperation()
  span.end({ output: result })
} catch (error) {
  span.error({
    error: error.message,
    stack: error.stack,
    code: error.code,
    severity: 'error',
    context: {
      operation: 'riskyOperation',
      inputs: { /* ... */ }
    }
  })
  throw error
}

Error Categories

# Classify errors
if isinstance(error, ValidationError):
    severity = 'warning'
elif isinstance(error, RateLimitError):
    severity = 'warning'
elif isinstance(error, NetworkError):
    severity = 'error'
else:
    severity = 'critical'
 
span.error(
    error=str(error),
    severity=severity,
    recoverable=isinstance(error, RetryableError)
)

Visualizing Traces

Trace Timeline

Time →
├─ 0ms     : Trace started
├─ 10ms    : Input validated
├─ 50ms    : Context retrieval started
│  ├─ 60ms : Database query started
│  └─ 150ms: Database query completed
├─ 200ms   : Context retrieval completed
├─ 220ms   : LLM inference started
│  ├─ 230ms: Prompt constructed
│  ├─ 250ms: API call started
│  ├─ 2200ms: API call completed
│  └─ 2220ms: Response parsed
├─ 2220ms  : LLM inference completed
├─ 2250ms  : Response formatted
└─ 2300ms  : Trace completed

Flamegraph

┌────────────────────────────────────────────┐
│ agent-execution (2.3s)                      │
├────────────────────────────────────────────┤
│▓ validate (10ms)                           │
│██ retrieve-context (150ms)                 │
│  █ db-query (90ms)                         │
│████████████████████ llm-inference (2.0s)   │
│  ▓ construct (10ms)                        │
│  ███████████████████ api-call (1.98s)      │
│  ▓ parse (10ms)                            │
│▓ format (30ms)                             │
└────────────────────────────────────────────┘

Best Practices

1. Meaningful Names

//  Good
span('llm-inference')
span('database-query')
span('vector-search')
 
//  Avoid
span('step1')
span('process')
span('func')

2. Rich Metadata

trace.complete(
    output=response,
    metadata={
        'model': 'gpt-4',
        'tokens': 350,
        'cost': 0.0105,
        'confidence': 0.95,
        'cache_hit': False,
        'retries': 0
    }
)

3. Proper Error Handling

Always record errors in traces:

catch (error) {
  await trace.error({
    error: error.message,
    stack: error.stack,
    severity: 'error',
    context: { /* relevant data */ }
  })
  throw error  // Still throw after recording
}

4. Smart Sampling

Balance coverage and cost:

# Sample strategically
def should_trace(request):
    # Always trace errors
    if has_error(request):
        return True
    
    # Always trace slow requests
    if is_slow(request):
        return True
    
    # Sample by user tier
    if request.user_tier == 'enterprise':
        return random.random() < 0.5  # 50%
    else:
        return random.random() < 0.1  # 10%

Next Steps

Explore SRE - Advanced monitoring and reliability
Learn about Metrics - Time-series monitoring
Set Up Alerts - Proactive notifications

Explore Integrations →

Observability Model Overview