Tracing Fundamentals
Distributed tracing is essential for understanding AI agent behavior. Learn how AgenticAnts implements tracing for AI systems.
What is Tracing?
Tracing tracks a request as it flows through your system, capturing:
- What happened
- When it happened
- How long it took
- What data was involved
- Any errors that occurred
Trace vs Span vs Event
Trace (Complete request lifecycle)
└─ Span (Unit of work)
└─ Event (Point-in-time occurrence)Example:
Trace: "Process customer support query"
├─ Span: "Classify intent"
│ └─ Event: "LLM tokens used: 50"
├─ Span: "Retrieve context"
│ ├─ Span: "Query vector DB"
│ │ └─ Event: "Found 3 documents"
│ └─ Span: "Fetch customer history"
├─ Span: "Generate response"
│ └─ Event: "LLM tokens used: 200"
└─ Span: "Send response"Creating Traces
Basic Trace
import { AgenticAnts } from '@agenticants/sdk'
const ants = new AgenticAnts({ apiKey: process.env.AGENTICANTS_API_KEY })
async function processQuery(query: string) {
// Start a trace
const trace = await ants.trace.create({
name: 'process-customer-query',
input: query,
metadata: {
userId: 'user_123',
timestamp: new Date().toISOString()
}
})
try {
// Your agent logic
const result = await agent.process(query)
// Complete the trace
await trace.complete({
output: result,
metadata: {
success: true,
confidence: 0.95
}
})
return result
} catch (error) {
// Record error
await trace.error({
error: error.message,
stack: error.stack
})
throw error
}
}Nested Spans
Add spans for detailed breakdown:
async function processQuery(query: string) {
const trace = await ants.trace.create({
name: 'process-query'
})
// Span 1: Classification
const classifySpan = trace.span('classify-intent')
const intent = await classifyIntent(query)
classifySpan.end({ intent })
// Span 2: Retrieval
const retrievalSpan = trace.span('retrieve-context')
const context = await retrieveContext(intent)
retrievalSpan.end({ documents: context.length })
// Span 3: Generation
const generationSpan = trace.span('generate-response')
const response = await generateResponse(query, context)
generationSpan.end({
tokens: response.usage.total,
cost: response.cost
})
await trace.complete({ output: response.text })
}Trace Context
Propagation
Trace context flows through your system:
// Service A starts a trace
const trace = await ants.trace.create({ name: 'main-request' })
const traceContext = trace.getContext()
// Pass context to Service B
await fetch('https://service-b.com/api', {
headers: {
'x-trace-id': traceContext.traceId,
'x-span-id': traceContext.spanId
}
})
// Service B continues the trace
const trace = ants.trace.fromContext(traceContext)
const span = trace.span('service-b-operation')
// ...Correlation
Link related traces:
// Parent trace
const parentTrace = await ants.trace.create({
name: 'user-session'
})
// Child traces reference parent
const childTrace1 = await ants.trace.create({
name: 'query-1',
parentTraceId: parentTrace.id
})
const childTrace2 = await ants.trace.create({
name: 'query-2',
parentTraceId: parentTrace.id
})Metadata and Tags
Adding Metadata
Enrich traces with context:
const trace = await ants.trace.create({
name: 'agent-execution',
metadata: {
// User information
userId: 'user_123',
userEmail: 'user@example.com',
userTier: 'premium',
// Request context
requestId: 'req_abc',
sessionId: 'session_xyz',
ipAddress: '192.168.1.1',
// Business context
customerId: 'customer_456',
accountType: 'enterprise',
feature: 'customer-support',
// Technical context
agentVersion: '1.2.3',
model: 'gpt-4',
temperature: 0.7,
region: 'us-east-1'
}
})Using Tags
Categorize and filter traces:
trace = ants.trace.create(
name='agent-execution',
tags={
'environment': 'production',
'team': 'customer-success',
'priority': 'high',
'ab_test': 'variant_b'
}
)
# Query by tags later
traces = ants.traces.query(
tags={'environment': 'production', 'priority': 'high'}
)Sampling Strategies
Head-Based Sampling
Decide at trace creation:
const ants = new AgenticAnts({
apiKey: process.env.AGENTICANTS_API_KEY,
sampling: {
strategy: 'head-based',
rate: 0.1 // Sample 10% of traces
}
})
// Or custom logic
const shouldSample = (request) => {
// Always sample errors
if (request.expectedError) return true
// Always sample premium users
if (request.userTier === 'premium') return true
// Sample 10% of others
return Math.random() < 0.1
}Tail-Based Sampling
Decide after trace completes:
ants = AgenticAnts(
api_key=os.getenv('AGENTICANTS_API_KEY'),
sampling={
'strategy': 'tail-based',
'rules': [
# Keep all errors
{'condition': 'error = true', 'rate': 1.0},
# Keep all slow requests
{'condition': 'duration > 5000', 'rate': 1.0},
# Keep 50% of high-value customers
{'condition': 'customer_tier = "enterprise"', 'rate': 0.5},
# Keep 10% of everything else
{'condition': 'true', 'rate': 0.1}
]
}
)Performance Tracking
Measuring Latency
const trace = await ants.trace.create({ name: 'agent-run' })
// Automatic timing
const span = trace.span('llm-call')
const result = await llm.generate(prompt)
span.end() // Duration calculated automatically
// Manual timing
const start = Date.now()
const result = await operation()
const duration = Date.now() - start
span.end({ duration })Token Tracking
span = trace.span('llm-inference')
response = openai.chat.completions.create(
model='gpt-4',
messages=messages
)
span.end({
'tokens': {
'prompt': response.usage.prompt_tokens,
'completion': response.usage.completion_tokens,
'total': response.usage.total_tokens
},
'cost': calculate_cost(response.usage, 'gpt-4')
})Error Tracking
Recording Errors
try {
const result = await riskyOperation()
span.end({ output: result })
} catch (error) {
span.error({
error: error.message,
stack: error.stack,
code: error.code,
severity: 'error',
context: {
operation: 'riskyOperation',
inputs: { /* ... */ }
}
})
throw error
}Error Categories
# Classify errors
if isinstance(error, ValidationError):
severity = 'warning'
elif isinstance(error, RateLimitError):
severity = 'warning'
elif isinstance(error, NetworkError):
severity = 'error'
else:
severity = 'critical'
span.error(
error=str(error),
severity=severity,
recoverable=isinstance(error, RetryableError)
)Visualizing Traces
Trace Timeline
Time →
├─ 0ms : Trace started
├─ 10ms : Input validated
├─ 50ms : Context retrieval started
│ ├─ 60ms : Database query started
│ └─ 150ms: Database query completed
├─ 200ms : Context retrieval completed
├─ 220ms : LLM inference started
│ ├─ 230ms: Prompt constructed
│ ├─ 250ms: API call started
│ ├─ 2200ms: API call completed
│ └─ 2220ms: Response parsed
├─ 2220ms : LLM inference completed
├─ 2250ms : Response formatted
└─ 2300ms : Trace completedFlamegraph
┌────────────────────────────────────────────┐
│ agent-execution (2.3s) │
├────────────────────────────────────────────┤
│▓ validate (10ms) │
│██ retrieve-context (150ms) │
│ █ db-query (90ms) │
│████████████████████ llm-inference (2.0s) │
│ ▓ construct (10ms) │
│ ███████████████████ api-call (1.98s) │
│ ▓ parse (10ms) │
│▓ format (30ms) │
└────────────────────────────────────────────┘Best Practices
1. Meaningful Names
// Good
span('llm-inference')
span('database-query')
span('vector-search')
// Avoid
span('step1')
span('process')
span('func')2. Rich Metadata
trace.complete(
output=response,
metadata={
'model': 'gpt-4',
'tokens': 350,
'cost': 0.0105,
'confidence': 0.95,
'cache_hit': False,
'retries': 0
}
)3. Proper Error Handling
Always record errors in traces:
catch (error) {
await trace.error({
error: error.message,
stack: error.stack,
severity: 'error',
context: { /* relevant data */ }
})
throw error // Still throw after recording
}4. Smart Sampling
Balance coverage and cost:
# Sample strategically
def should_trace(request):
# Always trace errors
if has_error(request):
return True
# Always trace slow requests
if is_slow(request):
return True
# Sample by user tier
if request.user_tier == 'enterprise':
return random.random() < 0.5 # 50%
else:
return random.random() < 0.1 # 10%Next Steps
- Explore SRE - Advanced monitoring and reliability
- Learn about Metrics - Time-series monitoring
- Set Up Alerts - Proactive notifications