Tracing Fundamentals
Distributed tracing is essential for understanding AI agent behavior. Learn how AgenticAnts implements tracing for AI systems.
What is Tracing?
Tracing tracks a request as it flows through your system, capturing:
- What happened
- When it happened
- How long it took
- What data was involved
- Any errors that occurred
Trace vs Span vs Event
Example:
Creating Traces
Basic Trace
typescript
import { AgenticAnts } from '@agenticants/sdk'
const ants = new AgenticAnts({ apiKey: process.env.AGENTICANTS_API_KEY })
async function processQuery(query: string) {
// Start a trace
const trace = await ants.trace.create({
name: 'process-customer-query',
input: query,
metadata: {
userId: 'user_123',
timestamp: new Date().toISOString()
}
})
try {
// Your agent logic
const result = await agent.process(query)
// Complete the trace
await trace.complete({
output: result,
metadata: {
success: true,
confidence: 0.95
}
})
return result
} catch (error) {
// Record error
await trace.error({
error: error.message,
stack: error.stack
})
throw error
}
}
Nested Spans
Add spans for detailed breakdown:
typescript
async function processQuery(query: string) {
const trace = await ants.trace.create({
name: 'process-query'
})
// Span 1: Classification
const classifySpan = trace.span('classify-intent')
const intent = await classifyIntent(query)
classifySpan.end({ intent })
// Span 2: Retrieval
const retrievalSpan = trace.span('retrieve-context')
const context = await retrieveContext(intent)
retrievalSpan.end({ documents: context.length })
// Span 3: Generation
const generationSpan = trace.span('generate-response')
const response = await generateResponse(query, context)
generationSpan.end({
tokens: response.usage.total,
cost: response.cost
})
await trace.complete({ output: response.text })
}
Trace Context
Propagation
Trace context flows through your system:
typescript
// Service A starts a trace
const trace = await ants.trace.create({ name: 'main-request' })
const traceContext = trace.getContext()
// Pass context to Service B
await fetch('https://service-b.com/api', {
headers: {
'x-trace-id': traceContext.traceId,
'x-span-id': traceContext.spanId
}
})
// Service B continues the trace
const trace = ants.trace.fromContext(traceContext)
const span = trace.span('service-b-operation')
// ...
Correlation
Link related traces:
typescript
// Parent trace
const parentTrace = await ants.trace.create({
name: 'user-session'
})
// Child traces reference parent
const childTrace1 = await ants.trace.create({
name: 'query-1',
parentTraceId: parentTrace.id
})
const childTrace2 = await ants.trace.create({
name: 'query-2',
parentTraceId: parentTrace.id
})
Metadata and Tags
Adding Metadata
Enrich traces with context:
typescript
const trace = await ants.trace.create({
name: 'agent-execution',
metadata: {
// User information
userId: 'user_123',
userEmail: 'user@example.com',
userTier: 'premium',
// Request context
requestId: 'req_abc',
sessionId: 'session_xyz',
ipAddress: '192.168.1.1',
// Business context
customerId: 'customer_456',
accountType: 'enterprise',
feature: 'customer-support',
// Technical context
agentVersion: '1.2.3',
model: 'gpt-4',
temperature: 0.7,
region: 'us-east-1'
}
})
Using Tags
Categorize and filter traces:
python
trace = ants.trace.create(
name='agent-execution',
tags={
'environment': 'production',
'team': 'customer-success',
'priority': 'high',
'ab_test': 'variant_b'
}
)
# Query by tags later
traces = ants.traces.query(
tags={'environment': 'production', 'priority': 'high'}
)
Sampling Strategies
Head-Based Sampling
Decide at trace creation:
typescript
const ants = new AgenticAnts({
apiKey: process.env.AGENTICANTS_API_KEY,
sampling: {
strategy: 'head-based',
rate: 0.1 // Sample 10% of traces
}
})
// Or custom logic
const shouldSample = (request) => {
// Always sample errors
if (request.expectedError) return true
// Always sample premium users
if (request.userTier === 'premium') return true
// Sample 10% of others
return Math.random() < 0.1
}
Tail-Based Sampling
Decide after trace completes:
python
ants = AgenticAnts(
api_key=os.getenv('AGENTICANTS_API_KEY'),
sampling={
'strategy': 'tail-based',
'rules': [
# Keep all errors
{'condition': 'error = true', 'rate': 1.0},
# Keep all slow requests
{'condition': 'duration > 5000', 'rate': 1.0},
# Keep 50% of high-value customers
{'condition': 'customer_tier = "enterprise"', 'rate': 0.5},
# Keep 10% of everything else
{'condition': 'true', 'rate': 0.1}
]
}
)
Performance Tracking
Measuring Latency
typescript
const trace = await ants.trace.create({ name: 'agent-run' })
// Automatic timing
const span = trace.span('llm-call')
const result = await llm.generate(prompt)
span.end() // Duration calculated automatically
// Manual timing
const start = Date.now()
const result = await operation()
const duration = Date.now() - start
span.end({ duration })
Token Tracking
python
span = trace.span('llm-inference')
response = openai.chat.completions.create(
model='gpt-4',
messages=messages
)
span.end({
'tokens': {
'prompt': response.usage.prompt_tokens,
'completion': response.usage.completion_tokens,
'total': response.usage.total_tokens
},
'cost': calculate_cost(response.usage, 'gpt-4')
})
Error Tracking
Recording Errors
typescript
try {
const result = await riskyOperation()
span.end({ output: result })
} catch (error) {
span.error({
error: error.message,
stack: error.stack,
code: error.code,
severity: 'error',
context: {
operation: 'riskyOperation',
inputs: { /* ... */ }
}
})
throw error
}
Error Categories
python
# Classify errors
if isinstance(error, ValidationError):
severity = 'warning'
elif isinstance(error, RateLimitError):
severity = 'warning'
elif isinstance(error, NetworkError):
severity = 'error'
else:
severity = 'critical'
span.error(
error=str(error),
severity=severity,
recoverable=isinstance(error, RetryableError)
)
Visualizing Traces
Trace Timeline
Flamegraph
Best Practices
1. Meaningful Names
typescript
// Good
span('llm-inference')
span('database-query')
span('vector-search')
// Avoid
span('step1')
span('process')
span('func')
2. Rich Metadata
python
trace.complete(
output=response,
metadata={
'model': 'gpt-4',
'tokens': 350,
'cost': 0.0105,
'confidence': 0.95,
'cache_hit': False,
'retries': 0
}
)
3. Proper Error Handling
Always record errors in traces:
typescript
catch (error) {
await trace.error({
error: error.message,
stack: error.stack,
severity: 'error',
context: { /* relevant data */ }
})
throw error // Still throw after recording
}
4. Smart Sampling
Balance coverage and cost:
python
# Sample strategically
def should_trace(request):
# Always trace errors
if has_error(request):
return True
# Always trace slow requests
if is_slow(request):
return True
# Sample by user tier
if request.user_tier == 'enterprise':
return random.random() < 0.5 # 50%
else:
return random.random() < 0.1 # 10%
Next Steps
- Explore SRE - Advanced monitoring and reliability
- Learn about Metrics - Time-series monitoring
- Set Up Alerts - Proactive notifications