Docs/Core Concepts/Three Pillars

Three Pillars of LLMOps

AgenticAnts implements LLMOps (Large Language Model Operations) through three foundational pillars that work together to provide comprehensive AI operations management.

Overview

LLMOps is the overarching discipline that encompasses the entire lifecycle of LLM operations from development to production. Our three pillars approach ensures complete coverage of AI operational needs:

LLMOps Framework

LLMOps provides the comprehensive framework for managing LLM operations:

Model Lifecycle Management - Selection, versioning, deployment, and retirement
Prompt Operations - Prompt engineering, versioning, and optimization
Performance Optimization - Latency, throughput, and cost optimization
Model Governance - Policies, compliance, and risk management
Versioning & Deployment - CI/CD pipelines and rollback strategies

Learn more about LLMOps →

Pillar 1: AI Cost (FinOps)

Cost Visibility, Allocation, Optimization & Accountability

What is AI Cost (FinOps)?

Cost (FinOps) for AI helps organizations understand, control, and optimize AI spending. Cost is the primary indicator and measurable outcome of FinOps - providing:

Cost Visibility: See where every dollar is spent in real-time
Cost Attribution: Track costs by customer, team, or product
Cost Optimization: Identify and eliminate waste
Cost Accountability: Budget allocation and forecasting

Key Capabilities

Token Usage Monitoring

Track every token consumed by your AI systems:

typescript

// Automatically tracked const response = await openai.chat.completions.create({ model: 'gpt-4', messages: [{ role: 'user', content: query }] }) // AgenticAnts records: // - Model used: gpt-4 // - Tokens: prompt=150, completion=200, total=350 // - Cost: $0.0105 (based on current pricing)

Cost Per Customer Query

Understand the economics of your AI operations:

python

# View cost breakdown customer_costs = ants.metrics.get_customer_costs( start_date='2025-10-01', end_date='2025-10-31', group_by='customer' ) # Results: # customer_123: $45.50 (450 queries) # customer_456: $89.20 (920 queries) # customer_789: $12.30 (95 queries)

Budget Management

Set budgets and receive alerts:

typescript

await ants.budgets.create({ name: 'Q4 AI Spending', amount: 10000, period: 'quarterly', alerts: [ { threshold: 0.7, type: 'warning' }, // 70% { threshold: 0.9, type: 'critical' } // 90% ] })

ROI Analytics

Measure the business impact of AI investments:

typescript

const roi = await ants.analytics.calculateROI({ costs: 5000, // AI costs revenue: 25000, // Revenue attributed to AI timePeriod: 'month' }) // ROI: 400% (5x return on investment) // Cost per conversion: $2.50 // Customer lifetime value: $500

AI Cost Best Practices

Tag Everything: Use consistent tagging for cost attribution
Set Budgets: Define spending limits for teams and projects
Monitor Regularly: Review costs weekly, not monthly
Optimize Models: Use smaller models where appropriate
Cache Responses: Reduce redundant LLM calls

Learn more about AI Cost (FinOps) →

Pillar 2: AI Resilient (SRE)

Latency, Performance, Availability & Reliability

What is AI Resilient (SRE)?

Resilient (SRE) for AI systems emphasizes latency, performance, and operational health through SRE principles:

Low Latency: Fast response times and optimized performance
High Performance: Maximum throughput and efficiency
Reliable: High availability and fault tolerance with error budgets
Observable: Complete visibility into system behavior
SLAs/SLOs: Service level objectives and compliance

Key Capabilities

End-to-End Tracing

Follow requests through your entire AI stack:

typescript

// Trace shows complete execution path Trace: customer-support-query (2.3s) ├─ Span: input-validation (10ms) ├─ Span: retrieve-customer-context (150ms) │ └─ Span: database-query (145ms) ├─ Span: vector-search (200ms) │ ├─ Span: embedding-generation (50ms) │ └─ Span: similarity-search (150ms) ├─ Span: llm-inference (1.8s) │ ├─ Span: prompt-construction (5ms) │ ├─ Span: api-call (1.78s) │ └─ Span: response-parsing (15ms) └─ Span: response-formatting (140ms)

Performance Monitoring

Track key performance metrics:

python

# View performance metrics metrics = ants.metrics.get_performance({ agent: 'customer-support', period: 'last_24h' }) print(f"Latency p50: {metrics.latency.p50}ms") # 1,200ms print(f"Latency p95: {metrics.latency.p95}ms") # 3,500ms print(f"Latency p99: {metrics.latency.p99}ms") # 5,200ms print(f"Error rate: {metrics.error_rate}%") # 0.5% print(f"Throughput: {metrics.throughput}/s") # 45 req/s

Automated Alerting

Get notified when things go wrong:

typescript

await ants.alerts.create({ name: 'High Error Rate', condition: 'error_rate > 5%', window: '5m', channels: ['slack', 'pagerduty'], severity: 'critical' }) await ants.alerts.create({ name: 'Slow Response Time', condition: 'p95_latency > 5000ms', window: '10m', channels: ['email'], severity: 'warning' })

Incident Response

Quickly diagnose and resolve issues:

python

# Get incident details incident = ants.incidents.get('inc-123') # View timeline for event in incident.timeline: print(f"{event.time}: {event.description}") # Identify root cause root_cause = ants.incidents.analyze_root_cause('inc-123') print(f"Root cause: {root_cause.description}") # View similar incidents similar = ants.incidents.find_similar('inc-123')

AI Resilient Best Practices

Set SLOs: Define Service Level Objectives for latency and availability
Monitor Proactively: Don't wait for users to report performance issues
Automate Responses: Auto-remediate common issues
Learn from Incidents: Conduct post-mortems
Test Resilience: Implement chaos engineering

Learn more about AI Resilient (SRE) →

Pillar 3: AI Governance

Compliance, Risk Management & Policy Enforcement

What is AI Governance?

AI Governance provides comprehensive oversight of AI/LLM operations through governance, compliance, risk management, and security:

AI Governance: Policy enforcement and model usage controls
Compliance: Meet regulatory requirements (SOC2, GDPR, HIPAA, AI Act)
Risk Management: Continuous assessment and mitigation
Data Protection: Prevent sensitive data leaks
Access Control: Manage who can access what with RBAC
Audit Trails: Complete logs for compliance and forensics

Key Capabilities

PII Detection & Protection

Automatically identify and protect sensitive data:

typescript

// AgenticAnts automatically detects PII const trace = await ants.trace.create({ name: 'customer-query', input: 'My SSN is 123-45-6789 and email is john@example.com' }) // Dashboard shows: // - PII detected: SSN, Email // - Automatically redacted in storage // - Alert sent to security team // - Audit log created

Security Guardrails

Prevent harmful or policy-violating outputs:

python

# Configure guardrails ants.guardrails.create({ 'name': 'content-policy', 'rules': [ {'type': 'no_pii', 'action': 'redact'}, {'type': 'no_toxic_content', 'action': 'block'}, {'type': 'no_financial_advice', 'action': 'warn'} ] }) # Automatically enforced on all outputs response = agent.run(query) # Checked against guardrails

Compliance Reporting

Generate compliance reports automatically:

typescript

// Generate SOC2 compliance report const report = await ants.compliance.generate({ framework: 'SOC2', period: 'Q4-2025', controls: [ 'access-control', 'data-encryption', 'audit-logging', 'incident-response' ] }) // Download GDPR data export const gdprExport = await ants.compliance.exportData({ userId: 'user-123', format: 'json' })

RBAC & Access Control

Fine-grained permissions management:

python

# Create role ants.roles.create({ 'name': 'data-scientist', 'permissions': [ 'traces.read', 'metrics.read', 'dashboards.read', 'projects.list' ], 'resources': ['project-123', 'project-456'] }) # Assign to user ants.users.assign_role('user-789', 'data-scientist')

Audit Trails

Complete logging of all activities:

typescript

// Query audit logs const logs = await ants.audit.query({ action: 'data.export', startDate: '2025-10-01', endDate: '2025-10-31' }) for (const log of logs) { console.log(`${log.timestamp}: ${log.user} ${log.action}`) console.log(` Resource: ${log.resource}`) console.log(` IP: ${log.ip}`) console.log(` Status: ${log.status}`) }

AI Governance Best Practices

Principle of Least Privilege: Give minimum necessary access
Policy Enforcement: Define and enforce AI/LLM usage policies
Regular Audits: Review access, compliance, and activities regularly
Risk Assessment: Continuously assess and mitigate AI risks
Incident Response Plan: Have a plan before incidents occur

Three Pillars of LLMOps

Overview

LLMOps Framework

Pillar 1: AI Cost (FinOps)

What is AI Cost (FinOps)?

Key Capabilities

Token Usage Monitoring

Cost Per Customer Query

Budget Management

ROI Analytics

AI Cost Best Practices

Pillar 2: AI Resilient (SRE)

What is AI Resilient (SRE)?

Key Capabilities

End-to-End Tracing

Performance Monitoring

Automated Alerting

Incident Response

AI Resilient Best Practices

Pillar 3: AI Governance

What is AI Governance?

Key Capabilities

PII Detection & Protection

Security Guardrails

Compliance Reporting

RBAC & Access Control

Audit Trails

AI Governance Best Practices

Integration of the Three Pillars

Example: Production Incident

Example: Cost Optimization

Getting Started with Each Pillar

Start with AI Cost

Start with AI Resilient

Start with AI Governance