Batch Inference Limits

Provider-specific requirements and limits for batch inference in the AI Optimizer.

Batch inference can be up to 50% cheaper than real-time inference when comparing large numbers of traces. However, each provider has minimum trace requirements and specific constraints.

Provider Comparison

Provider	Min Requests	Max Requests	Max Input Size	Deployment Requirement
AWS Bedrock	See quotas	See quotas	1 GB/file, 5 GB total	Model access enabled in region
Google Vertex AI	No minimum	200,000	Varies by model	Model available in selected location
Azure OpenAI	1	100,000	200 MB	Global Batch deployment type
Anthropic	1	100,000	256 MB	Standard API access
OpenAI	1	50,000	200 MB	Standard API access

Provider-Specific Notes

AWS Bedrock

Batch limits are expressed as file size (1 GB per file, 5 GB cumulative) rather than record counts — check AWS Service Quotas (opens in a new tab) for your account's specific limits
S3 bucket is automatically created for batch input/output
Service role ARN is required for S3 access
See Bedrock Batch Inference Setup for IAM configuration

Google Vertex AI

No minimum batch size — but very small batches lose the cost advantage
Up to 200,000 requests per batch job
Requires Vertex AI API enabled and a service account with aiplatform.user role
Batch jobs run in the configured location (defaults to us-central1) — supports all Gemini-available regions
See Vertex AI Batch Inference Setup for GCP configuration

Azure OpenAI

Models must be deployed with the "Global Batch" deployment type
Standard or provisioned deployments will fail for batch jobs
See Azure documentation (opens in a new tab) for batch deployment setup

Anthropic

Uses the Anthropic Message Batches API
Standard API key with batch access is sufficient
No special deployment configuration needed

OpenAI

Uses the OpenAI Batch API
Standard API key is sufficient
No special deployment configuration needed

General Limitations

Latency optimization is not supported for batch inference jobs — latency metrics are not meaningful for batch processing
Job completion times vary — batch jobs may take minutes to hours depending on queue depth and batch size
Results are not real-time — batch jobs are queued and processed asynchronously
Cost savings scale with volume — the more traces you compare, the greater the cost benefit

Choosing Batch vs Real-Time

Factor	Batch Inference	Real-Time Inference
Cost	Up to 50% cheaper	Standard pricing
Speed	Minutes to hours	Seconds
Latency metrics	Not available	Full P50/P95/P99
Min requests	Varies by provider	1
Best for	Large-scale cost comparisons	Quick evaluations, latency optimization

⚠️

If you need latency optimization results, use real-time inference. Batch inference only supports cost optimization comparisons.

Vertex AI Setup