Batch Inference Limits
Provider-specific requirements and limits for batch inference in the AI Optimizer.
Batch inference can be up to 50% cheaper than real-time inference when comparing large numbers of traces. However, each provider has minimum trace requirements and specific constraints.
Provider Comparison
| Provider | Min Requests | Max Requests | Max Input Size | Deployment Requirement |
|---|---|---|---|---|
| AWS Bedrock | See quotas | See quotas | 1 GB/file, 5 GB total | Model access enabled in region |
| Google Vertex AI | No minimum | 200,000 | Varies by model | Model available in selected location |
| Azure OpenAI | 1 | 100,000 | 200 MB | Global Batch deployment type |
| Anthropic | 1 | 100,000 | 256 MB | Standard API access |
| OpenAI | 1 | 50,000 | 200 MB | Standard API access |
Provider-Specific Notes
AWS Bedrock
- Batch limits are expressed as file size (1 GB per file, 5 GB cumulative) rather than record counts — check AWS Service Quotas (opens in a new tab) for your account's specific limits
- S3 bucket is automatically created for batch input/output
- Service role ARN is required for S3 access
- See Bedrock Batch Inference Setup for IAM configuration
Google Vertex AI
- No minimum batch size — but very small batches lose the cost advantage
- Up to 200,000 requests per batch job
- Requires Vertex AI API enabled and a service account with
aiplatform.userrole - Batch jobs run in the configured location (defaults to
us-central1) — supports all Gemini-available regions - See Vertex AI Batch Inference Setup for GCP configuration
Azure OpenAI
- Models must be deployed with the "Global Batch" deployment type
- Standard or provisioned deployments will fail for batch jobs
- See Azure documentation (opens in a new tab) for batch deployment setup
Anthropic
- Uses the Anthropic Message Batches API
- Standard API key with batch access is sufficient
- No special deployment configuration needed
OpenAI
- Uses the OpenAI Batch API
- Standard API key is sufficient
- No special deployment configuration needed
General Limitations
- Latency optimization is not supported for batch inference jobs — latency metrics are not meaningful for batch processing
- Job completion times vary — batch jobs may take minutes to hours depending on queue depth and batch size
- Results are not real-time — batch jobs are queued and processed asynchronously
- Cost savings scale with volume — the more traces you compare, the greater the cost benefit
Choosing Batch vs Real-Time
| Factor | Batch Inference | Real-Time Inference |
|---|---|---|
| Cost | Up to 50% cheaper | Standard pricing |
| Speed | Minutes to hours | Seconds |
| Latency metrics | Not available | Full P50/P95/P99 |
| Min requests | Varies by provider | 1 |
| Best for | Large-scale cost comparisons | Quick evaluations, latency optimization |
⚠️
If you need latency optimization results, use real-time inference. Batch inference only supports cost optimization comparisons.