Docs
Batch Inference
Provider Limits

Batch Inference Limits

Provider-specific requirements and limits for batch inference in the AI Optimizer.

Batch inference can be up to 50% cheaper than real-time inference when comparing large numbers of traces. However, each provider has minimum trace requirements and specific constraints.

Provider Comparison

ProviderMin RequestsMax RequestsMax Input SizeDeployment Requirement
AWS BedrockSee quotasSee quotas1 GB/file, 5 GB totalModel access enabled in region
Google Vertex AINo minimum200,000Varies by modelModel available in selected location
Azure OpenAI1100,000200 MBGlobal Batch deployment type
Anthropic1100,000256 MBStandard API access
OpenAI150,000200 MBStandard API access

Provider-Specific Notes

AWS Bedrock

  • Batch limits are expressed as file size (1 GB per file, 5 GB cumulative) rather than record counts — check AWS Service Quotas (opens in a new tab) for your account's specific limits
  • S3 bucket is automatically created for batch input/output
  • Service role ARN is required for S3 access
  • See Bedrock Batch Inference Setup for IAM configuration

Google Vertex AI

  • No minimum batch size — but very small batches lose the cost advantage
  • Up to 200,000 requests per batch job
  • Requires Vertex AI API enabled and a service account with aiplatform.user role
  • Batch jobs run in the configured location (defaults to us-central1) — supports all Gemini-available regions
  • See Vertex AI Batch Inference Setup for GCP configuration

Azure OpenAI

Anthropic

  • Uses the Anthropic Message Batches API
  • Standard API key with batch access is sufficient
  • No special deployment configuration needed

OpenAI

  • Uses the OpenAI Batch API
  • Standard API key is sufficient
  • No special deployment configuration needed

General Limitations

  • Latency optimization is not supported for batch inference jobs — latency metrics are not meaningful for batch processing
  • Job completion times vary — batch jobs may take minutes to hours depending on queue depth and batch size
  • Results are not real-time — batch jobs are queued and processed asynchronously
  • Cost savings scale with volume — the more traces you compare, the greater the cost benefit

Choosing Batch vs Real-Time

FactorBatch InferenceReal-Time Inference
CostUp to 50% cheaperStandard pricing
SpeedMinutes to hoursSeconds
Latency metricsNot availableFull P50/P95/P99
Min requestsVaries by provider1
Best forLarge-scale cost comparisonsQuick evaluations, latency optimization
⚠️

If you need latency optimization results, use real-time inference. Batch inference only supports cost optimization comparisons.