AI API Integration Guide

Going from "I have an API key" to a production-ready AI integration involves more than just calling an endpoint. This guide covers the patterns that prevent outages, cost overruns, and security incidents.

Provider API Overview

Quick reference for the most popular AI API providers.

Provider	SDKs	Auth	Rate Limits
OpenAI	Python, Node.js, REST	Bearer token (API key)	TPM + RPM limits (tier-based)
Anthropic	Python, TypeScript, REST	x-api-key header	TPM + RPM limits (tier-based)
Google (Gemini)	Python, Node.js, REST	API key or OAuth	QPM limits (free tier + paid)
OpenRouter	OpenAI-compatible	Bearer token	Per-model limits
Groq	OpenAI-compatible	Bearer token	Model-specific TPM

All providers support SSE streaming. Many providers use OpenAI-compatible APIs, so existing OpenAI code works with a base URL change.

Production Best Practices

Five areas where most AI integrations break in production, and how to avoid each one.

Authentication & Security

How you handle API keys determines whether your app stays secure.

+ Store API keys in environment variables, never in code
+ Use separate keys for development and production
+ Rotate keys regularly (monthly for production)
+ Set usage limits on your API dashboard

Don't

- Hardcode API keys in source code
- Commit .env files to version control
- Expose API keys in client-side JavaScript
- Share keys between team members via chat

Retry & Error Handling

AI APIs fail sometimes. Your code should handle it gracefully.

+ Implement exponential backoff (1s, 2s, 4s, 8s)
+ Retry on 429 (rate limit) and 5xx (server error)
+ Set request timeouts (30-120s depending on model)
+ Log errors with request IDs for debugging

Don't

- Retry immediately on failures (causes thundering herd)
- Retry on 400 errors (these are your bug, not the API)
- Retry indefinitely (set a max retry count of 3-5)
- Ignore error response bodies (they contain useful info)

Streaming Responses

Streaming shows output as it generates, reducing perceived latency dramatically.

+ Use streaming for user-facing chat interfaces
+ Handle SSE (Server-Sent Events) properly in your client
+ Show a typing indicator while waiting for first token
+ Buffer streaming for structured output (JSON mode)

Don't

- Use streaming for batch/background processing (adds complexity)
- Parse partial JSON from streaming responses
- Forget to handle stream disconnections and reconnection
- Stream to clients over HTTP/1.0 (use HTTP/2)

Rate Limits & Throughput

Every provider has rate limits. Plan for them from day one.

+ Check your rate limit tier before launching
+ Implement a request queue with concurrency limits
+ Use batch APIs for non-real-time workloads (50% cheaper)
+ Monitor usage via provider dashboards

Don't

- Assume unlimited throughput in production
- Ignore 429 responses (they include retry-after headers)
- Send parallel requests without concurrency control
- Wait until launch to test at production volume

Production Resilience

Build for the inevitable: API outages, model deprecations, and price changes.

+ Build a fallback chain (primary model -> cheaper backup)
+ Abstract the AI call behind an interface (easy model swaps)
+ Pin model versions in production (avoid surprise behavior changes)
+ Monitor response quality, not just availability

Don't

- Depend on a single provider with no fallback
- Use "latest" model aliases in production code
- Skip monitoring (quality can degrade silently)
- Build tight coupling to one provider's SDK

Which SDK Should You Use?

A quick decision tree for picking the right SDK.

IFyou only use one providerUse their official SDK (openai, anthropic, @google/generative-ai)

IFyou want to switch models easilyUse a multi-model provider (200+ models, one API)

IFyou need fallback chainsUse LiteLLM (unified interface + automatic fallback)

IFyou want maximum controlUse raw REST API calls with fetch/requests

Explore Provider Endpoints

See detailed endpoint documentation, pricing, and model availability for each provider.

API Endpoints |API Pricing |Cost Optimization |Provider Status

Choosing Guide|Model Families|Self-Hosting Guide|AI Glossary

Frequently Asked Questions

OpenAI and Anthropic have the best developer experience. Both offer well-documented SDKs, clear error messages, and generous free trial credits. Multi-model API providers give you access to multiple providers through one API.

Implement exponential backoff with jitter. When you receive a 429 response, wait the duration specified in the retry-after header before retrying. Use a request queue with concurrency limits to prevent hitting rate limits in the first place. Most providers offer higher limits at higher payment tiers.

For user-facing chat interfaces, always use streaming. It makes the response feel 5-10x faster because users see tokens as they generate. For batch processing, background tasks, or structured output (JSON), non-streaming is simpler and works fine.

Use an abstraction layer. Options include OpenRouter (drop-in API compatibility), LiteLLM (Python library), or build your own thin wrapper. The key is keeping provider-specific code behind an interface so swapping models only requires changing a config value.

Provider API Overview

Quick reference for the most popular AI API providers.

Provider	SDKs	Auth	Rate Limits
OpenAI	Python, Node.js, REST	Bearer token (API key)	TPM + RPM limits (tier-based)
Anthropic	Python, TypeScript, REST	x-api-key header	TPM + RPM limits (tier-based)
Google (Gemini)	Python, Node.js, REST	API key or OAuth	QPM limits (free tier + paid)
OpenRouter	OpenAI-compatible	Bearer token	Per-model limits
Groq	OpenAI-compatible	Bearer token	Model-specific TPM

All providers support SSE streaming. Many providers use OpenAI-compatible APIs, so existing OpenAI code works with a base URL change.

Production Best Practices

Five areas where most AI integrations break in production, and how to avoid each one.

Authentication & Security

How you handle API keys determines whether your app stays secure.

+ Store API keys in environment variables, never in code
+ Use separate keys for development and production
+ Rotate keys regularly (monthly for production)
+ Set usage limits on your API dashboard

Don't

- Hardcode API keys in source code
- Commit .env files to version control
- Expose API keys in client-side JavaScript
- Share keys between team members via chat

Retry & Error Handling

AI APIs fail sometimes. Your code should handle it gracefully.

+ Implement exponential backoff (1s, 2s, 4s, 8s)
+ Retry on 429 (rate limit) and 5xx (server error)
+ Set request timeouts (30-120s depending on model)
+ Log errors with request IDs for debugging

Don't

- Retry immediately on failures (causes thundering herd)
- Retry on 400 errors (these are your bug, not the API)
- Retry indefinitely (set a max retry count of 3-5)
- Ignore error response bodies (they contain useful info)

Streaming Responses

Streaming shows output as it generates, reducing perceived latency dramatically.

+ Use streaming for user-facing chat interfaces
+ Handle SSE (Server-Sent Events) properly in your client
+ Show a typing indicator while waiting for first token
+ Buffer streaming for structured output (JSON mode)

Don't

- Use streaming for batch/background processing (adds complexity)
- Parse partial JSON from streaming responses
- Forget to handle stream disconnections and reconnection
- Stream to clients over HTTP/1.0 (use HTTP/2)

Rate Limits & Throughput

Every provider has rate limits. Plan for them from day one.

+ Check your rate limit tier before launching
+ Implement a request queue with concurrency limits
+ Use batch APIs for non-real-time workloads (50% cheaper)
+ Monitor usage via provider dashboards

Don't

- Assume unlimited throughput in production
- Ignore 429 responses (they include retry-after headers)
- Send parallel requests without concurrency control
- Wait until launch to test at production volume

Production Resilience

Build for the inevitable: API outages, model deprecations, and price changes.

+ Build a fallback chain (primary model -> cheaper backup)
+ Abstract the AI call behind an interface (easy model swaps)
+ Pin model versions in production (avoid surprise behavior changes)
+ Monitor response quality, not just availability

Don't

- Depend on a single provider with no fallback
- Use "latest" model aliases in production code
- Skip monitoring (quality can degrade silently)
- Build tight coupling to one provider's SDK

Which SDK Should You Use?

A quick decision tree for picking the right SDK.

IFyou only use one providerUse their official SDK (openai, anthropic, @google/generative-ai)

IFyou want to switch models easilyUse a multi-model provider (200+ models, one API)

IFyou need fallback chainsUse LiteLLM (unified interface + automatic fallback)

IFyou want maximum controlUse raw REST API calls with fetch/requests