exo-cuda-inference
Exo CUDA Inference API provides distributed GPU-accelerated inference for AI models using NVIDIA CUDA. ## Features - OpenAI-compatible API (use OpenAI SDK directly) - Support for Llama 3, Mistral, Stable Diffusion XL, Whisper, and more - NVIDIA GPU acceleration for fast inference - Scalable infrastructure handling thousands of requests - JSON & streaming response support - Multiple model…
exo-cuda-inference endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST |
chat /v1/chat/completions |
OpenAI-compatible chat completion endpoint. Send a conversation and receive AI-generated text response. |
| POST |
completions /v1/completions |
Text completion endpoint. Generate AI text completions for given prompts. |
| POST |
embeddings /v1/embeddings |
Generate vector embeddings for text. Useful for semantic search and similarity. |
| POST |
images /v1/images/generations |
Generate images from text prompts using Stable Diffusion XL. |
| POST |
transcriptions /v1/audio/transcriptions |
Convert audio to text. Upload audio file and get transcription. |
| GET |
models /v1/models |
List all available AI models on this inference API. |
exo-cuda-inference pricing
| Plan | Price | Rate limit | Quotas |
|---|---|---|---|
| BASIC | Free | — |
|
| PRO | $19 / month | — |
|
| ULTRA | $49 / month | — |
|
| MEGA | $149 / month | — |
|