Back to Blog
AI & ML

Running LLMs Locally: When Ollama Makes More Sense Than OpenAI

Zyptr Admin
6 May 2024
8 min read

Not Everything Needs GPT-4

Let us be clear upfront: for most production applications, OpenAI's API is the right choice. GPT-4o is incredibly capable, the API is reliable, and the cost per token keeps dropping. But we've found a growing set of use cases where running a local LLM via Ollama is not just viable — it's actually preferable.

We started experimenting with Ollama about eight months ago, initially just for development (saving on API costs during prototyping). But we've since deployed local LLMs in three production scenarios, and we're planning more.

Use Case 1: Data-Sensitive Environments

A healthcare client flat-out refused to send patient data to any external API. Not OpenAI, not Azure OpenAI, not even with a BAA in place. Their compliance team said no, and that was that. We deployed Llama 3 8B on their on-premise servers via Ollama and built a clinical note summarization system. The quality isn't as good as GPT-4 — it's maybe 80% as capable for this specific task — but 80% of GPT-4 quality with zero data leaving the premises was exactly what the client needed.

Running Llama 3 8B requires about 8GB of VRAM. We used an NVIDIA A10G GPU on their existing infrastructure. Inference latency is about 30 tokens/second, which is acceptable for batch processing (summarizing notes overnight) but not for real-time chat. For real-time use cases, we'd need a beefier GPU or a smaller model.

Use Case 2: High-Volume, Low-Complexity Tasks

For a client processing 500,000+ customer reviews per day for sentiment analysis and categorization, the API costs were prohibitive — about $2,800/month even with GPT-3.5-turbo. We switched to Mistral 7B running locally on a 2x A10G setup, and the monthly cost dropped to about $180 (just the GPU instance cost). Accuracy was within 2% of GPT-3.5-turbo on our eval set. At 500K requests/day, that math works out overwhelmingly in favor of local.

Use Case 3: Development and Testing

This is where Ollama has become indispensable for our team. Every engineer runs Ollama locally during development. Instead of burning through API credits while iterating on prompts, they test against Llama 3 or Mistral locally. It's not identical to GPT-4, but it catches most logic issues in prompt design. We estimate it saves us $800-1,200/month in development API costs across the team.

The setup is dead simple: `brew install ollama`, then `ollama pull llama3`. It runs in the background and exposes an OpenAI-compatible API on localhost. Most LLM libraries (LangChain, LlamaIndex, Vercel AI SDK) can point to it with a one-line config change.

When Local Absolutely Doesn't Work

Complex reasoning tasks. Anything that requires GPT-4-level intelligence — nuanced analysis, complex code generation, multi-step planning — local models can't match it. We tried using Llama 3 70B for a contract analysis task and the error rate was 3x higher than GPT-4. For tasks where quality directly impacts business outcomes, the API cost is worth it.

Also, if you need the latest capabilities — vision, function calling with complex schemas, very long context windows — the open-source models are typically 6-12 months behind the frontier. GPT-4o's 128K context window and multimodal capabilities don't have good open-source equivalents yet.

Our Decision Framework

We ask three questions: Does the data need to stay on-premise? Is the volume high enough that API costs exceed GPU costs? Is the task simple enough that a 7-13B parameter model can handle it reliably? If you answer yes to any of these, it's worth testing a local deployment. If all three are no, just use the API. Don't make your life harder than it needs to be.

ollamallamalocal-llmopen-source-ai
Let's Work Together

Have a Project in Mind?
Great?

Let's talk about building your next product.