Open-Source LLMs for Developers: Models, Agents & Local AI

Q: What is the difference between open-source and open-weight LLMs?

Open-source LLMs release everything: model weights, training code, training data, and evaluation pipelines under a permissive license. Open-weight models release only the trained weights, often with custom licenses. Most models marketed as open-source — including Llama 4, DeepSeek, and Qwen — are actually open-weight. Only OLMo (AI2) and StarCoder 2 (BigCode) qualify as truly open source by releasing weights, code, and training data.

Q: Which open-weight LLM is best for coding agents?

For coding agents like Aider, Continue.dev, and OpenCode, the top recommendations are: Qwen3-32B for best overall agent and tool-calling capabilities, DeepSeek-R1-Distill-Qwen-32B for complex reasoning and debugging (MIT license), and Mistral Small 3.1 24B for function calling with Apache 2.0 licensing. All three run locally on Apple Silicon Macs with 32GB+ RAM when quantized.

Q: Can I run LLMs locally on a MacBook?

Yes. Tools like Ollama, llama.cpp, LM Studio, and Microsoft Foundry Local make it straightforward. On a 16GB MacBook, you can run 7B–14B parameter models comfortably. With 32GB, models up to 32B quantized work well. Ollama is the easiest starting point — install it and run 'ollama run qwen3' to get started in seconds.

Q: Is Ollama free and open source?

Yes. Ollama is licensed under MIT and is completely free. It provides a simple CLI and REST API for downloading, running, and managing LLMs locally. It supports macOS, Linux, Windows, and Docker, and integrates with coding agents like Claude Code, OpenCode, GitHub Copilot CLI, and OpenClaw.

Q: What is Microsoft Foundry Local?

Foundry Local is Microsoft's lightweight (~20MB) local AI runtime. It provides native SDKs for C#, JavaScript, Python, and Rust with automatic hardware acceleration via ONNX Runtime. It includes a curated catalog of optimized models (Qwen, DeepSeek, Mistral, Phi, Whisper) and an OpenAI-compatible API. No Azure subscription required — all inference runs on-device with zero network latency.

15+ Models Reviewed

$0 Local Inference Cost

6 Runtime Options

128K+ Context Windows

All GitHub repositories, licenses, and parameter counts verified — May 2026

The open-source LLM landscape has exploded. In 2024, there were a handful of usable open models. Today, developers can choose from dozens of production-grade LLMs — many rivaling closed-source alternatives — and run them locally on a MacBook with zero API costs.

But here's the catch most guides skip: almost none of these models are actually open source. They're open weight. The distinction matters — for licensing, for trust, and for your business. This guide covers every major model a developer should know about, the tools to run them locally, the agents they power, and the license traps to avoid.

What's Inside

Open source vs. open weight — why the distinction matters for your stack
15 model families compared: Llama 4, DeepSeek R1, Qwen 3, Mistral, Phi-4, Gemma, Granite, Falcon, and more
Master comparison table — license, parameters, context length, Apple Silicon viability
Best models for coding agents (Aider, Continue.dev, OpenCode, OpenClaw)
6 ways to run models locally: Ollama, llama.cpp, LM Studio, Foundry Local, vLLM, Apple MLX
OpenClaw — the personal AI assistant turning local LLMs into a cross-platform agent
Apple Silicon deployment matrix — what runs on your Mac hardware
License warning box — commercial use pitfalls and thresholds

Open Source vs. Open Weight — The Distinction That Matters

The industry uses "open source" loosely. Here's what the terms actually mean:

Open source means the model weights, training code, training data, evaluation pipelines, and intermediate checkpoints are all publicly available under a permissive license. You can reproduce the model from scratch. As of May 2026, only two model families meet this bar:

OLMo (Allen Institute for AI) — Apache 2.0, full training code, data, configs, and every checkpoint
StarCoder 2 (BigCode / Hugging Face + ServiceNow) — Open RAIL-M, training code + The Stack v2 dataset public

Open weight means the trained model weights are downloadable, but training data, training code, or both are proprietary. This includes Llama 4, DeepSeek, Qwen 3, Mistral, Phi-4, Gemma, and most others. Some use genuinely permissive licenses (MIT, Apache 2.0). Others use custom licenses with restrictions you must read carefully.

⚖️ License Warning Box

MIT / Apache 2.0 — Use commercially, modify, redistribute. Best for production. (Phi-4, Granite, Qwen 3, Mistral Small 3.1, DeepSeek R1)
Llama 4 Community License — Free for companies under 700M monthly active users. Must include "Llama" in derivative names.
Gemma Terms of Use — Commercial use allowed, but Google can revoke; prohibits using weights to train competing foundation models.
Codestral MNPL — Non-production only. Cannot use in commercial products without a paid Mistral license.
CC-BY-NC (Cohere) — Non-commercial only. No production use without a Cohere contract.
DeepSeek Model Agreement — V3 weights governed by PRC law. R1 distilled models are MIT.

Best Model by Use Case — Quick Picks

Before the deep dive, here are the top picks by scenario:

Quick Recommendation Matrix

Use Case	Top Pick	Why
Coding agent (Aider, OpenCode)	Qwen3-32B	Best function calling + tool use, Apache 2.0
Complex reasoning / debugging	DeepSeek-R1-Distill-32B	o1-mini class reasoning, MIT license
Local Mac inference (16GB)	Phi-4 14B (Q4)	Best reasoning per parameter, MIT
Enterprise / regulated	IBM Granite 3.3	Apache 2.0, GRC-vetted training data
Multilingual / global	Qwen3-235B-A22B	100+ languages, Apache 2.0
Research transparency	OLMo 2	Only truly open-source LLM, Apache 2.0
Code completion (FIM)	Qwen3-Coder / StarCoder 2	600+ languages, fill-in-middle support
Ultra-long context (1M+)	Llama 4 Scout	10M token context window
Inference speed / throughput	Falcon H1	Hybrid SSM architecture, 4–8× faster
Personal AI assistant	Any model + OpenClaw	WhatsApp, Telegram, Slack, Discord, Signal

The Models: A Developer's Guide

Every GitHub URL, license, and parameter count below has been verified against official repositories and model cards. Links go to canonical sources — not third-party mirrors.

Meta Llama 4

GitHub: meta-llama/llama-models · License: Llama 4 Community License (custom) · Type: Open Weight

Meta's Llama 4 introduced mixture-of-experts (MoE) architecture to the Llama family. Scout activates 17B parameters across 16 experts with a staggering 10M token context window. Maverick scales to 128 experts with 1M context. Both use the same 17B active parameter footprint per token.

Model	Architecture	Context	Released
Llama 4 Scout	17B active / 16 experts (MoE)	10M tokens	Apr 2025
Llama 4 Maverick	17B active / 128 experts (MoE)	1M tokens	Apr 2025
Llama 3.3	70B dense	128K	Dec 2024
Llama 3.2	1B, 3B dense	128K	Sep 2024
Llama 3.1	8B, 70B, 405B dense	128K	Jul 2024

Strengths: Largest community ecosystem. Thousands of fine-tunes on Hugging Face. Broadest tool support (Ollama, llama.cpp, vLLM, LM Studio). Llama 3.2 1B/3B run on any Apple Silicon Mac. Llama 4 Scout's 10M context is state-of-the-art.

Watch out: The Llama 4 Community License is not OSI-approved. Companies with over 700M monthly active users need a separate commercial license from Meta. Derivatives must include "Llama" in the name. Llama 4 models need multi-GPU setups at full precision.

Apple Silicon: Llama 3.2 1B/3B on any M-series Mac (8GB). Llama 3.1 8B quantized on M2/M3 Pro (16GB). 70B+ requires M2/M3 Ultra or Mac Studio.

DeepSeek V3 & R1

GitHub: deepseek-ai/DeepSeek-V3 · deepseek-ai/DeepSeek-R1 · License: V3 — custom DeepSeek License; R1 — MIT · Type: Open Weight

DeepSeek V3 is a 671B parameter MoE model that activates only 37B per token, trained for just 2.664M H800 GPU hours — remarkable efficiency. R1 is their reasoning model, built on the same architecture, trained via pure reinforcement learning without supervised fine-tuning. R1's distilled variants bring frontier reasoning to smaller models.

Model	Total / Active Params	Context	License
DeepSeek-V3	671B / 37B (MoE)	128K	DeepSeek License
DeepSeek-R1	671B / 37B (MoE)	128K	MIT
R1-Distill-Qwen-32B	32B dense	32K	MIT
R1-Distill-Llama-70B	70B dense	128K	MIT
R1-Distill-Qwen-7B	7B dense	32K	MIT

Strengths: R1-Distill-Qwen-32B outperforms OpenAI o1-mini on reasoning benchmarks. V3 competes with GPT-4o on coding and math. MIT license on R1 is exceptionally permissive for a frontier model. Pioneer of MLA (Multi-head Latent Attention) and FP8 training.

Watch out: V3's DeepSeek Model Agreement is governed by PRC law — a concern for regulated industries. The full 671B model requires an H100 cluster. R1 can produce verbose, repetitive chain-of-thought outputs.

Apple Silicon: R1-Distill-Qwen-7B on M2 Pro (16GB). R1-Distill 14B quantized on M2 Max (32GB). R1-Distill-32B quantized needs M2/M3 Ultra (64GB+). Full V3/R1 — not viable locally.

Qwen 3 (Alibaba)

GitHub: QwenLM/Qwen3 · License: Apache 2.0 · Type: Open Weight

Qwen 3 is the most versatile open-weight model family available. It spans from a tiny 0.6B model that runs on a phone to a 235B MoE flagship. The killer feature: a thinking mode toggle. Set enable_thinking=True for chain-of-thought reasoning, or False for fast chat — in one unified model. Qwen3 was explicitly designed for AI agents with native function calling and tool use.

Model	Architecture	Context	Best For
Qwen3-235B-A22B	235B / 22B active (MoE, 128 experts)	32K–131K	Flagship, agents, multilingual
Qwen3-32B	32B dense	32K–131K	Coding agents, reasoning
Qwen3-14B	14B dense	32K–131K	Local inference, fine-tuning
Qwen3-8B	8B dense	32K–131K	Lightweight agents
Qwen3-30B-A3B	30B / 3B active (MoE)	32K–131K	Efficient edge inference
Qwen3-Coder	Various	32K–131K	Code completion, FIM

Strengths: Apache 2.0 for most sizes. Best-in-class agent and tool-calling capabilities among open models. 100+ languages. Thinking/non-thinking mode in a single model. The July 2025 Qwen3-2507 update significantly improved instruction following and coding.

Watch out: 32K native context is smaller than Llama 4's 10M (131K with YaRN extension). Very large models (235B MoE) need substantial infrastructure. Some enterprises have data sovereignty concerns with Alibaba-origin models.

Apple Silicon: 0.6B–8B on any M-series. 14B quantized on M2/M3 Pro (16GB). 32B requires M2/M3 Ultra.

Mistral (Small 3.1, Mixtral, Codestral)

GitHub: mistralai/mistral-inference · License: Varies by model · Type: Open Weight

Mistral AI from France offers models across a wide range of sizes and capabilities. The critical nuance: licenses vary dramatically by model. Mistral Small 3.1 and Mistral 7B are Apache 2.0 (genuinely permissive). Codestral uses a non-production license. Mistral Large is research-only.

Model	Params	License	Commercial Use
Mistral Small 3.1	24B dense, multimodal, 128K ctx	Apache 2.0	✅ Yes
Mistral 7B v0.3	7B dense	Apache 2.0	✅ Yes
Mixtral 8x7B	46.7B / 12.9B active (MoE)	Apache 2.0	✅ Yes
Mixtral 8x22B	141B / 39B active (MoE)	Apache 2.0	✅ Yes
Codestral 22B	22B dense	MNPL (non-production)	❌ No
Mistral Large 2	123B dense	MRL (research only)	❌ No

Strengths: Mistral Small 3.1 (24B) fits on a single RTX 4090 or a 32GB MacBook when quantized. It adds vision understanding and 128K context. Native function calling. Excellent multilingual support. Apache 2.0 on the models that matter most.

Watch out: Codestral's MNPL license is a trap for commercial builders — it looks open but prohibits production use. Mistral Large 2 is research-only. No 70B+ Apache model in their lineup. Training data is not disclosed.

Apple Silicon: Mistral 7B on any M-series (8GB). Small 3.1 24B quantized on M2 Max / M3 Pro (32GB). Mixtral 8x7B quantized on M2 Max (32GB+).

Microsoft Phi-4

Hugging Face: microsoft/phi-4 · License: MIT · Type: Open Weight

Phi-4 proves that small models can punch above their weight. At just 14B parameters with a dense decoder-only transformer, it outperforms Llama 3.3 70B on several reasoning benchmarks. Microsoft achieved this through a training methodology focused on synthetic, textbook-quality data — 9.8T tokens of carefully curated content.

Strengths: MIT license — genuinely permissive, use anywhere. Exceptional reasoning for its size. Outstanding on STEM tasks (math, coding, science). Ideal for memory-constrained and latency-bound scenarios. One of the best "reasoning per parameter" ratios available.

Watch out: 16K context window — much smaller than the 128K standard. English-focused. Only one size available (14B). Not a general knowledge model — can hallucinate on factual questions. No dedicated GitHub model repository (Hugging Face only).

Apple Silicon: Phi-4 14B quantized runs well on M2/M3 Pro (16GB). Phi-3 mini (3.8B) runs on any M-series Mac.

Google Gemma 3

GitHub: google-deepmind/gemma · License: Gemma Terms of Use (custom) · Type: Open Weight

Gemma 3 from Google DeepMind brings multimodal capabilities (text + image) to a compact model family. The 27B model competes with much larger alternatives and supports 128K context with 140+ languages. A Gemma 4 generation with MoE architecture is emerging.

Model	Params	Context	Features
Gemma 3 27B	27B dense	128K	Text + image, 140+ languages
Gemma 3 12B	12B dense	128K	Text + image
Gemma 3 4B	4B dense	128K	On-device, text + image
Gemma 3 1B	1B dense	32K	Edge / mobile

Strengths: Google-quality pre-training. Strong safety alignment. Multimodal out of the box. Runs on CPU/GPU/TPU. 1B and 4B models excellent for edge deployments. Well-integrated with JAX, Keras, PyTorch, and Transformers.

Watch out: The Gemma Terms of Use is not OSI-approved. Google can revoke the license. Prohibits using weights to train competing foundation models. Smaller community than Llama/Qwen. JAX is the primary implementation — Hugging Face Transformers support is secondary.

Apple Silicon: Gemma 1B/4B on any M-series. 12B quantized on M2 Pro (16GB). 27B quantized on M2 Max (32GB+).

IBM Granite 3.3

GitHub: ibm-granite/granite-3.3-language-models · ibm-granite/granite-code-models · License: Apache 2.0 · Type: Open Weight

Granite is the most enterprise-vetted model family available. IBM applies full governance, risk, and compliance (GRC) screening to all training data — including legal review, ClamAV scanning, PII redaction, and license verification for every code file. For organizations in regulated industries, this audit trail matters.

Model	Params	License	Best For
Granite 3.3 Language	2B, 8B (dense)	Apache 2.0	Enterprise RAG, function calling
Granite 3.0 MoE	1B (400M active), 3B (800M active)	Apache 2.0	Edge, constrained compute
Granite Code	3B, 8B, 20B, 34B	Apache 2.0	Code gen, 116 languages

Strengths: Apache 2.0. Most transparent training data governance in the industry. IBM enterprise support. FIM (fill-in-middle) for code. Separated thinking/answer in reasoning tasks. StarCoder tokenizer compatibility. Granite Code covers 116 programming languages.

Watch out: Smaller parameter counts (max 34B for code, 8B for language) compared to frontier models. Less benchmark-competitive than Qwen 3 or DeepSeek at similar sizes. Not a general-purpose frontier model — positioned for enterprise use cases.

Apple Silicon: All Granite models (2B–8B language, 3B–34B code) run on Apple Silicon. 8B on M2 Pro (16GB).

Falcon H1 (TII)

GitHub: tiiuae/Falcon-H1 · License: Apache 2.0 (smaller models), Falcon-LLM License (34B) · Type: Open Weight

Falcon H1 introduces a novel hybrid SSM+Attention architecture — combining Mamba2 state-space models with transformer attention in parallel. The result: 4× input throughput and 8× output throughput compared to pure transformer models of similar size. This architectural innovation is the most significant departure from standard transformer design in this guide.

Model	Params	Context	Architecture
Falcon-H1-34B	34B dense	256K	Hybrid SSM+Attention
Falcon-H1-7B	7B dense	256K	Hybrid SSM+Attention
Falcon-H1-3B	3B dense	256K	Hybrid SSM+Attention
Falcon-H1-1.5B	1.5B dense	256K	Hybrid SSM+Attention
Falcon-H1-0.5B	0.5B dense	256K	Hybrid SSM+Attention

Strengths: 256K context window across all sizes. Falcon-H1-34B competes with Qwen2.5-72B and Llama 3.3-70B at half the parameters. The 0.5B model delivers typical 2024-era 7B performance. Natively integrated into Apple MLX — explicitly demonstrated on MacBook M4 Max. Also supports llama.cpp, vLLM, SGLang.

Watch out: The 34B model uses the custom Falcon-LLM License (verify commercial terms). Less community adoption than Llama or Qwen. Newer architecture means fewer ecosystem integrations and fine-tuning recipes. From TII (Technology Innovation Institute, Abu Dhabi).

Apple Silicon: Falcon-H1-1.5B confirmed running on MacBook M4 Max — natively integrated into Apple MLX. 0.5B–7B on any M-series. 34B quantized on M2/M3 Ultra.

StarCoder 2 (BigCode)

GitHub: bigcode-project/starcoder2 · License: BigCode Open RAIL-M v1 · Type: Mostly Open

StarCoder 2 is the most transparent code model available. Developed by the BigCode project (a Hugging Face + ServiceNow collaboration), it trains on The Stack v2 — a publicly available dataset spanning 600+ programming languages with an opt-out mechanism for code authors.

Sizes: 3B, 7B, 15B. All trained on 3–4T+ tokens. 16K context with sliding window attention (4K).

Strengths: Most transparent training data of any code model. 600+ language coverage. Well-evaluated on the BigCode leaderboard. Fill-in-middle (FIM) support. Fine-tuning-friendly. Permissive RAIL license allows commercial use.

Watch out: Not instruction-tuned — it's a completion model, not a chat model. 16K context is limited for large codebases. Outperformed by newer code models (Qwen3-Coder, DeepSeek) on benchmarks. Best used as a base for fine-tuning or code completion, not conversational coding agents.

OLMo 2/3 (AI2) — The Truly Open Source LLM

GitHub: allenai/OLMo-core · License: Apache 2.0 · Type: Truly Open Source

OLMo from the Allen Institute for AI is the gold standard for open-source LLMs. Not just the weights — the full training code, every intermediate checkpoint (every 1,000 steps), training configs, data provenance CSVs, evaluation pipelines, and WandB training runs are all public. If reproducibility and transparency matter to your organization, OLMo is the benchmark.

Sizes: OLMo-2: 1B, 7B, 13B, 32B. OLMo-3: 7B, 32B.

Strengths: The most genuinely open-source LLM. Apache 2.0 on everything. Full 2-stage training recipe (OLMo-mix-1124 → Dolmino-mix-1124). Explicitly supports Mac silicon training. Ideal for research, reproducibility, and understanding LLM training dynamics.

Watch out: Not a frontier model — competitive at size but trails Qwen 3 and DeepSeek on leaderboards. Primary use case is research, not production. 32B is the largest available. Smaller community than Llama.

Apple Silicon: OLMo-2 1B/7B viable via Hugging Face Transformers. Mac silicon training explicitly documented in the README.

Other Notable Models

Yi (01.AI) — Apache 2.0, 6B/9B/34B, strong bilingual Chinese/English, 200K context variants. Development has slowed versus Qwen and DeepSeek — review benchmarks before choosing Yi for new projects.

Code Llama (facebookresearch/codellama) — Superseded. Based on Llama 2 architecture. Newer Llama 3.x and Qwen3-Coder models outperform it on most benchmarks. Migrate to Llama 3.3/4 or Qwen3-Coder for new projects.

DBRX (Databricks) — Superseded. 132B MoE, released March 2024. No GitHub model repo. Surpassed by Qwen 3, DeepSeek V3, and Llama 4 for most use cases.

Cohere Command R+ — 104B, purpose-built for RAG with grounding citations. CC-BY-NC license prohibits commercial use without a paid Cohere contract. Strong at retrieval and multi-step tool calling, but the license makes it impractical for most developers.

Building an AI-powered product and not sure which model fits? We help startups and growing businesses across Western Canada evaluate LLM options, licensing, and deployment strategies.

Book a Free Strategy Call

Master Comparison Table

All models side by side. Scroll horizontally on mobile.

Model	Org	Params	License	Context	Commercial	Apple Silicon
Llama 4 Scout	Meta	17B active (MoE)	Llama 4 Community	10M	✅ <700M MAU	❌ Multi-GPU
Llama 3.3	Meta	70B	Llama 3.3 Community	128K	✅ <700M MAU	Ultra only
Llama 3.2	Meta	1B, 3B	Llama 3.2 Community	128K	✅ <700M MAU	✅ Any Mac
DeepSeek-V3	DeepSeek	671B/37B (MoE)	DeepSeek License	128K	✅ w/ restrictions	❌ Cluster
DeepSeek-R1	DeepSeek	671B/37B (MoE)	MIT	128K	✅ Free	❌ Cluster
R1-Distill-32B	DeepSeek	32B	MIT	32K	✅ Free	Ultra (64GB)
R1-Distill-7B	DeepSeek	7B	MIT	32K	✅ Free	✅ M2 Pro
Qwen3-235B-A22B	Alibaba	235B/22B (MoE)	Apache 2.0	32K–131K	✅ Free	❌ Cluster
Qwen3-32B	Alibaba	32B	Apache 2.0	32K–131K	✅ Free	Ultra (64GB)
Qwen3-8B	Alibaba	8B	Apache 2.0	32K–131K	✅ Free	✅ M2 Pro
Mistral Small 3.1	Mistral AI	24B	Apache 2.0	128K	✅ Free	M2 Max (32GB)
Codestral 22B	Mistral AI	22B	MNPL	32K	❌ No	M2 Max (32GB)
Phi-4	Microsoft	14B	MIT	16K	✅ Free	✅ M2 Pro
Gemma 3 27B	Google	27B	Gemma ToU	128K	✅ w/ restrictions	M2 Max (32GB)
Gemma 3 4B	Google	4B	Gemma ToU	128K	✅ w/ restrictions	✅ Any Mac
Granite 3.3 8B	IBM	8B	Apache 2.0	128K	✅ Free	✅ M2 Pro
Granite Code 34B	IBM	34B	Apache 2.0	128K	✅ Free	Ultra (64GB)
Falcon-H1-34B	TII	34B	Falcon-LLM	256K	⚠️ Check terms	Ultra (64GB)
Falcon-H1-1.5B	TII	1.5B	Apache 2.0	256K	✅ Free	✅ Any Mac (MLX)
StarCoder 2 15B	BigCode	15B	Open RAIL-M	16K	✅ w/ use restrictions	✅ M2 Pro
OLMo-2 32B	AI2	32B	Apache 2.0	128K	✅ Free	Ultra (64GB)
OLMo-2 7B	AI2	7B	Apache 2.0	128K	✅ Free	✅ M2 Pro

Which Models Work Best for Coding Agents

Not all models are equal for agentic coding. Agents need strong instruction following, reliable function calling, ability to generate and apply patches, and multi-step reasoning. Here's what works best with popular coding tools:

Aider — The most popular open-source coding agent. Works best with DeepSeek R1 and V3 (per their README), Qwen3-32B, and Mistral Small 3.1. Over 6.8M installs and 15B tokens per week processed.

Continue.dev — Apache 2.0 licensed IDE extension. Now includes CI-enforceable code checks via .continue/checks/ markdown files. Works with any OpenAI-compatible API — pair with Ollama for fully local inference using any model above.

OpenCode → Crush — Note: the original opencode-ai/opencode repository has been archived. The project continues as charmbracelet/crush by the original author and the Charm team. If you're using Kodra macOS or referencing OpenCode, update to Crush.

GitHub Copilot CLI — Works with GitHub's models out of the box, but can also be paired with local models via compatible backends for privacy-sensitive work.

Coding Agent Model Rankings

Rank	Model	Why
1	Qwen3-32B	Best function calling, tool use, explicit agent design. Apache 2.0.
2	DeepSeek-R1-Distill-32B	Best reasoning for complex debugging. MIT license.
3	Mistral Small 3.1 (24B)	Best-in-class function calling. Apache 2.0. Fits on MacBook.
4	Llama 3.3 70B	Strong general coding. Massive ecosystem. Best community support.
5	Qwen3-Coder	Dedicated code model. Excellent FIM and multi-language support.

How to Run Models Locally

Six tools dominate local LLM inference. Each has a distinct sweet spot.

Ollama — The Easiest Starting Point

GitHub: ollama/ollama · License: MIT · Backend: llama.cpp

Ollama is the Docker of LLMs. One command to install, one command to run any model. It handles model downloads, quantization, and serves an OpenAI-compatible REST API on localhost:11434.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Run models
ollama run qwen3
ollama run deepseek-r1:7b
ollama run gemma3
ollama run phi4

# Launch coding integrations
ollama launch claude
ollama launch openclaw

# REST API
curl http://localhost:11434/api/chat -d '{
  "model": "qwen3",
  "messages": [{"role": "user", "content": "Explain MoE architecture"}]
}'

Ollama supports macOS, Linux, Windows, and Docker. It integrates directly with Claude Code, OpenCode/Crush, Codex, GitHub Copilot CLI, OpenClaw, and dozens of other tools. For most developers, Ollama is the right starting point.

llama.cpp — Maximum Control

GitHub: ggml-org/llama.cpp · License: MIT

The foundational inference engine that Ollama runs on top of. If you need maximum control — custom quantization, fine-tuned GGUF models, batch inference, embedding generation, or a built-in web UI — llama.cpp is the direct path. It supports Metal acceleration on Apple Silicon, CUDA on NVIDIA, and Vulkan on AMD.

Recent additions include multimodal support in llama-server, native GPT-OSS model support with MXFP4 format, and VS Code/Vim extensions for FIM code completions.

LM Studio — GUI for Everyone

Website: lmstudio.ai · License: Free for personal use

LM Studio provides a polished desktop app for discovering, downloading, and chatting with LLMs. It uses Apple's MLX framework on Mac for native Apple Silicon acceleration. Key features include an OpenAI-compatible local server, one-click model downloads from Hugging Face, and built-in quantization. Ideal for developers who want a GUI experience or need to demo local AI to non-technical stakeholders.

Microsoft Foundry Local — Embedded AI Runtime

GitHub: microsoft/foundry-local · License: MIT · Size: ~20MB

Foundry Local is Microsoft's answer to embedded local AI. At just ~20MB, it provides a complete runtime with native SDKs for C#, JavaScript, Python, and Rust. It uses ONNX Runtime under the hood with automatic hardware acceleration — detecting whether to use NPU, GPU, or CPU on each device.

# Python quickstart
pip install foundry-local-sdk

# JavaScript quickstart
npm install foundry-local-sdk

The curated catalog includes optimized variants of Qwen, DeepSeek, Mistral, Phi, and Whisper. Every model goes through quantization and compression testing to balance quality and performance. An OpenAI-compatible API means existing OpenAI SDK code works with minimal changes. No Azure subscription required — everything runs on-device.

Foundry Local is the best choice when you're building a product that ships with embedded AI. The lightweight runtime, auto hardware detection, and native SDKs across four languages make it ideal for desktop applications, offline tools, and privacy-sensitive use cases.

vLLM & SGLang — Production Serving

vLLM (GitHub) and SGLang (GitHub) are high-throughput inference servers for production deployments. Use these when serving models to multiple users or applications. Both support PagedAttention for efficient memory management, continuous batching, and tensor parallelism.

Apple MLX — Native Apple Silicon

GitHub: ml-explore/mlx · License: MIT

Apple's own machine learning framework, designed specifically for Apple Silicon. Unified memory architecture means no CPU↔GPU data copying. LM Studio uses MLX under the hood on Mac. Falcon H1 has native MLX integration. If you're building Apple-first applications, MLX gives the best performance per watt on M-series chips.

Which Runtime Should You Use?

Scenario	Best Tool	Why
Getting started / experimenting	Ollama	One-command install, broadest model support
Custom quantization / GGUF models	llama.cpp	Maximum control, all quantization formats
GUI / non-technical demos	LM Studio	Polished desktop app, one-click downloads
Embedded in your product	Foundry Local	20MB runtime, 4 language SDKs, auto hardware
Production serving (multi-user)	vLLM / SGLang	High throughput, continuous batching
Apple-native development	MLX	Unified memory, best perf/watt on M-series

OpenClaw — Your Personal AI Assistant

GitHub: openclaw/openclaw · License: MIT · Stars: 372k+

OpenClaw turns local LLMs into a personal AI assistant that meets you on the channels you already use: WhatsApp, Telegram, Slack, Discord, Signal, iMessage, Google Chat, Microsoft Teams, Matrix, and 15+ more platforms.

It runs as a lightweight gateway on your own device. You point it at any LLM provider — Ollama for local inference, or cloud providers for heavier models. The result is a single-user AI assistant that's always on, runs locally, and responds across every messaging platform simultaneously.

# Install
npm install -g openclaw@latest

# Guided setup (auth, channels, skills)
openclaw onboard --install-daemon

# Launch directly from Ollama
ollama launch openclaw

# Send a message
openclaw agent --message "Ship checklist" --thinking high

The OpenClaw ecosystem includes ClawHub (skill directory), gogcli (Google Workspace in your terminal), mcporter (MCP-to-TypeScript bridge), and Peekaboo (macOS screenshot MCP server for AI agents). It's MIT-licensed, supports voice on macOS/iOS/Android, and renders a live Canvas you control.

For developers who want their AI assistant to work across every platform without giving data to a cloud provider, OpenClaw + Ollama is a powerful combination.

Linux Foundation AI & Data

The Linux Foundation AI & Data (LF AI) doesn't host LLM model weights — it hosts the infrastructure ecosystem around AI and ML. Key projects relevant to developers working with open models:

ONNX — Open neural network exchange format. The backbone of Microsoft Foundry Local's inference engine.
MLflow — Experiment tracking, model registry, deployment. Essential for managing LLM fine-tuning workflows.
KServe — Kubernetes-native model serving. Deploy LLMs at scale on your own infrastructure.
BeeAI — Open-source framework for building production-ready agents (LF AI graduated project, 2025).
Data Prep Kit — Simplifies unstructured data preparation for LLM training and fine-tuning (IBM-contributed).
AI Fairness 360 / AI Explainability 360 — IBM-contributed toolkits for trustworthy, auditable AI systems.
Adversarial Robustness Toolbox (ART) — ML model security evaluation and defense.

The LF AI ecosystem is the glue. Models get the headlines, but ONNX, MLflow, and KServe are what make local and production LLM deployments actually work. If you're evaluating AI tooling for your organization, these projects deserve as much attention as the models themselves.

Apple Silicon Deployment Matrix

What actually runs on your Mac, and what's comfortable versus technically-possible-but-painful.

Mac Hardware	RAM	Comfortable Models	Stretch (Slow but Works)
M1/M2 base	8GB	Llama 3.2 1B/3B, Gemma 1B, Falcon-H1 0.5B	Phi-3 mini 3.8B (Q4)
M2/M3 Pro	16GB	Phi-4 14B (Q4), Qwen3-8B, R1-Distill-7B, Granite 8B	Mistral 7B, StarCoder 2 15B (Q4)
M2/M3 Max	32GB	Mistral Small 3.1 24B (Q4), Qwen3-14B, Gemma 3 27B (Q4)	Qwen3-32B (Q3), Falcon-H1-34B (Q4)
M2/M3/M4 Ultra	64GB+	Qwen3-32B, R1-Distill-32B, Granite Code 34B	Llama 3.3 70B (Q4), R1-Distill-70B (Q3)
Mac Studio Ultra	128GB+	Llama 3.3 70B, R1-Distill-70B, Mixtral 8x22B (Q4)	Qwen3-235B-A22B MoE (Q4, experimental)

Key assumptions: "Comfortable" means reasonable generation speed (10+ tokens/sec) with enough headroom for context. "Stretch" means it loads and runs but expect slower generation (3–8 tokens/sec) with reduced context windows. All assume Q4_K_M quantization via Ollama or llama.cpp.

🔒 Security Considerations for Local AI

Local inference ≠ automatically private. Agents calling cloud APIs still send data externally.
Coding agents can execute shell commands. Sandbox them — never run with root/admin privileges.
Secrets can leak into prompts and model logs. Scrub environment variables before piping context.
Audit logs matter. Track what models generate, especially in regulated environments.
Don't run production inference on consumer hardware without monitoring and rate limiting.

Frequently Asked Questions

What is the difference between open-source and open-weight LLMs?

Open-source LLMs release everything: model weights, training code, training data, and evaluation pipelines under a permissive license. Open-weight models release only the trained weights, often with custom licenses. Most models marketed as "open-source" — including Llama 4, DeepSeek, and Qwen 3 — are actually open-weight. Only OLMo (AI2) and StarCoder 2 (BigCode) qualify as truly open source.

Which open-weight LLM is best for coding agents?

For coding agents like Aider, Continue.dev, and OpenCode/Crush, the top recommendations are: Qwen3-32B for best overall agent and tool-calling capabilities, DeepSeek-R1-Distill-Qwen-32B for complex reasoning and debugging (MIT license), and Mistral Small 3.1 for function calling with Apache 2.0 licensing. All three run locally on Apple Silicon Macs with 32GB+ RAM when quantized.

Can I run LLMs locally on a MacBook?

Yes. Tools like Ollama, llama.cpp, LM Studio, and Microsoft Foundry Local make it straightforward. On a 16GB MacBook, you can run 7B–14B parameter models comfortably. With 32GB, models up to 24B–32B quantized work well. Check the Apple Silicon deployment matrix for specific recommendations by hardware.

Is Ollama free and open source?

Yes. Ollama is licensed under MIT and is completely free. It provides a simple CLI and REST API for downloading, running, and managing LLMs locally. It supports macOS, Linux, Windows, and Docker, and integrates with coding agents like Claude Code, OpenCode/Crush, GitHub Copilot CLI, and OpenClaw.

What is Microsoft Foundry Local?

Foundry Local is Microsoft's lightweight (~20MB) local AI runtime with native SDKs for C#, JavaScript, Python, and Rust. It auto-detects hardware (NPU, GPU, CPU) and uses ONNX Runtime for inference. It includes a curated catalog of optimized models and an OpenAI-compatible API. No Azure subscription required — all inference runs on-device with zero latency.

The Bottom Line

The open-weight LLM ecosystem is the most vibrant it's ever been. Developers have genuine choices — from MIT-licensed reasoning models (Phi-4, DeepSeek R1) to Apache 2.0 agent-first models (Qwen 3, Granite) to truly open-source research models (OLMo). The runtimes to deploy them locally — Ollama, Foundry Local, llama.cpp — are mature, free, and getting better every month.

But read the licenses carefully. "Open" doesn't always mean what you think. Codestral looks open until you try to use it in production. Gemma looks permissive until Google's terms restrict fine-tuning for competing models. Llama 4 is free unless your company has 700M+ users. The open-source vs. open-weight distinction isn't academic — it has real implications for your codebase, your product, and your business.

If you found this guide useful, share it with your team or join the Code To Cloud Discord community where developers across Alberta and Western Canada discuss LLMs, agents, and developer tooling every week.

Need Help Choosing the Right LLM for Your Product?

50+ advisory engagements across Alberta & Western Canada

We help startups and growing businesses evaluate open-weight models, navigate licensing, build local inference pipelines, and ship AI-powered products — without vendor lock-in.

Book a Free Strategy Call Get Your Free Alberta Guide

Kevin Evans

Fractional CTO & Founder, Code To Cloud Inc.

Kevin Evans is a fractional CTO and technology advisor based in Calgary, Alberta. He helps startups and growing businesses across Western Canada make smart technology decisions — from developer environment setup to AI agent strategy. More about Kevin

Open-Source LLMs for Developers: The Complete Guide to Models, Agents, and Running AI Locally