Open-Source LLMs for Developers: The Complete Guide to Models, Agents, and Running AI Locally

Compare 15+ open-weight and open-source LLMs for coding agents and local inference. Covers Llama 4, DeepSeek R1, Qwen 3, Ollama, Foundry Local, and more — with license traps, Apple Silicon benchmarks, and everything a developer needs to choose the right model.

15+ Models Reviewed
$0 Local Inference Cost
6 Runtime Options
128K+ Context Windows

All GitHub repositories, licenses, and parameter counts verified — May 2026

The open-source LLM landscape has exploded. In 2024, there were a handful of usable open models. Today, developers can choose from dozens of production-grade LLMs — many rivaling closed-source alternatives — and run them locally on a MacBook with zero API costs.

But here's the catch most guides skip: almost none of these models are actually open source. They're open weight. The distinction matters — for licensing, for trust, and for your business. This guide covers every major model a developer should know about, the tools to run them locally, the agents they power, and the license traps to avoid.

What's Inside

  • Open source vs. open weight — why the distinction matters for your stack
  • 15 model families compared: Llama 4, DeepSeek R1, Qwen 3, Mistral, Phi-4, Gemma, Granite, Falcon, and more
  • Master comparison table — license, parameters, context length, Apple Silicon viability
  • Best models for coding agents (Aider, Continue.dev, OpenCode, OpenClaw)
  • 6 ways to run models locally: Ollama, llama.cpp, LM Studio, Foundry Local, vLLM, Apple MLX
  • OpenClaw — the personal AI assistant turning local LLMs into a cross-platform agent
  • Apple Silicon deployment matrix — what runs on your Mac hardware
  • License warning box — commercial use pitfalls and thresholds

Open Source vs. Open Weight — The Distinction That Matters

The industry uses "open source" loosely. Here's what the terms actually mean:

Open source means the model weights, training code, training data, evaluation pipelines, and intermediate checkpoints are all publicly available under a permissive license. You can reproduce the model from scratch. As of May 2026, only two model families meet this bar:

Open weight means the trained model weights are downloadable, but training data, training code, or both are proprietary. This includes Llama 4, DeepSeek, Qwen 3, Mistral, Phi-4, Gemma, and most others. Some use genuinely permissive licenses (MIT, Apache 2.0). Others use custom licenses with restrictions you must read carefully.

⚖️ License Warning Box

  • MIT / Apache 2.0 — Use commercially, modify, redistribute. Best for production. (Phi-4, Granite, Qwen 3, Mistral Small 3.1, DeepSeek R1)
  • Llama 4 Community License — Free for companies under 700M monthly active users. Must include "Llama" in derivative names.
  • Gemma Terms of Use — Commercial use allowed, but Google can revoke; prohibits using weights to train competing foundation models.
  • Codestral MNPL — Non-production only. Cannot use in commercial products without a paid Mistral license.
  • CC-BY-NC (Cohere) — Non-commercial only. No production use without a Cohere contract.
  • DeepSeek Model Agreement — V3 weights governed by PRC law. R1 distilled models are MIT.

Best Model by Use Case — Quick Picks

Before the deep dive, here are the top picks by scenario:

Quick Recommendation Matrix

Use CaseTop PickWhy
Coding agent (Aider, OpenCode)Qwen3-32BBest function calling + tool use, Apache 2.0
Complex reasoning / debuggingDeepSeek-R1-Distill-32Bo1-mini class reasoning, MIT license
Local Mac inference (16GB)Phi-4 14B (Q4)Best reasoning per parameter, MIT
Enterprise / regulatedIBM Granite 3.3Apache 2.0, GRC-vetted training data
Multilingual / globalQwen3-235B-A22B100+ languages, Apache 2.0
Research transparencyOLMo 2Only truly open-source LLM, Apache 2.0
Code completion (FIM)Qwen3-Coder / StarCoder 2600+ languages, fill-in-middle support
Ultra-long context (1M+)Llama 4 Scout10M token context window
Inference speed / throughputFalcon H1Hybrid SSM architecture, 4–8× faster
Personal AI assistantAny model + OpenClawWhatsApp, Telegram, Slack, Discord, Signal

The Models: A Developer's Guide

Every GitHub URL, license, and parameter count below has been verified against official repositories and model cards. Links go to canonical sources — not third-party mirrors.

Meta Llama 4

GitHub: meta-llama/llama-models · License: Llama 4 Community License (custom) · Type: Open Weight

Meta's Llama 4 introduced mixture-of-experts (MoE) architecture to the Llama family. Scout activates 17B parameters across 16 experts with a staggering 10M token context window. Maverick scales to 128 experts with 1M context. Both use the same 17B active parameter footprint per token.

ModelArchitectureContextReleased
Llama 4 Scout17B active / 16 experts (MoE)10M tokensApr 2025
Llama 4 Maverick17B active / 128 experts (MoE)1M tokensApr 2025
Llama 3.370B dense128KDec 2024
Llama 3.21B, 3B dense128KSep 2024
Llama 3.18B, 70B, 405B dense128KJul 2024

Strengths: Largest community ecosystem. Thousands of fine-tunes on Hugging Face. Broadest tool support (Ollama, llama.cpp, vLLM, LM Studio). Llama 3.2 1B/3B run on any Apple Silicon Mac. Llama 4 Scout's 10M context is state-of-the-art.

Watch out: The Llama 4 Community License is not OSI-approved. Companies with over 700M monthly active users need a separate commercial license from Meta. Derivatives must include "Llama" in the name. Llama 4 models need multi-GPU setups at full precision.

Apple Silicon: Llama 3.2 1B/3B on any M-series Mac (8GB). Llama 3.1 8B quantized on M2/M3 Pro (16GB). 70B+ requires M2/M3 Ultra or Mac Studio.

DeepSeek V3 & R1

GitHub: deepseek-ai/DeepSeek-V3 · deepseek-ai/DeepSeek-R1 · License: V3 — custom DeepSeek License; R1 — MIT · Type: Open Weight

DeepSeek V3 is a 671B parameter MoE model that activates only 37B per token, trained for just 2.664M H800 GPU hours — remarkable efficiency. R1 is their reasoning model, built on the same architecture, trained via pure reinforcement learning without supervised fine-tuning. R1's distilled variants bring frontier reasoning to smaller models.

ModelTotal / Active ParamsContextLicense
DeepSeek-V3671B / 37B (MoE)128KDeepSeek License
DeepSeek-R1671B / 37B (MoE)128KMIT
R1-Distill-Qwen-32B32B dense32KMIT
R1-Distill-Llama-70B70B dense128KMIT
R1-Distill-Qwen-7B7B dense32KMIT

Strengths: R1-Distill-Qwen-32B outperforms OpenAI o1-mini on reasoning benchmarks. V3 competes with GPT-4o on coding and math. MIT license on R1 is exceptionally permissive for a frontier model. Pioneer of MLA (Multi-head Latent Attention) and FP8 training.

Watch out: V3's DeepSeek Model Agreement is governed by PRC law — a concern for regulated industries. The full 671B model requires an H100 cluster. R1 can produce verbose, repetitive chain-of-thought outputs.

Apple Silicon: R1-Distill-Qwen-7B on M2 Pro (16GB). R1-Distill 14B quantized on M2 Max (32GB). R1-Distill-32B quantized needs M2/M3 Ultra (64GB+). Full V3/R1 — not viable locally.

Qwen 3 (Alibaba)

GitHub: QwenLM/Qwen3 · License: Apache 2.0 · Type: Open Weight

Qwen 3 is the most versatile open-weight model family available. It spans from a tiny 0.6B model that runs on a phone to a 235B MoE flagship. The killer feature: a thinking mode toggle. Set enable_thinking=True for chain-of-thought reasoning, or False for fast chat — in one unified model. Qwen3 was explicitly designed for AI agents with native function calling and tool use.

ModelArchitectureContextBest For
Qwen3-235B-A22B235B / 22B active (MoE, 128 experts)32K–131KFlagship, agents, multilingual
Qwen3-32B32B dense32K–131KCoding agents, reasoning
Qwen3-14B14B dense32K–131KLocal inference, fine-tuning
Qwen3-8B8B dense32K–131KLightweight agents
Qwen3-30B-A3B30B / 3B active (MoE)32K–131KEfficient edge inference
Qwen3-CoderVarious32K–131KCode completion, FIM

Strengths: Apache 2.0 for most sizes. Best-in-class agent and tool-calling capabilities among open models. 100+ languages. Thinking/non-thinking mode in a single model. The July 2025 Qwen3-2507 update significantly improved instruction following and coding.

Watch out: 32K native context is smaller than Llama 4's 10M (131K with YaRN extension). Very large models (235B MoE) need substantial infrastructure. Some enterprises have data sovereignty concerns with Alibaba-origin models.

Apple Silicon: 0.6B–8B on any M-series. 14B quantized on M2/M3 Pro (16GB). 32B requires M2/M3 Ultra.

Mistral (Small 3.1, Mixtral, Codestral)

GitHub: mistralai/mistral-inference · License: Varies by model · Type: Open Weight

Mistral AI from France offers models across a wide range of sizes and capabilities. The critical nuance: licenses vary dramatically by model. Mistral Small 3.1 and Mistral 7B are Apache 2.0 (genuinely permissive). Codestral uses a non-production license. Mistral Large is research-only.

ModelParamsLicenseCommercial Use
Mistral Small 3.124B dense, multimodal, 128K ctxApache 2.0✅ Yes
Mistral 7B v0.37B denseApache 2.0✅ Yes
Mixtral 8x7B46.7B / 12.9B active (MoE)Apache 2.0✅ Yes
Mixtral 8x22B141B / 39B active (MoE)Apache 2.0✅ Yes
Codestral 22B22B denseMNPL (non-production)❌ No
Mistral Large 2123B denseMRL (research only)❌ No

Strengths: Mistral Small 3.1 (24B) fits on a single RTX 4090 or a 32GB MacBook when quantized. It adds vision understanding and 128K context. Native function calling. Excellent multilingual support. Apache 2.0 on the models that matter most.

Watch out: Codestral's MNPL license is a trap for commercial builders — it looks open but prohibits production use. Mistral Large 2 is research-only. No 70B+ Apache model in their lineup. Training data is not disclosed.

Apple Silicon: Mistral 7B on any M-series (8GB). Small 3.1 24B quantized on M2 Max / M3 Pro (32GB). Mixtral 8x7B quantized on M2 Max (32GB+).

Microsoft Phi-4

Hugging Face: microsoft/phi-4 · License: MIT · Type: Open Weight

Phi-4 proves that small models can punch above their weight. At just 14B parameters with a dense decoder-only transformer, it outperforms Llama 3.3 70B on several reasoning benchmarks. Microsoft achieved this through a training methodology focused on synthetic, textbook-quality data — 9.8T tokens of carefully curated content.

Strengths: MIT license — genuinely permissive, use anywhere. Exceptional reasoning for its size. Outstanding on STEM tasks (math, coding, science). Ideal for memory-constrained and latency-bound scenarios. One of the best "reasoning per parameter" ratios available.

Watch out: 16K context window — much smaller than the 128K standard. English-focused. Only one size available (14B). Not a general knowledge model — can hallucinate on factual questions. No dedicated GitHub model repository (Hugging Face only).

Apple Silicon: Phi-4 14B quantized runs well on M2/M3 Pro (16GB). Phi-3 mini (3.8B) runs on any M-series Mac.

Google Gemma 3

GitHub: google-deepmind/gemma · License: Gemma Terms of Use (custom) · Type: Open Weight

Gemma 3 from Google DeepMind brings multimodal capabilities (text + image) to a compact model family. The 27B model competes with much larger alternatives and supports 128K context with 140+ languages. A Gemma 4 generation with MoE architecture is emerging.

ModelParamsContextFeatures
Gemma 3 27B27B dense128KText + image, 140+ languages
Gemma 3 12B12B dense128KText + image
Gemma 3 4B4B dense128KOn-device, text + image
Gemma 3 1B1B dense32KEdge / mobile

Strengths: Google-quality pre-training. Strong safety alignment. Multimodal out of the box. Runs on CPU/GPU/TPU. 1B and 4B models excellent for edge deployments. Well-integrated with JAX, Keras, PyTorch, and Transformers.

Watch out: The Gemma Terms of Use is not OSI-approved. Google can revoke the license. Prohibits using weights to train competing foundation models. Smaller community than Llama/Qwen. JAX is the primary implementation — Hugging Face Transformers support is secondary.

Apple Silicon: Gemma 1B/4B on any M-series. 12B quantized on M2 Pro (16GB). 27B quantized on M2 Max (32GB+).

IBM Granite 3.3

GitHub: ibm-granite/granite-3.3-language-models · ibm-granite/granite-code-models · License: Apache 2.0 · Type: Open Weight

Granite is the most enterprise-vetted model family available. IBM applies full governance, risk, and compliance (GRC) screening to all training data — including legal review, ClamAV scanning, PII redaction, and license verification for every code file. For organizations in regulated industries, this audit trail matters.

ModelParamsLicenseBest For
Granite 3.3 Language2B, 8B (dense)Apache 2.0Enterprise RAG, function calling
Granite 3.0 MoE1B (400M active), 3B (800M active)Apache 2.0Edge, constrained compute
Granite Code3B, 8B, 20B, 34BApache 2.0Code gen, 116 languages

Strengths: Apache 2.0. Most transparent training data governance in the industry. IBM enterprise support. FIM (fill-in-middle) for code. Separated thinking/answer in reasoning tasks. StarCoder tokenizer compatibility. Granite Code covers 116 programming languages.

Watch out: Smaller parameter counts (max 34B for code, 8B for language) compared to frontier models. Less benchmark-competitive than Qwen 3 or DeepSeek at similar sizes. Not a general-purpose frontier model — positioned for enterprise use cases.

Apple Silicon: All Granite models (2B–8B language, 3B–34B code) run on Apple Silicon. 8B on M2 Pro (16GB).

Falcon H1 (TII)

GitHub: tiiuae/Falcon-H1 · License: Apache 2.0 (smaller models), Falcon-LLM License (34B) · Type: Open Weight

Falcon H1 introduces a novel hybrid SSM+Attention architecture — combining Mamba2 state-space models with transformer attention in parallel. The result: 4× input throughput and 8× output throughput compared to pure transformer models of similar size. This architectural innovation is the most significant departure from standard transformer design in this guide.

ModelParamsContextArchitecture
Falcon-H1-34B34B dense256KHybrid SSM+Attention
Falcon-H1-7B7B dense256KHybrid SSM+Attention
Falcon-H1-3B3B dense256KHybrid SSM+Attention
Falcon-H1-1.5B1.5B dense256KHybrid SSM+Attention
Falcon-H1-0.5B0.5B dense256KHybrid SSM+Attention

Strengths: 256K context window across all sizes. Falcon-H1-34B competes with Qwen2.5-72B and Llama 3.3-70B at half the parameters. The 0.5B model delivers typical 2024-era 7B performance. Natively integrated into Apple MLX — explicitly demonstrated on MacBook M4 Max. Also supports llama.cpp, vLLM, SGLang.

Watch out: The 34B model uses the custom Falcon-LLM License (verify commercial terms). Less community adoption than Llama or Qwen. Newer architecture means fewer ecosystem integrations and fine-tuning recipes. From TII (Technology Innovation Institute, Abu Dhabi).

Apple Silicon: Falcon-H1-1.5B confirmed running on MacBook M4 Max — natively integrated into Apple MLX. 0.5B–7B on any M-series. 34B quantized on M2/M3 Ultra.

StarCoder 2 (BigCode)

GitHub: bigcode-project/starcoder2 · License: BigCode Open RAIL-M v1 · Type: Mostly Open

StarCoder 2 is the most transparent code model available. Developed by the BigCode project (a Hugging Face + ServiceNow collaboration), it trains on The Stack v2 — a publicly available dataset spanning 600+ programming languages with an opt-out mechanism for code authors.

Sizes: 3B, 7B, 15B. All trained on 3–4T+ tokens. 16K context with sliding window attention (4K).

Strengths: Most transparent training data of any code model. 600+ language coverage. Well-evaluated on the BigCode leaderboard. Fill-in-middle (FIM) support. Fine-tuning-friendly. Permissive RAIL license allows commercial use.

Watch out: Not instruction-tuned — it's a completion model, not a chat model. 16K context is limited for large codebases. Outperformed by newer code models (Qwen3-Coder, DeepSeek) on benchmarks. Best used as a base for fine-tuning or code completion, not conversational coding agents.

OLMo 2/3 (AI2) — The Truly Open Source LLM

GitHub: allenai/OLMo-core · License: Apache 2.0 · Type: Truly Open Source

OLMo from the Allen Institute for AI is the gold standard for open-source LLMs. Not just the weights — the full training code, every intermediate checkpoint (every 1,000 steps), training configs, data provenance CSVs, evaluation pipelines, and WandB training runs are all public. If reproducibility and transparency matter to your organization, OLMo is the benchmark.

Sizes: OLMo-2: 1B, 7B, 13B, 32B. OLMo-3: 7B, 32B.

Strengths: The most genuinely open-source LLM. Apache 2.0 on everything. Full 2-stage training recipe (OLMo-mix-1124 → Dolmino-mix-1124). Explicitly supports Mac silicon training. Ideal for research, reproducibility, and understanding LLM training dynamics.

Watch out: Not a frontier model — competitive at size but trails Qwen 3 and DeepSeek on leaderboards. Primary use case is research, not production. 32B is the largest available. Smaller community than Llama.

Apple Silicon: OLMo-2 1B/7B viable via Hugging Face Transformers. Mac silicon training explicitly documented in the README.

Other Notable Models

Yi (01.AI) — Apache 2.0, 6B/9B/34B, strong bilingual Chinese/English, 200K context variants. Development has slowed versus Qwen and DeepSeek — review benchmarks before choosing Yi for new projects.

Code Llama (facebookresearch/codellama) — Superseded. Based on Llama 2 architecture. Newer Llama 3.x and Qwen3-Coder models outperform it on most benchmarks. Migrate to Llama 3.3/4 or Qwen3-Coder for new projects.

DBRX (Databricks) — Superseded. 132B MoE, released March 2024. No GitHub model repo. Surpassed by Qwen 3, DeepSeek V3, and Llama 4 for most use cases.

Cohere Command R+ — 104B, purpose-built for RAG with grounding citations. CC-BY-NC license prohibits commercial use without a paid Cohere contract. Strong at retrieval and multi-step tool calling, but the license makes it impractical for most developers.

Building an AI-powered product and not sure which model fits? We help startups and growing businesses across Western Canada evaluate LLM options, licensing, and deployment strategies.

Book a Free Strategy Call

Master Comparison Table

All models side by side. Scroll horizontally on mobile.

ModelOrgParamsLicenseContextCommercialApple Silicon
Llama 4 ScoutMeta17B active (MoE)Llama 4 Community10M✅ <700M MAU❌ Multi-GPU
Llama 3.3Meta70BLlama 3.3 Community128K✅ <700M MAUUltra only
Llama 3.2Meta1B, 3BLlama 3.2 Community128K✅ <700M MAU✅ Any Mac
DeepSeek-V3DeepSeek671B/37B (MoE)DeepSeek License128K✅ w/ restrictions❌ Cluster
DeepSeek-R1DeepSeek671B/37B (MoE)MIT128K✅ Free❌ Cluster
R1-Distill-32BDeepSeek32BMIT32K✅ FreeUltra (64GB)
R1-Distill-7BDeepSeek7BMIT32K✅ Free✅ M2 Pro
Qwen3-235B-A22BAlibaba235B/22B (MoE)Apache 2.032K–131K✅ Free❌ Cluster
Qwen3-32BAlibaba32BApache 2.032K–131K✅ FreeUltra (64GB)
Qwen3-8BAlibaba8BApache 2.032K–131K✅ Free✅ M2 Pro
Mistral Small 3.1Mistral AI24BApache 2.0128K✅ FreeM2 Max (32GB)
Codestral 22BMistral AI22BMNPL32K❌ NoM2 Max (32GB)
Phi-4Microsoft14BMIT16K✅ Free✅ M2 Pro
Gemma 3 27BGoogle27BGemma ToU128K✅ w/ restrictionsM2 Max (32GB)
Gemma 3 4BGoogle4BGemma ToU128K✅ w/ restrictions✅ Any Mac
Granite 3.3 8BIBM8BApache 2.0128K✅ Free✅ M2 Pro
Granite Code 34BIBM34BApache 2.0128K✅ FreeUltra (64GB)
Falcon-H1-34BTII34BFalcon-LLM256K⚠️ Check termsUltra (64GB)
Falcon-H1-1.5BTII1.5BApache 2.0256K✅ Free✅ Any Mac (MLX)
StarCoder 2 15BBigCode15BOpen RAIL-M16K✅ w/ use restrictions✅ M2 Pro
OLMo-2 32BAI232BApache 2.0128K✅ FreeUltra (64GB)
OLMo-2 7BAI27BApache 2.0128K✅ Free✅ M2 Pro

Which Models Work Best for Coding Agents

Not all models are equal for agentic coding. Agents need strong instruction following, reliable function calling, ability to generate and apply patches, and multi-step reasoning. Here's what works best with popular coding tools:

Aider — The most popular open-source coding agent. Works best with DeepSeek R1 and V3 (per their README), Qwen3-32B, and Mistral Small 3.1. Over 6.8M installs and 15B tokens per week processed.

Continue.dev — Apache 2.0 licensed IDE extension. Now includes CI-enforceable code checks via .continue/checks/ markdown files. Works with any OpenAI-compatible API — pair with Ollama for fully local inference using any model above.

OpenCode → Crush — Note: the original opencode-ai/opencode repository has been archived. The project continues as charmbracelet/crush by the original author and the Charm team. If you're using Kodra macOS or referencing OpenCode, update to Crush.

GitHub Copilot CLI — Works with GitHub's models out of the box, but can also be paired with local models via compatible backends for privacy-sensitive work.

Coding Agent Model Rankings

RankModelWhy
1Qwen3-32BBest function calling, tool use, explicit agent design. Apache 2.0.
2DeepSeek-R1-Distill-32BBest reasoning for complex debugging. MIT license.
3Mistral Small 3.1 (24B)Best-in-class function calling. Apache 2.0. Fits on MacBook.
4Llama 3.3 70BStrong general coding. Massive ecosystem. Best community support.
5Qwen3-CoderDedicated code model. Excellent FIM and multi-language support.

How to Run Models Locally

Six tools dominate local LLM inference. Each has a distinct sweet spot.

Ollama — The Easiest Starting Point

GitHub: ollama/ollama · License: MIT · Backend: llama.cpp

Ollama is the Docker of LLMs. One command to install, one command to run any model. It handles model downloads, quantization, and serves an OpenAI-compatible REST API on localhost:11434.

# Install curl -fsSL https://ollama.com/install.sh | sh # Run models ollama run qwen3 ollama run deepseek-r1:7b ollama run gemma3 ollama run phi4 # Launch coding integrations ollama launch claude ollama launch openclaw # REST API curl http://localhost:11434/api/chat -d '{ "model": "qwen3", "messages": [{"role": "user", "content": "Explain MoE architecture"}] }'

Ollama supports macOS, Linux, Windows, and Docker. It integrates directly with Claude Code, OpenCode/Crush, Codex, GitHub Copilot CLI, OpenClaw, and dozens of other tools. For most developers, Ollama is the right starting point.

llama.cpp — Maximum Control

GitHub: ggml-org/llama.cpp · License: MIT

The foundational inference engine that Ollama runs on top of. If you need maximum control — custom quantization, fine-tuned GGUF models, batch inference, embedding generation, or a built-in web UI — llama.cpp is the direct path. It supports Metal acceleration on Apple Silicon, CUDA on NVIDIA, and Vulkan on AMD.

Recent additions include multimodal support in llama-server, native GPT-OSS model support with MXFP4 format, and VS Code/Vim extensions for FIM code completions.

LM Studio — GUI for Everyone

Website: lmstudio.ai · License: Free for personal use

LM Studio provides a polished desktop app for discovering, downloading, and chatting with LLMs. It uses Apple's MLX framework on Mac for native Apple Silicon acceleration. Key features include an OpenAI-compatible local server, one-click model downloads from Hugging Face, and built-in quantization. Ideal for developers who want a GUI experience or need to demo local AI to non-technical stakeholders.

Microsoft Foundry Local — Embedded AI Runtime

GitHub: microsoft/foundry-local · License: MIT · Size: ~20MB

Foundry Local is Microsoft's answer to embedded local AI. At just ~20MB, it provides a complete runtime with native SDKs for C#, JavaScript, Python, and Rust. It uses ONNX Runtime under the hood with automatic hardware acceleration — detecting whether to use NPU, GPU, or CPU on each device.

# Python quickstart pip install foundry-local-sdk # JavaScript quickstart npm install foundry-local-sdk

The curated catalog includes optimized variants of Qwen, DeepSeek, Mistral, Phi, and Whisper. Every model goes through quantization and compression testing to balance quality and performance. An OpenAI-compatible API means existing OpenAI SDK code works with minimal changes. No Azure subscription required — everything runs on-device.

Foundry Local is the best choice when you're building a product that ships with embedded AI. The lightweight runtime, auto hardware detection, and native SDKs across four languages make it ideal for desktop applications, offline tools, and privacy-sensitive use cases.

vLLM & SGLang — Production Serving

vLLM (GitHub) and SGLang (GitHub) are high-throughput inference servers for production deployments. Use these when serving models to multiple users or applications. Both support PagedAttention for efficient memory management, continuous batching, and tensor parallelism.

Apple MLX — Native Apple Silicon

GitHub: ml-explore/mlx · License: MIT

Apple's own machine learning framework, designed specifically for Apple Silicon. Unified memory architecture means no CPU↔GPU data copying. LM Studio uses MLX under the hood on Mac. Falcon H1 has native MLX integration. If you're building Apple-first applications, MLX gives the best performance per watt on M-series chips.

Which Runtime Should You Use?

ScenarioBest ToolWhy
Getting started / experimentingOllamaOne-command install, broadest model support
Custom quantization / GGUF modelsllama.cppMaximum control, all quantization formats
GUI / non-technical demosLM StudioPolished desktop app, one-click downloads
Embedded in your productFoundry Local20MB runtime, 4 language SDKs, auto hardware
Production serving (multi-user)vLLM / SGLangHigh throughput, continuous batching
Apple-native developmentMLXUnified memory, best perf/watt on M-series

OpenClaw — Your Personal AI Assistant

GitHub: openclaw/openclaw · License: MIT · Stars: 372k+

OpenClaw turns local LLMs into a personal AI assistant that meets you on the channels you already use: WhatsApp, Telegram, Slack, Discord, Signal, iMessage, Google Chat, Microsoft Teams, Matrix, and 15+ more platforms.

It runs as a lightweight gateway on your own device. You point it at any LLM provider — Ollama for local inference, or cloud providers for heavier models. The result is a single-user AI assistant that's always on, runs locally, and responds across every messaging platform simultaneously.

# Install npm install -g openclaw@latest # Guided setup (auth, channels, skills) openclaw onboard --install-daemon # Launch directly from Ollama ollama launch openclaw # Send a message openclaw agent --message "Ship checklist" --thinking high

The OpenClaw ecosystem includes ClawHub (skill directory), gogcli (Google Workspace in your terminal), mcporter (MCP-to-TypeScript bridge), and Peekaboo (macOS screenshot MCP server for AI agents). It's MIT-licensed, supports voice on macOS/iOS/Android, and renders a live Canvas you control.

For developers who want their AI assistant to work across every platform without giving data to a cloud provider, OpenClaw + Ollama is a powerful combination.

Linux Foundation AI & Data

The Linux Foundation AI & Data (LF AI) doesn't host LLM model weights — it hosts the infrastructure ecosystem around AI and ML. Key projects relevant to developers working with open models:

The LF AI ecosystem is the glue. Models get the headlines, but ONNX, MLflow, and KServe are what make local and production LLM deployments actually work. If you're evaluating AI tooling for your organization, these projects deserve as much attention as the models themselves.

Apple Silicon Deployment Matrix

What actually runs on your Mac, and what's comfortable versus technically-possible-but-painful.

Mac HardwareRAMComfortable ModelsStretch (Slow but Works)
M1/M2 base8GBLlama 3.2 1B/3B, Gemma 1B, Falcon-H1 0.5BPhi-3 mini 3.8B (Q4)
M2/M3 Pro16GBPhi-4 14B (Q4), Qwen3-8B, R1-Distill-7B, Granite 8BMistral 7B, StarCoder 2 15B (Q4)
M2/M3 Max32GBMistral Small 3.1 24B (Q4), Qwen3-14B, Gemma 3 27B (Q4)Qwen3-32B (Q3), Falcon-H1-34B (Q4)
M2/M3/M4 Ultra64GB+Qwen3-32B, R1-Distill-32B, Granite Code 34BLlama 3.3 70B (Q4), R1-Distill-70B (Q3)
Mac Studio Ultra128GB+Llama 3.3 70B, R1-Distill-70B, Mixtral 8x22B (Q4)Qwen3-235B-A22B MoE (Q4, experimental)

Key assumptions: "Comfortable" means reasonable generation speed (10+ tokens/sec) with enough headroom for context. "Stretch" means it loads and runs but expect slower generation (3–8 tokens/sec) with reduced context windows. All assume Q4_K_M quantization via Ollama or llama.cpp.

🔒 Security Considerations for Local AI

  • Local inference ≠ automatically private. Agents calling cloud APIs still send data externally.
  • Coding agents can execute shell commands. Sandbox them — never run with root/admin privileges.
  • Secrets can leak into prompts and model logs. Scrub environment variables before piping context.
  • Audit logs matter. Track what models generate, especially in regulated environments.
  • Don't run production inference on consumer hardware without monitoring and rate limiting.

Frequently Asked Questions

What is the difference between open-source and open-weight LLMs?

Open-source LLMs release everything: model weights, training code, training data, and evaluation pipelines under a permissive license. Open-weight models release only the trained weights, often with custom licenses. Most models marketed as "open-source" — including Llama 4, DeepSeek, and Qwen 3 — are actually open-weight. Only OLMo (AI2) and StarCoder 2 (BigCode) qualify as truly open source.

Which open-weight LLM is best for coding agents?

For coding agents like Aider, Continue.dev, and OpenCode/Crush, the top recommendations are: Qwen3-32B for best overall agent and tool-calling capabilities, DeepSeek-R1-Distill-Qwen-32B for complex reasoning and debugging (MIT license), and Mistral Small 3.1 for function calling with Apache 2.0 licensing. All three run locally on Apple Silicon Macs with 32GB+ RAM when quantized.

Can I run LLMs locally on a MacBook?

Yes. Tools like Ollama, llama.cpp, LM Studio, and Microsoft Foundry Local make it straightforward. On a 16GB MacBook, you can run 7B–14B parameter models comfortably. With 32GB, models up to 24B–32B quantized work well. Check the Apple Silicon deployment matrix for specific recommendations by hardware.

Is Ollama free and open source?

Yes. Ollama is licensed under MIT and is completely free. It provides a simple CLI and REST API for downloading, running, and managing LLMs locally. It supports macOS, Linux, Windows, and Docker, and integrates with coding agents like Claude Code, OpenCode/Crush, GitHub Copilot CLI, and OpenClaw.

What is Microsoft Foundry Local?

Foundry Local is Microsoft's lightweight (~20MB) local AI runtime with native SDKs for C#, JavaScript, Python, and Rust. It auto-detects hardware (NPU, GPU, CPU) and uses ONNX Runtime for inference. It includes a curated catalog of optimized models and an OpenAI-compatible API. No Azure subscription required — all inference runs on-device with zero latency.

The Bottom Line

The open-weight LLM ecosystem is the most vibrant it's ever been. Developers have genuine choices — from MIT-licensed reasoning models (Phi-4, DeepSeek R1) to Apache 2.0 agent-first models (Qwen 3, Granite) to truly open-source research models (OLMo). The runtimes to deploy them locally — Ollama, Foundry Local, llama.cpp — are mature, free, and getting better every month.

But read the licenses carefully. "Open" doesn't always mean what you think. Codestral looks open until you try to use it in production. Gemma looks permissive until Google's terms restrict fine-tuning for competing models. Llama 4 is free unless your company has 700M+ users. The open-source vs. open-weight distinction isn't academic — it has real implications for your codebase, your product, and your business.

If you found this guide useful, share it with your team or join the Code To Cloud Discord community where developers across Alberta and Western Canada discuss LLMs, agents, and developer tooling every week.

Need Help Choosing the Right LLM for Your Product?

50+ advisory engagements across Alberta & Western Canada

We help startups and growing businesses evaluate open-weight models, navigate licensing, build local inference pipelines, and ship AI-powered products — without vendor lock-in.

Kevin Evans — Fractional CTO and founder of Code To Cloud

Kevin Evans

Fractional CTO & Founder, Code To Cloud Inc.

Kevin Evans is a fractional CTO and technology advisor based in Calgary, Alberta. He helps startups and growing businesses across Western Canada make smart technology decisions — from developer environment setup to AI agent strategy. More about Kevin