All GitHub repositories, licenses, and parameter counts verified — May 2026
The open-source LLM landscape has exploded. In 2024, there were a handful of usable open models. Today, developers can choose from dozens of production-grade LLMs — many rivaling closed-source alternatives — and run them locally on a MacBook with zero API costs.
But here's the catch most guides skip: almost none of these models are actually open source. They're open weight. The distinction matters — for licensing, for trust, and for your business. This guide covers every major model a developer should know about, the tools to run them locally, the agents they power, and the license traps to avoid.
What's Inside
- Open source vs. open weight — why the distinction matters for your stack
- 15 model families compared: Llama 4, DeepSeek R1, Qwen 3, Mistral, Phi-4, Gemma, Granite, Falcon, and more
- Master comparison table — license, parameters, context length, Apple Silicon viability
- Best models for coding agents (Aider, Continue.dev, OpenCode, OpenClaw)
- 6 ways to run models locally: Ollama, llama.cpp, LM Studio, Foundry Local, vLLM, Apple MLX
- OpenClaw — the personal AI assistant turning local LLMs into a cross-platform agent
- Apple Silicon deployment matrix — what runs on your Mac hardware
- License warning box — commercial use pitfalls and thresholds
Open Source vs. Open Weight — The Distinction That Matters
The industry uses "open source" loosely. Here's what the terms actually mean:
Open source means the model weights, training code, training data, evaluation pipelines, and intermediate checkpoints are all publicly available under a permissive license. You can reproduce the model from scratch. As of May 2026, only two model families meet this bar:
- OLMo (Allen Institute for AI) — Apache 2.0, full training code, data, configs, and every checkpoint
- StarCoder 2 (BigCode / Hugging Face + ServiceNow) — Open RAIL-M, training code + The Stack v2 dataset public
Open weight means the trained model weights are downloadable, but training data, training code, or both are proprietary. This includes Llama 4, DeepSeek, Qwen 3, Mistral, Phi-4, Gemma, and most others. Some use genuinely permissive licenses (MIT, Apache 2.0). Others use custom licenses with restrictions you must read carefully.
⚖️ License Warning Box
- MIT / Apache 2.0 — Use commercially, modify, redistribute. Best for production. (Phi-4, Granite, Qwen 3, Mistral Small 3.1, DeepSeek R1)
- Llama 4 Community License — Free for companies under 700M monthly active users. Must include "Llama" in derivative names.
- Gemma Terms of Use — Commercial use allowed, but Google can revoke; prohibits using weights to train competing foundation models.
- Codestral MNPL — Non-production only. Cannot use in commercial products without a paid Mistral license.
- CC-BY-NC (Cohere) — Non-commercial only. No production use without a Cohere contract.
- DeepSeek Model Agreement — V3 weights governed by PRC law. R1 distilled models are MIT.
Best Model by Use Case — Quick Picks
Before the deep dive, here are the top picks by scenario:
Quick Recommendation Matrix
| Use Case | Top Pick | Why |
|---|---|---|
| Coding agent (Aider, OpenCode) | Qwen3-32B | Best function calling + tool use, Apache 2.0 |
| Complex reasoning / debugging | DeepSeek-R1-Distill-32B | o1-mini class reasoning, MIT license |
| Local Mac inference (16GB) | Phi-4 14B (Q4) | Best reasoning per parameter, MIT |
| Enterprise / regulated | IBM Granite 3.3 | Apache 2.0, GRC-vetted training data |
| Multilingual / global | Qwen3-235B-A22B | 100+ languages, Apache 2.0 |
| Research transparency | OLMo 2 | Only truly open-source LLM, Apache 2.0 |
| Code completion (FIM) | Qwen3-Coder / StarCoder 2 | 600+ languages, fill-in-middle support |
| Ultra-long context (1M+) | Llama 4 Scout | 10M token context window |
| Inference speed / throughput | Falcon H1 | Hybrid SSM architecture, 4–8× faster |
| Personal AI assistant | Any model + OpenClaw | WhatsApp, Telegram, Slack, Discord, Signal |
The Models: A Developer's Guide
Every GitHub URL, license, and parameter count below has been verified against official repositories and model cards. Links go to canonical sources — not third-party mirrors.
Meta Llama 4
GitHub: meta-llama/llama-models · License: Llama 4 Community License (custom) · Type: Open Weight
Meta's Llama 4 introduced mixture-of-experts (MoE) architecture to the Llama family. Scout activates 17B parameters across 16 experts with a staggering 10M token context window. Maverick scales to 128 experts with 1M context. Both use the same 17B active parameter footprint per token.
| Model | Architecture | Context | Released |
|---|---|---|---|
| Llama 4 Scout | 17B active / 16 experts (MoE) | 10M tokens | Apr 2025 |
| Llama 4 Maverick | 17B active / 128 experts (MoE) | 1M tokens | Apr 2025 |
| Llama 3.3 | 70B dense | 128K | Dec 2024 |
| Llama 3.2 | 1B, 3B dense | 128K | Sep 2024 |
| Llama 3.1 | 8B, 70B, 405B dense | 128K | Jul 2024 |
Strengths: Largest community ecosystem. Thousands of fine-tunes on Hugging Face. Broadest tool support (Ollama, llama.cpp, vLLM, LM Studio). Llama 3.2 1B/3B run on any Apple Silicon Mac. Llama 4 Scout's 10M context is state-of-the-art.
Watch out: The Llama 4 Community License is not OSI-approved. Companies with over 700M monthly active users need a separate commercial license from Meta. Derivatives must include "Llama" in the name. Llama 4 models need multi-GPU setups at full precision.
Apple Silicon: Llama 3.2 1B/3B on any M-series Mac (8GB). Llama 3.1 8B quantized on M2/M3 Pro (16GB). 70B+ requires M2/M3 Ultra or Mac Studio.
DeepSeek V3 & R1
GitHub: deepseek-ai/DeepSeek-V3 · deepseek-ai/DeepSeek-R1 · License: V3 — custom DeepSeek License; R1 — MIT · Type: Open Weight
DeepSeek V3 is a 671B parameter MoE model that activates only 37B per token, trained for just 2.664M H800 GPU hours — remarkable efficiency. R1 is their reasoning model, built on the same architecture, trained via pure reinforcement learning without supervised fine-tuning. R1's distilled variants bring frontier reasoning to smaller models.
| Model | Total / Active Params | Context | License |
|---|---|---|---|
| DeepSeek-V3 | 671B / 37B (MoE) | 128K | DeepSeek License |
| DeepSeek-R1 | 671B / 37B (MoE) | 128K | MIT |
| R1-Distill-Qwen-32B | 32B dense | 32K | MIT |
| R1-Distill-Llama-70B | 70B dense | 128K | MIT |
| R1-Distill-Qwen-7B | 7B dense | 32K | MIT |
Strengths: R1-Distill-Qwen-32B outperforms OpenAI o1-mini on reasoning benchmarks. V3 competes with GPT-4o on coding and math. MIT license on R1 is exceptionally permissive for a frontier model. Pioneer of MLA (Multi-head Latent Attention) and FP8 training.
Watch out: V3's DeepSeek Model Agreement is governed by PRC law — a concern for regulated industries. The full 671B model requires an H100 cluster. R1 can produce verbose, repetitive chain-of-thought outputs.
Apple Silicon: R1-Distill-Qwen-7B on M2 Pro (16GB). R1-Distill 14B quantized on M2 Max (32GB). R1-Distill-32B quantized needs M2/M3 Ultra (64GB+). Full V3/R1 — not viable locally.
Qwen 3 (Alibaba)
GitHub: QwenLM/Qwen3 · License: Apache 2.0 · Type: Open Weight
Qwen 3 is the most versatile open-weight model family available. It spans from a tiny 0.6B model that runs on a phone to a 235B MoE flagship. The killer feature: a thinking mode toggle. Set enable_thinking=True for chain-of-thought reasoning, or False for fast chat — in one unified model. Qwen3 was explicitly designed for AI agents with native function calling and tool use.
| Model | Architecture | Context | Best For |
|---|---|---|---|
| Qwen3-235B-A22B | 235B / 22B active (MoE, 128 experts) | 32K–131K | Flagship, agents, multilingual |
| Qwen3-32B | 32B dense | 32K–131K | Coding agents, reasoning |
| Qwen3-14B | 14B dense | 32K–131K | Local inference, fine-tuning |
| Qwen3-8B | 8B dense | 32K–131K | Lightweight agents |
| Qwen3-30B-A3B | 30B / 3B active (MoE) | 32K–131K | Efficient edge inference |
| Qwen3-Coder | Various | 32K–131K | Code completion, FIM |
Strengths: Apache 2.0 for most sizes. Best-in-class agent and tool-calling capabilities among open models. 100+ languages. Thinking/non-thinking mode in a single model. The July 2025 Qwen3-2507 update significantly improved instruction following and coding.
Watch out: 32K native context is smaller than Llama 4's 10M (131K with YaRN extension). Very large models (235B MoE) need substantial infrastructure. Some enterprises have data sovereignty concerns with Alibaba-origin models.
Apple Silicon: 0.6B–8B on any M-series. 14B quantized on M2/M3 Pro (16GB). 32B requires M2/M3 Ultra.
Mistral (Small 3.1, Mixtral, Codestral)
GitHub: mistralai/mistral-inference · License: Varies by model · Type: Open Weight
Mistral AI from France offers models across a wide range of sizes and capabilities. The critical nuance: licenses vary dramatically by model. Mistral Small 3.1 and Mistral 7B are Apache 2.0 (genuinely permissive). Codestral uses a non-production license. Mistral Large is research-only.
| Model | Params | License | Commercial Use |
|---|---|---|---|
| Mistral Small 3.1 | 24B dense, multimodal, 128K ctx | Apache 2.0 | ✅ Yes |
| Mistral 7B v0.3 | 7B dense | Apache 2.0 | ✅ Yes |
| Mixtral 8x7B | 46.7B / 12.9B active (MoE) | Apache 2.0 | ✅ Yes |
| Mixtral 8x22B | 141B / 39B active (MoE) | Apache 2.0 | ✅ Yes |
| Codestral 22B | 22B dense | MNPL (non-production) | ❌ No |
| Mistral Large 2 | 123B dense | MRL (research only) | ❌ No |
Strengths: Mistral Small 3.1 (24B) fits on a single RTX 4090 or a 32GB MacBook when quantized. It adds vision understanding and 128K context. Native function calling. Excellent multilingual support. Apache 2.0 on the models that matter most.
Watch out: Codestral's MNPL license is a trap for commercial builders — it looks open but prohibits production use. Mistral Large 2 is research-only. No 70B+ Apache model in their lineup. Training data is not disclosed.
Apple Silicon: Mistral 7B on any M-series (8GB). Small 3.1 24B quantized on M2 Max / M3 Pro (32GB). Mixtral 8x7B quantized on M2 Max (32GB+).
Microsoft Phi-4
Hugging Face: microsoft/phi-4 · License: MIT · Type: Open Weight
Phi-4 proves that small models can punch above their weight. At just 14B parameters with a dense decoder-only transformer, it outperforms Llama 3.3 70B on several reasoning benchmarks. Microsoft achieved this through a training methodology focused on synthetic, textbook-quality data — 9.8T tokens of carefully curated content.
Strengths: MIT license — genuinely permissive, use anywhere. Exceptional reasoning for its size. Outstanding on STEM tasks (math, coding, science). Ideal for memory-constrained and latency-bound scenarios. One of the best "reasoning per parameter" ratios available.
Watch out: 16K context window — much smaller than the 128K standard. English-focused. Only one size available (14B). Not a general knowledge model — can hallucinate on factual questions. No dedicated GitHub model repository (Hugging Face only).
Apple Silicon: Phi-4 14B quantized runs well on M2/M3 Pro (16GB). Phi-3 mini (3.8B) runs on any M-series Mac.
Google Gemma 3
GitHub: google-deepmind/gemma · License: Gemma Terms of Use (custom) · Type: Open Weight
Gemma 3 from Google DeepMind brings multimodal capabilities (text + image) to a compact model family. The 27B model competes with much larger alternatives and supports 128K context with 140+ languages. A Gemma 4 generation with MoE architecture is emerging.
| Model | Params | Context | Features |
|---|---|---|---|
| Gemma 3 27B | 27B dense | 128K | Text + image, 140+ languages |
| Gemma 3 12B | 12B dense | 128K | Text + image |
| Gemma 3 4B | 4B dense | 128K | On-device, text + image |
| Gemma 3 1B | 1B dense | 32K | Edge / mobile |
Strengths: Google-quality pre-training. Strong safety alignment. Multimodal out of the box. Runs on CPU/GPU/TPU. 1B and 4B models excellent for edge deployments. Well-integrated with JAX, Keras, PyTorch, and Transformers.
Watch out: The Gemma Terms of Use is not OSI-approved. Google can revoke the license. Prohibits using weights to train competing foundation models. Smaller community than Llama/Qwen. JAX is the primary implementation — Hugging Face Transformers support is secondary.
Apple Silicon: Gemma 1B/4B on any M-series. 12B quantized on M2 Pro (16GB). 27B quantized on M2 Max (32GB+).
IBM Granite 3.3
GitHub: ibm-granite/granite-3.3-language-models · ibm-granite/granite-code-models · License: Apache 2.0 · Type: Open Weight
Granite is the most enterprise-vetted model family available. IBM applies full governance, risk, and compliance (GRC) screening to all training data — including legal review, ClamAV scanning, PII redaction, and license verification for every code file. For organizations in regulated industries, this audit trail matters.
| Model | Params | License | Best For |
|---|---|---|---|
| Granite 3.3 Language | 2B, 8B (dense) | Apache 2.0 | Enterprise RAG, function calling |
| Granite 3.0 MoE | 1B (400M active), 3B (800M active) | Apache 2.0 | Edge, constrained compute |
| Granite Code | 3B, 8B, 20B, 34B | Apache 2.0 | Code gen, 116 languages |
Strengths: Apache 2.0. Most transparent training data governance in the industry. IBM enterprise support. FIM (fill-in-middle) for code. Separated thinking/answer in reasoning tasks. StarCoder tokenizer compatibility. Granite Code covers 116 programming languages.
Watch out: Smaller parameter counts (max 34B for code, 8B for language) compared to frontier models. Less benchmark-competitive than Qwen 3 or DeepSeek at similar sizes. Not a general-purpose frontier model — positioned for enterprise use cases.
Apple Silicon: All Granite models (2B–8B language, 3B–34B code) run on Apple Silicon. 8B on M2 Pro (16GB).
Falcon H1 (TII)
GitHub: tiiuae/Falcon-H1 · License: Apache 2.0 (smaller models), Falcon-LLM License (34B) · Type: Open Weight
Falcon H1 introduces a novel hybrid SSM+Attention architecture — combining Mamba2 state-space models with transformer attention in parallel. The result: 4× input throughput and 8× output throughput compared to pure transformer models of similar size. This architectural innovation is the most significant departure from standard transformer design in this guide.
| Model | Params | Context | Architecture |
|---|---|---|---|
| Falcon-H1-34B | 34B dense | 256K | Hybrid SSM+Attention |
| Falcon-H1-7B | 7B dense | 256K | Hybrid SSM+Attention |
| Falcon-H1-3B | 3B dense | 256K | Hybrid SSM+Attention |
| Falcon-H1-1.5B | 1.5B dense | 256K | Hybrid SSM+Attention |
| Falcon-H1-0.5B | 0.5B dense | 256K | Hybrid SSM+Attention |
Strengths: 256K context window across all sizes. Falcon-H1-34B competes with Qwen2.5-72B and Llama 3.3-70B at half the parameters. The 0.5B model delivers typical 2024-era 7B performance. Natively integrated into Apple MLX — explicitly demonstrated on MacBook M4 Max. Also supports llama.cpp, vLLM, SGLang.
Watch out: The 34B model uses the custom Falcon-LLM License (verify commercial terms). Less community adoption than Llama or Qwen. Newer architecture means fewer ecosystem integrations and fine-tuning recipes. From TII (Technology Innovation Institute, Abu Dhabi).
Apple Silicon: Falcon-H1-1.5B confirmed running on MacBook M4 Max — natively integrated into Apple MLX. 0.5B–7B on any M-series. 34B quantized on M2/M3 Ultra.
StarCoder 2 (BigCode)
GitHub: bigcode-project/starcoder2 · License: BigCode Open RAIL-M v1 · Type: Mostly Open
StarCoder 2 is the most transparent code model available. Developed by the BigCode project (a Hugging Face + ServiceNow collaboration), it trains on The Stack v2 — a publicly available dataset spanning 600+ programming languages with an opt-out mechanism for code authors.
Sizes: 3B, 7B, 15B. All trained on 3–4T+ tokens. 16K context with sliding window attention (4K).
Strengths: Most transparent training data of any code model. 600+ language coverage. Well-evaluated on the BigCode leaderboard. Fill-in-middle (FIM) support. Fine-tuning-friendly. Permissive RAIL license allows commercial use.
Watch out: Not instruction-tuned — it's a completion model, not a chat model. 16K context is limited for large codebases. Outperformed by newer code models (Qwen3-Coder, DeepSeek) on benchmarks. Best used as a base for fine-tuning or code completion, not conversational coding agents.
OLMo 2/3 (AI2) — The Truly Open Source LLM
GitHub: allenai/OLMo-core · License: Apache 2.0 · Type: Truly Open Source
OLMo from the Allen Institute for AI is the gold standard for open-source LLMs. Not just the weights — the full training code, every intermediate checkpoint (every 1,000 steps), training configs, data provenance CSVs, evaluation pipelines, and WandB training runs are all public. If reproducibility and transparency matter to your organization, OLMo is the benchmark.
Sizes: OLMo-2: 1B, 7B, 13B, 32B. OLMo-3: 7B, 32B.
Strengths: The most genuinely open-source LLM. Apache 2.0 on everything. Full 2-stage training recipe (OLMo-mix-1124 → Dolmino-mix-1124). Explicitly supports Mac silicon training. Ideal for research, reproducibility, and understanding LLM training dynamics.
Watch out: Not a frontier model — competitive at size but trails Qwen 3 and DeepSeek on leaderboards. Primary use case is research, not production. 32B is the largest available. Smaller community than Llama.
Apple Silicon: OLMo-2 1B/7B viable via Hugging Face Transformers. Mac silicon training explicitly documented in the README.
Other Notable Models
Yi (01.AI) — Apache 2.0, 6B/9B/34B, strong bilingual Chinese/English, 200K context variants. Development has slowed versus Qwen and DeepSeek — review benchmarks before choosing Yi for new projects.
Code Llama (facebookresearch/codellama) — Superseded. Based on Llama 2 architecture. Newer Llama 3.x and Qwen3-Coder models outperform it on most benchmarks. Migrate to Llama 3.3/4 or Qwen3-Coder for new projects.
DBRX (Databricks) — Superseded. 132B MoE, released March 2024. No GitHub model repo. Surpassed by Qwen 3, DeepSeek V3, and Llama 4 for most use cases.
Cohere Command R+ — 104B, purpose-built for RAG with grounding citations. CC-BY-NC license prohibits commercial use without a paid Cohere contract. Strong at retrieval and multi-step tool calling, but the license makes it impractical for most developers.
Building an AI-powered product and not sure which model fits? We help startups and growing businesses across Western Canada evaluate LLM options, licensing, and deployment strategies.
Book a Free Strategy CallMaster Comparison Table
All models side by side. Scroll horizontally on mobile.
| Model | Org | Params | License | Context | Commercial | Apple Silicon |
|---|---|---|---|---|---|---|
| Llama 4 Scout | Meta | 17B active (MoE) | Llama 4 Community | 10M | ✅ <700M MAU | ❌ Multi-GPU |
| Llama 3.3 | Meta | 70B | Llama 3.3 Community | 128K | ✅ <700M MAU | Ultra only |
| Llama 3.2 | Meta | 1B, 3B | Llama 3.2 Community | 128K | ✅ <700M MAU | ✅ Any Mac |
| DeepSeek-V3 | DeepSeek | 671B/37B (MoE) | DeepSeek License | 128K | ✅ w/ restrictions | ❌ Cluster |
| DeepSeek-R1 | DeepSeek | 671B/37B (MoE) | MIT | 128K | ✅ Free | ❌ Cluster |
| R1-Distill-32B | DeepSeek | 32B | MIT | 32K | ✅ Free | Ultra (64GB) |
| R1-Distill-7B | DeepSeek | 7B | MIT | 32K | ✅ Free | ✅ M2 Pro |
| Qwen3-235B-A22B | Alibaba | 235B/22B (MoE) | Apache 2.0 | 32K–131K | ✅ Free | ❌ Cluster |
| Qwen3-32B | Alibaba | 32B | Apache 2.0 | 32K–131K | ✅ Free | Ultra (64GB) |
| Qwen3-8B | Alibaba | 8B | Apache 2.0 | 32K–131K | ✅ Free | ✅ M2 Pro |
| Mistral Small 3.1 | Mistral AI | 24B | Apache 2.0 | 128K | ✅ Free | M2 Max (32GB) |
| Codestral 22B | Mistral AI | 22B | MNPL | 32K | ❌ No | M2 Max (32GB) |
| Phi-4 | Microsoft | 14B | MIT | 16K | ✅ Free | ✅ M2 Pro |
| Gemma 3 27B | 27B | Gemma ToU | 128K | ✅ w/ restrictions | M2 Max (32GB) | |
| Gemma 3 4B | 4B | Gemma ToU | 128K | ✅ w/ restrictions | ✅ Any Mac | |
| Granite 3.3 8B | IBM | 8B | Apache 2.0 | 128K | ✅ Free | ✅ M2 Pro |
| Granite Code 34B | IBM | 34B | Apache 2.0 | 128K | ✅ Free | Ultra (64GB) |
| Falcon-H1-34B | TII | 34B | Falcon-LLM | 256K | ⚠️ Check terms | Ultra (64GB) |
| Falcon-H1-1.5B | TII | 1.5B | Apache 2.0 | 256K | ✅ Free | ✅ Any Mac (MLX) |
| StarCoder 2 15B | BigCode | 15B | Open RAIL-M | 16K | ✅ w/ use restrictions | ✅ M2 Pro |
| OLMo-2 32B | AI2 | 32B | Apache 2.0 | 128K | ✅ Free | Ultra (64GB) |
| OLMo-2 7B | AI2 | 7B | Apache 2.0 | 128K | ✅ Free | ✅ M2 Pro |
Which Models Work Best for Coding Agents
Not all models are equal for agentic coding. Agents need strong instruction following, reliable function calling, ability to generate and apply patches, and multi-step reasoning. Here's what works best with popular coding tools:
Aider — The most popular open-source coding agent. Works best with DeepSeek R1 and V3 (per their README), Qwen3-32B, and Mistral Small 3.1. Over 6.8M installs and 15B tokens per week processed.
Continue.dev — Apache 2.0 licensed IDE extension. Now includes CI-enforceable code checks via .continue/checks/ markdown files. Works with any OpenAI-compatible API — pair with Ollama for fully local inference using any model above.
OpenCode → Crush — Note: the original opencode-ai/opencode repository has been archived. The project continues as charmbracelet/crush by the original author and the Charm team. If you're using Kodra macOS or referencing OpenCode, update to Crush.
GitHub Copilot CLI — Works with GitHub's models out of the box, but can also be paired with local models via compatible backends for privacy-sensitive work.
Coding Agent Model Rankings
| Rank | Model | Why |
|---|---|---|
| 1 | Qwen3-32B | Best function calling, tool use, explicit agent design. Apache 2.0. |
| 2 | DeepSeek-R1-Distill-32B | Best reasoning for complex debugging. MIT license. |
| 3 | Mistral Small 3.1 (24B) | Best-in-class function calling. Apache 2.0. Fits on MacBook. |
| 4 | Llama 3.3 70B | Strong general coding. Massive ecosystem. Best community support. |
| 5 | Qwen3-Coder | Dedicated code model. Excellent FIM and multi-language support. |
How to Run Models Locally
Six tools dominate local LLM inference. Each has a distinct sweet spot.
Ollama — The Easiest Starting Point
GitHub: ollama/ollama · License: MIT · Backend: llama.cpp
Ollama is the Docker of LLMs. One command to install, one command to run any model. It handles model downloads, quantization, and serves an OpenAI-compatible REST API on localhost:11434.
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Run models
ollama run qwen3
ollama run deepseek-r1:7b
ollama run gemma3
ollama run phi4
# Launch coding integrations
ollama launch claude
ollama launch openclaw
# REST API
curl http://localhost:11434/api/chat -d '{
"model": "qwen3",
"messages": [{"role": "user", "content": "Explain MoE architecture"}]
}'
Ollama supports macOS, Linux, Windows, and Docker. It integrates directly with Claude Code, OpenCode/Crush, Codex, GitHub Copilot CLI, OpenClaw, and dozens of other tools. For most developers, Ollama is the right starting point.
llama.cpp — Maximum Control
GitHub: ggml-org/llama.cpp · License: MIT
The foundational inference engine that Ollama runs on top of. If you need maximum control — custom quantization, fine-tuned GGUF models, batch inference, embedding generation, or a built-in web UI — llama.cpp is the direct path. It supports Metal acceleration on Apple Silicon, CUDA on NVIDIA, and Vulkan on AMD.
Recent additions include multimodal support in llama-server, native GPT-OSS model support with MXFP4 format, and VS Code/Vim extensions for FIM code completions.
LM Studio — GUI for Everyone
Website: lmstudio.ai · License: Free for personal use
LM Studio provides a polished desktop app for discovering, downloading, and chatting with LLMs. It uses Apple's MLX framework on Mac for native Apple Silicon acceleration. Key features include an OpenAI-compatible local server, one-click model downloads from Hugging Face, and built-in quantization. Ideal for developers who want a GUI experience or need to demo local AI to non-technical stakeholders.
Microsoft Foundry Local — Embedded AI Runtime
GitHub: microsoft/foundry-local · License: MIT · Size: ~20MB
Foundry Local is Microsoft's answer to embedded local AI. At just ~20MB, it provides a complete runtime with native SDKs for C#, JavaScript, Python, and Rust. It uses ONNX Runtime under the hood with automatic hardware acceleration — detecting whether to use NPU, GPU, or CPU on each device.
# Python quickstart
pip install foundry-local-sdk
# JavaScript quickstart
npm install foundry-local-sdk
The curated catalog includes optimized variants of Qwen, DeepSeek, Mistral, Phi, and Whisper. Every model goes through quantization and compression testing to balance quality and performance. An OpenAI-compatible API means existing OpenAI SDK code works with minimal changes. No Azure subscription required — everything runs on-device.
Foundry Local is the best choice when you're building a product that ships with embedded AI. The lightweight runtime, auto hardware detection, and native SDKs across four languages make it ideal for desktop applications, offline tools, and privacy-sensitive use cases.
vLLM & SGLang — Production Serving
vLLM (GitHub) and SGLang (GitHub) are high-throughput inference servers for production deployments. Use these when serving models to multiple users or applications. Both support PagedAttention for efficient memory management, continuous batching, and tensor parallelism.
Apple MLX — Native Apple Silicon
GitHub: ml-explore/mlx · License: MIT
Apple's own machine learning framework, designed specifically for Apple Silicon. Unified memory architecture means no CPU↔GPU data copying. LM Studio uses MLX under the hood on Mac. Falcon H1 has native MLX integration. If you're building Apple-first applications, MLX gives the best performance per watt on M-series chips.
Which Runtime Should You Use?
| Scenario | Best Tool | Why |
|---|---|---|
| Getting started / experimenting | Ollama | One-command install, broadest model support |
| Custom quantization / GGUF models | llama.cpp | Maximum control, all quantization formats |
| GUI / non-technical demos | LM Studio | Polished desktop app, one-click downloads |
| Embedded in your product | Foundry Local | 20MB runtime, 4 language SDKs, auto hardware |
| Production serving (multi-user) | vLLM / SGLang | High throughput, continuous batching |
| Apple-native development | MLX | Unified memory, best perf/watt on M-series |
OpenClaw — Your Personal AI Assistant
GitHub: openclaw/openclaw · License: MIT · Stars: 372k+
OpenClaw turns local LLMs into a personal AI assistant that meets you on the channels you already use: WhatsApp, Telegram, Slack, Discord, Signal, iMessage, Google Chat, Microsoft Teams, Matrix, and 15+ more platforms.
It runs as a lightweight gateway on your own device. You point it at any LLM provider — Ollama for local inference, or cloud providers for heavier models. The result is a single-user AI assistant that's always on, runs locally, and responds across every messaging platform simultaneously.
# Install
npm install -g openclaw@latest
# Guided setup (auth, channels, skills)
openclaw onboard --install-daemon
# Launch directly from Ollama
ollama launch openclaw
# Send a message
openclaw agent --message "Ship checklist" --thinking high
The OpenClaw ecosystem includes ClawHub (skill directory), gogcli (Google Workspace in your terminal), mcporter (MCP-to-TypeScript bridge), and Peekaboo (macOS screenshot MCP server for AI agents). It's MIT-licensed, supports voice on macOS/iOS/Android, and renders a live Canvas you control.
For developers who want their AI assistant to work across every platform without giving data to a cloud provider, OpenClaw + Ollama is a powerful combination.
Linux Foundation AI & Data
The Linux Foundation AI & Data (LF AI) doesn't host LLM model weights — it hosts the infrastructure ecosystem around AI and ML. Key projects relevant to developers working with open models:
- ONNX — Open neural network exchange format. The backbone of Microsoft Foundry Local's inference engine.
- MLflow — Experiment tracking, model registry, deployment. Essential for managing LLM fine-tuning workflows.
- KServe — Kubernetes-native model serving. Deploy LLMs at scale on your own infrastructure.
- BeeAI — Open-source framework for building production-ready agents (LF AI graduated project, 2025).
- Data Prep Kit — Simplifies unstructured data preparation for LLM training and fine-tuning (IBM-contributed).
- AI Fairness 360 / AI Explainability 360 — IBM-contributed toolkits for trustworthy, auditable AI systems.
- Adversarial Robustness Toolbox (ART) — ML model security evaluation and defense.
The LF AI ecosystem is the glue. Models get the headlines, but ONNX, MLflow, and KServe are what make local and production LLM deployments actually work. If you're evaluating AI tooling for your organization, these projects deserve as much attention as the models themselves.
Apple Silicon Deployment Matrix
What actually runs on your Mac, and what's comfortable versus technically-possible-but-painful.
| Mac Hardware | RAM | Comfortable Models | Stretch (Slow but Works) |
|---|---|---|---|
| M1/M2 base | 8GB | Llama 3.2 1B/3B, Gemma 1B, Falcon-H1 0.5B | Phi-3 mini 3.8B (Q4) |
| M2/M3 Pro | 16GB | Phi-4 14B (Q4), Qwen3-8B, R1-Distill-7B, Granite 8B | Mistral 7B, StarCoder 2 15B (Q4) |
| M2/M3 Max | 32GB | Mistral Small 3.1 24B (Q4), Qwen3-14B, Gemma 3 27B (Q4) | Qwen3-32B (Q3), Falcon-H1-34B (Q4) |
| M2/M3/M4 Ultra | 64GB+ | Qwen3-32B, R1-Distill-32B, Granite Code 34B | Llama 3.3 70B (Q4), R1-Distill-70B (Q3) |
| Mac Studio Ultra | 128GB+ | Llama 3.3 70B, R1-Distill-70B, Mixtral 8x22B (Q4) | Qwen3-235B-A22B MoE (Q4, experimental) |
Key assumptions: "Comfortable" means reasonable generation speed (10+ tokens/sec) with enough headroom for context. "Stretch" means it loads and runs but expect slower generation (3–8 tokens/sec) with reduced context windows. All assume Q4_K_M quantization via Ollama or llama.cpp.
🔒 Security Considerations for Local AI
- Local inference ≠ automatically private. Agents calling cloud APIs still send data externally.
- Coding agents can execute shell commands. Sandbox them — never run with root/admin privileges.
- Secrets can leak into prompts and model logs. Scrub environment variables before piping context.
- Audit logs matter. Track what models generate, especially in regulated environments.
- Don't run production inference on consumer hardware without monitoring and rate limiting.
Frequently Asked Questions
What is the difference between open-source and open-weight LLMs?
Open-source LLMs release everything: model weights, training code, training data, and evaluation pipelines under a permissive license. Open-weight models release only the trained weights, often with custom licenses. Most models marketed as "open-source" — including Llama 4, DeepSeek, and Qwen 3 — are actually open-weight. Only OLMo (AI2) and StarCoder 2 (BigCode) qualify as truly open source.
Which open-weight LLM is best for coding agents?
For coding agents like Aider, Continue.dev, and OpenCode/Crush, the top recommendations are: Qwen3-32B for best overall agent and tool-calling capabilities, DeepSeek-R1-Distill-Qwen-32B for complex reasoning and debugging (MIT license), and Mistral Small 3.1 for function calling with Apache 2.0 licensing. All three run locally on Apple Silicon Macs with 32GB+ RAM when quantized.
Can I run LLMs locally on a MacBook?
Yes. Tools like Ollama, llama.cpp, LM Studio, and Microsoft Foundry Local make it straightforward. On a 16GB MacBook, you can run 7B–14B parameter models comfortably. With 32GB, models up to 24B–32B quantized work well. Check the Apple Silicon deployment matrix for specific recommendations by hardware.
Is Ollama free and open source?
Yes. Ollama is licensed under MIT and is completely free. It provides a simple CLI and REST API for downloading, running, and managing LLMs locally. It supports macOS, Linux, Windows, and Docker, and integrates with coding agents like Claude Code, OpenCode/Crush, GitHub Copilot CLI, and OpenClaw.
What is Microsoft Foundry Local?
Foundry Local is Microsoft's lightweight (~20MB) local AI runtime with native SDKs for C#, JavaScript, Python, and Rust. It auto-detects hardware (NPU, GPU, CPU) and uses ONNX Runtime for inference. It includes a curated catalog of optimized models and an OpenAI-compatible API. No Azure subscription required — all inference runs on-device with zero latency.
The Bottom Line
The open-weight LLM ecosystem is the most vibrant it's ever been. Developers have genuine choices — from MIT-licensed reasoning models (Phi-4, DeepSeek R1) to Apache 2.0 agent-first models (Qwen 3, Granite) to truly open-source research models (OLMo). The runtimes to deploy them locally — Ollama, Foundry Local, llama.cpp — are mature, free, and getting better every month.
But read the licenses carefully. "Open" doesn't always mean what you think. Codestral looks open until you try to use it in production. Gemma looks permissive until Google's terms restrict fine-tuning for competing models. Llama 4 is free unless your company has 700M+ users. The open-source vs. open-weight distinction isn't academic — it has real implications for your codebase, your product, and your business.
If you found this guide useful, share it with your team or join the Code To Cloud Discord community where developers across Alberta and Western Canada discuss LLMs, agents, and developer tooling every week.
Need Help Choosing the Right LLM for Your Product?
50+ advisory engagements across Alberta & Western Canada
We help startups and growing businesses evaluate open-weight models, navigate licensing, build local inference pipelines, and ship AI-powered products — without vendor lock-in.