System Overview · April 2026
North AI
north-ml.space

north-ml — open inference infrastructure
An open-source platform for running open-weight language models at low latency on Google TPU v6e-16 hardware, with a compatible API and a built-in AI coding assistant.
north-mlis an open-source inference platform built on Google's v6e-16 TPU hardware. It provides access to a curated set of high-quality open-weight language models — including Llama 4, Qwen 2.5, Gemma 3, and North AI's own Wind Edge — through a unified REST API, a web-based chat interface, and a VS Code extension. The platform itself is open source; the models are open-weight releases from Meta, Alibaba, Google, and North AI. This document describes the platform architecture, model selection, inference stack, and design philosophy.
The rapid proliferation of capable open-weight language models — including Meta's Llama 4, Qwen 2.5, Gemma 3, and North AI's own Wind Edge series — has created an opportunity to build inference infrastructure that is transparent, reproducible, and community-driven.
Existing inference providers are largely proprietary, opaque in their model routing, and subject to pricing and availability changes outside the user's control. north-ml addresses this by:
north-ml is built on Next.js 16 with the App Router, deployed on Vercel for the web layer, with inference requests routed to TPU-backed endpoints. The system consists of three main components:
| Component | Technology | Purpose |
|---|---|---|
| Web Interface | Next.js 16 / React | Chat UI, model selection, workspace |
| API Layer | Next.js Route Handlers | Request handling, streaming, auth |
| Inference Backend | TPU v6e-16 (OpenAI-compatible) | Model execution, token generation |
| Auth | Firebase + NextAuth | User identity, session management |
| State | Zustand | Client-side chat and UI state |
| VS Code Extension | TypeScript / VS Code API | Editor-integrated AI assistant |
Inference requests are forwarded from the Next.js API layer to a TPU v6e-16 endpoint that exposes an OpenAI-compatible API ( /v1/chat/completions ). Responses are streamed back to the client using the Vercel AI SDK's streaming primitives, providing token-level streaming with low time-to-first-token.
The TPU v6e-16 provides exceptional throughput for transformer inference due to its high-bandwidth memory (HBM) and matrix multiply units optimized for bfloat16 operations — the native precision of all models served by north-ml.
POST /api/chat
Content-Type: application/json
{
"model": "llama-4-maverick",
"messages": [...],
"stream": true
}
→ Response: text/event-stream (SSE)
data: {"choices":[{"delta":{"content":"..."}}]}
data: [DONE]north-ml serves four open-weight models, selected for their strong performance-to-size ratio and permissive licensing:
| Model | Parameters | Strengths | Provider |
|---|---|---|---|
| Llama 4 Maverick | 17B MoE | Instruction following, chat, reasoning | Meta |
| Qwen 3.5 35B | 35B | Code, multilingual, long context | Alibaba |
| Gemma 3 27B | 27B | Strong reasoning, safety-tuned | |
| Wind Edge 1.6 | 1.6B MoE | Edge deployment, low latency | North AI |
All models are served in bfloat16 precision. Wind Edge 1.6 is North AI's own model, developed specifically for edge and embedded deployment — see the companion technical report for full details.
The north-ml web interface is a minimal, keyboard-friendly AI assistant built for developers. Key design principles:
Summit is north-ml's agentic mode, allowing the AI to use tools (web search, code execution, file operations) across multi-step tasks. Summit is configured per-session via the workspace panel and integrates with the same TPU inference backend as the base chat interface.
The north-ml VS Code extension provides editor-integrated AI assistance. It connects to the same API backend as the web interface and supports:
The extension is written in TypeScript and targets VS Code API 1.85+. It is distributed as a .vsix package and will be published to the VS Code Marketplace.
The v6e-16 TPU provides 16 TPU chips with 128GB HBM3 total, 918 TFLOPS of bfloat16 compute, and 4.6 TB/s memory bandwidth. This makes it well-suited for serving large language models at low latency, particularly sparse MoE architectures where memory bandwidth is the primary bottleneck.
| Spec | v6e-16 |
|---|---|
| TPU chips | 16 |
| HBM3 capacity | 128 GB |
| bf16 FLOPS | 918 TFLOPS |
| Memory bandwidth | 4.6 TB/s |
| Interconnect | ICI (inter-chip) |
The Next.js web application is deployed on Vercel with edge-optimized routing. API routes use Vercel's serverless functions with a 60-second timeout for streaming chat completions. Static assets and the marketing pages are served from Vercel's CDN. The domain north-ml.space is managed through Vercel DNS.
The north-ml platform code is fully open source — the web interface, API layer, and VS Code extension are available on GitHub under the Apache 2.0 license. The models served by north-ml are open-weight releases from their respective organizations (Meta, Alibaba, Google, North AI), each under their own permissive licenses. We welcome contributions, bug reports, and feature requests.
Our commitment to open source reflects our belief that AI infrastructure should be transparent, auditable, and community-owned. By open-sourcing the platform and publishing Wind Edge model weights, we aim to accelerate the development of open AI tooling and reduce dependence on proprietary systems.