← north-ml / research

System Overview · April 2026

north-ml: Open-Source Inference Platform for Open-Weight Models on TPU v6e-16

North AI

north-ml.space

North ML

north-ml — open inference infrastructure

An open-source platform for running open-weight language models at low latency on Google TPU v6e-16 hardware, with a compatible API and a built-in AI coding assistant.

Abstract

north-mlis an open-source inference platform built on Google's v6e-16 TPU hardware. It provides access to a curated set of high-quality open-weight language models — including Llama 4, Qwen 2.5, Gemma 3, and North AI's own Wind Edge — through a unified REST API, a web-based chat interface, and a VS Code extension. The platform itself is open source; the models are open-weight releases from Meta, Alibaba, Google, and North AI. This document describes the platform architecture, model selection, inference stack, and design philosophy.

1. Motivation

The rapid proliferation of capable open-weight language models — including Meta's Llama 4, Qwen 2.5, Gemma 3, and North AI's own Wind Edge series — has created an opportunity to build inference infrastructure that is transparent, reproducible, and community-driven.

Existing inference providers are largely proprietary, opaque in their model routing, and subject to pricing and availability changes outside the user's control. north-ml addresses this by:

  • Running exclusively on open-weight models with published weights and architecture specifications.
  • Deploying on dedicated TPU v6e-16 hardware for consistent, low-latency inference.
  • Providing a fully open-source platform stack (web app, API, VS Code extension).
  • Offering a simple, unified REST API for developer integrations.

2. Platform Architecture

2.1 Overview

north-ml is built on Next.js 16 with the App Router, deployed on Vercel for the web layer, with inference requests routed to TPU-backed endpoints. The system consists of three main components:

ComponentTechnologyPurpose
Web InterfaceNext.js 16 / ReactChat UI, model selection, workspace
API LayerNext.js Route HandlersRequest handling, streaming, auth
Inference BackendTPU v6e-16 (OpenAI-compatible)Model execution, token generation
AuthFirebase + NextAuthUser identity, session management
StateZustandClient-side chat and UI state
VS Code ExtensionTypeScript / VS Code APIEditor-integrated AI assistant

2.2 Inference Stack

Inference requests are forwarded from the Next.js API layer to a TPU v6e-16 endpoint that exposes an OpenAI-compatible API ( /v1/chat/completions ). Responses are streamed back to the client using the Vercel AI SDK's streaming primitives, providing token-level streaming with low time-to-first-token.

The TPU v6e-16 provides exceptional throughput for transformer inference due to its high-bandwidth memory (HBM) and matrix multiply units optimized for bfloat16 operations — the native precision of all models served by north-ml.

2.3 Streaming Protocol

POST /api/chat
Content-Type: application/json

{
  "model": "llama-4-maverick",
  "messages": [...],
  "stream": true
}

→ Response: text/event-stream (SSE)
  data: {"choices":[{"delta":{"content":"..."}}]}
  data: [DONE]

3. Model Lineup

north-ml serves four open-weight models, selected for their strong performance-to-size ratio and permissive licensing:

ModelParametersStrengthsProvider
Llama 4 Maverick17B MoEInstruction following, chat, reasoningMeta
Qwen 3.5 35B35BCode, multilingual, long contextAlibaba
Gemma 3 27B27BStrong reasoning, safety-tunedGoogle
Wind Edge 1.61.6B MoEEdge deployment, low latencyNorth AI

All models are served in bfloat16 precision. Wind Edge 1.6 is North AI's own model, developed specifically for edge and embedded deployment — see the companion technical report for full details.

4. Web Interface

The north-ml web interface is a minimal, keyboard-friendly AI assistant built for developers. Key design principles:

  • Monospace-first design — the entire UI uses a monospace typeface, reflecting the developer-oriented audience.
  • Dark-by-default — optimized for long coding sessions.
  • Model switching — users can switch between all four models mid-conversation without losing context.
  • Tool support — the interface supports function calling for models that expose it, with a visual tool call block renderer.
  • Workspace panels — collapsible sidebar panels for Summit AI agent configuration, tool builder, and GitHub repository integration.

4.1 Summit

Summit is north-ml's agentic mode, allowing the AI to use tools (web search, code execution, file operations) across multi-step tasks. Summit is configured per-session via the workspace panel and integrates with the same TPU inference backend as the base chat interface.

5. VS Code Extension

The north-ml VS Code extension provides editor-integrated AI assistance. It connects to the same API backend as the web interface and supports:

  • Inline chat panel via a webview sidebar.
  • Context menu commands: Ask about selection, Explain selection, Fix selection.
  • Model selection via VS Code settings (northml.model).
  • Streaming responses with token-level display.
  • Configurable API endpoint for self-hosted deployments.

The extension is written in TypeScript and targets VS Code API 1.85+. It is distributed as a .vsix package and will be published to the VS Code Marketplace.

6. Infrastructure

6.1 TPU v6e-16

The v6e-16 TPU provides 16 TPU chips with 128GB HBM3 total, 918 TFLOPS of bfloat16 compute, and 4.6 TB/s memory bandwidth. This makes it well-suited for serving large language models at low latency, particularly sparse MoE architectures where memory bandwidth is the primary bottleneck.

Specv6e-16
TPU chips16
HBM3 capacity128 GB
bf16 FLOPS918 TFLOPS
Memory bandwidth4.6 TB/s
InterconnectICI (inter-chip)

6.2 Deployment

The Next.js web application is deployed on Vercel with edge-optimized routing. API routes use Vercel's serverless functions with a 60-second timeout for streaming chat completions. Static assets and the marketing pages are served from Vercel's CDN. The domain north-ml.space is managed through Vercel DNS.

7. Open Source

The north-ml platform code is fully open source — the web interface, API layer, and VS Code extension are available on GitHub under the Apache 2.0 license. The models served by north-ml are open-weight releases from their respective organizations (Meta, Alibaba, Google, North AI), each under their own permissive licenses. We welcome contributions, bug reports, and feature requests.

Our commitment to open source reflects our belief that AI infrastructure should be transparent, auditable, and community-owned. By open-sourcing the platform and publishing Wind Edge model weights, we aim to accelerate the development of open AI tooling and reduce dependence on proprietary systems.

© 2026 North AI — north-ml.spaceApache 2.0 License