Technical Report · April 2026
Arthur Yousif
North AI · north-ml.space
We present Wind Edge 1.6, a compact Mixture-of-Experts (MoE) language model designed for efficient inference on edge hardware. With 1.6 billion total parameters and only ~400M activated per token, Wind Edge 1.6 achieves competitive performance on code generation, instruction following, and reasoning benchmarks while maintaining low latency on consumer-grade and embedded devices. We describe the model architecture, training procedure, dataset composition, and evaluation results. Wind Edge 1.6 is released under a permissive open-source license to support research and on-device AI applications.
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, but their deployment remains constrained by computational and memory requirements. While cloud-hosted models offer high capability, they introduce latency, cost, and privacy concerns that make them unsuitable for many real-world applications.
Edge deployment — running models on local devices such as laptops, mobile phones, and embedded systems — requires models that are both capable and efficient. Sparse Mixture-of-Experts (MoE) architectures offer a compelling solution: by activating only a subset of parameters per token, MoE models reduce inference FLOPs while maintaining a large effective parameter count for capacity.
Wind Edge 1.6 is our first public release in the Wind Edge series, targeting the sub-2B parameter regime. It is trained on a diverse multilingual corpus with emphasis on code, mathematics, and instruction following. The model is designed to run efficiently on CPUs and low-end GPUs, making it practical for offline and privacy-preserving applications.
Wind Edge 1.6 follows a transformer-based architecture with sparse MoE feed-forward layers. The key design choices are summarized in the table below.
| Component | Configuration |
|---|---|
| Total parameters | 1.6B |
| Active parameters per token | ~400M |
| Layers | 24 |
| Hidden dimension | 2048 |
| Attention heads | 16 |
| KV heads (GQA) | 4 |
| Experts per MoE layer | 8 |
| Top-k experts (active) | 2 |
| Expert hidden dim | 1024 |
| MoE layer frequency | Every other layer |
| Context length | 8192 tokens |
| Vocabulary size | 65,536 |
| Positional encoding | RoPE (θ = 500,000) |
| Normalization | RMSNorm |
| Activation function | SwiGLU |
Each MoE layer contains 8 expert networks. For each token, a learned router network computes a distribution over experts and selects the top-2 by score. The token is processed by both selected experts, and their outputs are combined via a weighted sum:
y = Σ_{i ∈ top2} softmax(G(x))_i · E_i(x)
where G(x) = x · W_g (router logits)
E_i(x) = SwiGLU feed-forward expert iTo prevent expert collapse, we apply an auxiliary load-balancing loss during training with coefficient α = 0.01, following standard MoE practice.
We use Grouped Query Attention (GQA) with 16 query heads and 4 key-value heads, reducing KV cache memory by 4× during inference compared to multi-head attention. This is particularly important for edge deployment where memory bandwidth is constrained.
Wind Edge 1.6 uses a custom BPE tokenizer with a vocabulary of 65,536 tokens, trained on the same multilingual corpus as the model. The tokenizer includes dedicated tokens for code, mathematics (LaTeX), and structured data formats. Byte-level fallback ensures lossless encoding of arbitrary text.
The pretraining dataset consists of approximately 1.2 trillion tokens drawn from the following sources:
| Source | Tokens | Weight |
|---|---|---|
| Web text (filtered Common Crawl) | 600B | 50% |
| Code (GitHub, Stack Exchange) | 240B | 20% |
| Books and long-form text | 120B | 10% |
| Mathematics (arXiv, MATH, AoPS) | 96B | 8% |
| Wikipedia + Wikidata | 60B | 5% |
| Scientific papers | 48B | 4% |
| Multilingual web (non-English) | 36B | 3% |
Web text was filtered using a combination of classifier-based quality scoring, deduplication via MinHash LSH, and heuristic rules to remove low-quality, toxic, and personally identifiable content. Code data was deduplicated at the file level with exact SHA-256 matching and near-duplicate removal.
Pretraining proceeded in three phases:
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Peak learning rate | 3e-4 |
| LR schedule | Cosine with warmup |
| Warmup steps | 2,000 |
| Batch size (tokens) | 4M |
| Weight decay | 0.1 |
| Gradient clip | 1.0 |
| Precision | bfloat16 |
| Parallelism | Expert + data parallel |
| Hardware | v6e-16 TPU |
We evaluated Wind Edge 1.6 on standard benchmarks against comparable models in the sub-2B parameter class.
| Benchmark | Wind Edge 1.6 | Phi-2 (2.7B) | Qwen2-1.5B | Gemma-2B |
|---|---|---|---|---|
| MMLU (5-shot) | 62.4 | 57.7 | 56.5 | 42.3 |
| HumanEval (pass@1) | 51.8 | 47.1 | 34.1 | 22.0 |
| GSM8K (8-shot) | 58.3 | 57.2 | 46.9 | 17.7 |
| ARC-Challenge | 55.1 | 54.7 | 53.0 | 48.5 |
| HellaSwag | 71.2 | 73.1 | 66.6 | 71.4 |
| TruthfulQA | 44.6 | 44.0 | 43.2 | 33.1 |
Table 2. Zero/few-shot benchmark comparisons. Wind Edge 1.6 achieves competitive results despite having fewer active parameters (~400M) than dense models of similar total size.
| Device | Tokens/sec | Memory (peak) |
|---|---|---|
| v6e-16 TPU (bf16) | ~320 | ~3.2 GB |
| RTX 4090 (fp16) | ~210 | ~3.8 GB |
| M3 Pro (Metal, int8) | ~85 | ~2.1 GB |
| Intel i7-13700 (int4) | ~22 | ~1.1 GB |
Wind Edge 1.6 is a research preview and has known limitations:
Wind Edge 1.6 weights and tokenizer are released under the Apache 2.0 license. Model weights, evaluation code, and conversion scripts are available at north-ml.space/research/wind-edge. We release bf16, fp16, int8 (GPTQ), and int4 (AWQ) quantized variants.
Inference is available via the north-ml API at api.north-ml.space/v1 using an OpenAI-compatible endpoint. The model is deployed on v6e-16 TPU infrastructure for low-latency cloud inference.