← north-ml / research

Technical Report · April 2026

Wind Edge 1.6: A Compact Mixture-of-Experts Language Model for Edge Deployment

Arthur Yousif

North AI · north-ml.space

Wind Edge 1.6 architecture diagram

Figure 1. Wind Edge 1.6 sparse MoE routing — 8 experts per layer, top-2 active per token.

Abstract

We present Wind Edge 1.6, a compact Mixture-of-Experts (MoE) language model designed for efficient inference on edge hardware. With 1.6 billion total parameters and only ~400M activated per token, Wind Edge 1.6 achieves competitive performance on code generation, instruction following, and reasoning benchmarks while maintaining low latency on consumer-grade and embedded devices. We describe the model architecture, training procedure, dataset composition, and evaluation results. Wind Edge 1.6 is released under a permissive open-source license to support research and on-device AI applications.

1. Introduction

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, but their deployment remains constrained by computational and memory requirements. While cloud-hosted models offer high capability, they introduce latency, cost, and privacy concerns that make them unsuitable for many real-world applications.

Edge deployment — running models on local devices such as laptops, mobile phones, and embedded systems — requires models that are both capable and efficient. Sparse Mixture-of-Experts (MoE) architectures offer a compelling solution: by activating only a subset of parameters per token, MoE models reduce inference FLOPs while maintaining a large effective parameter count for capacity.

Wind Edge 1.6 is our first public release in the Wind Edge series, targeting the sub-2B parameter regime. It is trained on a diverse multilingual corpus with emphasis on code, mathematics, and instruction following. The model is designed to run efficiently on CPUs and low-end GPUs, making it practical for offline and privacy-preserving applications.

2. Model Architecture

Wind Edge 1.6 follows a transformer-based architecture with sparse MoE feed-forward layers. The key design choices are summarized in the table below.

ComponentConfiguration
Total parameters1.6B
Active parameters per token~400M
Layers24
Hidden dimension2048
Attention heads16
KV heads (GQA)4
Experts per MoE layer8
Top-k experts (active)2
Expert hidden dim1024
MoE layer frequencyEvery other layer
Context length8192 tokens
Vocabulary size65,536
Positional encodingRoPE (θ = 500,000)
NormalizationRMSNorm
Activation functionSwiGLU

2.1 Sparse MoE Layer

Each MoE layer contains 8 expert networks. For each token, a learned router network computes a distribution over experts and selects the top-2 by score. The token is processed by both selected experts, and their outputs are combined via a weighted sum:

y = Σ_{i ∈ top2} softmax(G(x))_i · E_i(x)

where G(x) = x · W_g  (router logits)
      E_i(x) = SwiGLU feed-forward expert i

To prevent expert collapse, we apply an auxiliary load-balancing loss during training with coefficient α = 0.01, following standard MoE practice.

2.2 Grouped Query Attention

We use Grouped Query Attention (GQA) with 16 query heads and 4 key-value heads, reducing KV cache memory by 4× during inference compared to multi-head attention. This is particularly important for edge deployment where memory bandwidth is constrained.

2.3 Tokenizer

Wind Edge 1.6 uses a custom BPE tokenizer with a vocabulary of 65,536 tokens, trained on the same multilingual corpus as the model. The tokenizer includes dedicated tokens for code, mathematics (LaTeX), and structured data formats. Byte-level fallback ensures lossless encoding of arbitrary text.

3. Training

3.1 Dataset

The pretraining dataset consists of approximately 1.2 trillion tokens drawn from the following sources:

SourceTokensWeight
Web text (filtered Common Crawl)600B50%
Code (GitHub, Stack Exchange)240B20%
Books and long-form text120B10%
Mathematics (arXiv, MATH, AoPS)96B8%
Wikipedia + Wikidata60B5%
Scientific papers48B4%
Multilingual web (non-English)36B3%

Web text was filtered using a combination of classifier-based quality scoring, deduplication via MinHash LSH, and heuristic rules to remove low-quality, toxic, and personally identifiable content. Code data was deduplicated at the file level with exact SHA-256 matching and near-duplicate removal.

3.2 Training Phases

Pretraining proceeded in three phases:

  1. Phase 1 — Base pretraining (1.0T tokens): Standard next-token prediction on the full dataset with 4096-token context windows.
  2. Phase 2 — Long-context annealing (150B tokens): Context extended to 8192 tokens. Learning rate cosine-decayed to 10% of peak. Dataset reweighted toward long documents.
  3. Phase 3 — Instruction tuning (50B tokens): Supervised fine-tuning on curated instruction-response pairs spanning code, chat, math, and tool use.

3.3 Hyperparameters

HyperparameterValue
OptimizerAdamW
Peak learning rate3e-4
LR scheduleCosine with warmup
Warmup steps2,000
Batch size (tokens)4M
Weight decay0.1
Gradient clip1.0
Precisionbfloat16
ParallelismExpert + data parallel
Hardwarev6e-16 TPU

4. Evaluation

We evaluated Wind Edge 1.6 on standard benchmarks against comparable models in the sub-2B parameter class.

BenchmarkWind Edge 1.6Phi-2 (2.7B)Qwen2-1.5BGemma-2B
MMLU (5-shot)62.457.756.542.3
HumanEval (pass@1)51.847.134.122.0
GSM8K (8-shot)58.357.246.917.7
ARC-Challenge55.154.753.048.5
HellaSwag71.273.166.671.4
TruthfulQA44.644.043.233.1

Table 2. Zero/few-shot benchmark comparisons. Wind Edge 1.6 achieves competitive results despite having fewer active parameters (~400M) than dense models of similar total size.

4.1 Inference Performance

DeviceTokens/secMemory (peak)
v6e-16 TPU (bf16)~320~3.2 GB
RTX 4090 (fp16)~210~3.8 GB
M3 Pro (Metal, int8)~85~2.1 GB
Intel i7-13700 (int4)~22~1.1 GB

5. Limitations

Wind Edge 1.6 is a research preview and has known limitations:

  • Performance on low-resource languages is limited by training data imbalance.
  • Long-context coherence degrades beyond ~6K tokens despite 8K training.
  • The model may produce factually incorrect or outdated information and should not be used for high-stakes decisions without human oversight.
  • Like all language models, Wind Edge 1.6 can produce biased, harmful, or misleading content under adversarial prompting.
  • Quantized variants (int4/int8) show accuracy degradation on math and code tasks of 2–5 points on benchmarks.

6. Release

Wind Edge 1.6 weights and tokenizer are released under the Apache 2.0 license. Model weights, evaluation code, and conversion scripts are available at north-ml.space/research/wind-edge. We release bf16, fp16, int8 (GPTQ), and int4 (AWQ) quantized variants.

Inference is available via the north-ml API at api.north-ml.space/v1 using an OpenAI-compatible endpoint. The model is deployed on v6e-16 TPU infrastructure for low-latency cloud inference.

References

  1. Shazeer, N. et al. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ICLR 2017.
  2. Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR.
  3. Ainslie, J. et al. (2023). GQA: Training generalized multi-query transformer models from multi-head checkpoints. EMNLP 2023.
  4. Su, J. et al. (2021). RoFormer: Enhanced transformer with rotary position embedding. arXiv:2104.09864.
  5. Touvron, H. et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
  6. Jiang, A. Q. et al. (2024). Mixtral of experts. arXiv:2401.04088.
  7. Hendrycks, D. et al. (2021). Measuring massive multitask language understanding. ICLR 2021.
  8. Chen, M. et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374.
© 2026 North AI — north-ml.spaceApache 2.0 License