Technical Report · April 2026

Wind Edge 1.6: A Compact Mixture-of-Experts Language Model for Edge Deployment

Arthur Yousif

North AI · north-ml.space

Figure 1. Wind Edge 1.6 sparse MoE routing — 8 experts per layer, top-2 active per token.

Abstract

We present Wind Edge 1.6, a compact Mixture-of-Experts (MoE) language model designed for efficient inference on edge hardware. With 1.6 billion total parameters and only ~400M activated per token, Wind Edge 1.6 achieves competitive performance on code generation, instruction following, and reasoning benchmarks while maintaining low latency on consumer-grade and embedded devices. We describe the model architecture, training procedure, dataset composition, and evaluation results. Wind Edge 1.6 is released under a permissive open-source license to support research and on-device AI applications.

1. Introduction

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, but their deployment remains constrained by computational and memory requirements. While cloud-hosted models offer high capability, they introduce latency, cost, and privacy concerns that make them unsuitable for many real-world applications.

Edge deployment — running models on local devices such as laptops, mobile phones, and embedded systems — requires models that are both capable and efficient. Sparse Mixture-of-Experts (MoE) architectures offer a compelling solution: by activating only a subset of parameters per token, MoE models reduce inference FLOPs while maintaining a large effective parameter count for capacity.

Wind Edge 1.6 is our first public release in the Wind Edge series, targeting the sub-2B parameter regime. It is trained on a diverse multilingual corpus with emphasis on code, mathematics, and instruction following. The model is designed to run efficiently on CPUs and low-end GPUs, making it practical for offline and privacy-preserving applications.

2. Model Architecture

Wind Edge 1.6 follows a transformer-based architecture with sparse MoE feed-forward layers. The key design choices are summarized in the table below.

Component	Configuration
Total parameters	1.6B
Active parameters per token	~400M
Layers	24
Hidden dimension	2048
Attention heads	16
KV heads (GQA)	4
Experts per MoE layer	8
Top-k experts (active)	2
Expert hidden dim	1024
MoE layer frequency	Every other layer
Context length	8192 tokens
Vocabulary size	65,536
Positional encoding	RoPE (θ = 500,000)
Normalization	RMSNorm
Activation function	SwiGLU

2.1 Sparse MoE Layer

Each MoE layer contains 8 expert networks. For each token, a learned router network computes a distribution over experts and selects the top-2 by score. The token is processed by both selected experts, and their outputs are combined via a weighted sum:

y = Σ_{i ∈ top2} softmax(G(x))_i · E_i(x)

where G(x) = x · W_g  (router logits)
      E_i(x) = SwiGLU feed-forward expert i

To prevent expert collapse, we apply an auxiliary load-balancing loss during training with coefficient α = 0.01, following standard MoE practice.

2.2 Grouped Query Attention

We use Grouped Query Attention (GQA) with 16 query heads and 4 key-value heads, reducing KV cache memory by 4× during inference compared to multi-head attention. This is particularly important for edge deployment where memory bandwidth is constrained.

2.3 Tokenizer

Wind Edge 1.6 uses a custom BPE tokenizer with a vocabulary of 65,536 tokens, trained on the same multilingual corpus as the model. The tokenizer includes dedicated tokens for code, mathematics (LaTeX), and structured data formats. Byte-level fallback ensures lossless encoding of arbitrary text.

3. Training

3.1 Dataset

The pretraining dataset consists of approximately 1.2 trillion tokens drawn from the following sources:

Source	Tokens	Weight
Web text (filtered Common Crawl)	600B	50%
Code (GitHub, Stack Exchange)	240B	20%
Books and long-form text	120B	10%
Mathematics (arXiv, MATH, AoPS)	96B	8%
Wikipedia + Wikidata	60B	5%
Scientific papers	48B	4%
Multilingual web (non-English)	36B	3%

Web text was filtered using a combination of classifier-based quality scoring, deduplication via MinHash LSH, and heuristic rules to remove low-quality, toxic, and personally identifiable content. Code data was deduplicated at the file level with exact SHA-256 matching and near-duplicate removal.

3.2 Training Phases

Pretraining proceeded in three phases:

Phase 1 — Base pretraining (1.0T tokens): Standard next-token prediction on the full dataset with 4096-token context windows.
Phase 2 — Long-context annealing (150B tokens): Context extended to 8192 tokens. Learning rate cosine-decayed to 10% of peak. Dataset reweighted toward long documents.
Phase 3 — Instruction tuning (50B tokens): Supervised fine-tuning on curated instruction-response pairs spanning code, chat, math, and tool use.

3.3 Hyperparameters

Hyperparameter	Value
Optimizer	AdamW
Peak learning rate	3e-4
LR schedule	Cosine with warmup
Warmup steps	2,000
Batch size (tokens)	4M
Weight decay	0.1
Gradient clip	1.0
Precision	bfloat16
Parallelism	Expert + data parallel
Hardware	v6e-16 TPU

4. Evaluation

We evaluated Wind Edge 1.6 on standard benchmarks against comparable models in the sub-2B parameter class.

Benchmark	Wind Edge 1.6	Phi-2 (2.7B)	Qwen2-1.5B	Gemma-2B
MMLU (5-shot)	62.4	57.7	56.5	42.3
HumanEval (pass@1)	51.8	47.1	34.1	22.0
GSM8K (8-shot)	58.3	57.2	46.9	17.7
ARC-Challenge	55.1	54.7	53.0	48.5
HellaSwag	71.2	73.1	66.6	71.4
TruthfulQA	44.6	44.0	43.2	33.1

Table 2. Zero/few-shot benchmark comparisons. Wind Edge 1.6 achieves competitive results despite having fewer active parameters (~400M) than dense models of similar total size.

4.1 Inference Performance

Device	Tokens/sec	Memory (peak)
v6e-16 TPU (bf16)	~320	~3.2 GB
RTX 4090 (fp16)	~210	~3.8 GB
M3 Pro (Metal, int8)	~85	~2.1 GB
Intel i7-13700 (int4)	~22	~1.1 GB

5. Limitations

Wind Edge 1.6 is a research preview and has known limitations:

Performance on low-resource languages is limited by training data imbalance.
Long-context coherence degrades beyond ~6K tokens despite 8K training.
The model may produce factually incorrect or outdated information and should not be used for high-stakes decisions without human oversight.
Like all language models, Wind Edge 1.6 can produce biased, harmful, or misleading content under adversarial prompting.
Quantized variants (int4/int8) show accuracy degradation on math and code tasks of 2–5 points on benchmarks.

6. Release

Wind Edge 1.6 weights and tokenizer are released under the Apache 2.0 license. Model weights, evaluation code, and conversion scripts are available at north-ml.space/research/wind-edge. We release bf16, fp16, int8 (GPTQ), and int4 (AWQ) quantized variants.

Inference is available via the north-ml API at api.north-ml.space/v1 using an OpenAI-compatible endpoint. The model is deployed on v6e-16 TPU infrastructure for low-latency cloud inference.

References

Shazeer, N. et al. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ICLR 2017.
Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR.
Ainslie, J. et al. (2023). GQA: Training generalized multi-query transformer models from multi-head checkpoints. EMNLP 2023.
Su, J. et al. (2021). RoFormer: Enhanced transformer with rotary position embedding. arXiv:2104.09864.
Touvron, H. et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
Jiang, A. Q. et al. (2024). Mixtral of experts. arXiv:2401.04088.
Hendrycks, D. et al. (2021). Measuring massive multitask language understanding. ICLR 2021.
Chen, M. et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374.