[Deep Dive] Qwen 3.5 Brings Native Multimodality and Long Context to Small Open Models
Alibaba’s Qwen 3.5 release packs native multimodal reasoning and 262K-token context into models as small as 0.8 billion parameters, enabled by a hybrid attention architecture that makes long-context inference feasible on consumer hardware.
That combination matters because it collapses three capabilities that previously required separate, large models into a single architecture small enough to run on a phone: long context, visual understanding, and efficient inference.
What changed from Qwen 3 to Qwen 3.5
Qwen 3 shipped dense models with 32K context at smaller sizes and 128K for the larger variants and MoE configurations. The model family supported 119 languages, introduced hybrid thinking modes (with explicit “thinking” and “non-thinking” toggles), and relied on standard full attention throughout the stack.
Qwen 3.5 changes the picture in several ways. The release spans eight models covering a wide range of deployment targets:
Qwen3.5-397B-A17B (397B total, 17B activated per token)
Qwen3.5-122B-A10B (122B total, 10B activated)
Qwen3.5-35B-A3B (35B total, 3B activated)
Qwen3.5-27B (dense)
Qwen3.5-9B, 4B, 2B, 0.8B (dense, small model series)
The “AxxB” suffix on the MoE variants indicates activated parameters per token, which better approximates per-token inference cost than the total parameter count.
Context length jumps to 262K natively across the full model lineup, including the 0.8B, 2B, 4B, and 9B variants. Language coverage expands from 119 to 201 languages and dialects. The architecture shifts to a hybrid design combining Gated Delta Networks (a linear attention mechanism) with sparse Mixture-of-Experts (MoE) routing. And multimodality moves from a bolt-on capability to a foundational training objective.
One notable behavioral change: Qwen 3.5 drops official support for the /think and /no_think soft-switch tags that Qwen 3 used to toggle reasoning modes turn by turn.
How Gated Delta Networks enable 262K context
Standard transformer attention computes a score between every pair of tokens in a sequence. For n tokens, this requires O(n²) operations and O(n²) memory. At 262K tokens, a single attention layer with 32 heads would need to store roughly 68.7 billion score entries per layer. For a 32-layer model on consumer hardware, that math simply does not work.
Gated Delta Networks take a different approach. Instead of maintaining an ever-growing attention map, each GDN layer keeps a fixed-size state matrix with dimensions proportional to the head dimension squared (e.g., 128 × 128), independent of sequence length. New tokens update this state incrementally, and the output for each token is produced by querying the current state. The cost becomes O(n · d²): linear in sequence length, quadratic only in the small, fixed head dimension.
The “delta” in the name refers to the delta rule, an error-correcting update mechanism. Vanilla linear attention accumulates key-value associations by simple addition, with no way to revise or overwrite stale information. Over long sequences, this causes memory contamination. The delta rule fixes this by checking what the memory currently predicts for a given key, computing the error between the predicted and desired value, and writing only the correction. The model can overwrite outdated information rather than endlessly piling up associations.
The “gated” component adds two learned, input-dependent gates: a decay gate that controls how much of the previous state to retain, and an update gate that controls how aggressively to write corrections. These gates give each token fine-grained control over the memory, deciding independently per dimension whether to remember, forget, or revise. This selective read/write mechanism is conceptually similar to the gating in LSTMs, but applied to the state matrix of a linear attention layer.
GDN layers train efficiently through a chunk-wise parallel algorithm: the sequence is split into fixed-size chunks, a small local attention-like computation runs in parallel within each chunk, and the recurrent state propagates across chunk boundaries.
Periodic full attention preserves retrieval precision
Pure linear attention, even with delta rule gating, has a fundamental limitation. The fixed-size state matrix is a compressed representation with finite capacity. Tasks requiring exact retrieval over very long ranges (finding a specific fact buried in thousands of tokens, copying a precise sequence, performing multi-hop lookups across distant passages) degrade because fine-grained information gets lost in the compression.
Qwen 3.5 addresses this with a hybrid layout that interleaves GDN layers with standard full softmax attention layers at regular intervals. The 0.8B model card spells out the block structure explicitly:
6 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))
Most layers use the efficient GDN formulation. Periodic full attention layers act as precision checkpoints: they can attend to any position in the full context with exact scoring, resolving ambiguities or retrieving specific facts that the compressed state might have degraded.
The savings are substantial. If a 32-layer model uses 4 full attention layers and 28 GDN layers, the attention-related compute drops by roughly 8x compared to a fully quadratic model at 262K tokens. The KV cache (which stores key-value pairs for every prior token and represents the primary memory bottleneck at long contexts) is needed only for those few full attention layers. GDN layers maintain their fixed-size state matrices at a cost of roughly 32 KB per head per layer, totaling tens of megabytes instead of tens of gigabytes.
Early fusion trains vision into the backbone from scratch
Previous Qwen vision models (Qwen-VL, Qwen2-VL, Qwen2.5-VL, and Qwen 3-VL) used a late fusion strategy. A separate Vision Transformer processed images, a projection layer compressed the visual features into the language model’s embedding space, and the language model received these projected tokens as additional inputs. The language backbone was first trained on text alone. Vision was added afterward in a multi-stage process.
Qwen 3.5 replaces this with early fusion (sometimes called “native multimodal” training). Visual tokens and text tokens enter the same transformer backbone from the earliest layers and are processed jointly throughout the entire depth of the network. The model trains on interleaved image-text data from the beginning of pretraining.
In a late-fusion model, only the upper layers of the language model have learned to interpret visual information. The lower layers were shaped entirely by text during pretraining, and the adapter projecting ViT features into the LLM’s space becomes an information bottleneck, compressing rich spatial and semantic visual details into a format the text-focused backbone can accept.
In an early-fusion model, every layer from bottom to top develops cross-modal representations. Visual and textual concepts co-evolve. A chart’s axis labels are understood in the same representational space as the numerical data they describe. This deeper integration produces measurable gains on tasks requiring tight text-vision coupling: document understanding, OCR, chart reasoning, visual mathematics, and spatial reasoning.
This approach follows a trajectory established by Google’s Gemini (December 2023) and Meta’s Chameleon (2024), both of which demonstrated that native multimodal training produces stronger cross-modal reasoning than post-hoc fusion. What Qwen 3.5 adds is the combination of early fusion with the efficient hybrid attention architecture, enabling native multimodal capabilities at model sizes previously considered too small for serious vision-language work.
The tradeoff is cost. Early fusion requires training the entire model from scratch on massive multimodal corpora. You cannot reuse an existing high-quality text model as a starting point. The 0.8B model’s vision capabilities represent a deliberate training investment, which partly explains why the small models’ multimodal performance is qualitatively different from what adapter-based approaches achieve at similar parameter counts.
What 262K context and vision at sub-10B unlocks
The convergence of long context, native multimodality, and small parameter counts opens deployment scenarios that no single prior open model addressed well.
A 0.8B model quantized to 4-bit precision requires roughly 500-600 MB of memory. This fits comfortably on mid-range smartphones, Raspberry Pi 5 boards, and virtually any laptop manufactured in the past five years. With native vision, it can process camera inputs, screenshots, or scanned documents without a cloud round-trip. With 262K context, it can ingest and reason over substantial documents (roughly 400 pages) on-device.
The 2B and 4B models occupy a sweet spot for laptop and edge-server deployment. At 4-bit quantization, the 4B model requires approximately 2.5-3 GB, fitting within the unified memory of Apple Silicon Macs and the VRAM of entry-level discrete GPUs. These sizes are large enough for meaningful multimodal reasoning while remaining fast enough for interactive use.
The 9B model, quantized to 4-bit, requires roughly 5-6 GB and targets enthusiast desktops, workstation-class hardware, and edge servers.
One nuance with MoE variants: total parameter count exceeds active parameter count, meaning the disk and memory footprint is larger than a dense model with equivalent active compute. A model with 9B active parameters but 25-30B total parameters needs more storage and memory for weight loading, even though each forward pass activates only the 9B subset. This tradeoff (more memory for better quality-per-FLOP) is generally favorable for edge deployment where inference speed matters more than absolute memory minimization.
For agentic and RAG workflows, the 262K context window changes the engineering trade space. Traditional RAG pipelines at the sub-10B scale required chunking documents, maintaining a vector store, retrieving relevant passages, and hoping the small model could synthesize fragments coherently. With 262K tokens of context, many documents can be loaded directly into the context window, reducing reliance on the retrieval layer. Combined with vision, the model can process scanned PDFs, photographs of whiteboards, or annotated diagrams as part of the same input, enabling genuinely multimodal local agents.
The small model series also supports tool use as a first-class capability. The 0.8B model card includes server launch flags and examples for enabling tool parsing, making these models viable as lightweight agent cores with tool-calling support out of the box.
Where Qwen 3.5 fits in the open model landscape
The open-weight models that currently matter for the cost-capability Pareto frontier are a small group: Moonshot AI’s Kimi K2.5, Zhipu AI’s GLM-5, and MiniMax M2.5. Each occupies a different point on the tradeoff curve between capability, throughput, and deployment cost. Kimi K2.5 brings strong coding and reasoning with flexible routing configurations. GLM-5 competes on general capability at competitive inference costs. MiniMax M2.5 targets high-throughput generation with efficient serving.
Qwen 3.5 enters this field with a differentiated bet: native multimodality and very long context pushed down to parameter counts where none of these competitors offer comparable coverage. At the 0.8B to 4B range, no current open model combines vision, 262K context, and competitive quality in a single architecture. The hybrid attention design is what makes this possible; without it, 262K context at these sizes would be computationally infeasible with standard attention. For teams building multi-model routing stacks, Qwen 3.5’s small models fill a gap that the flagship-focused competition leaves open: a lightweight, multimodal, long-context component that can handle vision-heavy or context-heavy subtasks locally while larger models handle the high-judgment work.
Limitations worth understanding
Retrieval at the edges of context. GDN layers compress all prior context into a fixed-size state. Tasks requiring precise needle-in-a-haystack retrieval depend on the periodic full attention layers. If the relevant information falls in a region of the context where no full attention layer provides coverage, retrieval quality degrades. The effective context length for precise recall tasks may be shorter than the nominal 262K, particularly for the smallest models where fewer total layers mean fewer full attention checkpoints.
MoE at small scales. With fewer total parameters, the risk of expert collapse (where multiple experts converge to similar representations, eliminating the diversity benefit) increases. Routing decisions are sensitive to quantization, making aggressive weight compression trickier than with dense models.
Benchmark caveats. Self-reported scores may not always translate to real-world performance, and concerns about data contamination are difficult to fully dispel without independent verification. Community benchmarks like the Open LLM Leaderboard, LiveBench, and LMArena provide more independent signals but may not yet have comprehensive Qwen 3.5 results across all model sizes.
Capacity constraints at small scales. An 0.8B model, regardless of architecture, cannot match larger models on complex visual reasoning tasks requiring deep world knowledge. Early fusion helps these models make better use of their limited parameters, but the gap to 70B+ models on benchmarks like MMMU (which tests college-level multimodal reasoning) will remain substantial.
Framework support lag. Standard transformers enjoy mature optimization across every major inference framework (vLLM, TensorRT-LLM, llama.cpp, Ollama). Gated Delta Network layers require new kernel implementations and may not be fully optimized on all platforms at launch. Users should expect an initial period where inference speed may not yet reflect the architecture’s theoretical advantages, a gap that typically closes within weeks to months as the ecosystem adapts.
What this signals
Qwen 3.5’s significance lies in the convergence of three architectural bets: Gated Delta Networks for efficient long context, sparse MoE for quality-per-FLOP, and early fusion for native multimodality, all packaged at parameter counts that change where these capabilities can physically run.
The hybrid attention layout (periodic full attention checkpoints among linear-time recurrent layers) is a practical engineering solution to a real mathematical constraint: you cannot run 262K-token quadratic attention on a phone, but you can if most layers are linear and only a few are quadratic.
The shift from late fusion to early fusion multimodality marks a transition from “text model with vision attached” to “multimodal model from birth,” a distinction that shows up most clearly in tasks requiring tight text-image integration, precisely the tasks where small models have historically underperformed.
Whether Qwen 3.5’s implementations fully deliver on these architectural promises will be settled by independent benchmarks and real-world deployment experience. But the design choices represent a clear thesis about where the open model frontier is heading: toward smaller, natively multimodal models that trade architectural complexity for deployment flexibility.






The hybrid attention architecture detail is the bit I haven't seen covered well elsewhere. Most coverage fixates on parameter counts and misses why the GDN + MoE combo actually matters for consumer hardware. I wrote a deep-dive recently on the MoE vs dense distinction specifically because people keep comparing 4B dense to 80B MoE without understanding what's actually happening at inference time. Short version: the 4B dense model and the 80B-A3B model are both activating roughly the same number of parameters per token. The real win is memory footprint, not raw capability: https://reading.sh/your-laptop-is-an-ai-server-now-370bad238461