Changelog

All notable changes to Lucid. Keep a Changelog format.

Added — Torch-grade quantization subsystem (PTQ · dynamic · QAT · graph-mode)

Weight-only int4/int8 acceleration through `convert()`. QuantizedLinearMLX runs MLX's genuine group-wise low-precision GEMM (quantized_matmul) on Metal — ~2.5–3.2× faster decode + ~3.55× smaller weights — and is *device-transparent* (the GEMM runs on Metal, the result returns to the input's device), so it accelerates inside an otherwise-CPU model. Wired into convert / quantize_dynamic when the MLX backend is active; the dequantize-to-float path stays the reference-accurate default.
QFunctional (quantized add / mul / cat / add_relu) — residual skip-adds and Transformer residuals now quantize end-to-end (ResNet / BERT).
Full observer suite — FixedQParamsObserver, MovingAveragePerChannelMinMaxObserver, PlaceholderObserver, NoopObserver, alongside the existing min/max, moving-average, per-channel, and histogram observers.
QAT & fusion completeness — rank-generic ConvBn{1,2,3}d / ConvBnReLU{1,2,3}d (BN folded in-forward so BN keeps training under the STE), QAT fused ConvReLU{1,2,3}d / LinearReLU / Embedding, and fuse_modules_qat.
Quantized module breadth — ConvTranspose{1,2,3}d, EmbeddingBag (sum/mean/max), and quantized activations (Sigmoid / Hardswish / Hardsigmoid / Tanh / ELU / LeakyReLU).
Dynamic quantization — dynamic.Linear + dynamic.LSTM (int8 weights, runtime activation quant, calibration-free) for Linear-/RNN-heavy inference.
Calibration safety — convert / prepare / quantize_dynamic now warn on an uncalibrated observer or a zero-match QConfigMapping instead of silently collapsing to a scale = eps degenerate grid.

Fixed — quantization correctness (EmbeddingBag kernels, Parameter deepcopy)

`EmbeddingBag` CPU kernel read indices as int32 (the default is int64) and seeded max pooling with 0 (masking all-negative bags) — both fixed with a dtype-aware index read, a lowest() seed, and an empty-bag → 0 guard.
`EmbeddingBag` GPU kernel was rewritten scatter-free (segment one-hot → matmul for sum/mean, masked-max for max) after an MLX scatter_add rank mismatch.
`Parameter` now survives `deepcopy` — prepare / QAT deep-copy no longer demotes a Parameter to a plain tensor (via Parameter.__deepcopy__); the quantized _C.engine type stub also gained its typed quantized submodule.

Fixed — degenerate matmul crashes, integer remainder, integer cumsum overflow

Degenerate matmul no longer crashes. Realizing an *empty-output* matmul (e.g. (0,4) @ (4,3)) via .numpy() / repr SIGSEGV'd on Metal — the GPU download ran contiguous() + eval() before the numel == 0 short-circuit; the guard is now hoisted above the MLX ops. A *zero-contraction* matmul ((2,0) @ (0,3), K = 0) crashed the CPU path with a BLAS error (lda must be >= MAX(K,1)); the backend now skips cblas_*gemm for M/N/K = 0 (the zeroed output is already correct).
Integer `remainder` is the floored modulo. It followed the sign of the dividend (acting like fmod) because integer division truncates toward zero — remainder([-7], [3]) gave -1 instead of 2. It now routes through the integer-correct floored division so the result has the sign of the divisor and (a // b) * b + remainder == a. Float behaviour is unchanged.
Integer `cumsum` / `cumprod` promote to int64. They accumulated in the input integer dtype and silently overflowed (int32 wrapped), while sum already promotes — now they match, mirroring sum/prod.

Changed — `lucid.compile(..., dynamic=True)`: symbolic batch by default, never crashes

gate clears the graph → one symbolic executable, reused for every batch size (no recompile when the batch changes). Transformers, CNNs, MLPs and hand-written arithmetic (x*0.5, where, manual LayerNorm) all qualify (parity ≤ 7e-7 across batch sizes).
gate rejects it — the graph bakes the batch into a constant MPSGraph can't infer (explicit broadcast / expand / repeat, a concat/stack on the batch axis, or a batch-shaped factory like zeros_like(x) / an RNN's zero hidden init) — or the symbolic lowering otherwise fails (off-dim-0 view) → robust per-shape static caching. Correct, never crashes — just recompiles per distinct shape. So dynamic=True never crashes or silently mis-shapes a real model: it shares one executable where provably safe and falls back to per-shape static otherwise. LUCID_COMPILE_DYNAMIC=0 forces pure static (no symbolic attempt). The compiled training step (make_step) stays per-shape static. The same applies to make_step(..., dynamic=True), which is always per-shape static (the backward graph of common reductions aborts under a symbolic batch).