Telexed

telexed ~ home★4 and up · hourly · UTC+09LIVE

TELEXED// solo-operator signal radar · Issue 843

AI news through a solo-operator lens — only what changes your day3 of 843

FILTER[All][Agents & tools][Models & API][Generative media][Infra & SaaS][ASO & growth][Indie business][Idea signals][Other][★6+ high-signal]

r/LocalLLaMA ✕clear filters

Sat, May 231 dispatches

#0843
#0843Other r/LocalLLaMAlast week
`Qwen3.6 27B` pure `Q4_K_M` GGUF fits in **16GB VRAM**
40radar
Qwen3.6Open LLM — active GGUF ecosystem for local inference
Pure quantization trims enough size to keep the whole model on a consumer GPU. Useful for local agent tests, but quality loss is real and benchmark depth is thin.
- Q4_K_M MTP is 15.4GB and non-MTP is 15.1GB; comparable builds listed at 16.5-18GB often spill past 16GB cards.
- MTP reaches 40 tok/s generation but only 195 tok/s prompt processing; non-MTP flips the trade-off at 715 tok/s pp and 24 tok/s tg.
- Perplexity delta is larger than Unsloth's quant: +0.1707 vs +0.0553 on MTP, so the size win buys speed/fit at some quality cost.
Source: www.reddit.com/r/LocalLLaMA/comments/1tkzk9e/qwen36_27b_Read original →
40radar
PHOTO
FIG-8431:1

Thu, May 211 dispatches

#0842
#0842Other r/LocalLLaMAlast week
`ik_llama.cpp` pushes `Qwen3.6 35B A3B` near 110 tok/s on 12GB VRAM
40radar
ik_llama.cppllama.cpp fork — optimized CPU offload and quantization
MTP plus CPU offload can make a local MoE model feel interactive on consumer hardware. Useful for private coding or batch jobs, but still a setup-specific benchmark.
- Same IQ4_XS quant averaged 89.76 tok/s on regular llama.cpp; ik_llama.cpp samples reached roughly 105-110 tok/s.
- Hardware was RTX 4070 Super 12GB, Ryzen 7 9700X, and 48GB DDR5. CPU offload quality matters as much as VRAM.
- Benchmark used --ctx-size 131072, q8 KV cache, and draft-mtp; long-context local workflows remain memory-sensitive.
- Treat it as a tuning lead, not a buying guide. Kernel, quant, and fork versions can swing results hard.
Source: www.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_toks_wiRead original →
40radar
PHOTO
FIG-8421:1

Sun, May 171 dispatches

#0841
#0841Other r/LocalLLaMA2 weeks ago
`llama.cpp` fork enables quantized KV cache with tensor split
50radar
llama.cppLocal LLM inference engine — supports GGUF and CUDA backends
Tensor parallelism becomes usable with quantized KV cache on dual GPUs. Still a fork with MoE caveats, so it is a test-only local inference tweak.
- Benchmarked Qwen3.5 27B Q4_K_M at 30.05 tok/s with -sm tensor vs 21.22 tok/s without it for generation.
- The command uses -ctk q8_0 -ctv q8_0, removing the old tensor-split tradeoff of falling back to non-quantized KV cache.
- Author reports real use rising from about 25 tok/s to 40 tok/s on 3060 12GB + 4070 Super 12GB.
- MoE models currently break with -sm tensor; dense models like Qwen 27B/9B are the safer test target.
Source: www.reddit.com/r/LocalLLaMA/comments/1tflngz/dual_gpu_llRead original →
50radar
PHOTO
FIG-8411:1