On 2025/01/30 13:27, Dave Voutila wrote: > Stuart Henderson <s...@spacehopper.org> writes: > > > On 2025/01/30 08:15, Dave Voutila wrote: > >> > >> FWIW we should be able to include Vulkan support as its in ports. I've > >> played with llama.cpp locally with it, but I don't have a GPU that's > >> worth a damn top see if it's an improvement over pure CPU-based > >> inferencing. > > > > Makes sense, though I think it would be better to commit without and > > add that later. > > > >> Also should this be arm64 and amd64 specific? I'm not a ports person so > >> not sure :) > > > > Do you mean for llama.cpp at all, or just the vulkan support? > > (If it's "at all", afaik the original intention was that - like > > whisper.cpp - it would run without anything special). > > I think some of its cpu-based inferencing relies on specific cpu > extensions, like AVX. Not sure it's truly cross-platform. I may be > wrong.
I _think_ it should be ok: - Plain C/C++ implementation without any dependencies as well as - Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks - AVX, AVX2, AVX512 and AMX support for x86 architectures - 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use - Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA) - Vulkan and SYCL backend support - CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity