* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan
* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),
* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests
* * Performance fixes: minimized branch divergence, uses collectives to
eliminate redundant calculation, macros removed.
* Kernel shared memory size check
* Updates test-backend-ops to support graphs for performance
measurement.
* * Apple/Win32 compile errors fixed
* Subgroup size used to determine tile size -> fixes llvmpipe errors.
* Collectives disabled by default.
* Intel support is disabled as the performance is poor.
* Conv2d enabled for Intel with disabled collectives, disabled for Apple
* test-backend-ops modifications are reverted
* Trailing spaces and missing override fixed.
* Triggering pipeline relaunch.
* Code formatted with .clang-format.
* imatrix : allow processing multiple chunks per batch
* perplexity : simplify filling the batch
* imatrix : fix segfault when using a single chunk per batch
* imatrix : use GGUF to store imatrix data
* imatrix : fix conversion problems
* imatrix : use FMA and sort tensor names
* py : add requirements for legacy imatrix convert script
* perplexity : revert changes
* py : include imatrix converter requirements in toplevel requirements
* imatrix : avoid using designated initializers in C++
* imatrix : remove unused n_entries
* imatrix : allow loading mis-ordered tensors
Sums and counts tensors no longer need to be consecutive.
* imatrix : more sanity checks when loading multiple imatrix files
* imatrix : use ggml_format_name instead of std::string concatenation
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* quantize : use unused imatrix chunk_size with LLAMA_TRACE
* common : use GGUF for imatrix output by default
* imatrix : two-way conversion between old format and GGUF
* convert : remove imatrix to gguf python script
* imatrix : use the function name in more error messages
* imatrix : don't use FMA explicitly
This should make comparisons between the formats easier
because this matches the behavior of the previous version.
* imatrix : avoid returning from void function save_imatrix
* imatrix : support 3d tensors with MUL_MAT
* quantize : fix dataset name loading from gguf imatrix
* common : move string_remove_suffix from quantize and imatrix
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* imatrix : add warning when legacy format is written
* imatrix : warn when writing partial data, to help guess dataset coverage
Also make the legacy format store partial data
by using neutral values for missing data.
This matches what is done at read-time for the new format,
and so should get the same quality in case the old format is still used.
* imatrix : avoid loading model to convert or combine imatrix
* imatrix : avoid using imatrix.dat in README
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Fix Gemma3n not executed as CUDA_GRAPH on NVGPUs
Gemma3n uses Matrix-Matrix addition as part of their input processing,
wrongly triggering CUDA_GRAPH disablement on NVGPUs even when batch-size
of 1 is used.
* Exclude `project_per_layer_input` by matching node names
This ensures that all other graphs which don't exhibit this pattern do
not have their behavior changed.
* Revert unnecessary formatting changes
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults
* Initialize webgpu device
* Making progress on setting up the backend
* Finish more boilerplate/utility functions
* Organize file and work on alloc buffer
* Add webgpu_context to prepare for actually running some shaders
* Work on memset and add shader loading
* Work on memset polyfill
* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it
* Implement get_tensor and buffer_clear
* Finish rest of setup
* Start work on compute graph
* Basic mat mul working
* Work on emscripten build
* Basic WebGPU backend instructions
* Use EMSCRIPTEN flag
* Work on passing ci, implement 4d tensor multiplication
* Pass thread safety test
* Implement permuting for mul_mat and cpy
* minor cleanups
* Address feedback
* Remove division by type size in cpy op
* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends
* Fix name
* Fix macos dawn prefix path
* Support diffusion models: Add Dream 7B
* Move diffusion to examples
* Move stuff to examples. Add patch to not use kv-cache
* Address review comments
* Make sampling fast
* llama: remove diffusion functions
* Add basic timings + cleanup
* More cleanup
* Review comments: better formating, use LOG instead std::cerr, re-use batch, use ubatch instead of max_length
* fixup!
* Review: move everything to diffusion-cli for now
Add LLAMA_API to fix the run-time error with llama-cpp-python in Windows env:
attributeError: function 'llama_kv_self_seq_div' not found.
Did you mean: 'llama_kv_self_seq_add'?
Although llama_kv_self_seq_div() has been marked deprecated but
it is necessary to export it to make llama-cpp-python happy.
Observed software version:
OS: windows
compiler: MSVC
llama-cpp-python: tag: v0.3.12-cu124
llama.cpp: tag: b5833
Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Co-authored-by: Min-Hua Chen <minhua.chen@neuchips.ai>
* Add PLaMo-2 model using hybrid memory module
* Fix z shape
* Add cmath to include from llama-vocab.h
* Explicitly dequantize normalization weights before RoPE apply
* Revert unnecessary cast because the problem can be solved by excluding attn_k, attn_q when quantizing
* Use ATTN_K/Q_NORM for k,q weights to prevent quantization
* Remove SSM_BCDT that is not used from anywhere
* Do not duplicate embedding weights for output.weight
* Fix tokenizer encoding problem for multibyte strings
* Apply suggestion from @CISC
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use LLM_FFN_SWIGLU instead of splitting ffn_gate and ffn_up
* Remove unnecessary part for Grouped Query Attention
* Fix how to load special token id to gguf
* Remove unused tensor mapping
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Remove llama_vocab_plamo2 class and replace it with llm_tokenizer_plamo2_session to follow the other tokenizer implementations
* Update src/llama-vocab.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Fix plamo2 tokenizer session to prevent multiple calls of build()
---------
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Remove un-necessary templates from class definition and packing functions
Reduce deeply nested conditionals, if-else switching in mnapck function
Replace repetitive code with inline functions in Packing functions
2 ~ 7% improvement in Q8 Model
15 ~ 50% improvement in Q4 Model
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>