Documentation
Local Engines
llama.cpp

llama.cpp (Cortex)

Overview

Jan uses llama.cpp for running local AI models. You can find its settings in Settings () > Local Engine > llama.cpp:


llama.cpp


These settings are for advanced users, you would want to check these settings when:

  • Your AI models are running slowly or not working
  • You've installed new hardware (like a graphics card)
  • You want to tinker & test performance with different backends

Engine Version and Updates

  • Engine Version: View current version of llama.cpp engine
  • Check Updates: Verify if a newer version is available & install available updates when it's available

Available Backends

Jan offers different backend variants for llama.cpp based on your operating system, you can:

  • Download different backends as needed
  • Switch between backends for different hardware configurations
  • View currently installed backends in the list
⚠️

Choose the backend that matches your hardware. Using the wrong variant may cause performance issues or prevent models from loading.

CUDA Support (NVIDIA GPUs)

  • llama.cpp-avx-cuda-11-7
  • llama.cpp-avx-cuda-12-0
  • llama.cpp-avx2-cuda-11-7
  • llama.cpp-avx2-cuda-12-0
  • llama.cpp-avx512-cuda-11-7
  • llama.cpp-avx512-cuda-12-0
  • llama.cpp-noavx-cuda-11-7
  • llama.cpp-noavx-cuda-12-0

CPU Only

  • llama.cpp-avx
  • llama.cpp-avx2
  • llama.cpp-avx512
  • llama.cpp-noavx

Other Accelerators

  • llama.cpp-vulkan
  • For detailed hardware compatibility, please visit our guide for Windows.
  • AVX, AVX2, and AVX-512 are CPU instruction sets. For best performance, use the most advanced instruction set your CPU supports.
  • CUDA versions should match your installed NVIDIA drivers.

Performance Settings

SettingDescriptionDefault
Continuous Batching- Enables processing multiple requests simultaneously
- Improves throughput for concurrent operations
Enabled
Parallel Operations- Number of prompts to run in parallel
- Affects model inference speed
4
CPU Threads- Number of CPU cores to use when running without GPU
- Higher thread counts may improve performance but increase CPU usage
Auto-detected based on your system's capabilities

Memory Settings

SettingDescriptionDefault
Flash Attention- Optimizes attention computation
- Reduces memory usage
- Recommended for most cases
Enabled
Caching- Enable to store recent prompts and responses
- Improves response time for repeated prompts
Enabled
KV Cache Type- KV cache implementation type; controls memory usage and precision trade-off
- Options:
• f16 (most stable)
• q8_0 (balanced)
• q4_0 (lowest memory)
f16
mmap- Enables memory-mapped model loading
- Reduces memory usage
- Recommended for large models
Enabled

Best Practices

  • Start with default settings
  • Adjust based on your hardware capabilities
  • Monitor system performance
  • Test changes with your specific use case

Performance impact varies by hardware, model size, and usage patterns.