Documentation
Local AI Engine

Local AI Engine (llama.cpp)

What is llama.cpp?

llama.cpp is the engine that runs AI models locally on your computer. Think of it as the software that takes an AI model file and makes it actually work on your hardware - whether that's your CPU, graphics card, or Apple's M-series chips.

Originally created by Georgi Gerganov, llama.cpp is designed to run large language models efficiently on consumer hardware without requiring specialized AI accelerators or cloud connections.

Why This Matters

Privacy: Your conversations never leave your computer Cost: No monthly subscription fees or API costs Speed: No internet required once models are downloaded Control: Choose exactly which models to run and how they behave

Accessing Engine Settings

Find llama.cpp settings at Settings () > Local Engine > llama.cpp:

llama.cpp

These are advanced settings. You typically only need to adjust them if models aren't working properly or you want to optimize performance for your specific hardware.

Engine Management

FeatureWhat It DoesWhen You Need It
Engine VersionShows which version of llama.cpp you're runningCheck compatibility with newer models
Check UpdatesDownloads newer engine versionsWhen new models require updated engine
Backend SelectionChoose the version optimized for your hardwareAfter installing new graphics cards or when performance is poor

Hardware Backends

Jan offers different backend versions optimized for your specific hardware. Think of these as different "drivers" - each one is tuned for particular processors or graphics cards.

⚠️

Using the wrong backend can make models run slowly or fail to load. Pick the one that matches your hardware.

NVIDIA Graphics Cards (Recommended for Speed)

Choose based on your CUDA version (check NVIDIA Control Panel):

For CUDA 12.0:

  • llama.cpp-avx2-cuda-12-0 (most common)
  • llama.cpp-avx512-cuda-12-0 (newer Intel/AMD CPUs)

For CUDA 11.7:

  • llama.cpp-avx2-cuda-11-7 (most common)
  • llama.cpp-avx512-cuda-11-7 (newer Intel/AMD CPUs)

CPU Only (No Graphics Card Acceleration)

  • llama.cpp-avx2 (most modern CPUs)
  • llama.cpp-avx512 (newer Intel/AMD CPUs)
  • llama.cpp-avx (older CPUs)
  • llama.cpp-noavx (very old CPUs)

Other Graphics Cards

  • llama.cpp-vulkan (AMD, Intel Arc, some others)

Quick Test: Start with avx2-cuda-12-0 if you have an NVIDIA card, or avx2 for CPU-only. If it doesn't work, try the avx variant.

Performance Settings

These control how efficiently models run:

SettingWhat It DoesRecommended ValueImpact
Continuous BatchingProcess multiple requests at onceEnabledFaster when using multiple tools or having multiple conversations
Parallel OperationsHow many requests to handle simultaneously4Higher = more multitasking, but uses more memory
CPU ThreadsHow many processor cores to useAuto-detectedMore threads can speed up CPU processing

Memory Settings

These control how models use your computer's memory:

SettingWhat It DoesRecommended ValueWhen to Change
Flash AttentionMore efficient memory usageEnabledLeave enabled unless you have problems
CachingRemember recent conversationsEnabledSpeeds up follow-up questions
KV Cache TypeMemory precision trade-offf16Change to q8_0 or q4_0 if running out of memory
mmapLoad models more efficientlyEnabledHelps with large models
Context ShiftHandle very long conversationsDisabledEnable for very long chats or multiple tool calls

KV Cache Types Explained

  • f16: Most stable, uses more memory
  • q8_0: Balanced memory usage and quality
  • q4_0: Uses least memory, slight quality loss

Troubleshooting Common Issues

Models won't load:

  • Try a different backend (switch from CUDA to CPU or vice versa)
  • Check if you have enough RAM/VRAM
  • Update to latest engine version

Very slow performance:

  • Make sure you're using GPU acceleration (CUDA/Metal/Vulkan backend)
  • Increase GPU Layers in model settings
  • Close other memory-intensive programs

Out of memory errors:

  • Reduce Context Size in model settings
  • Switch KV Cache Type to q8_0 or q4_0
  • Try a smaller model variant

Random crashes:

  • Switch to a more stable backend (try avx instead of avx2)
  • Disable overclocking if you have it enabled
  • Update graphics drivers

Quick Setup Guide

For most users:

  1. Use the default backend that Jan installs
  2. Leave all performance settings at defaults
  3. Only adjust if you experience problems

If you have an NVIDIA graphics card:

  1. Download the appropriate CUDA backend
  2. Make sure GPU Layers is set high in model settings
  3. Enable Flash Attention

If models are too slow:

  1. Check you're using GPU acceleration
  2. Try enabling Continuous Batching
  3. Close other applications using memory

If running out of memory:

  1. Change KV Cache Type to q8_0
  2. Reduce Context Size in model settings
  3. Try a smaller model

Most users can run Jan successfully without changing any of these settings. The defaults are chosen to work well on typical hardware.