Jan Local Server
Local AI Engine

Local AI Engine (llama.cpp)

llama.cpp is the engine that runs AI models locally on your computer. It's what makes Jan work without needing internet or cloud services.

Accessing Engine Settings

Find llama.cpp settings at Settings () > Local Engine > llama.cpp:

llama.cpp

Most users don't need to change these settings. Jan picks good defaults for your hardware automatically.

When to Adjust Settings

You might need to modify these settings if:

  • Models load slowly or don't work
  • You've installed new hardware (like a graphics card)
  • You want to optimize performance for your specific setup

Engine Management

FeatureWhat It DoesWhen You Need It
Engine VersionShows current llama.cpp versionCheck compatibility with newer models
Check UpdatesDownloads engine updatesWhen new models require updated engine
Backend SelectionChoose hardware-optimized versionAfter hardware changes or performance issues

Hardware Backends

Different backends are optimized for different hardware. Pick the one that matches your computer:

NVIDIA Graphics Cards (Fastest)

For CUDA 12.0:

  • llama.cpp-avx2-cuda-12-0 (most common)
  • llama.cpp-avx512-cuda-12-0 (newer Intel/AMD CPUs)

For CUDA 11.7:

  • llama.cpp-avx2-cuda-11-7 (older drivers)

CPU Only

  • llama.cpp-avx2 (modern CPUs)
  • llama.cpp-avx (older CPUs)
  • llama.cpp-noavx (very old CPUs)

Other Graphics Cards

  • llama.cpp-vulkan (AMD, Intel Arc)

Performance Settings

SettingWhat It DoesRecommendedImpact
Continuous BatchingHandle multiple requests simultaneouslyEnabledFaster when using tools or multiple chats
Parallel OperationsNumber of concurrent requests4Higher = more multitasking, uses more memory
CPU ThreadsProcessor cores to useAutoMore threads can speed up CPU processing

Memory Settings

SettingWhat It DoesRecommendedWhen to Change
Flash AttentionEfficient memory usageEnabledLeave enabled unless problems occur
CachingRemember recent conversationsEnabledSpeeds up follow-up questions
KV Cache TypeMemory vs quality trade-offf16Change to q8_0 if low on memory
mmapEfficient model loadingEnabledHelps with large models
Context ShiftHandle very long conversationsDisabledEnable for very long chats

Memory Options Explained

  • f16: Best quality, uses more memory
  • q8_0: Balanced memory and quality
  • q4_0: Least memory, slight quality reduction

Quick Troubleshooting

Models won't load:

  • Try a different backend
  • Check available RAM/VRAM
  • Update engine version

Slow performance:

  • Verify GPU acceleration is active
  • Close memory-intensive applications
  • Increase GPU Layers in model settings

Out of memory:

  • Change KV Cache Type to q8_0
  • Reduce Context Size in model settings
  • Try a smaller model

Crashes or errors:

  • Switch to a more stable backend (avx instead of avx2)
  • Update graphics drivers
  • Check system temperature

Quick Setup Guide

Most users:

  1. Use default settings
  2. Only change if problems occur

NVIDIA GPU users:

  1. Download CUDA backend
  2. Ensure GPU Layers is set high
  3. Enable Flash Attention

Performance optimization:

  1. Enable Continuous Batching
  2. Use appropriate backend for hardware
  3. Monitor memory usage

The default settings work well for most hardware. Only adjust these if you're experiencing specific issues or want to optimize for your particular setup.