Documentation
Local Engines
llama.cpp

llama.cpp

Overview

Jan has a default C++ inference server (opens in a new tab) built on top of llama.cpp (opens in a new tab). This server provides an OpenAI-compatible API, queues, scaling, and additional features on top of the wide capabilities of llama.cpp.

llama.cpp Extension

Nitro is the default AI engine downloaded in Jan. No additional setup is needed.

This guide walks you through customizing your engine settings by configuring the nitro.json file.

Prerequisites

  • Make sure you're using an Intel-based Mac. For a complete list of supported Intel CPUs, please see here (opens in a new tab).
  • For Mac Intel, it is recommended to utilize smaller models.

This uses CPU by default, and no acceleration option is available.

CPU Instruction

The prebuilt Nitro with the llama.cpp inference engine requires running with the correct CPU instruction. AVX2 is the standard CPU instruction set in Jan, commonly used in newer Intel and AMD processors. However, older CPUs that support AVX or AVX-512 may require customization of the default CPU instruction.

The CPU instruction configuration is only applicable to the Windows and Linux versions.

To Update CPU Instructions for Jan Applications Using AVX, follow these steps:

  1. Check what your CPU uses for processing. For AVX, please see the list of CPUs here (opens in a new tab).
  2. Navigate to your Jan Data folder, which is usually found at C:/Users/<name>/jan(Windows) or ~/jan(Linux)
  3. Download the appropriate Nitro version based on your CPU, GPU, and system specifications from here (opens in a new tab).
  4. Unzip the downloaded Nitro file.
  5. Move the new Nitro files to the correct location in your Jan folder. Make sure it's not nested in another folder.
  6. Restart the Jan app.

GPU Acceleration Options

Enable the GPU acceleration option available within the Jan application by following the steps in the Hardware Setup guide.

Step-by-step Guide

  1. Navigate to the App Settings > Advanced > Open App Directory > ~/jan/engines folder.
  2. Modify the nitro.json file based on your needs. The default settings are shown below.

You can also edit the model.json file in the model folder, which overwrites the nitro.json settings.


{
"ctx_len": 2048,
"ngl": 100,
"cpu_threads": 1,
"cont_batching": false,
"embedding": false
}

The table below describes the parameters in the nitro.json file.

ParameterTypeDescription
ctx_lenIntegerTypically set at 2048, ctx_len provides ample context for model operations like GPT-3.5. (Maximum: 4096, Minimum: 1)
nglIntegerDefaulted at 100, ngl determines GPU layer usage.
cpu_threadsIntegerDetermines CPU inference threads, limited by hardware and OS. (Maximum determined by system)
cont_batchingIntegerControls continuous batching, enhancing throughput for LLM inference.
embeddingIntegerEnables embedding utilization for tasks like document-enhanced chat in RAG-based applications.
  • By default, the value of ngl is set to 100, which indicates that it will offload all. If you wish to offload only 50% of the GPU, you can set ngl to 15 because most models on Mistral or llama.cpp are around ~ 30 layers.
  • To utilize the embedding feature, include the JSON parameter "embedding": true. It will enable Nitro to process inferences with embedding capabilities. Please refer to the Embedding in the Nitro documentation (opens in a new tab) for a more detailed explanation.
  • To utilize the continuous batching feature for boosting throughput and minimizing latency in large language model (LLM) inference, include cont_batching: true. For details, please refer to the Continuous Batching in the Nitro documentation (opens in a new tab).

If you have questions, please join our Discord community (opens in a new tab) for support, updates, and discussions.