Documentation
Built-in Engines
llama.cpp

llama.cpp (Default)

Overview

Jan has a default C++ inference server (opens in a new tab) built on top of llama.cpp (opens in a new tab). This server provides an OpenAI-compatible API, queues, scaling, and additional features on top of the wide capabilities of llama.cpp.

llama.cpp Extension

This guide shows you how to modify your engine's behavior by adjusting its settings in the model.json file.

Prerequisites

  • Make sure you're using an Intel-based Mac. For a complete list of supported Intel CPUs, please see here (opens in a new tab).
  • For Mac Intel, it is recommended to utilize smaller models.

This uses CPU by default, and no acceleration option is available.

GPU Acceleration Options

Enable the GPU acceleration option within the Jan application by following the Installation Setup guide.

Step-by-step Guide

Step 1: Open the model.json

  1. Navigate to the Threads.
  2. Click the three dots (:) icon next to the Model.
  3. Select the Edit Global Defaults for the <model_name>.
  4. This will open up a model.json. For example, the model.json of TinyLlama Chat 1.1B Q4 is shown below:

{
"sources": [
{
"filename": "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
"url": "https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
}
],
"id": "tinyllama-1.1b",
"object": "model",
"name": "TinyLlama Chat 1.1B Q4",
"version": "1.0",
"description": "TinyLlama is a tiny model with only 1.1B. It's a good model for less powerful computers.",
"format": "gguf",
"settings": {
"ctx_len": 4096,
"prompt_template": "<|system|>\n{system_message}<|user|>\n{prompt}<|assistant|>",
"llama_model_path": "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
},
"parameters": {
"temperature": 0.7,
"top_p": 0.95,
"stream": true,
"max_tokens": 2048,
"stop": [],
"frequency_penalty": 0,
"presence_penalty": 0
},
"metadata": {
"author": "TinyLlama",
"tags": [
"Tiny",
"Foundation Model"
],
"size": 669000000
},
"engine": "nitro"
}

Step 2: Modify the model.json

  1. Modify the model's engine settings under the settings array. You can modify the settings with the following parameters:
ParameterTypeDescription
ctx_lenIntegerProvides ample context for model operations like GPT-3.5. The default value is 2048 (Maximum: 4096, Minimum: 1).
prompt_templateStringDefines the template used to format prompts
model_pathStringSpecifies the path to the model .GGUF file.
nglIntegerDetermines GPU layer usage. The default value is 100.
cpu_threadsIntegerDetermines CPU inference threads, limited by hardware and OS. (Maximum determined by system)
cont_batchingIntegerControls continuous batching, enhancing throughput for LLM inference.
embeddingIntegerEnables embedding utilization for tasks like document-enhanced chat in RAG-based applications.
  1. Save the model.json file.

If you use a different model, you must set it up again. As this only affects the selected model.

Step 3: Start the Model

  1. Restart the Jan application to apply your settings.
  2. Navigate to the Threads.
  3. Chat with your model.

If you have questions, please join our Discord community (opens in a new tab) for support, updates, and discussions.