Documentation
Local Engines
TensorRT-LLM

TensorRT-LLM

Overview

Users with Nvidia GPUs can get 20-40% faster token generation speeds than llama.cpp engine on their laptop or desktops by using TensorRT-LLM (opens in a new tab). The more significant implication is that you are running FP16, which is more accurate than quantized models.

TensortRT-LLM

This guide walks you through installing Jan's official TensorRT-LLM Extension (opens in a new tab). This extension uses Nitro-TensorRT-LLM (opens in a new tab) as the AI engine, instead of the default Nitro-Llama-CPP (opens in a new tab). It includes an efficient C++ server that executes the TRT-LLM C++ runtime (opens in a new tab) natively. It also includes features and performance improvements like OpenAI compatibility, tokenizer improvements, and queues.

  • This feature is only available for Windows users, Linux is coming soon.

  • Additionally, we only prebuilt a few demo models. You can always build your desired models directly on your machine. For more information, please see here.

Pre-requisites

  • A Windows PC
  • Nvidia GPU(s): Ada or Ampere series (i.e. RTX 4000s & 3000s). More will be supported soon.
  • 3GB+ of disk space to download TRT-LLM artifacts and a Nitro binary
  • Jan v0.4.9+ or Jan v0.4.8-321+ (nightly)
  • Nvidia Driver v535+ (For installation guide, please see here)
  • CUDA Toolkit v12.2+ (For installation guide, please see here)
⚠️

If you are using our nightly builds, you may have to reinstall the TensorRT-LLM extension each time you update the app. We're working on better extension lifecycles - stay tuned.

Step 1: Install TensorRT-Extension

  1. Go to Settings > Extensions.
  2. Click Install next to the TensorRT-LLM Extension.
  3. Check that files are correctly downloaded.

ls ~/jan/extensions/@janhq/tensorrt-llm-extension/dist/bin
# Your Extension Folder should now include `nitro.exe`, among other artifacts needed to run TRT-LLM

Step 2: Download a Compatible Model

TensorRT-LLM can only run models in TensorRT format. These models, aka "TensorRT Engines", are prebuilt for each target OS+GPU architecture.

We offer a handful of precompiled models for Ampere and Ada cards that you can immediately download and play with:

  1. Restart the application and go to the Hub.
  2. Look for models with the TensorRT-LLM label in the recommended models list > Click Download.

This step might take some time. 🙏

image

  1. Click Use and start chatting!
  2. You may need to allow Nitro in your network.

Nitro Network

Step 3: Configure Settings

You can customize the default parameters for how Jan runs TensorRT-LLM.

Troubleshooting

Incompatible Extension vs Engine versions

For now, the model versions are pinned to the extension versions.

Uninstall Extension

To uninstall the extension, follow the steps below:

  1. Quit the app.
  2. Go to Settings > Extensions.
  3. Delete the entire Extensions folder.
  4. Reopen the app. Only the default extensions should be restored.

Install Nitro-TensorRT-LLM Manually

You can reference the source code to manually build the artifacts needed to run the server and TensorRT-LLM. Read here (opens in a new tab).