Cortex
info
Cortex is a Local AI API Platform that is used to run and customize LLMs.
Key Features:
- Straightforward CLI (inspired by Ollama)
- Full C++ implementation, packageable into Desktop and Mobile apps
- Pull from Huggingface, or Cortex Built-in Model Library
- Models stored in universal file formats (vs blobs)
- Swappable Inference Backends (default:
llamacpp
, future:ONNXRuntime
,TensorRT-LLM
) - Cortex can be deployed as a standalone API server, or integrated into apps like Jan.ai
Cortex's roadmap is to implement the full OpenAI API including Tools, Runs, Multi-modal and Realtime APIs.
Inference Backends
- Default: llama.cpp: cross-platform, supports most laptops, desktops and OSes
- Future: ONNX Runtime: supports Windows Copilot+ PCs & NPUs
- Future: TensorRT-LLM: supports Nvidia GPUs
If GPU hardware is available, Cortex is GPU accelerated by default.
Models
Cortex.cpp allows users to pull models from multiple Model Hubs, offering flexibility and extensive model access.
Note: As a very general guide: You should have >8 GB of RAM available to run the 7B models, 16 GB to run the 14B models, and 32 GB to run the 32B models.
Cortex Built-in Models & Quantizations
Model /Engine | llama.cpp | Command |
---|---|---|
phi-3.5 | ✅ | cortex run phi3.5 |
llama3.2 | ✅ | cortex run llama3.2 |
llama3.1 | ✅ | cortex run llama3.1 |
codestral | ✅ | cortex run codestral |
gemma2 | ✅ | cortex run gemma2 |
mistral | ✅ | cortex run mistral |
ministral | ✅ | cortex run ministral |
qwen2 | ✅ | cortex run qwen2.5 |
openhermes-2.5 | ✅ | cortex run openhermes-2.5 |
tinyllama | ✅ | cortex run tinyllama |
View all Cortex Built-in Models.
Cortex supports multiple quantizations for each model.