Jan v1 for Deep Research: System Prompts & Setup Guide

This cookbook will transform your Jan-V1 from a basic Q&A tool into a comprehensive research assistant. By the end of this guide, you'll have a custom-configured model that generates detailed reports with proper citations instead of surface-level answers.

Jan-V1 research comparison

Key Points

Jan-V1 includes a default chat template that's automatically embedded in its Hugging Face configuration
Use the default prompt for daily tasks requiring short, accurate answers
Use the research prompt for report generation and comprehensive research tasks
Always specify dates when asking time-sensitive questions (e.g., "What's the world population in 2023?")

Introduction

Jan-V1 (opens in a new tab) is the first release in the Jan Family, designed for agentic reasoning and problem-solving within the Jan App (opens in a new tab). Based on our Lucy (opens in a new tab) model, Jan-v1 achieves improved performance through model scaling.

Jan-v1 uses the Qwen3-4B-thinking (opens in a new tab) model to provide enhanced reasoning capabilities and tool utilization. This architecture delivers better performance on complex agentic tasks.

This guide explores how to optimize Jan-V1 for research-intensive tasks by switching from the default prompt to a specialized research system prompt. Through our internal testing, we've found significant improvements in report generation quality, depth of analysis, and citation accuracy when using the research-optimized system prompt.

Findings

Jan-V1 comes with a built-in search and scrape prompt in its default template. This prompt is optimized for SimpleQA-style questions and daily tasks that require short, accurate responses. However, our testing revealed that when tasked with generating comprehensive long-form reports or conducting in-depth research, the model's performance was inconsistent with the default prompt.

We experimented with Jan-V1 using a new system prompt optimized for research-intensive tasks and found significant improvements in report generation capability. The research prompt produces more comprehensive, well-structured reports with proper citations and deeper analysis compared to the default prompt.

Output comparison between Default System Prompt and Research Prompt on report generation task

Example 1: Long-Context Benchmark Comparison

User Request:

Compare all published benchmarks for long-context performance across LLMs released in 2024-2025

Default Prompt Response

I've analyzed multiple benchmarks for long-context LLM performance published in 2024-2025 through comprehensive scraping and search results. Here's a structured comparison of the major benchmarks:

1. Loong (2024)

Purpose: Evaluates long-context understanding in realistic multi-document scenarios where all documents are relevant to the answer (no document can be ignored)
Key Features:
- Four task types: Spotlight Locating, Comparison, Clustering, Chain of Reasoning
- Context lengths: 10k-250k tokens across 4 sets
- Documents sourced from financial reports, legal cases, and academic papers (2024)
Notable Findings:
- Top models (e.g., Gemini-1.5-pro) still struggle with ultra-long contexts (performance drops significantly at >100k tokens)
- RAG shows minimal improvement over native models for these tasks
- Human performance on hardest tasks: 53.7% accuracy with 15-minute time limit

2. Ada-LEval (2024)

Purpose: Length-adaptable benchmark for evaluating LLMs' long-context capabilities across varying context lengths
Key Features:
- Two subsets: TSort (text sorting) and BestAnswer
- Supports context lengths up to 128k tokens
- Designed to test models in "ultra-long" settings (100k+ tokens)
Notable Findings:
- Reveals significant gaps in models' capabilities at extreme lengths
- Shows that even top models (e.g., GPT-4o) have limitations in ultra-long contexts

3. LongBench v2 (2025)

Purpose: Assesses LLMs' ability to handle long-context problems requiring deep understanding across real-world multitasks
Key Features:
- 503 multiple-choice questions across 6 categories
- Context lengths: 8k-2M words (majority under 128k)
- Human expert validation (53.7% accuracy on hardest questions)
Notable Findings:
- Best-performing model (o1-preview): 57.7% accuracy (vs. human baseline of 53.7%)
- Clear performance degradation at longer contexts (e.g., 128k+ tokens)
- Leaderboard shows models like Gemini-2.5-Pro and DeepSeek-R1 outperform others

4. Vellum AI LLM Leaderboard (2025)

Purpose: Comparative benchmark of LLMs' capabilities, price, and context window sizes
Key Features:
- Includes models from 2024-2025 (e.g., GPT-4o, Gemini-2.5-Pro, Qwen3-235B)
- Focuses on practical metrics like context window size and real-world performance
Notable Findings:
- Highlights the gap between claimed context window sizes and actual performance
- Shows that most models (even those with "1M" context windows) show performance drops beyond 128k tokens

Cross-Benchmark Comparison Summary

Benchmark	Release Year	Key Innovation	Context Length Range
Loong	2024	No-document-ignored multi-document QA	10k-250k tokens
Ada-LEval	2024	Length-adaptive task subsets	Up to 128k tokens
LongBench v2	2025	Real-world multitask evaluation	8k-2M words
Vellum Leaderboard	2025	Context window vs. real performance	Varies

Critical Insights from All Benchmarks

Performance Decline at Scale: All benchmarks show significant performance degradation beyond 100-128k tokens, indicating that current models don't fully leverage their claimed context windows
Realism Gap: Benchmarks like Loong and LongBench v2 reveal that models struggle with realistic multi-document scenarios (where evidence is scattered across documents) better than synthetic benchmarks
RAG Limitations: RAG implementations show minimal benefit for these tasks (Loong results indicate RAG often fails to retrieve all relevant documents)
Model Variability: Larger models (e.g., 235B+ parameter) show better performance but still fall short in ultra-long contexts compared to human capabilities

This comparison demonstrates that while 2024-2025 benchmarks have advanced long-context evaluation, there remains a significant gap between theoretical context window capabilities and practical performance in real-world multi-document scenarios.

Research Prompt Response

Comparative Analysis of Long-Context LLM Benchmarks (2024-2025)

Overview

This report provides a comprehensive comparison of published long-context LLM benchmarks released during 2024-2025. Long-context capabilities have become a critical differentiator among large language models (LLMs), particularly for applications requiring processing of extensive documents, multi-step reasoning across lengthy inputs, and complex information integration. As model context windows have expanded from traditional 128K to 2M+ tokens, the need for rigorous, standardized evaluation frameworks has intensified.

Key Benchmarks Identified

The following four benchmarks represent the most comprehensive and widely cited long-context evaluation frameworks from 2024-2025:

Benchmark	Release Date	Context Length Range	Primary Focus
LongBench v2	May 2025	8k - 2M words	Real-world long-context understanding
MIR-Bench	Feb 2025	Variable (up to 10k+ shots)	Many-shot in-context inductive reasoning
GSM-∞	Feb 2025	0 - 16M+ tokens	Infinitely scalable reasoning complexity
Vellum AI Leaderboard 2025	April 2025	Up to 2M tokens	Cross-benchmark model comparison

Key Findings and Implications

Based on the comprehensive analysis of these benchmarks, several important findings emerge:

Sigmoid performance pattern: Across all benchmarks (MIR-Bench, GSM-∞), LLM performance shows a consistent sigmoid decline as reasoning complexity increases. This suggests fundamental limitations in scaling LLMs for highly complex long-context tasks.
Context length ≠ performance: While longer context windows correlate with better performance in some benchmarks (Vellum), the relationship is not linear. GSM-∞ shows that beyond a certain point, adding more context does not improve performance proportionally.
CoT has diminishing returns: MIR-Bench findings indicate that Chain-of-Thought techniques often decrease performance in many-shot inductive reasoning tasks, contrary to their benefits in simpler tasks.
Noise robustness matters: GSM-∞ demonstrates that models struggle to distinguish relevant information from noise in extremely long contexts, highlighting a critical challenge for real-world applications.
Real-world relevance: LongBench v2 shows that models perform significantly better on real-world tasks than on synthetic ones, indicating that current benchmarks may not fully capture practical long-context capabilities.

Conclusion

The landscape of long-context LLM benchmarks in 2024-2025 reveals both progress and persistent challenges. While benchmarks like LongBench v2, MIR-Bench, GSM-∞, and the Vellum Leaderboard provide increasingly sophisticated ways to evaluate long-context capabilities, significant gaps remain in understanding how LLMs scale for highly complex, real-world tasks. Future work should focus on:

Developing benchmarks that better represent the full spectrum of real-world long-context challenges
Creating evaluation frameworks that measure both context length and information density
Establishing standardized methods to assess noise robustness and inductive reasoning capabilities

These benchmarks collectively provide the foundation for continued progress in long-context LLM development, with their findings guiding researchers and practitioners toward more effective model design and evaluation.

References

[1] Bai, Y., Tu, S., Zhang, J., et al. (2025). LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks. https://longbench2.github.io/ (opens in a new tab)

[2] Yan, K., Chen, Z., & Tian, Y. (2025). MIR-Bench: Benchmarking LLM's Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning. arXiv preprint arXiv:2502.09933.

[3] Liu, H., Chen, Z., & Tian, Y. (2025). GSM-∞: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? arXiv preprint arXiv:2502.05252.

[4] Vellum AI. (2025). LLM Leaderboard 2025. https://www.vellum.ai/llm-leaderboard (opens in a new tab)

Example 2: AI Model Comparison

User Request:

Explain the difference between different AI models: 1) Jan-Nano 2) Lucy 3) Jan-V1

Default Prompt Response

The differences between Jan-Nano, Lucy, and Jan-V1 can be summarized as follows based on available information:

Jan-Nano
- A 4-billion parameter model developed by Menlo Research for deep research tasks.
- Achieves 83.2% accuracy on the SimpleQA benchmark (per arXiv technical report).
- Optimized for local deployment with tools like MCP (Multi-Channel Protocol) for web search.
- Has variants like Jan-Nano-128k, which features a native 128k context window for improved research capabilities.
Lucy
- A 1.7-billion parameter model created by Menlo Research as a lightweight, mobile-friendly variant.
- Designed for agentic web search and lightweight browsing (e.g., using tools like Serper and Crawl4AI).
- Benchmarked at 78.3% accuracy on SimpleQA with MCP, slightly below Jan-Nano (80.7%), but optimized for mobile/CPU efficiency.
- Built on Qwen3-1.7B with RLVR (Reinforcement Learning with Value Functions) and "machine-generated task vectors" to enhance reasoning.
Jan-V1
- A 4-billion parameter model from Menlo Research's Jan Family, explicitly built based on Lucy.
- Achieves 91.1% accuracy on SimpleQA, outperforming Jan-Nano by ~8 percentage points.
- Focuses on agentic reasoning and problem-solving via enhanced tool integration (e.g., for complex tasks).
- Targets seamless integration with the Jan App for local, high-performance research workflows.

Key Distinctions:

Size & Use Case: Lucy is the smallest (1.7B) and mobile-optimized; Jan-Nano and Jan-V1 are both 4B but Jan-V1 is a scaled-up version of Lucy for higher performance.
Performance: Jan-V1 > Jan-Nano > Lucy in SimpleQA benchmarks (91.1% → 83.2% → 78.3%).
Technical Approach: Lucy uses RLVR with task vectors for efficiency; Jan-V1 inherits this but scales it for better accuracy and tool utilization.
Context Window: Jan-Nano-128k offers a larger context window (128k) than base Jan-Nano, but Lucy is designed for mobile constraints.

This hierarchy reflects Menlo Research's progression from lightweight mobile models (Lucy) to scalable research tools (Jan-Nano and Jan-V1).

Research Prompt Response

Differences Between Jan-Nano, Lucy, and Jan-V1 AI Models: A Comparative Analysis

Overview

This report provides a detailed comparison of three AI language models: Jan-Nano, Lucy, and Jan-V1. All three are part of the broader Jan AI ecosystem, though they serve distinct purposes and have different technical specifications. The analysis synthesizes information from authoritative sources to clarify their architectures, capabilities, intended use cases, and relationships.

Key Findings

1. Jan-Nano

Parameters: 4 billion (4B)
Primary Purpose: Deep research tasks
Notable Feature: Native 128k context window in its enhanced version (Jan-Nano 128k)

2. Lucy

Parameters: 1.7 billion (1.7B)
Primary Purpose: Mobile-capable search and lightweight browsing tasks
Notable Feature: Designed as a "toy model" for search and browsing with minimal resource requirements

3. Jan-V1

Parameters: 4 billion (4B)
Primary Purpose: Agentic reasoning and problem-solving via web search and tool invocation
Notable Feature: Open-source alternative to Perplexity Pro with 91% SimpleQA accuracy

Technical and Functional Differences

Model	Parameter Count	Specialized For	Key Differentiator
Jan-Nano	4B	Deep research tasks	128k context window (enhanced version)
Lucy	1.7B	Mobile search and lightweight browsing	Ultra-low resource footprint; "toy model" design
Jan-V1	4B	Agentic reasoning & web search	Built on Lucy; 91% SimpleQA accuracy; open-source

Performance and Contextual Capabilities

Jan-Nano's 128k context window allows it to process longer documents or complex queries than standard models.
Lucy's 1.7B size makes it significantly lighter than Jan-Nano or Jan-V1, enabling deployment on mobile devices.
Jan-V1's 91% SimpleQA accuracy (from benchmarks) demonstrates strong performance in structured knowledge retrieval tasks.

References

Example 3: Software Comparison

User Request: "Compare Signal, Telegram, WhatsApp for secure messaging"

Default Prompt Response

Security Overview:

Signal: E2EE by default, open-source
Telegram: E2EE only in Secret Chats
WhatsApp: E2EE but Meta data collection
Basic privacy comparison
200 words, single citation

Strengths: Clear recommendations
Weaknesses: Limited analysis depth

Research Prompt Response

Comprehensive Security Analysis:

Full report with methodology and detailed comparison table
Academic research citations (IACR 2023 study)
Privacy practices analysis with breach history
User experience and practical considerations
Final recommendations table by use case
1000+ words, 5 authoritative references

Strengths: Evidence-based analysis, academic rigor
Result: Professional security assessment format

How to use Research Prompt with Jan-V1

Jan-V1 comes with an embedded chat template (opens in a new tab) that automatically applies a default system prompt. By default, users are automatically opted-in to this chat template. To optimize for research tasks, you need to extend this default prompt in the Jan app settings.

Step 1: Access assistant settings

Navigate to the Jan app and access the assistant settings icon ⚙️ on the top left of the screen. Jan app: how to open assistant settings

When you open the assistant settings, you'll notice the system prompt field appears empty. This is expected behavior because Jan-V1's default system prompt is embedded directly into the chat template (opens in a new tab) rather than being displayed in the Jan app's UI. The empty field doesn't mean there's no system prompt - it's just not visible in the interface.

Jan app assistant settings interface showing empty system prompt field interface

Step 2: Understanding the Default System Prompt

Before switching to the research prompt, it's helpful to understand what the default Jan-V1 system prompt provides:


 In this environment you have access to a set of tools you can use to answer the user's question. You can use one tool per message, and will receive the result of that tool use in the user's response. You use tools step-by-step to accomplish a given task, with each tool use informed by the result of the previous tool use.
  Tool Use Rules
  Here are the rules you should always follow to solve your task:
  1. Always use the right arguments for the tools. Never use variable names as the action arguments, use the value instead.
  2. Call a tool only when needed: do not call the search agent if you do not need information, try to solve the task yourself.
  3. If no tool call is needed, just answer the question directly.
  4. Never re-do a tool call that you previously did with the exact same parameters.
  5. For tool use, MARK SURE use XML tag format as shown in the examples above. Do not use any other format.
  Now Begin! If you solve the task correctly, you will receive a reward of $1,000,000.

Step 3: Implementing the Research Prompt

To switch to the research-optimized prompt, replace the default system prompt with the following research template:


You are a **research agent** designed to conduct **in-depth, methodical investigations** into user questions. Your goal is to produce a **comprehensive, well-structured, and accurately cited report** using **authoritative sources**. You will use available tools to gather detailed information, analyze it, and synthesize a final response.
### **Tool Use Rules (Strictly Enforced)**
1. **Use correct arguments**: Always use actual values — never pass variable names (e.g., use "Paris" not {city}).
2. **Call tools only when necessary**: If you can answer from prior results, do so — **do not search unnecessarily**. However, All cited **url in the report must be visited**, and all **entities (People, Organization, Location, etc.) mentioned on the report must be searched/visited**. 
3. **Terminate When Full Coverage Is Achieved** Conclude tool usage and deliver a final response only when the investigation has achieved **comprehensive coverage** of the topic. This means not only gathering sufficient data to answer the question but also ensuring all critical aspects—context, subtopics, and nuances—are adequately addressed. Once the analysis is complete and no further tool use would add meaningful value, **immediately stop searching and provide a direct, fully formed response**.
4. **Visit all urls:** All cited **url in the report must be visited**, and all **entities mentioned on the report must be browsed**.
5. **Avoid repetition**: Never repeat the same tool call with identical arguments. If you detect a cycle (e.g., repeating the same search), **stop and answer based on available data**.
6. **Track progress**: Treat each tool call as a step in a plan. After each result, ask: "Did you have full coverage?" and "What is the next step?"
7. **Limit tool usage**: If you've used a tool multiple times without progress, **reassess and attempt to conclude** — do not continue indefinitely.
8. **Use proper format**: MARK sure you wrap tool calls in XML tags as shown in the example.
### Output Format Requirements
At the end, respond **only** with a **self-contained markdown report**. Do not include tool calls or internal reasoning in the final output.
Example structure:
```markdown
# [Clear Title]
## Overview
...
## Key Findings
- Finding 1 [1]
- Finding 2 [2]
## Detailed Analysis
...
## References
[1] https://example.com/source1  
[2] https://example.com/study2  
...
Goal
Answer with depth, precision, and scholarly rigor. You will be rewarded for:
Thoroughness in research
Use of high-quality sources when available (.gov, .edu, peer-reviewed, reputable media)
Clear, structured reporting
Efficient path to completion without redundancy
Now Begin! If you solve the task correctly, you will receive a reward of $1,000,000.

If set up correctly, you should see this on Jan screen. jan v1 deep_research_prompt

Save the assistant settings and enjoy Jan-V1 with improved report generation capability.

(Optional) Time-Sensitive queries optimization

Jan-V1 can sometimes incorrectly assume the current date based on the year it was trained on. This can be easily mitigated by attaching the current date to your system prompt:


You are a **research agent** designed to ...
....
Current Year: 2025
Now Begin! If you solve the task correctly, you will receive a reward of $1,000,000.

Alternatively, you can simply add the date directly to your question:

Example:

Instead of: "What's the world population?"
Use: "What's the world population in 2025?"

Advanced Usage: Customize Prompt Template

You can customize the chat template by opening the model settings icon ⚙️ in the center of Jan's model selection. Do not confuse this with the assistant settings ⚙️ on the top left of the screen.

jan_open_prompt_template

Scroll down and you will see the Jinja template that can be overridden. We recommend experimenting with this raw chat template (opens in a new tab) to completely eliminate the effect of the default system prompt.

We observed fewer tool calls per query when overriding the default chat template with this raw chat template and only recommend for advanced usage only. jan_prompt_template_settings