The Problem — Too Much to Read, Too Little Time

Every morning starts the same way. You open your favorite news channels, blog feeds, and Telegram groups — and the wall of unread posts hits you. Dozens of articles, updates, announcements. You need to understand what happened overnight, but reading everything is not realistic. You skim, miss context, and occasionally discover three days later that something important slipped through.

What if you could just ask? Not a search engine — a conversational assistant that has already read everything and can answer questions about it. “What were the key announcements yesterday?” “Did anyone mention breaking changes in the new Kafka release?” “Summarize the posts about Kubernetes security.”

This is a natural fit for RAG — Retrieval-Augmented Generation. Instead of relying on the LLM’s training data (which is frozen in time), you feed it your own content at query time. The model answers based on what it retrieves from your knowledge base, not what it memorized during training.

In this article, I will walk through building exactly this kind of assistant using LangChain4J, Qdrant vector database, and a local LLM served through an OpenAI-compatible API. We will cover the full pipeline: configuring the LLM, setting up embeddings, ingesting documents into Qdrant, and wiring up the RAG retrieval so the assistant can answer questions grounded in your actual content.


Architecture Overview

Before diving into code, let’s look at the overall architecture. The system has three main flows: ingestion (getting content into the vector database), retrieval (finding relevant content at query time), and generation (producing an answer using the LLM).

Ingestion flow — runs on a schedule to keep the knowledge base fresh:

      sequenceDiagram
    participant Scheduler
    participant Sources as Data Sources
    participant Parser as Document Parser<br/>(Apache Tika)
    participant EM as Embedding Model<br/>(all-MiniLM-L6-v2)
    participant Qdrant

    Scheduler->>Qdrant: Create new collection
    Scheduler->>Sources: Fetch posts, articles, PDFs
    Sources-->>Parser: Raw content
    Parser-->>EM: Parsed text
    EM-->>Qdrant: Vectors + metadata
    Scheduler->>Qdrant: Switch alias to new collection
    Scheduler->>Qdrant: Delete old collection
    

Query flow — happens in real time when a user asks a question:

      sequenceDiagram
    participant User
    participant AiServices
    participant EM as Embedding Model<br/>(all-MiniLM-L6-v2)
    participant Qdrant
    participant LLM

    User->>AiServices: "What happened with Kafka this week?"
    AiServices->>EM: Embed question
    EM-->>AiServices: Question vector [384 dims]
    AiServices->>Qdrant: Similarity search (top 20, min 0.5)
    Qdrant-->>AiServices: Relevant document chunks
    AiServices->>LLM: System prompt + memory + context + question
    LLM-->>AiServices: Grounded response
    AiServices-->>User: "Based on this week's posts..."
    

The ingestion pipeline pulls content from your data sources, parses documents (including PDFs), generates embeddings, and stores them in Qdrant. The query pipeline is synchronous: when a user asks a question, LangChain4J embeds the question, retrieves relevant document chunks from Qdrant, and passes them as context to the LLM alongside the user’s message.


Dependencies

The project uses Spring Boot 3.5 with LangChain4J 1.10.0. Here are the key dependencies in build.gradle.kts:

dependencies {
    // LangChain4J core
    implementation("dev.langchain4j:langchain4j:1.10.0")

    // OpenAI-compatible model integration
    implementation("dev.langchain4j:langchain4j-open-ai:1.10.0")
    implementation("dev.langchain4j:langchain4j-http-client-jdk:1.10.0")

    // Qdrant vector store
    implementation("dev.langchain4j:langchain4j-qdrant:1.10.0-beta18")

    // Local embedding model (no API calls needed)
    implementation("dev.langchain4j:langchain4j-embeddings-all-minilm-l6-v2:1.10.0-beta18")

    // Document parsing (PDF, text, etc.)
    implementation("dev.langchain4j:langchain4j-easy-rag:1.10.0-beta18")

    // Spring Boot
    implementation("org.springframework.boot:spring-boot-starter-webflux")
}

A few things to note:

Dependency Purpose
langchain4j Core framework — agents, tools, memory, RAG abstractions
langchain4j-open-ai Chat model integration via OpenAI-compatible API (works with vLLM, LiteLLM, Ollama)
langchain4j-http-client-jdk JDK 11+ HTTP client for LangChain4J (replaces the default OkHttp)
langchain4j-qdrant Qdrant vector store integration
langchain4j-embeddings-all-minilm-l6-v2 In-process embedding model — no external API calls, runs on CPU
langchain4j-easy-rag Document loaders and parsers (includes Apache Tika for PDF parsing)

The langchain4j-open-ai module is the key integration point. It uses the OpenAI API format, which means it works with any backend that speaks this protocol — not just OpenAI itself. If you have a local LLM running behind vLLM or LiteLLM (as covered in my previous article), you can point it at your local endpoint. No cloud API required.


Configuring the LLM

LangChain4J’s OpenAiChatModel connects to any OpenAI-compatible endpoint. First, define the configuration properties as a record:

@ConfigurationProperties(prefix = "dev.alimov.llm")
public record LlmProperties(
        String url,
        String model,
        String apiKey,
        @DefaultValue("65536") int maxTokens,
        @DefaultValue("500") int maxCompletionTokens,
        @DefaultValue("O200K_BASE") String tokenEncoding
) {}

The corresponding application properties:

dev.alimov.llm.url=http://localhost:4000/v1
dev.alimov.llm.model=qwen3-8b
dev.alimov.llm.api-key=api-key
dev.alimov.llm.max-tokens=65536
dev.alimov.llm.max-completion-tokens=500
dev.alimov.llm.token-encoding=O200K_BASE

Then wire it into a configuration class that creates the model beans:

@Configuration
@EnableConfigurationProperties(LlmProperties.class)
public class LlmConfiguration {

    @Bean
    public ChatModel chatModel(LlmProperties properties) {
        return OpenAiChatModel.builder()
                .baseUrl(properties.url())
                .modelName(properties.model())
                .apiKey(properties.apiKey())
                .httpClientBuilder(httpClientBuilder())
                .maxTokens(properties.maxTokens())
                .maxCompletionTokens(properties.maxCompletionTokens())
                .temperature(0.0)
                .parallelToolCalls(true)
                .logRequests(true)
                .logResponses(true)
                .build();
    }

    @Bean
    public StreamingChatModel streamingChatModel(LlmProperties properties) {
        return OpenAiStreamingChatModel.builder()
                .baseUrl(properties.url())
                .modelName(properties.model())
                .apiKey(properties.apiKey())
                .httpClientBuilder(httpClientBuilder())
                .maxTokens(properties.maxTokens())
                .maxCompletionTokens(properties.maxCompletionTokens())
                .temperature(0.0)
                .logRequests(true)
                .logResponses(true)
                .build();
    }

    private static JdkHttpClientBuilder httpClientBuilder() {
        return JdkHttpClient.builder()
                .connectTimeout(Duration.ofSeconds(30))
                .readTimeout(Duration.ofSeconds(60))
                .httpClientBuilder(HttpClient.newBuilder()
                        .connectTimeout(Duration.ofSeconds(30))
                        .version(HttpClient.Version.HTTP_1_1));
    }
}

Using a record for @ConfigurationProperties is clean — Spring Boot binds the properties directly to constructor parameters, giving you immutability and no boilerplate getters. The @DefaultValue annotation provides fallback values when a property is not set.

Key configuration decisions:

  • temperature=0.0 — deterministic responses. For a news assistant that should answer factually based on retrieved content, you want consistency, not creativity.
  • parallelToolCalls=true — allows the model to invoke multiple tools in a single turn, which speeds up responses when the model needs to look up several pieces of information.
  • maxCompletionTokens=500 — keeps responses concise. For a news digest assistant, you want summaries, not essays.
  • Custom HTTP client with HTTP_1_1 — this is not optional when working with local backends. Java’s default HTTP client may attempt to negotiate HTTP/2, which causes requests to fail silently or hang. Local LLM servers like vLLM and LM Studio do not support HTTP/2 — they only speak HTTP/1.1. Setting HttpClient.Version.HTTP_1_1 explicitly avoids this. The langchain4j-http-client-jdk module (GitHub issue #2882) gives you direct control over the JDK’s HttpClient builder, and also eliminates the OkHttp dependency.

The StreamingChatModel bean provides the same model configuration but with token-by-token streaming — useful if you are building a chat interface where users see the response as it generates.


Before we can do any retrieval, we need to understand embeddings. An embedding is a vector — an array of numbers — that represents the semantic meaning of a piece of text. Texts with similar meanings produce vectors that are close together in vector space. This is what makes semantic search possible: instead of matching keywords, you match meanings.

LangChain4J supports multiple embedding models. For this project, we use all-MiniLM-L6-v2 — a lightweight model from the sentence-transformers family that runs entirely in-process on CPU. No API calls, no network latency, no external dependencies.

@Bean
public EmbeddingModel embeddingModel() {
    return new AllMiniLmL6V2EmbeddingModel();
}

That is the entire configuration. The model is bundled as a Maven dependency (langchain4j-embeddings-all-minilm-l6-v2) and runs via ONNX Runtime inside your JVM. It produces 384-dimensional vectors.

Property Value
Model all-MiniLM-L6-v2
Vector dimension 384
Runtime ONNX (in-process, CPU)
Latency ~1-5ms per embedding
External API None required
Use case Short-to-medium text (articles, posts, paragraphs)

Why not use a larger, more capable embedding model served via API? For a news assistant, the content is typically short — post titles, article summaries, paragraph-length descriptions. All-MiniLM-L6-v2 handles this well. It is fast, free, and eliminates an external dependency. If you later need better embedding quality for longer or more complex documents, you can swap in a model like bge-large or e5-large-v2 served through an embedding API — LangChain4J makes this a configuration change, not an architecture change.


Setting Up Qdrant

Qdrant is a vector database purpose-built for similarity search. It stores your document embeddings and lets you query them efficiently using approximate nearest neighbor (ANN) algorithms. It supports metadata filtering, which is critical for multi-tenant setups where you need to scope searches to a specific data source.

Running Qdrant

The simplest way to run Qdrant is with Docker:

services:
  qdrant:
    image: qdrant/qdrant:v1.14.1
    container_name: qdrant
    ports:
      - "6333:6333"   # REST API
      - "6334:6334"   # gRPC API
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      QDRANT__SERVICE__GRPC_PORT: 6334

volumes:
  qdrant_data:

Spring Boot Configuration

Define the connection properties:

@ConfigurationProperties(prefix = "dev.alimov.qdrant")
public record QdrantProperties(
        @DefaultValue("localhost") String host,
        @DefaultValue("6334") int port,
        @DefaultValue("news-embeddings") String collectionName,
        @DefaultValue("") String apiKey,
        @DefaultValue("PT1H") Duration reindexPeriod
) {}
dev.alimov.qdrant.host=${QDRANT_HOST:localhost}
dev.alimov.qdrant.port=${QDRANT_PORT:6334}
dev.alimov.qdrant.collection-name=${QDRANT_COLLECTION:news-embeddings}
dev.alimov.qdrant.api-key=${QDRANT_API_KEY:}
dev.alimov.qdrant.reindex-period=${QDRANT_REINDEX_PERIOD:PT1H}

Qdrant Client and Embedding Store

The Qdrant configuration creates the client, ensures the collection exists, and provides a QdrantEmbeddingStore for LangChain4J:

@Configuration
@EnableConfigurationProperties(QdrantProperties.class)
public class QdrantConfiguration {

    @Bean
    public EmbeddingModel embeddingModel() {
        return new AllMiniLmL6V2EmbeddingModel();
    }

    @Bean
    public QdrantClient qdrantClient(QdrantProperties properties) {
        QdrantGrpcClient.Builder grpcBuilder = QdrantGrpcClient.newBuilder(
                properties.host(), properties.port(), false);

        if (properties.apiKey() != null && !properties.apiKey().isBlank()) {
            grpcBuilder.withApiKey(properties.apiKey());
        }

        return new QdrantClient(grpcBuilder.build());
    }

    @Bean
    public QdrantEmbeddingStore qdrantEmbeddingStore(
            QdrantProperties properties,
            QdrantClient qdrantClient,
            EmbeddingModel embeddingModel) {

        String aliasName = properties.collectionName();
        ensureAliasExists(qdrantClient, aliasName, embeddingModel.dimension());

        return QdrantEmbeddingStore.builder()
                .client(qdrantClient)
                .collectionName(aliasName)
                .build();
    }
}

Notice that we use a Qdrant alias rather than pointing directly at a collection. This is important for the reindexing strategy we will cover next — it allows zero-downtime collection swaps.

The ensureAliasExists method handles the initial bootstrap: if the alias does not exist yet, it creates an initial collection with the correct vector dimensions and distance metric, sets up a payload index for filtering, and points the alias at it.

private void ensureAliasExists(QdrantClient qdrantClient,
                                String aliasName,
                                int vectorDimension) {
    List<AliasDescription> aliases = qdrantClient.listAliasesAsync().get();

    boolean aliasExists = aliases.stream()
            .anyMatch(alias -> alias.getAliasName().equals(aliasName));

    if (!aliasExists) {
        String initialCollectionName = aliasName + "-initial";

        qdrantClient.createCollectionAsync(initialCollectionName,
                VectorParams.newBuilder()
                        .setSize(vectorDimension)
                        .setDistance(Distance.Cosine)
                        .build())
                .get();

        qdrantClient.createPayloadIndexAsync(initialCollectionName,
                "source_id", PayloadSchemaType.Keyword,
                null, true, null, null)
                .get();

        qdrantClient.createAliasAsync(aliasName, initialCollectionName).get();
    }
}

Key decisions in the collection setup:

  • Distance: Cosine — the standard choice for text embeddings. Cosine similarity measures the angle between vectors, ignoring magnitude, which works well for normalized embedding outputs.
  • Payload index on source_id — a keyword index that enables filtered searches. When you have multiple data sources (different channels, blogs, or feeds), you can scope retrieval to a specific source.
  • Vector size: 384 — matches the output dimension of our all-MiniLM-L6-v2 embedding model.

Ingesting Data into Qdrant

The ingestion pipeline is the heart of the system. It takes your raw content — posts, articles, PDF attachments — and converts it into embeddings stored in Qdrant. LangChain4J provides EmbeddingStoreIngestor which handles the document-to-embedding-to-storage pipeline.

Blue-Green Reindexing Strategy

A naive approach would be to clear the existing collection and re-insert everything. But this means your assistant has no context during reindexing — a gap that can last minutes for large datasets. Instead, we use a blue-green strategy with Qdrant aliases:

      graph LR
    subgraph Before["Before Reindex"]
        A1["Alias: news-embeddings"] --> C1["Collection: news-embeddings-1711100000"]
    end

    subgraph During["During Reindex"]
        A2["Alias: news-embeddings"] --> C2["Collection: news-embeddings-1711100000"]
        I["Ingestor"] --> C3["Collection: news-embeddings-1711200000"]
    end

    subgraph After["After Reindex"]
        A3["Alias: news-embeddings"] --> C4["Collection: news-embeddings-1711200000"]
        C5["Collection: news-embeddings-1711100000<br/><i>deleted</i>"]
    end

    Before ~~~ During ~~~ After
    
  1. Create a new collection with a timestamp-based name
  2. Ingest all documents into the new collection
  3. Atomically switch the alias to point to the new collection
  4. Delete the old collection

The application always reads through the alias, so queries are never interrupted — they seamlessly switch from the old data to the new data in a single atomic alias update.

The Reindex Job

Here is the scheduled reindexing job:

@Component
public class QdrantReindexJob {

    static final String METADATA_SOURCE_ID = "source_id";
    private static final DocumentParser DOCUMENT_PARSER = new ApacheTikaDocumentParser();

    private final ContentService contentService;
    private final QdrantClient qdrantClient;
    private final EmbeddingModel embeddingModel;
    private final QdrantProperties properties;

    @Scheduled(fixedDelayString = "${dev.alimov.qdrant.reindex-period}")
    public void reindex() {
        String aliasName = properties.collectionName();
        String newCollectionName = aliasName + "-" + Instant.now().toEpochMilli();

        try {
            // 1. Create new collection
            qdrantClient.createCollectionAsync(newCollectionName,
                    VectorParams.newBuilder()
                            .setSize(embeddingModel.dimension())
                            .setDistance(Distance.Cosine)
                            .build())
                    .get();

            // 2. Create payload index for filtering
            qdrantClient.createPayloadIndexAsync(newCollectionName,
                    METADATA_SOURCE_ID, PayloadSchemaType.Keyword,
                    null, true, null, null)
                    .get();

            // 3. Build temporary embedding store for the new collection
            QdrantEmbeddingStore newStore = QdrantEmbeddingStore.builder()
                    .client(qdrantClient)
                    .collectionName(newCollectionName)
                    .build();

            // 4. Ingest all content
            ingestContent(newStore);

            // 5. Resolve old collection and switch alias
            String oldCollectionName = resolveAliasCollection(aliasName);
            qdrantClient.createAliasAsync(aliasName, newCollectionName).get();

            // 6. Delete old collection
            if (oldCollectionName != null) {
                qdrantClient.deleteCollectionAsync(oldCollectionName).get();
            }

        } catch (Exception e) {
            // Cleanup on failure
            qdrantClient.deleteCollectionAsync(newCollectionName).get();
            throw new RuntimeException("Reindex failed", e);
        }
    }
}

Ingesting Documents

The actual document ingestion uses LangChain4J’s EmbeddingStoreIngestor. It takes a Document, splits it (if a splitter is configured), generates embeddings, and stores them in the embedding store — all in one call.

private void ingestContent(QdrantEmbeddingStore store) {
    String sourceId = "news-feed";

    EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
            .embeddingModel(embeddingModel)
            .embeddingStore(store)
            .documentTransformer(document -> {
                document.metadata().put(METADATA_SOURCE_ID, sourceId);
                return document;
            })
            .build();

    // Ingest text posts
    for (Post post : contentService.findAllPosts()) {
        StringBuilder sb = new StringBuilder();
        sb.append("News post: ").append(post.getName());
        if (post.getDescription() != null && !post.getDescription().isBlank()) {
            sb.append("\n").append(post.getDescription());
        }

        Document postDoc = Document.from(sb.toString(),
                Metadata.from(METADATA_SOURCE_ID, sourceId));
        ingestor.ingest(postDoc);
    }

    // Ingest PDF/text attachments
    for (Attachment attachment : contentService.findAttachments()) {
        MediaType mediaType = MediaType.parseMediaType(attachment.getMimeType());

        if (mediaType.getType().equals(MediaType.TEXT_PLAIN.getType())
                || mediaType.equals(MediaType.APPLICATION_PDF)) {

            byte[] content = contentService.downloadContent(attachment.getId());
            Document document = DOCUMENT_PARSER.parse(
                    new ByteArrayInputStream(content));
            ingestor.ingest(document);
        }
    }
}

Key points about the ingestion:

  • documentTransformer — attaches a source_id metadata field to every document before it is embedded. This metadata is stored alongside the vector in Qdrant and can be used for filtered retrieval later.
  • ApacheTikaDocumentParser — handles both plain text and PDF documents. Tika extracts text content from PDFs, making them searchable through embeddings.
  • Post content — each post is ingested as a single document with its title and description. For longer articles, you would typically add a DocumentSplitter to chunk them into smaller pieces, improving retrieval precision.

Adding Document Splitting

For longer documents, you should split them into chunks before embedding. LangChain4J provides several splitters:

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
        .embeddingModel(embeddingModel)
        .embeddingStore(store)
        .documentSplitter(DocumentSplitters.recursive(500, 50))
        .documentTransformer(document -> {
            document.metadata().put(METADATA_SOURCE_ID, sourceId);
            return document;
        })
        .build();

The DocumentSplitters.recursive(500, 50) splits documents into chunks of roughly 500 tokens with a 50-token overlap between consecutive chunks. The overlap ensures that context is not lost at chunk boundaries — a sentence that spans two chunks will appear in both.


Querying Data — The RAG Pipeline

Now for the payoff: wiring up retrieval so the LLM can answer questions based on your ingested content. LangChain4J makes this remarkably straightforward through EmbeddingStoreContentRetriever.

Content Retriever

The content retriever is the bridge between your user’s question and the vector database:

EmbeddingStoreContentRetriever contentRetriever =
        EmbeddingStoreContentRetriever.builder()
                .embeddingStore(qdrantEmbeddingStore)
                .embeddingModel(embeddingModel)
                .filter(MetadataFilterBuilder.metadataKey("source_id")
                        .isEqualTo("news-feed"))
                .maxResults(20)
                .minScore(0.5)
                .build();
Parameter Value Purpose
embeddingStore QdrantEmbeddingStore Where to search
embeddingModel AllMiniLmL6V2 Embeds the user’s question into the same vector space
filter source_id = "news-feed" Scopes search to a specific data source
maxResults 20 Maximum number of document chunks to retrieve
minScore 0.5 Minimum cosine similarity threshold — ignore low-relevance results

When a user asks “What happened with Kafka yesterday?”, the retriever:

  1. Embeds the question using all-MiniLM-L6-v2 → produces a 384-dimensional vector
  2. Searches Qdrant for the 20 most similar vectors, filtered by source_id
  3. Discards any results with cosine similarity below 0.5
  4. Returns the matching document chunks as context for the LLM

The minScore threshold is important — without it, the retriever always returns maxResults documents, even if none are actually relevant to the question. A threshold of 0.5 is a reasonable starting point; you may need to tune it based on your content.

Building the AI Assistant

LangChain4J’s AiServices brings everything together — the LLM, content retriever, tools, chat memory, and system prompt:

public interface NewsAssistant {

    Response<AiMessage> chat(
            @MemoryId String memoryId,
            @UserMessage UserMessage userMessage);

    TokenStream chatStream(
            @MemoryId String memoryId,
            @UserMessage UserMessage userMessage);
}
NewsAssistant assistant = AiServices.builder(NewsAssistant.class)
        .chatModel(chatModel)
        .streamingChatModel(streamingChatModel)
        .contentRetriever(contentRetriever)
        .tools(postServiceToolProxy)
        .chatMemoryProvider(chatMemoryProvider)
        .systemMessageProvider(memoryId -> systemPrompt)
        .build();
      graph TD
    UM["User Message"] --> AS["AiServices"]

    AS --> SP["System Prompt<br/><i>role and guidelines</i>"]
    AS --> CM["Chat Memory<br/><i>conversation history</i>"]
    AS --> CR["Content Retriever<br/><i>Qdrant semantic search</i>"]
    AS --> TC["Tool Calls<br/><i>list posts, get details</i>"]

    SP --> LLM["LLM"]
    CM --> LLM
    CR --> LLM
    TC --> LLM

    LLM --> R["Response"]
    

The AiServices proxy handles all the orchestration automatically:

  1. Load the system prompt via systemMessageProvider
  2. Retrieve conversation history via chatMemoryProvider
  3. Run the content retriever to find relevant documents
  4. Combine everything into a single prompt and call the LLM
  5. If the LLM invokes tools, execute them and feed results back
  6. Return the final response and update chat memory

Tool Integration

Beyond RAG retrieval, the assistant can also invoke tools — structured actions the LLM can call to get specific data. For a news assistant, this means the model can list posts or fetch full article details:

public class PostServiceToolProxy {

    private final ContentService contentService;

    @Tool(name = "news_post_find",
          value = """
              Retrieve full details of a specific article/post by its ID.
              Returns the complete article including title, full content,
              and metadata.
              """)
    public Post findById(
            @ToolMemoryId String userId,
            @P("Article/post ID") String id) {
        return contentService.findById(UUID.fromString(id));
    }

    @Tool(name = "news_post_list",
          value = """
              Browse and list all published articles/posts.
              Use when the user asks what articles are available.
              Returns titles and short descriptions.
              """)
    public List<Post> list(
            @ToolMemoryId String userId,
            @P(value = "Filter by name", required = false) String name,
            @P(value = "Filter by description", required = false) String description) {
        return contentService.findPosts(name, description);
    }
}

The distinction between RAG and tools is important:

  • RAG (content retriever) provides contextual background — the model gets relevant document chunks automatically with every query. This is implicit; the user does not need to ask for it.
  • Tools provide explicit actions — the model decides when to call them based on the user’s request. “Show me all posts about Kafka” would trigger news_post_list; “What’s the general sentiment about the new release?” would rely on RAG-retrieved context.

Together, they give the assistant both broad contextual knowledge and the ability to fetch specific, structured data.


System Prompt — Guiding the Assistant

The system prompt defines the assistant’s personality, capabilities, and constraints. For a news/blog assistant, it should emphasize accuracy, grounding in retrieved content, and honest acknowledgment of gaps:

You are a friendly and knowledgeable assistant for "{{serviceName}}",
a blog platform.

Your role:
- Help readers discover and navigate blog content
- Answer questions about articles, summarize posts, suggest related reading
- When asked about a topic covered in the blog, reference relevant articles

Guidelines:
- Always stay on topic with the blog's content and domain
- If you don't know the answer or the blog hasn't covered a topic, say so honestly
- Do not fabricate article titles, quotes, or statistics
- Keep responses concise unless the user asks for detailed explanation
- When the user asks to list or browse articles, ALWAYS call the news_post_list
  tool first — do NOT answer from memory

Grounding and accuracy:
- ONLY use information from the provided context (documents, knowledge base,
  and tool results). Never fabricate or guess information
- If a tool is available to answer a question, you MUST use it rather
  than guessing the answer

The prompt explicitly instructs the model to use tools rather than guessing, which prevents hallucination. This is especially important with smaller local models that are more prone to making things up when they do not have the answer.


Chat Memory — Maintaining Context

For a conversational assistant, you need chat memory — the ability to remember what was said earlier in the conversation. LangChain4J provides TokenWindowChatMemory, which maintains a sliding window of recent messages, capped by token count rather than message count.

ChatMemoryProvider chatMemoryProvider = memoryId ->
        TokenWindowChatMemory.builder()
                .id(memoryId)
                .maxTokens(2048, tokenCountEstimator)
                .chatMemoryStore(chatMemoryStore)
                .build();

The TokenCountEstimator uses jtokkit for BPE token counting, matching the tokenization scheme of the target model:

public class ChatTokenCountEstimator implements TokenCountEstimator {

    private static final EncodingRegistry ENCODING_REGISTRY =
            Encodings.newDefaultEncodingRegistry();

    private final Encoding encoding;

    public ChatTokenCountEstimator(EncodingType encodingType) {
        this.encoding = ENCODING_REGISTRY.getEncoding(encodingType);
    }

    @Override
    public int estimateTokenCountInText(String text) {
        return encoding.countTokensOrdinary(text);
    }

    @Override
    public int estimateTokenCountInMessage(ChatMessage message) {
        int tokenCount = 4; // role + per-message overhead
        if (message instanceof SystemMessage sm) {
            tokenCount += estimateTokenCountInText(sm.text());
        } else if (message instanceof UserMessage um) {
            for (Content content : um.contents()) {
                if (content instanceof TextContent tc) {
                    tokenCount += estimateTokenCountInText(tc.text());
                }
            }
        } else if (message instanceof AiMessage ai) {
            if (ai.text() != null) {
                tokenCount += estimateTokenCountInText(ai.text());
            }
            if (ai.hasToolExecutionRequests()) {
                tokenCount += 6;
                for (var req : ai.toolExecutionRequests()) {
                    tokenCount += 7 + estimateTokenCountInText(req.name());
                    if (req.arguments() != null) {
                        tokenCount += estimateTokenCountInText(req.arguments());
                    }
                }
            }
        }
        return tokenCount;
    }
}

Why a custom token estimator instead of LangChain4J’s built-in one? The built-in OpenAiTokenCountEstimator is hardcoded to specific OpenAI tokenizers. When you are running a local model like Qwen, the tokenization differs. Using jtokkit with a configurable EncodingType (e.g., O200K_BASE) gives you accurate token counting for your specific model.

The chatMemoryStore handles persistence — storing conversation history so it survives service restarts. You can implement this with any database; the interface requires just getMessages(memoryId) and updateMessages(memoryId, messages).


Putting It All Together

Here is the complete flow when a user asks “What were the key announcements this week?”:

      sequenceDiagram
    participant U as User
    participant A as AiServices
    participant CM as Chat Memory
    participant CR as Content Retriever
    participant Q as Qdrant
    participant EM as Embedding Model
    participant LLM as LLM

    U->>A: "What were the key announcements this week?"
    A->>CM: Load conversation history
    CM-->>A: Previous messages (if any)

    A->>EM: Embed user question
    EM-->>A: Question vector [384 dims]
    A->>CR: Retrieve relevant documents
    CR->>Q: Vector similarity search<br/>(filter: source_id, top 20, min 0.5)
    Q-->>CR: Matching document chunks
    CR-->>A: Retrieved context

    A->>LLM: System prompt + memory + context + question
    LLM-->>A: "Based on this week's posts, the key announcements were..."

    A->>CM: Save updated conversation
    A-->>U: Response with grounded answer
    

The entire RAG pipeline — embedding the question, searching Qdrant, assembling context, calling the LLM — happens transparently. Your application code just calls assistant.chat(memoryId, userMessage) and gets back a grounded response.


Production Considerations

Tuning Retrieval Quality

The two most impactful parameters for retrieval quality are:

  • maxResults — more results mean more context for the LLM, but also more noise and higher token usage. Start with 10-20 and adjust based on your content density.
  • minScore — the similarity threshold. Too low and irrelevant documents leak in; too high and relevant documents get filtered out. Monitor the scores of retrieved documents to find the right threshold for your data.

Reindex Frequency

How often you reindex depends on how frequently your content changes. For a news feed that updates every few hours, a 1-hour reindex period works well. For a blog with weekly posts, daily reindexing is sufficient. The blue-green strategy ensures zero downtime regardless of frequency.

Scaling

  • Qdrant scales well — it supports sharding, replicas, and can handle millions of vectors on modest hardware.
  • Embeddings are CPU-bound with the in-process model. For high-throughput ingestion, consider a GPU-accelerated embedding model or batch processing.
  • LLM inference is typically the bottleneck. If you are running locally, ensure your GPU has enough VRAM for the model you chose.

Conclusion

RAG with LangChain4J and Qdrant is a practical, production-ready approach to building AI assistants grounded in your own data. The key pieces:

  1. LangChain4J provides the framework — model integration, RAG abstractions, tool calling, and memory management, all in idiomatic Java.
  2. Qdrant stores and retrieves embeddings efficiently, with metadata filtering for multi-source setups.
  3. In-process embeddings (all-MiniLM-L6-v2) eliminate external dependencies for the embedding step.
  4. Blue-green reindexing ensures zero-downtime data updates.
  5. AiServices wires everything together — content retrieval, tools, memory, and system prompts — behind a simple interface.

The result is an assistant that answers questions based on what it actually knows — your content — rather than what it hallucinated from training data. Whether you are building a news digest bot, a documentation assistant, or a customer support agent, the pattern is the same: ingest, embed, retrieve, generate.


References

  1. LangChain4J Documentation
  2. Qdrant Documentation
  3. all-MiniLM-L6-v2 on Hugging Face
  4. LangChain4J RAG Guide
  5. Running Local LLMs for Coding and Private Agents