Llm | alimov.dev

Streaming LLM Responses to Telegram with Reactive Draft Messages

The Problem with Waiting Large language models are slow. A typical response takes seconds — sometimes tens of seconds — to generate. During that time, the user stares at an empty chat window, wondering if anything is happening at all. Every major LLM chat interface solved this the same way: stream tokens as they arrive. ChatGPT, Claude, Gemini — they all render partial text while the model is still thinking. The experience feels responsive even when the full response takes 15 seconds. ...

Building a RAG-Powered News Assistant with LangChain4J and Qdrant

The Problem — Too Much to Read, Too Little Time Every morning starts the same way. You open your favorite news channels, blog feeds, and Telegram groups — and the wall of unread posts hits you. Dozens of articles, updates, announcements. You need to understand what happened overnight, but reading everything is not realistic. You skim, miss context, and occasionally discover three days later that something important slipped through. What if you could just ask? Not a search engine — a conversational assistant that has already read everything and can answer questions about it. “What were the key announcements yesterday?” “Did anyone mention breaking changes in the new Kafka release?” “Summarize the posts about Kubernetes security.” ...

Running Local LLMs for Coding and Private Agents

Why Run Models Locally? Cloud-hosted LLMs are convenient, but they come with trade-offs that matter when you are writing code or building private tools. Every prompt you send to a hosted API leaves your machine — your proprietary code, internal architecture details, database schemas, and business logic all travel to a third-party server. For personal projects this might be acceptable, but for anything involving proprietary code, client data, or internal tooling, it raises real concerns. ...