Frequently asked questions
Architectural questions, in two parts: how the system behaves for users, and why the architecture is wired the way it is.
For users
Do users need a separate app on their phone?
No. The assistant lives inside Telegram, on the phone the user already carries. The architecture treats the messaging app as the channel; an operation that runs the architecture can extend it to Teams, Slack, or SMS by routing inbound messages through the same retrieve-and-cite pipeline.
Can the assistant invent an answer?
The architecture is built so it cannot return an answer without a supporting fragment from the approved corpus. Every reply ships with a citation to document, section, and page. When retrieval misses, the answer is "not found in approved documents." A post-processing service drops uncited or mismatched payloads before they reach the user. Citation is the gate, not decoration.
What happens with off-topic questions?
The assistant operates as an HVAC specialist tied to the user's role. Off-scope topics — legal advice, lifestyle questions, anything outside the company's approved corpus — return "out of scope for this assistant." This is one of the four constraints: fixed role.
About the architecture
Where does the model run?
On a single workstation in the company's own office. An open multilingual foundation model — Qwen3-32B on the pilot tier, Qwen2.5-72B on the upper tier, both with documented Russian-language coverage — under vLLM, on either a Mac Studio M3 Ultra (about $4,000) or an NVIDIA RTX PRO 6000 Blackwell workstation (about $19,000). The retrieval index, the citation gate, and the audit log share that machine. A foreman's question travels the company's network to the local server, gets answered, and returns. The corpus does not leave the building on the way to an answer. Telegram still carries the question text and the final answer text — that is the channel — but retrieval, generation, and gate verification all happen on the company's side. See /controls-and-data for the full perimeter description.
Why local instead of a commercial API with a no-train clause?
A no-train clause and a SOC 2 report are real artifacts. They do not answer the question on-prem answers. Control: the company decides who reaches the model, when it updates, when it shuts down — vendor outage, vendor pricing change, vendor terms-of-service revision do not apply. Audit-trail locality: the log lives on the same disk as the model, so a dispute review happens at the company's keyboard, not through a vendor support ticket. Predictable cost: capex once, no per-token meter, no overage. The economic case for on-prem is rarely token cost; it is control of where the data sits.
What hardware does it actually need?
Two configurations cover most fifty-to-two-hundred-employee operations. Pilot tier: Apple Mac Studio M3 Ultra, 96 GB unified memory, around $4,000 — runs Qwen3-32B with two-to-three-second latency on grounded HVAC queries, one to three concurrent users, silent, fits under a desk. Production tier: NVIDIA RTX PRO 6000 Blackwell workstation (96 GB GPU), Threadripper or Xeon chassis, around $19,000 — runs Qwen2.5-72B with sub-second latency on grounded HVAC queries, ten to twenty concurrent users, requires a dedicated circuit and someone who administers Linux. A multi-GPU cluster (Qwen3-235B-A22B class) becomes appropriate only above a few thousand queries a day or for shared backend across multiple branch offices.
Why exactly four constraints?
Each one removes a degree of freedom that public chat keeps open. Public ChatGPT picks role from how the question is phrased, draws on general knowledge, decides what to do under uncertainty, and accepts any input source. Fixing role, source, scope, and adding citation-or-refusal are the four constraints found necessary to make the assistant predictable enough for HVAC compliance work. Three would leave a gap; five became redundant.
Why Telegram and not Teams or Slack?
Telegram has voice messages as a first-class input, works on every phone, no per-seat licensing, and the field crew picked it up in a single shift. The architecture is channel-neutral — the same four constraints apply to any inbound channel. Teams or Slack are reasonable extensions for an operation already running on those.
Why not just configure Microsoft Copilot Studio with the "Allow ungrounded responses" toggle off?
That toggle changes the system prompt. The same outcome here is enforced at three architectural layers: retrieval binds to the approved corpus, lint-at-ingest blocks malformed source documents, and a post-processing service drops uncited payloads. A jailbreak in the user's question or a long context that pushes the system prompt out of attention does not bypass these layers. A single configurable setting can.
How does the architecture handle stale or superseded documents?
The knowledge base versions documents and tracks supersession explicitly. When a newer revision exists, the older one is archived but not deleted. The system surfaces the version it cited so a reviewer can verify the answer matches the current approved record. The assistant never decides which version is current — that decision belongs to the operations lead.
How is the citation gate actually enforced?
The gate is a separate service running between model output and reply send. The model returns a structured response with the answer, cited chunk references, and a confidence value. The service verifies each cited chunk was actually returned by the retrieval call for this turn — using a per-turn nonce to make the check replay-resistant. Mismatch or absent citation drops the payload before the user sees it. The model never sees the gate code.
Why not train employees on prompt engineering instead?
Training works for some employees, unevenly across a workforce. Even trained senior officials at agencies with mandatory security training have leaked sensitive documents into public ChatGPT — the summer 2025 CISA case is on the context page. The discipline that fails for trained federal officials fails harder for an apprentice on a deadline. Architecture scales — discipline does not.
What was deliberately not built?
Project management features (the architecture sits next to existing operations platforms, not on top of them), document authoring (the assistant retrieves and cites; document authorship belongs to the operations lead), automatic supersession decisions (too consequential to delegate to the model), full-text web search fallback (defeats the fixed-source constraint), and a generic conversational mode (defeats the fixed-role constraint).
What if a different trade — plumbing, electrical, mechanical service — wants to build this?
The four constraints are industry-neutral. The role taxonomy and approved corpus change with the trade. Plumbing might use IPC instead of IMC; electrical adds NEC sections; service companies add equipment-specific troubleshooting trees. The retrieval pipeline, the post-processing gate, and the role-and-scope filter are unchanged.