Where the model lives, and what that buys
The architecture moves through three deployment tiers — current commercial API, intermediate private-hosted, and full on-prem target. This page names what changes at each tier, and what the company controls when the model and the data finally share one perimeter inside its own office.
The architectural contract — fixed role, fixed source, fixed scope, citation-or-refusal — is enforced at every tier. What changes between tiers is where the data sits and who else can read it on the way through. The path is named explicitly so a company evaluating the architecture can see today's reality, the next step, and the destination on the same page.
Where we are on the path
Three deployment tiers, named in order. The architectural contract is the same across all three; the perimeter is what moves.
Commercial-API inference under a no-train clause
The four-constraint contract already runs: retrieval binds to an approved corpus, the citation gate verifies every answer, refusal-by-default is the dominant failure mode. Inference itself is a call to a commercial foundation-model provider under a no-train contract; hosting is on US-region cloud infrastructure with encryption at rest and in transit.
What this tier accepts: the question text, the retrieved chunks, and the model output transit a vendor's network on the way through. The vendor contract restricts use of that data, but the data is briefly outside the company's perimeter for the round-trip.
Inference in a single-tenant cloud the company controls
Inference moves to a dedicated tenant in cloud infrastructure the company itself owns — the company's own AWS, Azure, or GCP account, customer-managed encryption keys, no shared compute. The same retrieval, citation gate, and audit log run inside that tenant. Telegram still carries the channel layer.
What this tier buys: removes shared-tenant blast radius and removes the no-train clause as the only safety mechanism (the keys belong to the company). What it does not buy: the data still transits a cloud network; an outage at the cloud provider still takes the assistant offline.
Foundation model on a single workstation in the company's own office
The model, the retrieval index, the citation gate, and the audit log share one machine inside the company's office. A foreman's question travels the company's network to that server, gets answered, and returns. The corpus does not leave the building. The vendor contract becomes one fewer dependency, because the vendor is no longer in the inference path at all.
What this tier buys: control of the model layer and the audit trail. Predictable capex instead of metered token cost. A simpler answer to discovery, surety, and procurement questions. The rest of this page describes Tier 3 in detail because that is the deployment this site documents end to end.
Why move along the path
Each step removes a boundary the company cannot directly observe. Tier 1 → 2 removes the shared-tenant exposure. Tier 2 → 3 removes the cloud transit. The economic case for moving rarely turns on token cost — it turns on what the company wants to be able to prove about who reads its data, and on whose books the inference cost lives. Companies that never face that proof problem may stop at Tier 1 and the architecture still does its job. Companies that need the proof move forward when the operational cost is justified by the boundary they want to remove.
Where the model lives (Tier 3)
The sections that follow describe Tier 3 in detail — the on-prem target the architecture is moving toward. Where a section describes mechanics that change between tiers, the difference is named.
The default path for public AI assistants: a foreman types a question, the request travels to a vendor server in another state, the model processes it, the answer comes back. Between those points the company's data crosses the public internet — drawings, takeoffs, jobsite photos, customer names, contract pricing. Where that data settles, and who reads it, is a contract question, not an architectural one.
SMACNews, August 2025: cloud AI carries “the risk of exposing proprietary fabrication methods, client details, and competitive advantages.” That is not our framing. That is the trade body of sheet-metal contractors naming the threat in the contractor's own language.
At Tier 3, this installation is built differently. The foundation model — open, multilingual, Qwen3-32B class (or Qwen2.5-72B on the upper hardware tier) — runs under vLLM on a single workstation in the company's own office. Retrieval, citation gate, audit log share the same machine. The foreman's question travels the company's own network, hits the local server, gets processed, and returns. The corpus does not leave the building on the way to an answer.
That is why our server sits in our office, not theirs.
What sits on the box
Four components on one machine.
1. Foundation model
Qwen3-32B-Instruct on the pilot hardware tier; Qwen2.5-72B-Instruct on the upper hardware tier — under vLLM, in fp8 or 4-bit quantization to fit hardware. Both are open multilingual checkpoints with documented Russian-language coverage, which matters for crew voice notes. Egress to the public internet is firewalled at the box. Version pinned in the vllm.serve config.
2. Retrieval
A local vector store (Qdrant or pgvector) over the approved corpus — SOPs, install standards, OEM manuals, contract documents, jobsite photos. Filtered by user role and project assignment before similarity search runs. The role filter is a SQL WHERE clause, not a prompt instruction.
3. Citation gate
A small Python service between model output and reply send. Reads the model's structured JSON, verifies each cited chunk ID matched one returned by retrieval this turn, and drops uncited or mismatched payloads. The gate runs in code the model never sees. Median gate latency is under fifteen milliseconds.
4. Audit log
One JSON line per turn — user, timestamp, project tag, retrieved chunk IDs, model output, gate verdict, citation. Same disk as the index. Daily rotation. Filename pattern: audit-YYYY-MM-DD.jsonl.
The hardware
Two configurations cover most fifty-to-two-hundred-employee operations. Both are office boxes, not datacenter installs.
Pilot tier — Apple Mac Studio M3 Ultra, ~$4,000
A 96 GB unified-memory machine running Qwen3-32B under MLX or LM Studio's headless server. Latency for a grounded HVAC query lands around two to three seconds end to end. One to three concurrent users without queueing.
Silent, fits under a desk, draws under 200 W. The configuration where an owner can put the architecture in front of a crew and watch the citation contract hold up against real questions.
Production tier — NVIDIA RTX PRO 6000 Blackwell workstation, ~$19,000
A 96 GB GPU in a Threadripper or Xeon chassis running Qwen2.5-72B-Instruct under vLLM. Latency drops below one second for typical retrieval-grounded answers, on parity with mid-tier cloud frontier models (e.g., Claude Sonnet 4.6) on the same retrieval-grounded prompts. Ten to twenty concurrent users.
Trade-off: noise, heat, 1500 W under load, and a Linux server someone administers. The production tier for a crew of fifty plus.
A multi-GPU cluster — Qwen3-235B-A22B class or DeepSeek-V3 class — becomes appropriate only above a few thousand queries a day or for a shared backend across multiple branch offices. Below that line, a cluster is operationally disproportionate.
For scale: $4,000 once is roughly an annual office-copier lease. $19,000 once is roughly one Ford F-250 for a crew. Shop-tool capex, not ERP capex.
What the architecture buys
Not marketing. Numbers a broker and a surety analyst can verify.
Cyber insurance — 2026 renewal questionnaires
After the Mercor breach (September 2025) and the wave of shadow-AI claims through 2025, several major carriers added an AI-governance section to renewal questionnaires. Typical question categories: AI tool inventory, DLP for AI prompts, sub-processor disclosure, audit-log retention, training-data leakage, ransomware impact on AI infrastructure, and AI-use policy. The architectural answer — one server, one corpus, audit trail on the company's disk — addresses most of those in a sentence rather than several procedures. Brokers familiar with the segment indicate a single-digit-to-low-double-digit credit is plausible for a clean architectural posture, versus a meaningful loading or sub-limit for “ChatGPT plus a written policy.” Carriers vary: a contractor's actual delta depends on the specific carrier's questionnaire scoring and the contractor's overall risk profile, not on this page.
Surety bonding capacity
Sureties increasingly include systems maturity in capacity analysis — documented processes, reduced key-person dependency, recoverable institutional knowledge. A mid-size contractor with strong financials but knowledge concentrated in two PMs and a senior super is often bond-capped by key-person risk, not balance sheet. An indexed corpus and audit trail demonstrated on actual jobs over twelve to eighteen months is direct evidence that institutional knowledge is recoverable. The effect on capacity is one input among several the surety analyst weighs — capital, work-in-progress schedule, and historical loss ratio dominate. Real outcomes vary by surety, contractor history, and balance sheet.
A concrete dispute scenario
Illustrative scenario, constructed to show the math; not drawn from a single real dispute. Settlement ranges and dollar swings reflect commercial-construction dispute economics in the Pacific Northwest in 2024–2026.
Backcharge of $385K on a $12M institutional project. The GC issues “delayed sheet metal rough-in caused GC overhead extension and follow-trade re-mobilization.” The mechanical contractor disputes — the delay was a 19-day RFI sitting with the architect on chase dimensions. Without an audit trail: the PM digs through emails; settlements typically land at 50–65% of the backcharge ($200–250K). With the audit trail: every question the foreman asked over those 19 days is timestamped with the documents cited and the answers given. The foreman asked three times about chase dimensions, got “not found in approved documents” each time, escalated to the PM, the RFI is logged. Settlement lands at 0–15% ($0–55K) or the backcharge is withdrawn. Net swing: roughly $200K per dispute. On a $40M-revenue contractor running three to five such disputes a year, the architecture is worth $400–800K annually in avoided cost. Before counting OSHA inquiries and warranty callbacks.
Prequalification scoring
Public agencies and large general contractors began adding AI-data-handling questions to prequalification packages in late 2025. A contractor with an architectural answer — corpus loaded only with approved company documents, access role-scoped, audit trail locally controlled, no project data leaving the perimeter — supplies a one-paragraph answer that compares favorably against “ChatGPT plus a written policy.” On tight prequal scores this is a measurable differentiator; the page does not estimate the points because each agency scores differently. The signal: the questions are appearing where they did not appear two years ago.
What we are not claiming (the legal floor)
A specific percentage premium discount. A guarantee of bonding-capacity expansion. Litigation immunity. A tamper-proof audit log without foundation testimony. Safe harbor under HB 1071, OCPA, or CCPA. Those statements are between the contractor, their counsel, and their broker — not this page.
Where the data lives (Tier 3)
At Tier 3, every document, every retrieval call, every model completion, every audit row stays on the same machine. The perimeter is the building's network. There is no cloud tenant to isolate from, no third-party processor to contract with, no encryption-in-transit story across an external hop — because there is no external hop. (At Tier 1 and Tier 2, those concerns are real; at Tier 3 they fall away by topology.)
One perimeter
Model, corpus, gate, log on one machine on one LAN. Disk encryption: FileVault on macOS, LUKS on Linux.
No external processor
Inference does not run under a commercial API contract because inference does not leave the box. There is no no-train clause to enforce — there is nothing to train on.
Owner-controlled retention
The audit log lives on the company's disk. Retention is whatever the company sets. A wipe is shred on the relevant files (or destroying the LUKS / FileVault key for full-disk loss) plus a signed line in the log. There is no vendor portal to file a deletion request through.
Who can see what
Access is role-based and scoped to the project a person is assigned to. An apprentice on a school job sees documents and conversations tied to that job — not the payroll folder, not the VRF tower across town. Foremen and PMs see their projects. Owners see everything in the workspace. Every document open, every question, every export is written to the audit log with user, timestamp, and project tag. The owner can review the log at any time, from the same shell that runs the rest of the system.
What the AI can and cannot see
The model only reads what has been approved and loaded into the workspace. Anything outside that scope is invisible to it.
What the AI processes
- Approved company documents — SOPs, install standards, safety procedures, commissioning checklists
- Industry codes and OEM manuals loaded by the owner
- Project-level files — submittals, drawings, RFIs, daily logs uploaded to a given job
- Questions sent by crew through Telegram and the answers returned
- Photos, voice notes, and inspection results attached to a project record
What the AI never sees
- Payroll records, pay rates, or timecards
- Personal employee information — SSN, home address, medical files
- Bank accounts, invoices paid, or financial ledger data
- Email inboxes, calendars, or other systems not explicitly connected
- Anything that was not uploaded to the workspace
Constraints as architecture, not policy
Public ChatGPT and Claude default to permissive: the model picks role, sources, and uncertainty fallback on its own. The architecture inverts that default through four constraints that do not depend on system prompts to hold:
1. Retrieval binding (architectural)
The retriever pulls only from the approved corpus indexed on the local box. Not “instruction said so.”
2. Output gate (architectural)
A post-processing service runs between model output and reply send. It checks the response contains a citation token matching a chunk ID returned by retrieval. Uncited payloads are dropped — the user receives the refusal string instead.
3. Role and scope filter (architectural)
Document access is filtered by user role and project assignment before retrieval runs. Filtering happens at the data layer, not in the prompt.
4. Refusal by default
When retrieval misses, the response is “not found in approved documents.” There is no fallback path to general model knowledge — the gate refuses; the model is never trusted to refuse.
Why training is not the fix
Training works for some employees, unevenly across a workforce. The summer 2025 case of the acting director of CISA — the U.S. federal cybersecurity agency — uploading For Official Use Only documents into public ChatGPT (reported January 2026 by TechCrunch, Ars Technica, MeriTalk) shows that even disciplined senior officials at agencies with mandatory security training, dedicated DLP, and regular tabletop exercises leak data under operational pressure.
The same SMACNews issue from August 2025 that lays out six AI-governance recommendations for contractors names cloud AI directly as the “risk of exposing proprietary fabrication methods, client details, and competitive advantages.” A local-model architecture moves this concern out of the training-compliance layer and into the network-topology layer — where it can be held by tools any sheet-metal contractor already operates.
The fix is not another training round. The fix is architecture: constraints around the assistant that hold without relying on individual discipline.
What we are not claiming
This is not air-gap. The boundary is described honestly because otherwise it breaks under the first real question.
Telegram transports questions and answers
A foreman types on his phone; the message routes through Telegram's own infrastructure (operated outside US contractor control) before reaching the company LAN. The bot is reachable that way; the corpus is not. What does not transit Telegram: retrieval, generation, gate verification, audit logging — those happen on the company's side. Telegram carries the question text and the final answer text, plus the metadata Telegram retains by design (sender, timestamp, group membership). Nothing else.
Model weights and patches enter from outside
Qwen weights pulled from Hugging Face, vLLM from PyPI, OS patches from Canonical, drivers from NVIDIA. Each is a supply-chain trust the company inherits. The update window is scheduled, by a named admin, with signature verification.
One box is one point of failure
A GPU throws an ECC error on a Tuesday; the building loses power for six hours. The assistant is offline. There is no automatic cloud failover — that fallback would void the perimeter. For shops that cannot tolerate outage, the supported answer is active/passive, two boxes. Cost roughly doubles. Named, not deferred.
Not “zero data leaves the network”
NTP, monitoring, and signed updates cross the boundary by design. What does not cross: questions, answers, corpus chunks. On the path to an answer, none of those leave.
What the company commits to
Named on the page, not deferred to a contract.
Patches. A named on-call admin. OS security: weekly, automatic, reboot Sunday 02:00. vLLM and model weights: monthly, manual.
Reboots after power events. A systemd unit; whoever flips the breaker checks the dashboard.
Telegram token rotation. Every 90 days, by the owner. Replace the token in the bot service config, restart, revoke the old token at BotFather.
Quarterly disaster-recovery drill. The named admin restores from restic to a spare box and verifies retrieval and inference end to end. If that does not happen, you do not have backups; you have hopes.
Kill switch. The owner can stop the bot, freeze the corpus, or revoke a user role from one shell command and one dashboard click.
If a company cannot name these humans by role today, on-prem is the wrong tier and this page says so.
Infrastructure trade-offs
Open model versus frontier. Qwen3-72B answers grounded HVAC questions well, with citation discipline that holds under the gate. It is not GPT-class on open-ended reasoning. For retrieval-bound answers — what this is built for — the gap is small. For freeform chain-of-thought, the gap is real. The architecture leans into the bound; questions outside the corpus return the refusal string.
Capex versus opex. Mac is $4,000 once. NVIDIA workstation is $19,000 once. Cloud Claude at typical query volume — 12k input, 700 output tokens, ~$0.05 a query — breaks even against the Mac at roughly 80,000 queries and against the NVIDIA at roughly 380,000. Ops cost is not in those numbers. The economic case for on-prem is rarely token cost; it is control of where the data sits.
Latency parity. The NVIDIA tier matches cloud Claude on grounded round-trip. The Mac tier is two to three times slower. Both are fast enough for a Telegram exchange. Neither is fast enough for sub-100ms streaming UX.
Out of scope. Questions that require frontier reasoning, current-events grounding, or anything outside the indexed corpus return the refusal string. The architecture trades breadth for boundary. That is the deal.
When the system gets it wrong
Refusal-by-default is the dominant failure mode: the system answers “not found in approved documents” when retrieval misses. This is intentional. A user can re-ask, expand the question, or escalate to a person.
A more dangerous failure mode is a citation against a stale or superseded document. The knowledge base versions documents and tracks supersession explicitly. When a newer revision exists, the older one is archived but not deleted, and the system surfaces the version it cited so a reviewer can verify the answer matches the current company record.
If a wrong answer ships, the audit trail names the model output, the chunk retrieved, the document version, and the timestamp — every piece needed to determine whether the failure was retrieval, document staleness, or the user's question. Failures are recoverable artifacts, not undisclosed events.
Why move to Tier 3
The standard alternative to moving forward is to stay at Tier 1 with a no-train clause plus a SOC 2 Type II report. Both are real artifacts. Neither answers the question Tier 3 answers.
Control. When the model runs on the company's box, the company decides who reaches it, when it updates, and when it shuts down. A vendor outage cannot take it offline. A vendor pricing change cannot reprice it. A vendor terms-of-service revision cannot retroactively change what gets logged.
Audit-trail locality. The audit log lives on the same disk as the index and the model. A dispute review happens at the company's keyboard, not through a vendor support ticket. Subpoena response time drops from weeks to an afternoon.
Predictable cost. Capex is one number, paid once, depreciable on the company's books. There is no per-token meter, no overage tier, no surprise bill the month a foreman asks the assistant a thousand questions.
A SOC 2 report attests to controls a vendor operates on the company's behalf. On-prem removes the vendor. The cost-versus-control trade is explicit: capex up front buys the company the right to keep its own data on its own disk.
Certification status
The architecture does not carry SOC 2 certification today. The architectural controls — per-machine isolation, encryption at rest, role-and-scope retrieval filtering, structured audit logging on the company's own disk, on-premise hosting — exist as code and are described on this page. The audit log is structured and locally controlled; it is not held out as a tamper-proof artifact for legal-grade evidence (that requires foundation testimony about how the log is generated and protected, which is a question for the company's counsel, not this page). A SOC 2 Type II audit would attest those controls operate consistently for twelve months; it would not change what they are. A company building this architecture in-house can pursue certification on its own timeline; the controls are not contingent on the badge.
Data you own and can export
The operating company owns the documents, the project history, and the question-and-answer record. Everything is exportable. Documents come back in their original formats. Conversation history, project logs, and answer citations export to CSV and JSON. A workspace can be fully wiped — shred on the relevant files, or destruction of the FileVault / LUKS key for full-disk loss, plus a signed line in the audit log once the wipe completes. The architecture sits on top of the company's own record — the data is exportable end to end, not held behind an opaque path.
If you can power off the box, the data stops moving — that is the only privacy promise that holds up.