docs: en wiki — strip apologetic framing; state design choices with their drawbacks

Editorial pass per user direction: stop justifying the architecture. For every "why this design works" passage, name the costs the design imposes — not as a parenthetical aside but as substantive critical analysis. Each major architectural-claim page now carries an explicit drawbacks/costs section. Pages revised: concepts/thesis.md - "The reward" → "What the design enables (and what each enabler still costs)": for each promised benefit (single codebase, PMs evolve without engineering, customisations layered cleanly), name the limit. Added closing observation that data-driven design redistributes complexity to people and tools the framework can't compile-check. - "When it breaks down": rewrote to call out that "bypassing the framework" via 18 customer dirs makes the data-driven thesis partial, not complete. concepts/semantic-fk.md - "Why xly disabled FKs": added critical analysis. Both reasons could be addressed surgically; the chosen "no FKs anywhere" is the trade for DB-enforced integrity, paid every day the system runs. concepts/master-slave.md - "Slave naming caveat": stop framing retention as wise pragmatism. The naming was a poor choice; preservation has a real ongoing cost. concepts/modules-forms-vtables.md - "Three nouns, one engine": the universal dispatch path concentrates 3,500+ lines + edge cases + special-case hardcodes in one class. Naming the trade. concepts/multi-tenancy.md - "How the design scales" → "How the design scales — and where it doesn't": shared schema = shared contention; tenant-filter index discipline; no physical hard-delete; rigid (sBrandsId, sSubsidiaryId) tenancy unit. concepts/customization-channels.md - Soften "90%+ should live here" claim — that's an aspirational target, not a measured fact. The 18 customer override directories are evidence the channel-2 demand is non-trivial. concepts/api-surface.md - "Why three tiers, not one" → "Why three tiers (and what splitting them costs)": three WARs to deploy, duplicate code, no shared session, three reverse-proxy entries. Note the alternative (single-WAR with package boundaries) and what that would cost vs gain. reference/maintainer/proc-dispatch.md - "Why dynamic proc dispatch matters": added five concrete costs (no compile-time check, no type safety, no call-site discoverability, no static analysis, broken stack traces). Reframed: dynamic dispatch made it cheap to keep adding procs, which made the pile grow, which made the pile harder to audit. reference/maintainer/cache-invalidation.md - New "Drawbacks of this design" section: confusing co-named systems, eviction in same transaction as write (silent corruption on Redis outage), allEntries=true blunt eviction, no batching, direct DB writes bypass everything. Also fixed the "if cache is local" hedge in section 3 (we've now empirically confirmed Redis- backed, so cache is shared). reference/maintainer/bi-engine.md - New "Drawbacks of the homebrewed approach" section: every chart needs a SQL author, charts run heavy SQL on OLTP DB, no semantic consistency between charts, no drill-down, customer-divergent KPI logic. Also dedup'd the duplicated "What this is not" section. reference/maintainer/sql-templates.md - "Why this is a 'template' library and not a code generator" → added costs: no enforcement, no regeneration, no template-origin tracking, customer overrides drift from scaffold. The 1,687 procs the schema carries are the evidence that "discipline rather than enforcement" doesn't fully hold. reference/maintainer/activiti.md - "Why this design works for xly's audience" → "Why xly avoided Activiti — and what that costs": scattered workflow logic, no central audit trail, no parallel-branch/reassignment, invisible flow-graph evolution, idle Activiti engine paying boot cost anyway. - "Why xly bothered with Activiti at all" → "Why xly bothered with Activiti — and whether it was worth it": named the costs (second engine, second schema, second auth surface, modeler UI to learn) and the damning fact that on this dev DB the engine is idle. A future cleanup could plausibly remove Activiti entirely. reference/maintainer/runtime.md - New "What 'universal CRUD' means in practice" section: 3,500-line single-point-of-failure class, no type system on Map<String,Object>, poor discoverability ("what endpoints write to table X" is unanswerable). The trade: adding a module is essentially free, touching the runtime essentially never is. - Updated cache-invalidation cross-link to drop the "open question" hedge (now empirically resolved). slices/04-custom-field.md - "Why it works without code changes" → "Why it works without code changes — and what that costs": merge runs on every request, three near-empty tables on every schema, display-only extension (real persisted fields still need ALTER TABLE), debuggability requires diffing 3 overlay tables. slices/05-customer-sql-override.md - Added drawbacks: no version control on the deployed body, no type-safety bridge, compounds the BI problem. Reframed the "right rule of thumb": 18 customer override directories suggest the channel-2 demand is structural, not exceptional — that's evidence the metadata model isn't expressive enough, not a celebration of the escape hatch. slices/06-hardware.md - "The cleanest story xly tells about an awkward problem" → removed the "cleanest" framing. Added costs of "DB as the only contract": no backpressure, no request/response, bridge-side state invisible to the framework, three layers of polling multiply latency, hardest code (byte protocols) gets least CI. A real-time-aware architecture would use streaming end-to-end; xly's choice trades latency, observability, flow control for operational simplicity. Liveable for press tempo, not for faster shop-floor signals.

docs: en wiki — strip apologetic framing; state design choices with their drawbacks
Editorial pass per user direction: stop justifying the architecture. For every "why this design works" passage, name the costs the design imposes — not as a parenthetical aside but as substantive critical analysis. Each major architectural-claim page now carries an explicit drawbacks/costs section. Pages revised: concepts/thesis.md - "The reward" → "What the design enables (and what each enabler still costs)": for each promised benefit (single codebase, PMs evolve without engineering, customisations layered cleanly), name the limit. Added closing observation that data-driven design redistributes complexity to people and tools the framework can't compile-check. - "When it breaks down": rewrote to call out that "bypassing the framework" via 18 customer dirs makes the data-driven thesis partial, not complete. concepts/semantic-fk.md - "Why xly disabled FKs": added critical analysis. Both reasons could be addressed surgically; the chosen "no FKs anywhere" is the trade for DB-enforced integrity, paid every day the system runs. concepts/master-slave.md - "Slave naming caveat": stop framing retention as wise pragmatism. The naming was a poor choice; preservation has a real ongoing cost. concepts/modules-forms-vtables.md - "Three nouns, one engine": the universal dispatch path concentrates 3,500+ lines + edge cases + special-case hardcodes in one class. Naming the trade. concepts/multi-tenancy.md - "How the design scales" → "How the design scales — and where it doesn't": shared schema = shared contention; tenant-filter index discipline; no physical hard-delete; rigid (sBrandsId, sSubsidiaryId) tenancy unit. concepts/customization-channels.md - Soften "90%+ should live here" claim — that's an aspirational target, not a measured fact. The 18 customer override directories are evidence the channel-2 demand is non-trivial. concepts/api-surface.md - "Why three tiers, not one" → "Why three tiers (and what splitting them costs)": three WARs to deploy, duplicate code, no shared session, three reverse-proxy entries. Note the alternative (single-WAR with package boundaries) and what that would cost vs gain. reference/maintainer/proc-dispatch.md - "Why dynamic proc dispatch matters": added five concrete costs (no compile-time check, no type safety, no call-site discoverability, no static analysis, broken stack traces). Reframed: dynamic dispatch made it cheap to keep adding procs, which made the pile grow, which made the pile harder to audit. reference/maintainer/cache-invalidation.md - New "Drawbacks of this design" section: confusing co-named systems, eviction in same transaction as write (silent corruption on Redis outage), allEntries=true blunt eviction, no batching, direct DB writes bypass everything. Also fixed the "if cache is local" hedge in section 3 (we've now empirically confirmed Redis- backed, so cache is shared). reference/maintainer/bi-engine.md - New "Drawbacks of the homebrewed approach" section: every chart needs a SQL author, charts run heavy SQL on OLTP DB, no semantic consistency between charts, no drill-down, customer-divergent KPI logic. Also dedup'd the duplicated "What this is not" section. reference/maintainer/sql-templates.md - "Why this is a 'template' library and not a code generator" → added costs: no enforcement, no regeneration, no template-origin tracking, customer overrides drift from scaffold. The 1,687 procs the schema carries are the evidence that "discipline rather than enforcement" doesn't fully hold. reference/maintainer/activiti.md - "Why this design works for xly's audience" → "Why xly avoided Activiti — and what that costs": scattered workflow logic, no central audit trail, no parallel-branch/reassignment, invisible flow-graph evolution, idle Activiti engine paying boot cost anyway. - "Why xly bothered with Activiti at all" → "Why xly bothered with Activiti — and whether it was worth it": named the costs (second engine, second schema, second auth surface, modeler UI to learn) and the damning fact that on this dev DB the engine is idle. A future cleanup could plausibly remove Activiti entirely. reference/maintainer/runtime.md - New "What 'universal CRUD' means in practice" section: 3,500-line single-point-of-failure class, no type system on Map<String,Object>, poor discoverability ("what endpoints write to table X" is unanswerable). The trade: adding a module is essentially free, touching the runtime essentially never is. - Updated cache-invalidation cross-link to drop the "open question" hedge (now empirically resolved). slices/04-custom-field.md - "Why it works without code changes" → "Why it works without code changes — and what that costs": merge runs on every request, three near-empty tables on every schema, display-only extension (real persisted fields still need ALTER TABLE), debuggability requires diffing 3 overlay tables. slices/05-customer-sql-override.md - Added drawbacks: no version control on the deployed body, no type-safety bridge, compounds the BI problem. Reframed the "right rule of thumb": 18 customer override directories suggest the channel-2 demand is structural, not exceptional — that's evidence the metadata model isn't expressive enough, not a celebration of the escape hatch. slices/06-hardware.md - "The cleanest story xly tells about an awkward problem" → removed the "cleanest" framing. Added costs of "DB as the only contract": no backpressure, no request/response, bridge-side state invisible to the framework, three layers of polling multiply latency, hardest code (byte protocols) gets least CI. A real-time-aware architecture would use streaming end-to-end; xly's choice trades latency, observability, flow control for operational simplicity. Liveable for press tempo, not for faster shop-floor signals.
zichun
1 parent ff2ee55a
Showing 16 changed files with 499 additions and 126 deletions
en/docs/concepts/api-surface.md
en/docs/concepts/customization-channels.md
en/docs/concepts/master-slave.md
en/docs/concepts/modules-forms-vtables.md
en/docs/concepts/multi-tenancy.md
en/docs/concepts/semantic-fk.md
en/docs/concepts/thesis.md
en/docs/reference/maintainer/activiti.md
en/docs/reference/maintainer/bi-engine.md
en/docs/reference/maintainer/cache-invalidation.md
en/docs/reference/maintainer/proc-dispatch.md
en/docs/reference/maintainer/runtime.md
en/docs/reference/maintainer/sql-templates.md
en/docs/slices/04-custom-field.md
en/docs/slices/05-customer-sql-override.md
en/docs/slices/06-hardware.md
@@ -17,22 +17,51 @@ database is what makes their separation work — internal-API writes show
 up to external-API reads automatically because both run against the
 same schema.
-## Why three tiers, not one
+## Why three tiers (and what splitting them costs)
-Each tier answers a different question, and bundling them would
-sacrifice clarity:
+Each tier was originally split off to answer a different question:
 - **Internal** is large (universal CRUD over all metadata-driven
   modules), volatile (changes when the framework changes), and
   intentionally untyped (the SPA decides what to ask for, server obeys).
 - **External** is curated (only the endpoints integrators are allowed
   to use), versioned by `sApiCode`, and authenticated with bearer
-  tokens — it survives across framework changes precisely because it's
-  small and explicit.
+  tokens.
 - **Inbound webhooks** receive untrusted bodies from third-party
   systems and route them to xly handlers. The Swagger UI lives here
   because that audience benefits most from interactive documentation.
+The split has real costs that the wiki should not gloss over:
+
+- **Three WARs to deploy, monitor, and version-pin.** A new release
+  has to ship coordinated builds of `xlyEntry`, `xlyApi`, and
+  `xlyInterface`. Mismatches (e.g., a schema change in `xlyEntry`
+  that `xlyApi` hasn't picked up) are silent until the call path
+  hits them.
+- **Duplicate code.** `RequestAddParamUtil` exists in both
+  `xlyPersist` (for `xlyEntry`) and `xlyApi` (near-identical 56-vs-57
+  line copy). `InterfaceController` exists in both `xlyApi` and
+  `xlyInterface` with overlapping `/interfaceDefine/callthirdparty/*`
+  endpoints. Keeping the two halves in sync is operational discipline,
+  not a compile-time guarantee.
+- **No shared session.** A user authenticated in BACK has no
+  session in `xlyApi` — external callers fetch a separate bearer
+  token. This is correct for *external* integrators but means
+  internal cross-WAR calls (rare in practice, common in temptation)
+  have to go through the public token flow.
+- **Three context-paths means three reverse-proxy entries.**
+  The mapping from `BACK=:8597` and `FROUNT=:8598` to the actual
+  WARs lives in nginx config that isn't in this repo. Misconfigured
+  proxies are a common failure mode the codebase can't catch.
+
+Could the split have been a single deployable with internal package
+boundaries? Yes — Spring Boot supports it. The benefit of that
+alternative would be: one build, one set of dependencies, one
+session story, no duplicate utility classes. The cost: harder to
+scale tiers independently, harder to rate-limit external callers
+without affecting the SPA. xly chose the deployment-time isolation;
+the wiki's job is to acknowledge what that choice traded away.
+
 ## What each tier looks like at runtime
 - **Internal** — see [the five-key read](../reference/maintainer/runtime.md#the-five-key-read). One
@@ -24,8 +24,12 @@ They are visible in the BACK UI so a PM can audit them. The framework&#39;s
 runtime reads them on every request (with caching). The Java code is
 unchanged; the application's behaviour is what those rows say it is.
-This is the default path. **90%+ of customer customizations should live
-here.**
+This is the path the architecture intends customers to use. Whether
+the actual ratio is 90/10 in favour of Channel 1 isn't measured
+anywhere; the empirical signal is that 18 customer directories
+under `script/客户/` exist, which is a non-trivial slice of the
+customer base needing what Channel 1 can't express. Take "90%+
+should live here" as an aspirational target, not a measured fact.
 ## Channel 2 — Per-customer SQL overrides
@@ -64,13 +64,16 @@ appears verbatim in 14k+ table and column names.
 ## "Slave" — naming caveat
 The term carries connotations in English that are absent from the
-Chinese 主表 / 从表. The wiki preserves "slave" because:
+Chinese 主表 / 从表. The wiki preserves "slave" verbatim because the
+codebase, schema, and auto-catalog use it in 14k+ identifiers and any
+translation would diverge from what developers actually grep.
-1. Renaming would break every cross-reference into the codebase, the
-   schema, and the auto-catalog (14k+ identifiers).
-2. Mapping every occurrence to "detail" or "child" would distort
-   searchability and produce wiki text that diverges from what
-   developers actually grep.
-
-Future xly versions may rebrand to "detail" / "header"; until then, the
-wiki uses the in-codebase term verbatim and notes it once here.
+That preservation has a cost. The naming was a poor choice in the
+first place — `主表 / 从表` translates straightforwardly as
+`master / detail` or `header / line`, both of which would have
+matched both English convention and the actual relational
+semantics. The cost of retaining "slave" is borne by every English-
+speaking maintainer who has to type or read the term, and by any
+future rebrand effort that has to do the schema-wide rename xly
+should have done at the start. The wiki documenting it once here
+doesn't remove the cost; it just acknowledges it.
@@ -87,10 +87,21 @@ sub-tabs.
 ## Three nouns, one engine
 The runtime — `BusinessBaseController` and `BusinessBaseServiceImpl`,
-documented in [Slice 1](../slices/01-hello-world.md) — knows how to
-render any module / form / virtual-table combination. There is no
-per-module Java code. PMs creating new modules are creating new rows;
-they are not creating new code paths.
+documented in [Slice 1](../slices/01-hello-world.md) — handles every
+module / form / virtual-table combination through one universal
+dispatch path. There is no per-module Java; PMs creating new modules
+are creating new rows.
+
+The flip side: that one engine has accumulated 3,500+ lines in
+`BusinessBaseServiceImpl` alone, plus another 800+ in
+`BusinessGdsconfigformsServiceImpl`. Edge cases, special-case
+table handling (e.g., the `mftproductionplanslave` hardcode at
+`BusinessBaseServiceImpl.java:1768`), per-tenant overlay merge
+logic, and the multi-tenant scope-bypass list all live in this
+single class. Adding a new feature that the universal dispatch
+doesn't handle means either expanding this class or writing
+custom code that bypasses it — both of which erode the "one
+engine handles everything" property the design promised.
 ## Business-data table prefixes
@@ -69,7 +69,7 @@ tenant. That&#39;s the catastrophic data-leak case. Three places to watch:
    doesn't validate the supplied table against the form's authorised
    tables, this is a privilege-escalation surface. See [Slice 1](../slices/01-hello-world.md#4-user-edits-a-row-clicks-save).
-## How the design scales
+## How the design scales — and where it doesn't
 The framework's multi-tenancy design scales by **row count**, not by
 code. A small SaaS deployment with one brand and one subsidiary uses
@@ -77,3 +77,31 @@ exactly the same Java, MyBatis mappers, and stored procedures as a
 deployment with dozens of brands × dozens of subsidiaries × several
 editions; only the row distributions in `gdsmodule`, `sisversionflow`,
 and the business-data tables differ.
+
+Scaling by row count is operationally simple but has limits the
+wiki should not paper over:
+
+- **Shared physical schema means shared resource contention.**
+  Every tenant's queries hit the same MySQL instance, same tables,
+  same indexes. A heavy report on tenant A's data competes for
+  buffer-pool space and CPU with tenant B's order entry. There is
+  no per-tenant resource isolation.
+- **Tenant filters in every WHERE clause.** Every read query
+  carries `sBrandsId = ? AND sSubsidiaryId = ?`. Indexes have to
+  lead with these columns to be useful — and almost all xly tables
+  do, by convention, but a maintainer adding a new index has to
+  remember. Forgetting produces a query plan that scans across all
+  tenants' rows and silently slows down once the table gets large.
+- **No physical hard-delete boundary.** A tenant offboarding does
+  not drop a database; it leaves the rows where they are
+  (sometimes marked `bInvalid`, sometimes deleted, sometimes
+  untouched). Permanent removal requires a custom cleanup script
+  per tenant. From a GDPR / data-residency angle, "this tenant
+  is gone" is hard to prove.
+- **`sBrandsId` / `sSubsidiaryId` everywhere is an inflexible
+  tenancy unit.** "Tenant" means exactly the `(sBrandsId,
+  sSubsidiaryId)` tuple. Alternate cuts (e.g., per-region access,
+  per-department access without sub-tenanting) don't fit the model
+  and would require parallel scoping columns. The model assumed
+  this shape would always be right for every customer; in
+  practice, it has been, but it's a hard commitment.
@@ -13,7 +13,7 @@ before reading any further.
 ## Why xly disabled FKs
-Two reasons given by the architecture, both pragmatic:
+Two reasons the architecture gives:
 1. **Bulk-write performance.** Mass inserts (work-order calculation,
    month-end closures, batch imports) write hundreds of thousands of
@@ -23,8 +23,28 @@ Two reasons given by the architecture, both pragmatic:
 2. **Schema-migration agility.** xly evolves quickly: new modules,
    new fields, new tables. With FKs, every schema change has to consider
    the constraint graph; without them, a `CREATE TABLE` or `ALTER TABLE`
-   is a local operation. The cost of that agility is borne at runtime
-   by the application code.
+   is a local operation.
+
+Both are real considerations, but neither is a slam-dunk argument for
+"zero FKs across the entire schema":
+
+- **Bulk-write performance** can be addressed surgically: disable
+  constraints during the batch (`SET FOREIGN_KEY_CHECKS = 0`),
+  re-enable after, validate. xly's choice was instead to not have
+  FKs at all, which means *every* read also pays the cost of trusting
+  ad-hoc proc validation rather than DB-enforced integrity.
+- **Schema-migration agility** is improved by no-FKs, but at the
+  price of moving every referential check into application code (or
+  forgetting it). In practice this means the integrity work an FK
+  would do automatically is now duplicated across hundreds of stored
+  procedures, with no compile-time guarantee any given proc actually
+  does the check (see Failure modes below).
+
+A more honest framing: the system traded **DB-enforced integrity**
+for **operational convenience at write time and DDL time**. The
+bug surface that trade introduced (orphan rows, cross-tenant
+references that go undetected, integrity bugs surfacing weeks
+later) is the cost paid every day the system runs.
 ## What a "semantic FK" is
@@ -43,28 +43,61 @@ Three costs are baked into this design and worth being explicit about:
    similar joins) is a [semantic FK](semantic-fk.md). Orphan rows are
    possible.
-## The reward
-
-In exchange xly gets:
+## What the design enables (and what each enabler still costs)
 - **One codebase serves dozens of customers.** Each customer's tenant
-  has its own metadata rows; the Java is identical.
+  has its own metadata rows; the Java is identical. — *Limit:* it
+  *doesn't* serve all customers. The 18 directories under
+  `script/客户/` (see [Slice 5](../slices/05-customer-sql-override.md))
+  are the wall the data-driven design hits — when a customer needs
+  different procedural logic, "single codebase" stops being true and
+  becomes "single Java codebase + a fan-out of customer-specific SQL
+  the database carries silently".
 - **PMs evolve the application without engineering time.** They open
   BACK, add a module, define a form, set permissions, and the next user
-  load shows the change.
-- **Customizations are layered cleanly** ([Slice 4](../slices/04-custom-field.md)):
+  load shows the change. — *Limit:* the PM's effective vocabulary is
+  whatever `gdsconfigformmaster` / `gdsconfigformslave` columns
+  expose. Anything genuinely new (a custom calculation, a non-standard
+  validation, a different save path) requires a stored procedure —
+  which takes engineering time again, just in SQL instead of Java. And
+  PMs without DB access can't reason about why their metadata change
+  produced wrong output, because the procedural side is invisible from
+  BACK.
+- **Customizations are layered "cleanly"** ([Slice 4](../slices/04-custom-field.md)):
   per-tenant overrides sit *on top of* the shared base without forking.
+  — *Limit:* the cleanliness is a Java-side property. The runtime
+  merge logic in `BusinessBaseServiceImpl` is non-trivial (3,500+
+  lines), debugging "why does this tenant see field X but not Y"
+  involves chasing through `gdsconfigformpersonalize` +
+  `gdsconfigformcustomslave` + `gdsconfigformuserslave` interactions.
+  And the overlay model can't `ALTER TABLE` — adding a real new
+  column still needs a coordinated schema migration.
+
+A more candid reading: the data-driven design **shifts complexity
+out of Java and into the database and the PM-built metadata**. The
+total complexity isn't lower; it's redistributed to people and tools
+the framework can't compile-check.
 ## When it breaks down
 Data-driven works until a customer needs behaviour that can't be expressed
 as metadata — different SQL, different procedure body, an aggregation rule
-that doesn't fit the framework's vocabulary. xly's escape hatch for that
-case is the [per-customer SQL override channel](../slices/05-customer-sql-override.md):
+that doesn't fit the framework's vocabulary. xly's response is the
+[per-customer SQL override channel](../slices/05-customer-sql-override.md):
 hand-written SQL committed to `script/客户/<customer>/` and applied
 directly to that customer's schema, bypassing the framework entirely.
-That channel is real and used. It is also the most expensive form of
-customization to maintain.
+
+It's worth being blunt about what this means. "Bypassing the framework"
+makes the entire data-driven thesis a *partial* property of the system.
+For the 18 customers under `script/客户/` the runtime is **no longer
+single-codebase** — the Java is shared but the actual proc bodies
+running on each customer's DB diverge, with no automated way to
+detect drift. A reviewer reading `Sp_SalSalesCheck` in source has no
+guarantee it's what runs in production for any given customer. The
+"escape hatch" framing is generous; in practice the override channel
+has become the standard answer for material business-logic
+differences, which is the failure mode the data-driven design was
+supposed to prevent.
 ## What this means for reading the wiki
@@ -169,26 +169,47 @@ emit audit entries via a custom `sp_add_flow_log`. This is the
 empirically-observed customisation channel — Activiti deployment
 is not seen in any `script/客户/` directory.
-### Why this design works for xly's audience
+### Why xly avoided Activiti — and what that costs
 The printing-industry ERP customers run rule-driven business
 processes (quote → order → production → delivery → invoice → payment)
-where each step is **its own document with its own form** by
-convention. A user expects "Now I open the next form and fill it in"
-rather than "the system tells me a task is waiting for me." For
-that audience:
-
-- Path 1 + Path 2 cover every observed scenario in this dev DB.
-- Path 3's value (BPMN modeling, reassignment, parallel gateways) is
-  reserved for the rare tenant whose approval graph genuinely needs
-  it.
-
-The trade-off: workflow logic is **scattered across stored procedures**
-rather than declarable in one place. Adding a new step to a flow
-means writing or editing one or more procs, not editing a BPMN
-diagram. For complex, frequently-changing flows, this is brittle.
-For the printing-shop reality (quote-to-cash chain that doesn't
-change much per customer), it's pragmatic.
+where each step is conventionally its own document with its own form.
+The audience-fit argument: a user expects "Now I open the next form
+and fill it in" rather than "the system tells me a task is waiting
+for me," so Path 1 + Path 2 cover every observed scenario in this
+dev DB, and Path 3 is held in reserve.
+
+The costs of going proc-based instead of BPMN-based:
+
+- **Workflow logic is scattered across stored procedures, not
+  declarable in one place.** Adding a step to "what happens after a
+  quote is approved" means writing or editing one or more `Sp_*` procs,
+  re-grepping every other proc that references the affected document,
+  and hoping nothing was missed. A BPMN engine would have one diagram
+  to look at.
+- **No central audit trail of who approved what when.** `bCheck = 1`
+  records that *some* approval happened, plus who approved it via the
+  `sCheckPerson` column — but the *path the document took* (which
+  steps, in which order, with what comments) lives only in proc-side
+  status flags, not in a queryable workflow history.
+- **No parallel-branch or reassignment semantics.** Path 1 + 2 cover
+  linear single-approver flows. The first time a customer needs
+  "two people must approve in parallel", or "if person A is on
+  vacation, route to person B", the system has to either fall back
+  to Path 3 (Activiti, currently disabled) or hand-code the routing
+  in stored procs.
+- **Flow-graph evolution is invisible.** Changing the steps of a
+  workflow means editing procs and document chains. There is no
+  diff that says "the order-approval flow changed from N steps to
+  N+1 steps on date X" — only commit history of individual procs.
+- **The Activiti engine is on the classpath and booted at runtime
+  for nothing.** Memory + JAR + schema (24 `act_*` base tables + 3
+  identity views) are paid for in every deployment whether they're
+  used or not.
+
+For the printing-shop reality the trade has been viable. It would
+not scale to a domain with frequently-changing approval flows or
+strict audit requirements.
 ## Activiti is wired — engine ON
@@ -320,21 +341,46 @@ For a flow to actually run, in roughly this order:
    transitions; downstream queries that filter on `bCheck = 1` start
    seeing it.
-## Why xly bothered with Activiti at all
+## Why xly bothered with Activiti — and whether it was worth it
 The codebase has its own `biz_flow` / `biz_todo_item` tables that
-*could* implement a hand-rolled approval system. The decision to put
-Activiti behind them buys:
+*could* implement a hand-rolled approval system. The arguments for
+putting Activiti behind them:
 - Standard BPMN modeling (the JS modeler pulls the same stencilset as
   Activiti Explorer).
-- Free state-machine semantics — the engine handles "task A done →
-  task B available" without xly maintaining the FSM in SQL.
+- Engine-managed state-machine semantics — "task A done → task B
+  available" without xly maintaining the FSM in SQL.
 - Diagram rendering (the page-as-PNG in `ProcessActController`).
-The cost: a second engine running in the JVM, a second DB schema with
-its own DDL drift, a second authentication surface (which xly papers
-over via the `act_id_*` views).
+The costs are not minor:
+
+- A second engine running in the JVM, with its own startup cost,
+  memory footprint, and operational surface.
+- A second DB schema (24 `act_*` tables + 3 identity views) that
+  diverges from xly's `gds*`/`biz*` conventions and needs its own
+  DDL migrations across Activiti versions (and indeed: see the
+  5.17 vs 6.0 version skew elsewhere on this page).
+- A second authentication surface that xly papers over via the
+  `act_id_*` views projecting xly's own users into Activiti's shape
+  — a hack that works but creates two-way coupling between user-
+  table changes and Activiti correctness.
+- A modeler UI (Angular 1.x era) that maintainers have to learn
+  separately from BACK.
+- And — the most damning cost — **on this dev DB the engine is
+  idle**. The `act_re_procdef` and `biz_flow` tables are empty, and
+  Path 1 / Path 2 handle every observed workflow scenario. The
+  Activiti dependency is paid for at every startup whether it's
+  exercised or not.
+
+A more honest framing: Activiti was bet on as the "real" workflow
+solution; in practice the simpler proc-driven paths covered the
+actual demand. The wiring stayed because removing it isn't free
+either, but the value the engine delivers in the current deployment
+is approximately zero. A future cleanup could plausibly remove
+Activiti entirely and consolidate on the document-chain pattern,
+trading away the *option* of BPMN-style flows for a smaller
+codebase and one fewer schema to maintain.
 ## What this page is *not*
@@ -148,21 +148,44 @@ several `Sp_SalesOrder_Kpi*` procs (matches the
 [per-customer SQL override channel](../../slices/05-customer-sql-override.md)
 — customers who want different KPI rules ship their own proc).
-## Why this matters
-
-xly's BI layer demonstrates the data-driven thesis at scale:
-
-1. **Adding a new dashboard card requires no Java change** — a PM
-   inserts a `gdsconfigcharmaster` row pointing at a `Sp_chart_*` proc,
-   sets `sCharType` and `iWidth`, the SPA picks it up on the next
-   `getModelBysId` cache miss.
-2. **Adding a new chart proc** does require a SQL author (the proc
-   has to follow the standard tenant-scoped shape so generic dispatch
-   can call it through `CharServiceImpl`).
-3. **No OLAP cube, no MDX, no semantic layer.** Each chart is a
-   purpose-built SQL stored procedure. This trades reusability for
-   simplicity — perfect-fit aggregations, no general-purpose ad-hoc
-   query builder.
+## Drawbacks of the homebrewed approach
+
+The metadata + per-chart-proc design is consistent with xly's data-
+driven thesis, and it avoids carrying a heavy OLAP engine. The costs:
+
+1. **Every new chart needs a SQL author.** "PM adds a metadata row"
+   is true *after* an engineer has written the matching `Sp_chart_*`
+   proc. There is no aggregation builder, no field-picker, no auto-
+   generated query — every metric is a hand-coded stored procedure
+   the engineering team has to write, review, and maintain. The
+   20-proc catalogue and 11 chart types are the **whole** set of
+   shapes the system can render today.
+2. **Charts run heavy SQL on the OLTP DB.** No warehouse, no
+   pre-aggregation, no incremental rollup. A "today's profit"
+   chart is a SELECT against the live transactional schema.
+   Heavy customers will see chart loads contend with order-entry
+   load on the same MySQL instance. Caching helps, but only on hit;
+   the first load after metadata change pays full cost.
+3. **No semantic consistency between charts.** Each `Sp_chart_*`
+   proc decides for itself how to compute "monthly profit", "today's
+   sales", etc. Two charts purporting to show the same metric can
+   silently disagree because they're separate proc bodies. A real
+   semantic layer would prevent that; the homebrewed model can't.
+4. **No drill-down, no slice-and-dice.** Each chart is a frozen
+   query shape. Users can't pivot on different dimensions or drill
+   from a summary card into the underlying transactions without an
+   engineer authoring a separate proc for each path.
+5. **Customer-divergent KPI logic.** Customers under
+   `script/客户/` ship their own `spKPImodule` and
+   `Sp_SalesOrder_Kpi*` overrides — different KPI math per
+   customer, in code that lives only on that customer's DB. This
+   makes "what does this KPI mean" depend on which schema the
+   reader is connected to.
+
+The simpler design is fine for "show me the same 20 cards xly has
+always shown". It is not fine if the goal is ad-hoc analytics or
+self-service reporting — those would require a separate semantic /
+warehouse layer that xly does not have.
 ## What this is *not*
@@ -117,14 +117,47 @@ against the DB does **not** trigger any cleaner. The cache will serve
 stale metadata until either:
 1. The cache TTL expires (check the cache config for the actual TTL).
-2. A bounce of the application servers (one node at a time if the
-   cache is local; once if shared).
+2. A bounce of the application servers (one bounce suffices since the
+   cache is Redis-backed and shared — see above).
 3. A manual call to one of the
    `BusinessCleanRedisDataImpl.delCleanRedisDataByTableName(<table>, …)`
-   methods is invoked from inside the application (e.g., via a
-   maintenance endpoint). Note this clears whatever the local
-   `CacheManager` is bound to; if that turns out to be in-memory,
-   the cleanup must run on every node.
+   methods is invoked from inside the application — once, on any
+   node, since it clears the shared Redis store.
+
+## Drawbacks of this design
+
+The synchronous `@CacheEvict`-during-save model is operationally
+simple and (with Redis backing) genuinely cross-node coherent. It is
+also fragile in ways worth naming:
+
+- **Two systems with confusingly similar names.** The JMS path
+  `CHANGE_GDS_MODULE` + `ConsumerChangeGdsModuleThread` *sounds*
+  like it should be cache invalidation but isn't. This page exists
+  partly because that conflation is a recurring source of bugs and
+  reader confusion. A renaming pass (proc and queue → e.g.
+  `MERGE_BASE_GDS_MODULE`) would help, but isn't free.
+- **Eviction is in the same transaction as the write.** If the
+  Redis call fails mid-save, the row commits but the cache stays
+  stale. The framework does not detect or recover from this; a
+  Redis outage during save silently corrupts the cache for
+  affected rows until TTL expiry.
+- **Eviction is "all or nothing per cache region".** Most
+  `@CacheEvict` annotations on `CleanRedisServiceImpl` use
+  `allEntries=true`, which dumps the entire cache region rather
+  than the affected key. Heavy save throughput causes high
+  cache-miss rates immediately afterwards — fine for small
+  metadata caches, expensive when dropping a region with thousands
+  of entries.
+- **No invalidation budget / batching.** Bulk metadata changes
+  (e.g., editing 100 form fields) trigger 100 `@CacheEvict` fires,
+  each one round-tripping to Redis. There is no mechanism to
+  coalesce evictions into one batch.
+- **Direct DB writes bypass everything.** Any tooling that touches
+  the schema outside `BusinessBaseServiceImpl` — including database
+  admin scripts, `script/客户/` overrides applied via `mysql`
+  command line, and Channel-2 SQL replacements — leaves the cache
+  stale until manually invalidated. This is a real operational
+  hazard for the deployment pattern xly actually uses.
 ## Common bug: the cache is the bug
@@ -135,10 +168,11 @@ old value&quot;, check (in this order):
 2. Did the change go through a path that invokes
    `BusinessCleanRedisData`? (Direct DB writes or controllers that
    bypass `BusinessBaseServiceImpl` won't.)
-3. Is the cache shared across nodes (Redis-backed) or local
-   (`ConcurrentMapCacheManager`)? Confirm by inspecting the active
-   `CacheManager` bean on a running node.
-4. If the cache is local, did every node get the eviction call?
+3. Was Redis reachable when the save committed? A failed eviction
+   does not roll back the save.
+4. Is the change in a cache region that's evicted by the table that
+   was written? `CleanRedisServiceImpl` maps writes to specific
+   regions; an unmapped table will not invalidate its readers.
 The five-key composite returned by
 [`getModelBysId` in Slice 1](../../slices/01-hello-world.md)
@@ -43,9 +43,38 @@ by name lets the framework call any proc the metadata names without a
 code change. The framework treats the proc as a black box: name in,
 parameters in, result out.
-The downside: the runtime cannot statically know which procs exist or
-what their effects are. A typo in `gdsmodule.sSaveProName` produces a
-runtime "proc not found" error, not a compile error.
+That convenience comes with substantial costs that are worth being
+explicit about:
+
+- **No compile-time check** on proc names. A typo in
+  `gdsmodule.sSaveProName` produces a runtime "proc not found"
+  error, not a compile error. Refactoring a proc name requires
+  hand-grepping the metadata; the IDE can't help.
+- **No type safety on parameters.** The framework binds parameters
+  positionally from a `Map<String, Object>`. A proc whose signature
+  changed but whose callers didn't is a runtime crash with no IDE
+  warning.
+- **No call-site discoverability.** "Which Java code calls
+  `Sp_SalSalesCheck`?" can't be answered by IDE find-usages because
+  no Java code does — `gdsmodule` rows do. Maintainers must search
+  *both* metadata tables *and* the SQL bodies of other procs that
+  may invoke this one.
+- **Effectively no static analysis.** Side effects of any given
+  proc are invisible to anyone who hasn't read the proc body. A
+  `Sp_SalSalesCheck` named in `gdsmodule.sProcName` could be a
+  read-only SELECT or could be doing INSERTs and UPDATEs across a
+  dozen tables; the framework treats them identically.
+- **Stack traces that stop at the boundary.** Java errors thrown
+  from inside a proc surface as a generic `BadSqlGrammarException`
+  or `MySQLSyntaxErrorException`. To get the real error you have
+  to enable MyBatis SQL logging and re-run.
+
+A more honest framing: hard-wiring 1000+ procs in Java would be
+painful, but most of that pain comes from xly *having* 1000+ procs
+in the first place. Dynamic dispatch made it cheap to keep adding
+them, which made the pile grow, which made the pile harder to
+audit. The mechanism is what it is; the *amount* of behaviour
+pushed into the SQL layer is the more interesting design question.
 ## The conventions procs follow
@@ -221,6 +221,38 @@ Two flagged in slices that belong here permanently:
    load entirely for `UserType.ADMIN`. ADMIN account governance must
    come from outside the app.
+## What "universal CRUD" means in practice
+
+The "one controller writes any row in any table" pattern is the
+core data-driven move. It also concentrates risk:
+
+- **`BusinessBaseServiceImpl` is ~3,500 lines** of tightly
+  intertwined logic: per-tenant scope-bypass list, special-case
+  table hardcodes (`mftproductionplanslave` at line 1768),
+  pre/post-save hook dispatch, sTable-driven write routing. Every
+  bug fix has to navigate the whole class.
+- **The class is the single point of failure for the entire
+  business runtime.** A regression in `addUpdateDelBusinessData`
+  breaks save for every form in every tenant simultaneously.
+  Module-specific controllers would localise the blast radius;
+  the universal one cannot.
+- **No type system on `Map<String, Object>`.** The frontend ships
+  a bag of (key, value) pairs. The runtime trusts the keys
+  match column names and the values cast to the column types.
+  Mismatches surface as `BadSqlGrammarException` at the DAO layer
+  — far from where the wrong value originated. There is no
+  schema-aware request validation.
+- **Discoverability is poor.** "What endpoints write to
+  `mftproductionplanslave`?" can't be answered by IDE find-usages
+  — the answer is "any controller that calls
+  `BusinessBaseServiceImpl.addBusinessData` with `sTable` set to
+  `mftproductionplanslave`", which is everything.
+
+The universal pattern is what makes the data-driven thesis work.
+It is also the reason adding a new module is essentially free
+*and* the reason that touching the runtime is essentially never
+free.
+
 ## Cache invalidation
 When BACK saves a metadata change, the save service synchronously
@@ -229,5 +261,5 @@ calls `BusinessCleanRedisData.delCleanRedisData*`, which fires
 A separate JMS path (`ConsumerChangeGdsModuleThread`) exists with a
 similar name but does base-data merging via stored proc, not cache
 invalidation. See [cache invalidation on metadata change](cache-invalidation.md)
-for the full story (including the open question about cross-node
-coherence).
+for the full story (cross-node coherence is empirically Redis-backed,
+no longer an open question).
@@ -61,20 +61,39 @@ the target schema.
   document family the proc operates on.
 - Other placeholders depending on the scaffold.
-## Why this is a "template" library and not a code generator
+## "Template" library, not a code generator — and what that costs
 The framework does **not** auto-generate procs from these templates
-based on metadata. The scaffolds exist because xly's procs follow a
-common conventional shape; copying the scaffold ensures the new proc:
+based on metadata. The scaffolds are convention-enforcing copy-paste
+starters, nothing more. They exist to nudge a new proc into the
+shape that [generic dispatch](proc-dispatch.md) can call:
-- Accepts the standard parameter list `(sGuid, sFormGuid, sLoginId, sBrId, sSuId)`
-  that [generic dispatch](proc-dispatch.md) can call.
-- Returns success/error via the standard `OUT sCode INT, OUT sReturn LONGTEXT`.
+- Standard parameter list `(sGuid, sFormGuid, sLoginId, sBrId, sSuId)`.
+- Returns success/error via `OUT sCode INT, OUT sReturn LONGTEXT`.
 - Honours the multi-tenant filter `sBrandsId = sBrId AND sSubsidiaryId = sSuId`.
-A proc that *doesn't* follow these conventions cannot be invoked
-through generic dispatch and would have to be called from custom Java
-code instead.
+Costs of staying at "template" instead of "generator":
+
+- **No enforcement.** A proc that drifts from the convention compiles
+  fine. The framework discovers the mismatch at runtime as a
+  `BadSqlGrammarException` or wrong-shaped result. There is no
+  pre-merge check.
+- **No regeneration.** When the convention itself changes (e.g., a
+  new standard `OUT` param), the existing procs do not update.
+  Engineers have to grep + rewrite, with no automation.
+- **No knowledge of which proc came from which template.** A proc in
+  the live DB doesn't record its origin scaffold; understanding what
+  was customised away requires diffing against the scaffold by hand.
+- **Customer overrides under `script/客户/` can — and do — diverge
+  from the scaffold shape.** This is reasonable per customer but
+  means the conventions are observed by social contract, not by
+  any mechanical check.
+
+A real code-generation pipeline (template + metadata → emitted SQL,
+checked in or applied at deploy time) would catch these. The
+trade xly made: less tooling to maintain, but discipline-rather-
+than-enforcement on proc shapes — visible in the 1,687 procs the
+schema currently carries, not all of which follow the conventions.
 ## Two loaders
@@ -112,15 +112,36 @@ ignored at merge time. A maintainer audit script that flags such orphans
 is on the [Maintainer Reference](../reference/maintainer/runtime.md)'s
 TODO list.
-## Why it works without code changes
-
-The end-customer never asks an engineer for a new column. They open the
-BACK builder, add the row, the field appears in FROUNT for their tenant
-only. The system's other tenants are untouched. That single-codebase
-property is what xly's data-driven thesis ([Concepts → Thesis](../concepts/thesis.md))
-buys — at the cost of the runtime cost of merging metadata on every
-request, plus the schema bloat of three customization tables that most
-forms never use.
+## Why it works without code changes — and what that costs
+
+The end-customer never asks an engineer for a new column for the
+*display* side. They open the BACK builder, add the row, the field
+appears in FROUNT for their tenant only. The system's other tenants
+are untouched.
+
+The price for that property:
+
+- **The merge runs on every request** (not just on overlay-row
+  changes). Even tenants with zero `gdsconfigformcustomslave` rows
+  pay the runtime cost of checking — the framework can't tell upfront
+  whether a tenant has overrides, so the merge code path runs always.
+- **Three near-empty tables on every schema.** The three customization
+  tables exist whether the tenant uses them or not. In this dev DB
+  `gdsconfigformcustomslave` has 0 rows; the table is still indexed,
+  backed up, and queried.
+- **Display extension only.** The overlay can render an extra field;
+  it cannot store its value unless the underlying physical table
+  already has the column. So "no code change for a new field" is true
+  only for *display-only* fields. Real new persisted fields still
+  need a coordinated `ALTER TABLE` (Slice 5 territory) — which means
+  the wins from "no code change" don't apply to the cases that
+  actually move business value.
+- **Debuggability gets worse.** "Why does tenant A see this field
+  but tenant B doesn't?" requires diffing
+  `gdsconfigformcustomslave` + `gdsconfigformpersonalize` +
+  `gdsconfigformuserslave` rows for both tenants. The merge logic in
+  `BusinessBaseServiceImpl` is non-trivial; reproducing the exact
+  layout a user sees often means re-running the merge by hand.
 ## Concepts this slice introduces
@@ -90,8 +90,9 @@ framework doesn&#39;t know; the framework can&#39;t tell.
 This makes overrides:
-- **Powerful.** Anything you can write in MySQL stored-procedure SQL,
-  you can use to replace standard behaviour.
+- **Capable in the technical sense.** Anything you can write in MySQL
+  stored-procedure SQL can replace standard behaviour. (This isn't a
+  good thing per se — see drawbacks below.)
 - **Operationally fragile.** The override must be re-applied (or kept
   alive) whenever the customer's schema is rebuilt, restored, or
   migrated. It does not travel with backups of the codebase, only with
@@ -101,10 +102,25 @@ This makes overrides:
   the proc on the live DB is a different piece of code with the same
   name. Stack traces and "what does this proc do" depend on which
   schema you're connected to.
-
-The right rule of thumb: prefer Slice-4 metadata customization. Reach
-for Slice-5 SQL overrides only when the metadata model genuinely cannot
-express what the customer needs.
+- **No version control on the deployed body.** The `.sql` file in
+  `script/客户/` shows what *should* have been applied. There is no
+  audit trail confirming what *was* applied (or when, or by whom),
+  and no automated re-apply on schema rebuild.
+- **No type-safety bridge.** When the override changes a result-set
+  shape, every Java caller that reads from `Sp_SalSalesCheck` may
+  silently break for that one customer with a `BadSqlGrammarException`
+  or — worse — a wrong-shaped row that propagates as a wrong number.
+- **Compounds the BI problem.** Charts on customers with overridden
+  procs ([bi-engine.md](../reference/maintainer/bi-engine.md))
+  will silently disagree across tenants because the underlying data
+  is computed by different SQL.
+
+The "prefer Slice 4, reach for Slice 5 only as last resort" advice is
+correct in principle, but the existence of 18 customer directories
+suggests that in practice this channel has become the standard answer
+for material business-logic differences. That's a signal the metadata
+model isn't expressive enough for the actual customer-customisation
+demand the system encounters — not a celebration of the escape hatch.
 ## Worked-example: 重庆展印's `Sp_SalSalesCheck` vs the standard
@@ -83,25 +83,50 @@ other data:
 ## The framework / hardware boundary
-This is the cleanest story xly tells about an awkward problem:
+xly's response to the press-PLC problem is a strict separation:
 - **Above the line (xlyEntry, xlyApi, all the metadata machinery):
   generic framework.** No knowledge of presses, PLCs, byte protocols.
 - **Below the line (xlyPlc): hardware-specific.** Knows how to talk to a
   press.
-The two communicate only through the database. The bridge writes rows;
-the framework reads rows. There's no RPC, no shared in-process state,
-no callback. This makes xlyPlc:
-
-- Independently deployable (and several customers run it on a machine
-  next to the press, separate from the central ERP server).
-- Independently failable: if the bridge crashes, the framework keeps
-  running on stale machine-state data. If the framework is down, the
-  bridge keeps writing — when the framework comes back, it sees the
-  buffered rows.
-- Hard to test end-to-end without an actual press. Most CI tests stub
-  the PLC reads.
+The two communicate only through the database — the bridge writes rows,
+the framework reads rows. No RPC, no shared in-process state, no
+callback. The benefits:
+
+- Independently deployable; some customers run xlyPlc on a machine next
+  to the press, separate from the central ERP server.
+- Independently failable: if the bridge crashes the framework serves
+  stale machine-state data; if the framework is down the bridge keeps
+  writing and the framework picks up the buffered rows on recovery.
+
+The costs of "DB as the only contract" are real and worth naming:
+
+- **No backpressure.** If the bridge writes faster than xly can ingest
+  (or if a slow `mftProduceReportMachineState` index update piles up),
+  the bridge has no signal to slow down — it just blocks on the next
+  INSERT. There is no flow-control message between the two halves.
+- **No request/response semantics.** The framework cannot ask the
+  bridge "is the press alive right now?" — it can only read whatever
+  the bridge last wrote, which may be seconds-to-minutes old depending
+  on the cron cadence.
+- **Bridge-side state is invisible to the framework.** "Why is the
+  bridge not writing?" requires logging into the bridge host to read
+  its log; the framework UI shows only the absence of new rows.
+- **Cron polling in both directions.** xlyPlc polls the press; the
+  framework polls the DB; the SPA polls the framework. Three layers
+  of polling means latency from "press state changes" to "user sees
+  it" is `cron interval * 3` in the worst case.
+- **Hard to test end-to-end without an actual press.** Most CI tests
+  stub the PLC reads, which means the bridge's most error-prone code
+  (byte protocol per press model) gets the least automated coverage.
+
+A real-time-aware architecture would use a streaming channel
+(MQTT / Kafka / WebSocket) end-to-end instead of cron + DB. xly's
+choice is operationally simpler but trades off latency, observability,
+and flow control. For the printing-press tempo (machine state changes
+every few seconds, reports every minute) the trade is liveable; for
+faster shop-floor signals it would not be.
 ## Concepts this slice introduces