Commit ab4cf6c6c484a4b0d35edbd4966cd96b564fcc4b

Authored by zichun
1 parent ff2ee55a

docs: en wiki — strip apologetic framing; state design choices with their drawbacks

Editorial pass per user direction: stop justifying the architecture.
For every "why this design works" passage, name the costs the design
imposes — not as a parenthetical aside but as substantive critical
analysis. Each major architectural-claim page now carries an explicit
drawbacks/costs section.

Pages revised:

concepts/thesis.md
- "The reward" → "What the design enables (and what each enabler still costs)":
  for each promised benefit (single codebase, PMs evolve without
  engineering, customisations layered cleanly), name the limit. Added
  closing observation that data-driven design redistributes complexity
  to people and tools the framework can't compile-check.
- "When it breaks down": rewrote to call out that "bypassing the
  framework" via 18 customer dirs makes the data-driven thesis
  partial, not complete.

concepts/semantic-fk.md
- "Why xly disabled FKs": added critical analysis. Both reasons
  could be addressed surgically; the chosen "no FKs anywhere" is the
  trade for DB-enforced integrity, paid every day the system runs.

concepts/master-slave.md
- "Slave naming caveat": stop framing retention as wise pragmatism.
  The naming was a poor choice; preservation has a real ongoing cost.

concepts/modules-forms-vtables.md
- "Three nouns, one engine": the universal dispatch path concentrates
  3,500+ lines + edge cases + special-case hardcodes in one class.
  Naming the trade.

concepts/multi-tenancy.md
- "How the design scales" → "How the design scales — and where it
  doesn't": shared schema = shared contention; tenant-filter index
  discipline; no physical hard-delete; rigid (sBrandsId,
  sSubsidiaryId) tenancy unit.

concepts/customization-channels.md
- Soften "90%+ should live here" claim — that's an aspirational
  target, not a measured fact. The 18 customer override directories
  are evidence the channel-2 demand is non-trivial.

concepts/api-surface.md
- "Why three tiers, not one" → "Why three tiers (and what splitting
  them costs)": three WARs to deploy, duplicate code, no shared
  session, three reverse-proxy entries. Note the alternative
  (single-WAR with package boundaries) and what that would cost
  vs gain.

reference/maintainer/proc-dispatch.md
- "Why dynamic proc dispatch matters": added five concrete costs
  (no compile-time check, no type safety, no call-site discoverability,
  no static analysis, broken stack traces). Reframed: dynamic
  dispatch made it cheap to keep adding procs, which made the pile
  grow, which made the pile harder to audit.

reference/maintainer/cache-invalidation.md
- New "Drawbacks of this design" section: confusing co-named systems,
  eviction in same transaction as write (silent corruption on
  Redis outage), allEntries=true blunt eviction, no batching,
  direct DB writes bypass everything. Also fixed the "if cache is
  local" hedge in section 3 (we've now empirically confirmed Redis-
  backed, so cache is shared).

reference/maintainer/bi-engine.md
- New "Drawbacks of the homebrewed approach" section: every chart
  needs a SQL author, charts run heavy SQL on OLTP DB, no semantic
  consistency between charts, no drill-down, customer-divergent KPI
  logic. Also dedup'd the duplicated "What this is not" section.

reference/maintainer/sql-templates.md
- "Why this is a 'template' library and not a code generator" →
  added costs: no enforcement, no regeneration, no template-origin
  tracking, customer overrides drift from scaffold. The 1,687 procs
  the schema carries are the evidence that "discipline rather than
  enforcement" doesn't fully hold.

reference/maintainer/activiti.md
- "Why this design works for xly's audience" → "Why xly avoided
  Activiti — and what that costs": scattered workflow logic, no
  central audit trail, no parallel-branch/reassignment, invisible
  flow-graph evolution, idle Activiti engine paying boot cost
  anyway.
- "Why xly bothered with Activiti at all" → "Why xly bothered with
  Activiti — and whether it was worth it": named the costs (second
  engine, second schema, second auth surface, modeler UI to learn)
  and the damning fact that on this dev DB the engine is idle. A
  future cleanup could plausibly remove Activiti entirely.

reference/maintainer/runtime.md
- New "What 'universal CRUD' means in practice" section: 3,500-line
  single-point-of-failure class, no type system on Map<String,Object>,
  poor discoverability ("what endpoints write to table X" is
  unanswerable). The trade: adding a module is essentially free,
  touching the runtime essentially never is.
- Updated cache-invalidation cross-link to drop the "open question"
  hedge (now empirically resolved).

slices/04-custom-field.md
- "Why it works without code changes" → "Why it works without code
  changes — and what that costs": merge runs on every request,
  three near-empty tables on every schema, display-only extension
  (real persisted fields still need ALTER TABLE), debuggability
  requires diffing 3 overlay tables.

slices/05-customer-sql-override.md
- Added drawbacks: no version control on the deployed body, no
  type-safety bridge, compounds the BI problem. Reframed the
  "right rule of thumb": 18 customer override directories suggest
  the channel-2 demand is structural, not exceptional — that's
  evidence the metadata model isn't expressive enough, not a
  celebration of the escape hatch.

slices/06-hardware.md
- "The cleanest story xly tells about an awkward problem" →
  removed the "cleanest" framing. Added costs of "DB as the only
  contract": no backpressure, no request/response, bridge-side
  state invisible to the framework, three layers of polling
  multiply latency, hardest code (byte protocols) gets least CI.
  A real-time-aware architecture would use streaming end-to-end;
  xly's choice trades latency, observability, flow control for
  operational simplicity. Liveable for press tempo, not for
  faster shop-floor signals.
en/docs/concepts/api-surface.md
@@ -17,22 +17,51 @@ database is what makes their separation work — internal-API writes show @@ -17,22 +17,51 @@ database is what makes their separation work — internal-API writes show
17 up to external-API reads automatically because both run against the 17 up to external-API reads automatically because both run against the
18 same schema. 18 same schema.
19 19
20 -## Why three tiers, not one 20 +## Why three tiers (and what splitting them costs)
21 21
22 -Each tier answers a different question, and bundling them would  
23 -sacrifice clarity: 22 +Each tier was originally split off to answer a different question:
24 23
25 - **Internal** is large (universal CRUD over all metadata-driven 24 - **Internal** is large (universal CRUD over all metadata-driven
26 modules), volatile (changes when the framework changes), and 25 modules), volatile (changes when the framework changes), and
27 intentionally untyped (the SPA decides what to ask for, server obeys). 26 intentionally untyped (the SPA decides what to ask for, server obeys).
28 - **External** is curated (only the endpoints integrators are allowed 27 - **External** is curated (only the endpoints integrators are allowed
29 to use), versioned by `sApiCode`, and authenticated with bearer 28 to use), versioned by `sApiCode`, and authenticated with bearer
30 - tokens — it survives across framework changes precisely because it's  
31 - small and explicit. 29 + tokens.
32 - **Inbound webhooks** receive untrusted bodies from third-party 30 - **Inbound webhooks** receive untrusted bodies from third-party
33 systems and route them to xly handlers. The Swagger UI lives here 31 systems and route them to xly handlers. The Swagger UI lives here
34 because that audience benefits most from interactive documentation. 32 because that audience benefits most from interactive documentation.
35 33
  34 +The split has real costs that the wiki should not gloss over:
  35 +
  36 +- **Three WARs to deploy, monitor, and version-pin.** A new release
  37 + has to ship coordinated builds of `xlyEntry`, `xlyApi`, and
  38 + `xlyInterface`. Mismatches (e.g., a schema change in `xlyEntry`
  39 + that `xlyApi` hasn't picked up) are silent until the call path
  40 + hits them.
  41 +- **Duplicate code.** `RequestAddParamUtil` exists in both
  42 + `xlyPersist` (for `xlyEntry`) and `xlyApi` (near-identical 56-vs-57
  43 + line copy). `InterfaceController` exists in both `xlyApi` and
  44 + `xlyInterface` with overlapping `/interfaceDefine/callthirdparty/*`
  45 + endpoints. Keeping the two halves in sync is operational discipline,
  46 + not a compile-time guarantee.
  47 +- **No shared session.** A user authenticated in BACK has no
  48 + session in `xlyApi` — external callers fetch a separate bearer
  49 + token. This is correct for *external* integrators but means
  50 + internal cross-WAR calls (rare in practice, common in temptation)
  51 + have to go through the public token flow.
  52 +- **Three context-paths means three reverse-proxy entries.**
  53 + The mapping from `BACK=:8597` and `FROUNT=:8598` to the actual
  54 + WARs lives in nginx config that isn't in this repo. Misconfigured
  55 + proxies are a common failure mode the codebase can't catch.
  56 +
  57 +Could the split have been a single deployable with internal package
  58 +boundaries? Yes — Spring Boot supports it. The benefit of that
  59 +alternative would be: one build, one set of dependencies, one
  60 +session story, no duplicate utility classes. The cost: harder to
  61 +scale tiers independently, harder to rate-limit external callers
  62 +without affecting the SPA. xly chose the deployment-time isolation;
  63 +the wiki's job is to acknowledge what that choice traded away.
  64 +
36 ## What each tier looks like at runtime 65 ## What each tier looks like at runtime
37 66
38 - **Internal** — see [the five-key read](../reference/maintainer/runtime.md#the-five-key-read). One 67 - **Internal** — see [the five-key read](../reference/maintainer/runtime.md#the-five-key-read). One
en/docs/concepts/customization-channels.md
@@ -24,8 +24,12 @@ They are visible in the BACK UI so a PM can audit them. The framework&#39;s @@ -24,8 +24,12 @@ They are visible in the BACK UI so a PM can audit them. The framework&#39;s
24 runtime reads them on every request (with caching). The Java code is 24 runtime reads them on every request (with caching). The Java code is
25 unchanged; the application's behaviour is what those rows say it is. 25 unchanged; the application's behaviour is what those rows say it is.
26 26
27 -This is the default path. **90%+ of customer customizations should live  
28 -here.** 27 +This is the path the architecture intends customers to use. Whether
  28 +the actual ratio is 90/10 in favour of Channel 1 isn't measured
  29 +anywhere; the empirical signal is that 18 customer directories
  30 +under `script/客户/` exist, which is a non-trivial slice of the
  31 +customer base needing what Channel 1 can't express. Take "90%+
  32 +should live here" as an aspirational target, not a measured fact.
29 33
30 ## Channel 2 — Per-customer SQL overrides 34 ## Channel 2 — Per-customer SQL overrides
31 35
en/docs/concepts/master-slave.md
@@ -64,13 +64,16 @@ appears verbatim in 14k+ table and column names. @@ -64,13 +64,16 @@ appears verbatim in 14k+ table and column names.
64 ## "Slave" — naming caveat 64 ## "Slave" — naming caveat
65 65
66 The term carries connotations in English that are absent from the 66 The term carries connotations in English that are absent from the
67 -Chinese 主表 / 从表. The wiki preserves "slave" because: 67 +Chinese 主表 / 从表. The wiki preserves "slave" verbatim because the
  68 +codebase, schema, and auto-catalog use it in 14k+ identifiers and any
  69 +translation would diverge from what developers actually grep.
68 70
69 -1. Renaming would break every cross-reference into the codebase, the  
70 - schema, and the auto-catalog (14k+ identifiers).  
71 -2. Mapping every occurrence to "detail" or "child" would distort  
72 - searchability and produce wiki text that diverges from what  
73 - developers actually grep.  
74 -  
75 -Future xly versions may rebrand to "detail" / "header"; until then, the  
76 -wiki uses the in-codebase term verbatim and notes it once here. 71 +That preservation has a cost. The naming was a poor choice in the
  72 +first place — `主表 / 从表` translates straightforwardly as
  73 +`master / detail` or `header / line`, both of which would have
  74 +matched both English convention and the actual relational
  75 +semantics. The cost of retaining "slave" is borne by every English-
  76 +speaking maintainer who has to type or read the term, and by any
  77 +future rebrand effort that has to do the schema-wide rename xly
  78 +should have done at the start. The wiki documenting it once here
  79 +doesn't remove the cost; it just acknowledges it.
en/docs/concepts/modules-forms-vtables.md
@@ -87,10 +87,21 @@ sub-tabs. @@ -87,10 +87,21 @@ sub-tabs.
87 ## Three nouns, one engine 87 ## Three nouns, one engine
88 88
89 The runtime — `BusinessBaseController` and `BusinessBaseServiceImpl`, 89 The runtime — `BusinessBaseController` and `BusinessBaseServiceImpl`,
90 -documented in [Slice 1](../slices/01-hello-world.md) — knows how to  
91 -render any module / form / virtual-table combination. There is no  
92 -per-module Java code. PMs creating new modules are creating new rows;  
93 -they are not creating new code paths. 90 +documented in [Slice 1](../slices/01-hello-world.md) — handles every
  91 +module / form / virtual-table combination through one universal
  92 +dispatch path. There is no per-module Java; PMs creating new modules
  93 +are creating new rows.
  94 +
  95 +The flip side: that one engine has accumulated 3,500+ lines in
  96 +`BusinessBaseServiceImpl` alone, plus another 800+ in
  97 +`BusinessGdsconfigformsServiceImpl`. Edge cases, special-case
  98 +table handling (e.g., the `mftproductionplanslave` hardcode at
  99 +`BusinessBaseServiceImpl.java:1768`), per-tenant overlay merge
  100 +logic, and the multi-tenant scope-bypass list all live in this
  101 +single class. Adding a new feature that the universal dispatch
  102 +doesn't handle means either expanding this class or writing
  103 +custom code that bypasses it — both of which erode the "one
  104 +engine handles everything" property the design promised.
94 105
95 ## Business-data table prefixes 106 ## Business-data table prefixes
96 107
en/docs/concepts/multi-tenancy.md
@@ -69,7 +69,7 @@ tenant. That&#39;s the catastrophic data-leak case. Three places to watch: @@ -69,7 +69,7 @@ tenant. That&#39;s the catastrophic data-leak case. Three places to watch:
69 doesn't validate the supplied table against the form's authorised 69 doesn't validate the supplied table against the form's authorised
70 tables, this is a privilege-escalation surface. See [Slice 1](../slices/01-hello-world.md#4-user-edits-a-row-clicks-save). 70 tables, this is a privilege-escalation surface. See [Slice 1](../slices/01-hello-world.md#4-user-edits-a-row-clicks-save).
71 71
72 -## How the design scales 72 +## How the design scales — and where it doesn't
73 73
74 The framework's multi-tenancy design scales by **row count**, not by 74 The framework's multi-tenancy design scales by **row count**, not by
75 code. A small SaaS deployment with one brand and one subsidiary uses 75 code. A small SaaS deployment with one brand and one subsidiary uses
@@ -77,3 +77,31 @@ exactly the same Java, MyBatis mappers, and stored procedures as a @@ -77,3 +77,31 @@ exactly the same Java, MyBatis mappers, and stored procedures as a
77 deployment with dozens of brands × dozens of subsidiaries × several 77 deployment with dozens of brands × dozens of subsidiaries × several
78 editions; only the row distributions in `gdsmodule`, `sisversionflow`, 78 editions; only the row distributions in `gdsmodule`, `sisversionflow`,
79 and the business-data tables differ. 79 and the business-data tables differ.
  80 +
  81 +Scaling by row count is operationally simple but has limits the
  82 +wiki should not paper over:
  83 +
  84 +- **Shared physical schema means shared resource contention.**
  85 + Every tenant's queries hit the same MySQL instance, same tables,
  86 + same indexes. A heavy report on tenant A's data competes for
  87 + buffer-pool space and CPU with tenant B's order entry. There is
  88 + no per-tenant resource isolation.
  89 +- **Tenant filters in every WHERE clause.** Every read query
  90 + carries `sBrandsId = ? AND sSubsidiaryId = ?`. Indexes have to
  91 + lead with these columns to be useful — and almost all xly tables
  92 + do, by convention, but a maintainer adding a new index has to
  93 + remember. Forgetting produces a query plan that scans across all
  94 + tenants' rows and silently slows down once the table gets large.
  95 +- **No physical hard-delete boundary.** A tenant offboarding does
  96 + not drop a database; it leaves the rows where they are
  97 + (sometimes marked `bInvalid`, sometimes deleted, sometimes
  98 + untouched). Permanent removal requires a custom cleanup script
  99 + per tenant. From a GDPR / data-residency angle, "this tenant
  100 + is gone" is hard to prove.
  101 +- **`sBrandsId` / `sSubsidiaryId` everywhere is an inflexible
  102 + tenancy unit.** "Tenant" means exactly the `(sBrandsId,
  103 + sSubsidiaryId)` tuple. Alternate cuts (e.g., per-region access,
  104 + per-department access without sub-tenanting) don't fit the model
  105 + and would require parallel scoping columns. The model assumed
  106 + this shape would always be right for every customer; in
  107 + practice, it has been, but it's a hard commitment.
en/docs/concepts/semantic-fk.md
@@ -13,7 +13,7 @@ before reading any further. @@ -13,7 +13,7 @@ before reading any further.
13 13
14 ## Why xly disabled FKs 14 ## Why xly disabled FKs
15 15
16 -Two reasons given by the architecture, both pragmatic: 16 +Two reasons the architecture gives:
17 17
18 1. **Bulk-write performance.** Mass inserts (work-order calculation, 18 1. **Bulk-write performance.** Mass inserts (work-order calculation,
19 month-end closures, batch imports) write hundreds of thousands of 19 month-end closures, batch imports) write hundreds of thousands of
@@ -23,8 +23,28 @@ Two reasons given by the architecture, both pragmatic: @@ -23,8 +23,28 @@ Two reasons given by the architecture, both pragmatic:
23 2. **Schema-migration agility.** xly evolves quickly: new modules, 23 2. **Schema-migration agility.** xly evolves quickly: new modules,
24 new fields, new tables. With FKs, every schema change has to consider 24 new fields, new tables. With FKs, every schema change has to consider
25 the constraint graph; without them, a `CREATE TABLE` or `ALTER TABLE` 25 the constraint graph; without them, a `CREATE TABLE` or `ALTER TABLE`
26 - is a local operation. The cost of that agility is borne at runtime  
27 - by the application code. 26 + is a local operation.
  27 +
  28 +Both are real considerations, but neither is a slam-dunk argument for
  29 +"zero FKs across the entire schema":
  30 +
  31 +- **Bulk-write performance** can be addressed surgically: disable
  32 + constraints during the batch (`SET FOREIGN_KEY_CHECKS = 0`),
  33 + re-enable after, validate. xly's choice was instead to not have
  34 + FKs at all, which means *every* read also pays the cost of trusting
  35 + ad-hoc proc validation rather than DB-enforced integrity.
  36 +- **Schema-migration agility** is improved by no-FKs, but at the
  37 + price of moving every referential check into application code (or
  38 + forgetting it). In practice this means the integrity work an FK
  39 + would do automatically is now duplicated across hundreds of stored
  40 + procedures, with no compile-time guarantee any given proc actually
  41 + does the check (see Failure modes below).
  42 +
  43 +A more honest framing: the system traded **DB-enforced integrity**
  44 +for **operational convenience at write time and DDL time**. The
  45 +bug surface that trade introduced (orphan rows, cross-tenant
  46 +references that go undetected, integrity bugs surfacing weeks
  47 +later) is the cost paid every day the system runs.
28 48
29 ## What a "semantic FK" is 49 ## What a "semantic FK" is
30 50
en/docs/concepts/thesis.md
@@ -43,28 +43,61 @@ Three costs are baked into this design and worth being explicit about: @@ -43,28 +43,61 @@ Three costs are baked into this design and worth being explicit about:
43 similar joins) is a [semantic FK](semantic-fk.md). Orphan rows are 43 similar joins) is a [semantic FK](semantic-fk.md). Orphan rows are
44 possible. 44 possible.
45 45
46 -## The reward  
47 -  
48 -In exchange xly gets: 46 +## What the design enables (and what each enabler still costs)
49 47
50 - **One codebase serves dozens of customers.** Each customer's tenant 48 - **One codebase serves dozens of customers.** Each customer's tenant
51 - has its own metadata rows; the Java is identical. 49 + has its own metadata rows; the Java is identical. — *Limit:* it
  50 + *doesn't* serve all customers. The 18 directories under
  51 + `script/客户/` (see [Slice 5](../slices/05-customer-sql-override.md))
  52 + are the wall the data-driven design hits — when a customer needs
  53 + different procedural logic, "single codebase" stops being true and
  54 + becomes "single Java codebase + a fan-out of customer-specific SQL
  55 + the database carries silently".
52 - **PMs evolve the application without engineering time.** They open 56 - **PMs evolve the application without engineering time.** They open
53 BACK, add a module, define a form, set permissions, and the next user 57 BACK, add a module, define a form, set permissions, and the next user
54 - load shows the change.  
55 -- **Customizations are layered cleanly** ([Slice 4](../slices/04-custom-field.md)): 58 + load shows the change. — *Limit:* the PM's effective vocabulary is
  59 + whatever `gdsconfigformmaster` / `gdsconfigformslave` columns
  60 + expose. Anything genuinely new (a custom calculation, a non-standard
  61 + validation, a different save path) requires a stored procedure —
  62 + which takes engineering time again, just in SQL instead of Java. And
  63 + PMs without DB access can't reason about why their metadata change
  64 + produced wrong output, because the procedural side is invisible from
  65 + BACK.
  66 +- **Customizations are layered "cleanly"** ([Slice 4](../slices/04-custom-field.md)):
56 per-tenant overrides sit *on top of* the shared base without forking. 67 per-tenant overrides sit *on top of* the shared base without forking.
  68 + — *Limit:* the cleanliness is a Java-side property. The runtime
  69 + merge logic in `BusinessBaseServiceImpl` is non-trivial (3,500+
  70 + lines), debugging "why does this tenant see field X but not Y"
  71 + involves chasing through `gdsconfigformpersonalize` +
  72 + `gdsconfigformcustomslave` + `gdsconfigformuserslave` interactions.
  73 + And the overlay model can't `ALTER TABLE` — adding a real new
  74 + column still needs a coordinated schema migration.
  75 +
  76 +A more candid reading: the data-driven design **shifts complexity
  77 +out of Java and into the database and the PM-built metadata**. The
  78 +total complexity isn't lower; it's redistributed to people and tools
  79 +the framework can't compile-check.
57 80
58 ## When it breaks down 81 ## When it breaks down
59 82
60 Data-driven works until a customer needs behaviour that can't be expressed 83 Data-driven works until a customer needs behaviour that can't be expressed
61 as metadata — different SQL, different procedure body, an aggregation rule 84 as metadata — different SQL, different procedure body, an aggregation rule
62 -that doesn't fit the framework's vocabulary. xly's escape hatch for that  
63 -case is the [per-customer SQL override channel](../slices/05-customer-sql-override.md): 85 +that doesn't fit the framework's vocabulary. xly's response is the
  86 +[per-customer SQL override channel](../slices/05-customer-sql-override.md):
64 hand-written SQL committed to `script/客户/<customer>/` and applied 87 hand-written SQL committed to `script/客户/<customer>/` and applied
65 directly to that customer's schema, bypassing the framework entirely. 88 directly to that customer's schema, bypassing the framework entirely.
66 -That channel is real and used. It is also the most expensive form of  
67 -customization to maintain. 89 +
  90 +It's worth being blunt about what this means. "Bypassing the framework"
  91 +makes the entire data-driven thesis a *partial* property of the system.
  92 +For the 18 customers under `script/客户/` the runtime is **no longer
  93 +single-codebase** — the Java is shared but the actual proc bodies
  94 +running on each customer's DB diverge, with no automated way to
  95 +detect drift. A reviewer reading `Sp_SalSalesCheck` in source has no
  96 +guarantee it's what runs in production for any given customer. The
  97 +"escape hatch" framing is generous; in practice the override channel
  98 +has become the standard answer for material business-logic
  99 +differences, which is the failure mode the data-driven design was
  100 +supposed to prevent.
68 101
69 ## What this means for reading the wiki 102 ## What this means for reading the wiki
70 103
en/docs/reference/maintainer/activiti.md
@@ -169,26 +169,47 @@ emit audit entries via a custom `sp_add_flow_log`. This is the @@ -169,26 +169,47 @@ emit audit entries via a custom `sp_add_flow_log`. This is the
169 empirically-observed customisation channel — Activiti deployment 169 empirically-observed customisation channel — Activiti deployment
170 is not seen in any `script/客户/` directory. 170 is not seen in any `script/客户/` directory.
171 171
172 -### Why this design works for xly's audience 172 +### Why xly avoided Activiti — and what that costs
173 173
174 The printing-industry ERP customers run rule-driven business 174 The printing-industry ERP customers run rule-driven business
175 processes (quote → order → production → delivery → invoice → payment) 175 processes (quote → order → production → delivery → invoice → payment)
176 -where each step is **its own document with its own form** by  
177 -convention. A user expects "Now I open the next form and fill it in"  
178 -rather than "the system tells me a task is waiting for me." For  
179 -that audience:  
180 -  
181 -- Path 1 + Path 2 cover every observed scenario in this dev DB.  
182 -- Path 3's value (BPMN modeling, reassignment, parallel gateways) is  
183 - reserved for the rare tenant whose approval graph genuinely needs  
184 - it.  
185 -  
186 -The trade-off: workflow logic is **scattered across stored procedures**  
187 -rather than declarable in one place. Adding a new step to a flow  
188 -means writing or editing one or more procs, not editing a BPMN  
189 -diagram. For complex, frequently-changing flows, this is brittle.  
190 -For the printing-shop reality (quote-to-cash chain that doesn't  
191 -change much per customer), it's pragmatic. 176 +where each step is conventionally its own document with its own form.
  177 +The audience-fit argument: a user expects "Now I open the next form
  178 +and fill it in" rather than "the system tells me a task is waiting
  179 +for me," so Path 1 + Path 2 cover every observed scenario in this
  180 +dev DB, and Path 3 is held in reserve.
  181 +
  182 +The costs of going proc-based instead of BPMN-based:
  183 +
  184 +- **Workflow logic is scattered across stored procedures, not
  185 + declarable in one place.** Adding a step to "what happens after a
  186 + quote is approved" means writing or editing one or more `Sp_*` procs,
  187 + re-grepping every other proc that references the affected document,
  188 + and hoping nothing was missed. A BPMN engine would have one diagram
  189 + to look at.
  190 +- **No central audit trail of who approved what when.** `bCheck = 1`
  191 + records that *some* approval happened, plus who approved it via the
  192 + `sCheckPerson` column — but the *path the document took* (which
  193 + steps, in which order, with what comments) lives only in proc-side
  194 + status flags, not in a queryable workflow history.
  195 +- **No parallel-branch or reassignment semantics.** Path 1 + 2 cover
  196 + linear single-approver flows. The first time a customer needs
  197 + "two people must approve in parallel", or "if person A is on
  198 + vacation, route to person B", the system has to either fall back
  199 + to Path 3 (Activiti, currently disabled) or hand-code the routing
  200 + in stored procs.
  201 +- **Flow-graph evolution is invisible.** Changing the steps of a
  202 + workflow means editing procs and document chains. There is no
  203 + diff that says "the order-approval flow changed from N steps to
  204 + N+1 steps on date X" — only commit history of individual procs.
  205 +- **The Activiti engine is on the classpath and booted at runtime
  206 + for nothing.** Memory + JAR + schema (24 `act_*` base tables + 3
  207 + identity views) are paid for in every deployment whether they're
  208 + used or not.
  209 +
  210 +For the printing-shop reality the trade has been viable. It would
  211 +not scale to a domain with frequently-changing approval flows or
  212 +strict audit requirements.
192 213
193 ## Activiti is wired — engine ON 214 ## Activiti is wired — engine ON
194 215
@@ -320,21 +341,46 @@ For a flow to actually run, in roughly this order: @@ -320,21 +341,46 @@ For a flow to actually run, in roughly this order:
320 transitions; downstream queries that filter on `bCheck = 1` start 341 transitions; downstream queries that filter on `bCheck = 1` start
321 seeing it. 342 seeing it.
322 343
323 -## Why xly bothered with Activiti at all 344 +## Why xly bothered with Activiti — and whether it was worth it
324 345
325 The codebase has its own `biz_flow` / `biz_todo_item` tables that 346 The codebase has its own `biz_flow` / `biz_todo_item` tables that
326 -*could* implement a hand-rolled approval system. The decision to put  
327 -Activiti behind them buys: 347 +*could* implement a hand-rolled approval system. The arguments for
  348 +putting Activiti behind them:
328 349
329 - Standard BPMN modeling (the JS modeler pulls the same stencilset as 350 - Standard BPMN modeling (the JS modeler pulls the same stencilset as
330 Activiti Explorer). 351 Activiti Explorer).
331 -- Free state-machine semantics — the engine handles "task A done →  
332 - task B available" without xly maintaining the FSM in SQL. 352 +- Engine-managed state-machine semantics — "task A done → task B
  353 + available" without xly maintaining the FSM in SQL.
333 - Diagram rendering (the page-as-PNG in `ProcessActController`). 354 - Diagram rendering (the page-as-PNG in `ProcessActController`).
334 355
335 -The cost: a second engine running in the JVM, a second DB schema with  
336 -its own DDL drift, a second authentication surface (which xly papers  
337 -over via the `act_id_*` views). 356 +The costs are not minor:
  357 +
  358 +- A second engine running in the JVM, with its own startup cost,
  359 + memory footprint, and operational surface.
  360 +- A second DB schema (24 `act_*` tables + 3 identity views) that
  361 + diverges from xly's `gds*`/`biz*` conventions and needs its own
  362 + DDL migrations across Activiti versions (and indeed: see the
  363 + 5.17 vs 6.0 version skew elsewhere on this page).
  364 +- A second authentication surface that xly papers over via the
  365 + `act_id_*` views projecting xly's own users into Activiti's shape
  366 + — a hack that works but creates two-way coupling between user-
  367 + table changes and Activiti correctness.
  368 +- A modeler UI (Angular 1.x era) that maintainers have to learn
  369 + separately from BACK.
  370 +- And — the most damning cost — **on this dev DB the engine is
  371 + idle**. The `act_re_procdef` and `biz_flow` tables are empty, and
  372 + Path 1 / Path 2 handle every observed workflow scenario. The
  373 + Activiti dependency is paid for at every startup whether it's
  374 + exercised or not.
  375 +
  376 +A more honest framing: Activiti was bet on as the "real" workflow
  377 +solution; in practice the simpler proc-driven paths covered the
  378 +actual demand. The wiring stayed because removing it isn't free
  379 +either, but the value the engine delivers in the current deployment
  380 +is approximately zero. A future cleanup could plausibly remove
  381 +Activiti entirely and consolidate on the document-chain pattern,
  382 +trading away the *option* of BPMN-style flows for a smaller
  383 +codebase and one fewer schema to maintain.
338 384
339 ## What this page is *not* 385 ## What this page is *not*
340 386
en/docs/reference/maintainer/bi-engine.md
@@ -148,21 +148,44 @@ several `Sp_SalesOrder_Kpi*` procs (matches the @@ -148,21 +148,44 @@ several `Sp_SalesOrder_Kpi*` procs (matches the
148 [per-customer SQL override channel](../../slices/05-customer-sql-override.md) 148 [per-customer SQL override channel](../../slices/05-customer-sql-override.md)
149 — customers who want different KPI rules ship their own proc). 149 — customers who want different KPI rules ship their own proc).
150 150
151 -## Why this matters  
152 -  
153 -xly's BI layer demonstrates the data-driven thesis at scale:  
154 -  
155 -1. **Adding a new dashboard card requires no Java change** — a PM  
156 - inserts a `gdsconfigcharmaster` row pointing at a `Sp_chart_*` proc,  
157 - sets `sCharType` and `iWidth`, the SPA picks it up on the next  
158 - `getModelBysId` cache miss.  
159 -2. **Adding a new chart proc** does require a SQL author (the proc  
160 - has to follow the standard tenant-scoped shape so generic dispatch  
161 - can call it through `CharServiceImpl`).  
162 -3. **No OLAP cube, no MDX, no semantic layer.** Each chart is a  
163 - purpose-built SQL stored procedure. This trades reusability for  
164 - simplicity — perfect-fit aggregations, no general-purpose ad-hoc  
165 - query builder. 151 +## Drawbacks of the homebrewed approach
  152 +
  153 +The metadata + per-chart-proc design is consistent with xly's data-
  154 +driven thesis, and it avoids carrying a heavy OLAP engine. The costs:
  155 +
  156 +1. **Every new chart needs a SQL author.** "PM adds a metadata row"
  157 + is true *after* an engineer has written the matching `Sp_chart_*`
  158 + proc. There is no aggregation builder, no field-picker, no auto-
  159 + generated query — every metric is a hand-coded stored procedure
  160 + the engineering team has to write, review, and maintain. The
  161 + 20-proc catalogue and 11 chart types are the **whole** set of
  162 + shapes the system can render today.
  163 +2. **Charts run heavy SQL on the OLTP DB.** No warehouse, no
  164 + pre-aggregation, no incremental rollup. A "today's profit"
  165 + chart is a SELECT against the live transactional schema.
  166 + Heavy customers will see chart loads contend with order-entry
  167 + load on the same MySQL instance. Caching helps, but only on hit;
  168 + the first load after metadata change pays full cost.
  169 +3. **No semantic consistency between charts.** Each `Sp_chart_*`
  170 + proc decides for itself how to compute "monthly profit", "today's
  171 + sales", etc. Two charts purporting to show the same metric can
  172 + silently disagree because they're separate proc bodies. A real
  173 + semantic layer would prevent that; the homebrewed model can't.
  174 +4. **No drill-down, no slice-and-dice.** Each chart is a frozen
  175 + query shape. Users can't pivot on different dimensions or drill
  176 + from a summary card into the underlying transactions without an
  177 + engineer authoring a separate proc for each path.
  178 +5. **Customer-divergent KPI logic.** Customers under
  179 + `script/客户/` ship their own `spKPImodule` and
  180 + `Sp_SalesOrder_Kpi*` overrides — different KPI math per
  181 + customer, in code that lives only on that customer's DB. This
  182 + makes "what does this KPI mean" depend on which schema the
  183 + reader is connected to.
  184 +
  185 +The simpler design is fine for "show me the same 20 cards xly has
  186 +always shown". It is not fine if the goal is ad-hoc analytics or
  187 +self-service reporting — those would require a separate semantic /
  188 +warehouse layer that xly does not have.
166 189
167 ## What this is *not* 190 ## What this is *not*
168 191
en/docs/reference/maintainer/cache-invalidation.md
@@ -117,14 +117,47 @@ against the DB does **not** trigger any cleaner. The cache will serve @@ -117,14 +117,47 @@ against the DB does **not** trigger any cleaner. The cache will serve
117 stale metadata until either: 117 stale metadata until either:
118 118
119 1. The cache TTL expires (check the cache config for the actual TTL). 119 1. The cache TTL expires (check the cache config for the actual TTL).
120 -2. A bounce of the application servers (one node at a time if the  
121 - cache is local; once if shared). 120 +2. A bounce of the application servers (one bounce suffices since the
  121 + cache is Redis-backed and shared — see above).
122 3. A manual call to one of the 122 3. A manual call to one of the
123 `BusinessCleanRedisDataImpl.delCleanRedisDataByTableName(<table>, …)` 123 `BusinessCleanRedisDataImpl.delCleanRedisDataByTableName(<table>, …)`
124 - methods is invoked from inside the application (e.g., via a  
125 - maintenance endpoint). Note this clears whatever the local  
126 - `CacheManager` is bound to; if that turns out to be in-memory,  
127 - the cleanup must run on every node. 124 + methods is invoked from inside the application — once, on any
  125 + node, since it clears the shared Redis store.
  126 +
  127 +## Drawbacks of this design
  128 +
  129 +The synchronous `@CacheEvict`-during-save model is operationally
  130 +simple and (with Redis backing) genuinely cross-node coherent. It is
  131 +also fragile in ways worth naming:
  132 +
  133 +- **Two systems with confusingly similar names.** The JMS path
  134 + `CHANGE_GDS_MODULE` + `ConsumerChangeGdsModuleThread` *sounds*
  135 + like it should be cache invalidation but isn't. This page exists
  136 + partly because that conflation is a recurring source of bugs and
  137 + reader confusion. A renaming pass (proc and queue → e.g.
  138 + `MERGE_BASE_GDS_MODULE`) would help, but isn't free.
  139 +- **Eviction is in the same transaction as the write.** If the
  140 + Redis call fails mid-save, the row commits but the cache stays
  141 + stale. The framework does not detect or recover from this; a
  142 + Redis outage during save silently corrupts the cache for
  143 + affected rows until TTL expiry.
  144 +- **Eviction is "all or nothing per cache region".** Most
  145 + `@CacheEvict` annotations on `CleanRedisServiceImpl` use
  146 + `allEntries=true`, which dumps the entire cache region rather
  147 + than the affected key. Heavy save throughput causes high
  148 + cache-miss rates immediately afterwards — fine for small
  149 + metadata caches, expensive when dropping a region with thousands
  150 + of entries.
  151 +- **No invalidation budget / batching.** Bulk metadata changes
  152 + (e.g., editing 100 form fields) trigger 100 `@CacheEvict` fires,
  153 + each one round-tripping to Redis. There is no mechanism to
  154 + coalesce evictions into one batch.
  155 +- **Direct DB writes bypass everything.** Any tooling that touches
  156 + the schema outside `BusinessBaseServiceImpl` — including database
  157 + admin scripts, `script/客户/` overrides applied via `mysql`
  158 + command line, and Channel-2 SQL replacements — leaves the cache
  159 + stale until manually invalidated. This is a real operational
  160 + hazard for the deployment pattern xly actually uses.
128 161
129 ## Common bug: the cache is the bug 162 ## Common bug: the cache is the bug
130 163
@@ -135,10 +168,11 @@ old value&quot;, check (in this order): @@ -135,10 +168,11 @@ old value&quot;, check (in this order):
135 2. Did the change go through a path that invokes 168 2. Did the change go through a path that invokes
136 `BusinessCleanRedisData`? (Direct DB writes or controllers that 169 `BusinessCleanRedisData`? (Direct DB writes or controllers that
137 bypass `BusinessBaseServiceImpl` won't.) 170 bypass `BusinessBaseServiceImpl` won't.)
138 -3. Is the cache shared across nodes (Redis-backed) or local  
139 - (`ConcurrentMapCacheManager`)? Confirm by inspecting the active  
140 - `CacheManager` bean on a running node.  
141 -4. If the cache is local, did every node get the eviction call? 171 +3. Was Redis reachable when the save committed? A failed eviction
  172 + does not roll back the save.
  173 +4. Is the change in a cache region that's evicted by the table that
  174 + was written? `CleanRedisServiceImpl` maps writes to specific
  175 + regions; an unmapped table will not invalidate its readers.
142 176
143 The five-key composite returned by 177 The five-key composite returned by
144 [`getModelBysId` in Slice 1](../../slices/01-hello-world.md) 178 [`getModelBysId` in Slice 1](../../slices/01-hello-world.md)
en/docs/reference/maintainer/proc-dispatch.md
@@ -43,9 +43,38 @@ by name lets the framework call any proc the metadata names without a @@ -43,9 +43,38 @@ by name lets the framework call any proc the metadata names without a
43 code change. The framework treats the proc as a black box: name in, 43 code change. The framework treats the proc as a black box: name in,
44 parameters in, result out. 44 parameters in, result out.
45 45
46 -The downside: the runtime cannot statically know which procs exist or  
47 -what their effects are. A typo in `gdsmodule.sSaveProName` produces a  
48 -runtime "proc not found" error, not a compile error. 46 +That convenience comes with substantial costs that are worth being
  47 +explicit about:
  48 +
  49 +- **No compile-time check** on proc names. A typo in
  50 + `gdsmodule.sSaveProName` produces a runtime "proc not found"
  51 + error, not a compile error. Refactoring a proc name requires
  52 + hand-grepping the metadata; the IDE can't help.
  53 +- **No type safety on parameters.** The framework binds parameters
  54 + positionally from a `Map<String, Object>`. A proc whose signature
  55 + changed but whose callers didn't is a runtime crash with no IDE
  56 + warning.
  57 +- **No call-site discoverability.** "Which Java code calls
  58 + `Sp_SalSalesCheck`?" can't be answered by IDE find-usages because
  59 + no Java code does — `gdsmodule` rows do. Maintainers must search
  60 + *both* metadata tables *and* the SQL bodies of other procs that
  61 + may invoke this one.
  62 +- **Effectively no static analysis.** Side effects of any given
  63 + proc are invisible to anyone who hasn't read the proc body. A
  64 + `Sp_SalSalesCheck` named in `gdsmodule.sProcName` could be a
  65 + read-only SELECT or could be doing INSERTs and UPDATEs across a
  66 + dozen tables; the framework treats them identically.
  67 +- **Stack traces that stop at the boundary.** Java errors thrown
  68 + from inside a proc surface as a generic `BadSqlGrammarException`
  69 + or `MySQLSyntaxErrorException`. To get the real error you have
  70 + to enable MyBatis SQL logging and re-run.
  71 +
  72 +A more honest framing: hard-wiring 1000+ procs in Java would be
  73 +painful, but most of that pain comes from xly *having* 1000+ procs
  74 +in the first place. Dynamic dispatch made it cheap to keep adding
  75 +them, which made the pile grow, which made the pile harder to
  76 +audit. The mechanism is what it is; the *amount* of behaviour
  77 +pushed into the SQL layer is the more interesting design question.
49 78
50 ## The conventions procs follow 79 ## The conventions procs follow
51 80
en/docs/reference/maintainer/runtime.md
@@ -221,6 +221,38 @@ Two flagged in slices that belong here permanently: @@ -221,6 +221,38 @@ Two flagged in slices that belong here permanently:
221 load entirely for `UserType.ADMIN`. ADMIN account governance must 221 load entirely for `UserType.ADMIN`. ADMIN account governance must
222 come from outside the app. 222 come from outside the app.
223 223
  224 +## What "universal CRUD" means in practice
  225 +
  226 +The "one controller writes any row in any table" pattern is the
  227 +core data-driven move. It also concentrates risk:
  228 +
  229 +- **`BusinessBaseServiceImpl` is ~3,500 lines** of tightly
  230 + intertwined logic: per-tenant scope-bypass list, special-case
  231 + table hardcodes (`mftproductionplanslave` at line 1768),
  232 + pre/post-save hook dispatch, sTable-driven write routing. Every
  233 + bug fix has to navigate the whole class.
  234 +- **The class is the single point of failure for the entire
  235 + business runtime.** A regression in `addUpdateDelBusinessData`
  236 + breaks save for every form in every tenant simultaneously.
  237 + Module-specific controllers would localise the blast radius;
  238 + the universal one cannot.
  239 +- **No type system on `Map<String, Object>`.** The frontend ships
  240 + a bag of (key, value) pairs. The runtime trusts the keys
  241 + match column names and the values cast to the column types.
  242 + Mismatches surface as `BadSqlGrammarException` at the DAO layer
  243 + — far from where the wrong value originated. There is no
  244 + schema-aware request validation.
  245 +- **Discoverability is poor.** "What endpoints write to
  246 + `mftproductionplanslave`?" can't be answered by IDE find-usages
  247 + — the answer is "any controller that calls
  248 + `BusinessBaseServiceImpl.addBusinessData` with `sTable` set to
  249 + `mftproductionplanslave`", which is everything.
  250 +
  251 +The universal pattern is what makes the data-driven thesis work.
  252 +It is also the reason adding a new module is essentially free
  253 +*and* the reason that touching the runtime is essentially never
  254 +free.
  255 +
224 ## Cache invalidation 256 ## Cache invalidation
225 257
226 When BACK saves a metadata change, the save service synchronously 258 When BACK saves a metadata change, the save service synchronously
@@ -229,5 +261,5 @@ calls `BusinessCleanRedisData.delCleanRedisData*`, which fires @@ -229,5 +261,5 @@ calls `BusinessCleanRedisData.delCleanRedisData*`, which fires
229 A separate JMS path (`ConsumerChangeGdsModuleThread`) exists with a 261 A separate JMS path (`ConsumerChangeGdsModuleThread`) exists with a
230 similar name but does base-data merging via stored proc, not cache 262 similar name but does base-data merging via stored proc, not cache
231 invalidation. See [cache invalidation on metadata change](cache-invalidation.md) 263 invalidation. See [cache invalidation on metadata change](cache-invalidation.md)
232 -for the full story (including the open question about cross-node  
233 -coherence). 264 +for the full story (cross-node coherence is empirically Redis-backed,
  265 +no longer an open question).
en/docs/reference/maintainer/sql-templates.md
@@ -61,20 +61,39 @@ the target schema. @@ -61,20 +61,39 @@ the target schema.
61 document family the proc operates on. 61 document family the proc operates on.
62 - Other placeholders depending on the scaffold. 62 - Other placeholders depending on the scaffold.
63 63
64 -## Why this is a "template" library and not a code generator 64 +## "Template" library, not a code generator — and what that costs
65 65
66 The framework does **not** auto-generate procs from these templates 66 The framework does **not** auto-generate procs from these templates
67 -based on metadata. The scaffolds exist because xly's procs follow a  
68 -common conventional shape; copying the scaffold ensures the new proc: 67 +based on metadata. The scaffolds are convention-enforcing copy-paste
  68 +starters, nothing more. They exist to nudge a new proc into the
  69 +shape that [generic dispatch](proc-dispatch.md) can call:
69 70
70 -- Accepts the standard parameter list `(sGuid, sFormGuid, sLoginId, sBrId, sSuId)`  
71 - that [generic dispatch](proc-dispatch.md) can call.  
72 -- Returns success/error via the standard `OUT sCode INT, OUT sReturn LONGTEXT`. 71 +- Standard parameter list `(sGuid, sFormGuid, sLoginId, sBrId, sSuId)`.
  72 +- Returns success/error via `OUT sCode INT, OUT sReturn LONGTEXT`.
73 - Honours the multi-tenant filter `sBrandsId = sBrId AND sSubsidiaryId = sSuId`. 73 - Honours the multi-tenant filter `sBrandsId = sBrId AND sSubsidiaryId = sSuId`.
74 74
75 -A proc that *doesn't* follow these conventions cannot be invoked  
76 -through generic dispatch and would have to be called from custom Java  
77 -code instead. 75 +Costs of staying at "template" instead of "generator":
  76 +
  77 +- **No enforcement.** A proc that drifts from the convention compiles
  78 + fine. The framework discovers the mismatch at runtime as a
  79 + `BadSqlGrammarException` or wrong-shaped result. There is no
  80 + pre-merge check.
  81 +- **No regeneration.** When the convention itself changes (e.g., a
  82 + new standard `OUT` param), the existing procs do not update.
  83 + Engineers have to grep + rewrite, with no automation.
  84 +- **No knowledge of which proc came from which template.** A proc in
  85 + the live DB doesn't record its origin scaffold; understanding what
  86 + was customised away requires diffing against the scaffold by hand.
  87 +- **Customer overrides under `script/客户/` can — and do — diverge
  88 + from the scaffold shape.** This is reasonable per customer but
  89 + means the conventions are observed by social contract, not by
  90 + any mechanical check.
  91 +
  92 +A real code-generation pipeline (template + metadata → emitted SQL,
  93 +checked in or applied at deploy time) would catch these. The
  94 +trade xly made: less tooling to maintain, but discipline-rather-
  95 +than-enforcement on proc shapes — visible in the 1,687 procs the
  96 +schema currently carries, not all of which follow the conventions.
78 97
79 ## Two loaders 98 ## Two loaders
80 99
en/docs/slices/04-custom-field.md
@@ -112,15 +112,36 @@ ignored at merge time. A maintainer audit script that flags such orphans @@ -112,15 +112,36 @@ ignored at merge time. A maintainer audit script that flags such orphans
112 is on the [Maintainer Reference](../reference/maintainer/runtime.md)'s 112 is on the [Maintainer Reference](../reference/maintainer/runtime.md)'s
113 TODO list. 113 TODO list.
114 114
115 -## Why it works without code changes  
116 -  
117 -The end-customer never asks an engineer for a new column. They open the  
118 -BACK builder, add the row, the field appears in FROUNT for their tenant  
119 -only. The system's other tenants are untouched. That single-codebase  
120 -property is what xly's data-driven thesis ([Concepts → Thesis](../concepts/thesis.md))  
121 -buys — at the cost of the runtime cost of merging metadata on every  
122 -request, plus the schema bloat of three customization tables that most  
123 -forms never use. 115 +## Why it works without code changes — and what that costs
  116 +
  117 +The end-customer never asks an engineer for a new column for the
  118 +*display* side. They open the BACK builder, add the row, the field
  119 +appears in FROUNT for their tenant only. The system's other tenants
  120 +are untouched.
  121 +
  122 +The price for that property:
  123 +
  124 +- **The merge runs on every request** (not just on overlay-row
  125 + changes). Even tenants with zero `gdsconfigformcustomslave` rows
  126 + pay the runtime cost of checking — the framework can't tell upfront
  127 + whether a tenant has overrides, so the merge code path runs always.
  128 +- **Three near-empty tables on every schema.** The three customization
  129 + tables exist whether the tenant uses them or not. In this dev DB
  130 + `gdsconfigformcustomslave` has 0 rows; the table is still indexed,
  131 + backed up, and queried.
  132 +- **Display extension only.** The overlay can render an extra field;
  133 + it cannot store its value unless the underlying physical table
  134 + already has the column. So "no code change for a new field" is true
  135 + only for *display-only* fields. Real new persisted fields still
  136 + need a coordinated `ALTER TABLE` (Slice 5 territory) — which means
  137 + the wins from "no code change" don't apply to the cases that
  138 + actually move business value.
  139 +- **Debuggability gets worse.** "Why does tenant A see this field
  140 + but tenant B doesn't?" requires diffing
  141 + `gdsconfigformcustomslave` + `gdsconfigformpersonalize` +
  142 + `gdsconfigformuserslave` rows for both tenants. The merge logic in
  143 + `BusinessBaseServiceImpl` is non-trivial; reproducing the exact
  144 + layout a user sees often means re-running the merge by hand.
124 145
125 ## Concepts this slice introduces 146 ## Concepts this slice introduces
126 147
en/docs/slices/05-customer-sql-override.md
@@ -90,8 +90,9 @@ framework doesn&#39;t know; the framework can&#39;t tell. @@ -90,8 +90,9 @@ framework doesn&#39;t know; the framework can&#39;t tell.
90 90
91 This makes overrides: 91 This makes overrides:
92 92
93 -- **Powerful.** Anything you can write in MySQL stored-procedure SQL,  
94 - you can use to replace standard behaviour. 93 +- **Capable in the technical sense.** Anything you can write in MySQL
  94 + stored-procedure SQL can replace standard behaviour. (This isn't a
  95 + good thing per se — see drawbacks below.)
95 - **Operationally fragile.** The override must be re-applied (or kept 96 - **Operationally fragile.** The override must be re-applied (or kept
96 alive) whenever the customer's schema is rebuilt, restored, or 97 alive) whenever the customer's schema is rebuilt, restored, or
97 migrated. It does not travel with backups of the codebase, only with 98 migrated. It does not travel with backups of the codebase, only with
@@ -101,10 +102,25 @@ This makes overrides: @@ -101,10 +102,25 @@ This makes overrides:
101 the proc on the live DB is a different piece of code with the same 102 the proc on the live DB is a different piece of code with the same
102 name. Stack traces and "what does this proc do" depend on which 103 name. Stack traces and "what does this proc do" depend on which
103 schema you're connected to. 104 schema you're connected to.
104 -  
105 -The right rule of thumb: prefer Slice-4 metadata customization. Reach  
106 -for Slice-5 SQL overrides only when the metadata model genuinely cannot  
107 -express what the customer needs. 105 +- **No version control on the deployed body.** The `.sql` file in
  106 + `script/客户/` shows what *should* have been applied. There is no
  107 + audit trail confirming what *was* applied (or when, or by whom),
  108 + and no automated re-apply on schema rebuild.
  109 +- **No type-safety bridge.** When the override changes a result-set
  110 + shape, every Java caller that reads from `Sp_SalSalesCheck` may
  111 + silently break for that one customer with a `BadSqlGrammarException`
  112 + or — worse — a wrong-shaped row that propagates as a wrong number.
  113 +- **Compounds the BI problem.** Charts on customers with overridden
  114 + procs ([bi-engine.md](../reference/maintainer/bi-engine.md))
  115 + will silently disagree across tenants because the underlying data
  116 + is computed by different SQL.
  117 +
  118 +The "prefer Slice 4, reach for Slice 5 only as last resort" advice is
  119 +correct in principle, but the existence of 18 customer directories
  120 +suggests that in practice this channel has become the standard answer
  121 +for material business-logic differences. That's a signal the metadata
  122 +model isn't expressive enough for the actual customer-customisation
  123 +demand the system encounters — not a celebration of the escape hatch.
108 124
109 ## Worked-example: 重庆展印's `Sp_SalSalesCheck` vs the standard 125 ## Worked-example: 重庆展印's `Sp_SalSalesCheck` vs the standard
110 126
en/docs/slices/06-hardware.md
@@ -83,25 +83,50 @@ other data: @@ -83,25 +83,50 @@ other data:
83 83
84 ## The framework / hardware boundary 84 ## The framework / hardware boundary
85 85
86 -This is the cleanest story xly tells about an awkward problem: 86 +xly's response to the press-PLC problem is a strict separation:
87 87
88 - **Above the line (xlyEntry, xlyApi, all the metadata machinery): 88 - **Above the line (xlyEntry, xlyApi, all the metadata machinery):
89 generic framework.** No knowledge of presses, PLCs, byte protocols. 89 generic framework.** No knowledge of presses, PLCs, byte protocols.
90 - **Below the line (xlyPlc): hardware-specific.** Knows how to talk to a 90 - **Below the line (xlyPlc): hardware-specific.** Knows how to talk to a
91 press. 91 press.
92 92
93 -The two communicate only through the database. The bridge writes rows;  
94 -the framework reads rows. There's no RPC, no shared in-process state,  
95 -no callback. This makes xlyPlc:  
96 -  
97 -- Independently deployable (and several customers run it on a machine  
98 - next to the press, separate from the central ERP server).  
99 -- Independently failable: if the bridge crashes, the framework keeps  
100 - running on stale machine-state data. If the framework is down, the  
101 - bridge keeps writing — when the framework comes back, it sees the  
102 - buffered rows.  
103 -- Hard to test end-to-end without an actual press. Most CI tests stub  
104 - the PLC reads. 93 +The two communicate only through the database — the bridge writes rows,
  94 +the framework reads rows. No RPC, no shared in-process state, no
  95 +callback. The benefits:
  96 +
  97 +- Independently deployable; some customers run xlyPlc on a machine next
  98 + to the press, separate from the central ERP server.
  99 +- Independently failable: if the bridge crashes the framework serves
  100 + stale machine-state data; if the framework is down the bridge keeps
  101 + writing and the framework picks up the buffered rows on recovery.
  102 +
  103 +The costs of "DB as the only contract" are real and worth naming:
  104 +
  105 +- **No backpressure.** If the bridge writes faster than xly can ingest
  106 + (or if a slow `mftProduceReportMachineState` index update piles up),
  107 + the bridge has no signal to slow down — it just blocks on the next
  108 + INSERT. There is no flow-control message between the two halves.
  109 +- **No request/response semantics.** The framework cannot ask the
  110 + bridge "is the press alive right now?" — it can only read whatever
  111 + the bridge last wrote, which may be seconds-to-minutes old depending
  112 + on the cron cadence.
  113 +- **Bridge-side state is invisible to the framework.** "Why is the
  114 + bridge not writing?" requires logging into the bridge host to read
  115 + its log; the framework UI shows only the absence of new rows.
  116 +- **Cron polling in both directions.** xlyPlc polls the press; the
  117 + framework polls the DB; the SPA polls the framework. Three layers
  118 + of polling means latency from "press state changes" to "user sees
  119 + it" is `cron interval * 3` in the worst case.
  120 +- **Hard to test end-to-end without an actual press.** Most CI tests
  121 + stub the PLC reads, which means the bridge's most error-prone code
  122 + (byte protocol per press model) gets the least automated coverage.
  123 +
  124 +A real-time-aware architecture would use a streaming channel
  125 +(MQTT / Kafka / WebSocket) end-to-end instead of cron + DB. xly's
  126 +choice is operationally simpler but trades off latency, observability,
  127 +and flow control. For the printing-press tempo (machine state changes
  128 +every few seconds, reports every minute) the trade is liveable; for
  129 +faster shop-floor signals it would not be.
105 130
106 ## Concepts this slice introduces 131 ## Concepts this slice introduces
107 132