From ab4cf6c6c484a4b0d35edbd4966cd96b564fcc4b Mon Sep 17 00:00:00 2001 From: zichun Date: Mon, 11 May 2026 09:46:34 +0800 Subject: [PATCH] docs: en wiki — strip apologetic framing; state design choices with their drawbacks --- en/docs/concepts/api-surface.md | 39 ++++++++++++++++++++++++++++++++++----- en/docs/concepts/customization-channels.md | 8 ++++++-- en/docs/concepts/master-slave.md | 21 ++++++++++++--------- en/docs/concepts/modules-forms-vtables.md | 19 +++++++++++++++---- en/docs/concepts/multi-tenancy.md | 30 +++++++++++++++++++++++++++++- en/docs/concepts/semantic-fk.md | 26 +++++++++++++++++++++++--- en/docs/concepts/thesis.md | 53 +++++++++++++++++++++++++++++++++++++++++++---------- en/docs/reference/maintainer/activiti.md | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------- en/docs/reference/maintainer/bi-engine.md | 53 ++++++++++++++++++++++++++++++++++++++--------------- en/docs/reference/maintainer/cache-invalidation.md | 54 ++++++++++++++++++++++++++++++++++++++++++++---------- en/docs/reference/maintainer/proc-dispatch.md | 35 ++++++++++++++++++++++++++++++++--- en/docs/reference/maintainer/runtime.md | 36 ++++++++++++++++++++++++++++++++++-- en/docs/reference/maintainer/sql-templates.md | 37 ++++++++++++++++++++++++++++--------- en/docs/slices/04-custom-field.md | 39 ++++++++++++++++++++++++++++++--------- en/docs/slices/05-customer-sql-override.md | 28 ++++++++++++++++++++++------ en/docs/slices/06-hardware.md | 51 ++++++++++++++++++++++++++++++++++++++------------- 16 files changed, 499 insertions(+), 126 deletions(-) diff --git a/en/docs/concepts/api-surface.md b/en/docs/concepts/api-surface.md index 05aab4b..e89fb9a 100644 --- a/en/docs/concepts/api-surface.md +++ b/en/docs/concepts/api-surface.md @@ -17,22 +17,51 @@ database is what makes their separation work — internal-API writes show up to external-API reads automatically because both run against the same schema. -## Why three tiers, not one +## Why three tiers (and what splitting them costs) -Each tier answers a different question, and bundling them would -sacrifice clarity: +Each tier was originally split off to answer a different question: - **Internal** is large (universal CRUD over all metadata-driven modules), volatile (changes when the framework changes), and intentionally untyped (the SPA decides what to ask for, server obeys). - **External** is curated (only the endpoints integrators are allowed to use), versioned by `sApiCode`, and authenticated with bearer - tokens — it survives across framework changes precisely because it's - small and explicit. + tokens. - **Inbound webhooks** receive untrusted bodies from third-party systems and route them to xly handlers. The Swagger UI lives here because that audience benefits most from interactive documentation. +The split has real costs that the wiki should not gloss over: + +- **Three WARs to deploy, monitor, and version-pin.** A new release + has to ship coordinated builds of `xlyEntry`, `xlyApi`, and + `xlyInterface`. Mismatches (e.g., a schema change in `xlyEntry` + that `xlyApi` hasn't picked up) are silent until the call path + hits them. +- **Duplicate code.** `RequestAddParamUtil` exists in both + `xlyPersist` (for `xlyEntry`) and `xlyApi` (near-identical 56-vs-57 + line copy). `InterfaceController` exists in both `xlyApi` and + `xlyInterface` with overlapping `/interfaceDefine/callthirdparty/*` + endpoints. Keeping the two halves in sync is operational discipline, + not a compile-time guarantee. +- **No shared session.** A user authenticated in BACK has no + session in `xlyApi` — external callers fetch a separate bearer + token. This is correct for *external* integrators but means + internal cross-WAR calls (rare in practice, common in temptation) + have to go through the public token flow. +- **Three context-paths means three reverse-proxy entries.** + The mapping from `BACK=:8597` and `FROUNT=:8598` to the actual + WARs lives in nginx config that isn't in this repo. Misconfigured + proxies are a common failure mode the codebase can't catch. + +Could the split have been a single deployable with internal package +boundaries? Yes — Spring Boot supports it. The benefit of that +alternative would be: one build, one set of dependencies, one +session story, no duplicate utility classes. The cost: harder to +scale tiers independently, harder to rate-limit external callers +without affecting the SPA. xly chose the deployment-time isolation; +the wiki's job is to acknowledge what that choice traded away. + ## What each tier looks like at runtime - **Internal** — see [the five-key read](../reference/maintainer/runtime.md#the-five-key-read). One diff --git a/en/docs/concepts/customization-channels.md b/en/docs/concepts/customization-channels.md index 9ce50b5..3732a01 100644 --- a/en/docs/concepts/customization-channels.md +++ b/en/docs/concepts/customization-channels.md @@ -24,8 +24,12 @@ They are visible in the BACK UI so a PM can audit them. The framework's runtime reads them on every request (with caching). The Java code is unchanged; the application's behaviour is what those rows say it is. -This is the default path. **90%+ of customer customizations should live -here.** +This is the path the architecture intends customers to use. Whether +the actual ratio is 90/10 in favour of Channel 1 isn't measured +anywhere; the empirical signal is that 18 customer directories +under `script/客户/` exist, which is a non-trivial slice of the +customer base needing what Channel 1 can't express. Take "90%+ +should live here" as an aspirational target, not a measured fact. ## Channel 2 — Per-customer SQL overrides diff --git a/en/docs/concepts/master-slave.md b/en/docs/concepts/master-slave.md index e647689..e412b0f 100644 --- a/en/docs/concepts/master-slave.md +++ b/en/docs/concepts/master-slave.md @@ -64,13 +64,16 @@ appears verbatim in 14k+ table and column names. ## "Slave" — naming caveat The term carries connotations in English that are absent from the -Chinese 主表 / 从表. The wiki preserves "slave" because: +Chinese 主表 / 从表. The wiki preserves "slave" verbatim because the +codebase, schema, and auto-catalog use it in 14k+ identifiers and any +translation would diverge from what developers actually grep. -1. Renaming would break every cross-reference into the codebase, the - schema, and the auto-catalog (14k+ identifiers). -2. Mapping every occurrence to "detail" or "child" would distort - searchability and produce wiki text that diverges from what - developers actually grep. - -Future xly versions may rebrand to "detail" / "header"; until then, the -wiki uses the in-codebase term verbatim and notes it once here. +That preservation has a cost. The naming was a poor choice in the +first place — `主表 / 从表` translates straightforwardly as +`master / detail` or `header / line`, both of which would have +matched both English convention and the actual relational +semantics. The cost of retaining "slave" is borne by every English- +speaking maintainer who has to type or read the term, and by any +future rebrand effort that has to do the schema-wide rename xly +should have done at the start. The wiki documenting it once here +doesn't remove the cost; it just acknowledges it. diff --git a/en/docs/concepts/modules-forms-vtables.md b/en/docs/concepts/modules-forms-vtables.md index 2f8144b..02a46f8 100644 --- a/en/docs/concepts/modules-forms-vtables.md +++ b/en/docs/concepts/modules-forms-vtables.md @@ -87,10 +87,21 @@ sub-tabs. ## Three nouns, one engine The runtime — `BusinessBaseController` and `BusinessBaseServiceImpl`, -documented in [Slice 1](../slices/01-hello-world.md) — knows how to -render any module / form / virtual-table combination. There is no -per-module Java code. PMs creating new modules are creating new rows; -they are not creating new code paths. +documented in [Slice 1](../slices/01-hello-world.md) — handles every +module / form / virtual-table combination through one universal +dispatch path. There is no per-module Java; PMs creating new modules +are creating new rows. + +The flip side: that one engine has accumulated 3,500+ lines in +`BusinessBaseServiceImpl` alone, plus another 800+ in +`BusinessGdsconfigformsServiceImpl`. Edge cases, special-case +table handling (e.g., the `mftproductionplanslave` hardcode at +`BusinessBaseServiceImpl.java:1768`), per-tenant overlay merge +logic, and the multi-tenant scope-bypass list all live in this +single class. Adding a new feature that the universal dispatch +doesn't handle means either expanding this class or writing +custom code that bypasses it — both of which erode the "one +engine handles everything" property the design promised. ## Business-data table prefixes diff --git a/en/docs/concepts/multi-tenancy.md b/en/docs/concepts/multi-tenancy.md index cf80bb4..25633a2 100644 --- a/en/docs/concepts/multi-tenancy.md +++ b/en/docs/concepts/multi-tenancy.md @@ -69,7 +69,7 @@ tenant. That's the catastrophic data-leak case. Three places to watch: doesn't validate the supplied table against the form's authorised tables, this is a privilege-escalation surface. See [Slice 1](../slices/01-hello-world.md#4-user-edits-a-row-clicks-save). -## How the design scales +## How the design scales — and where it doesn't The framework's multi-tenancy design scales by **row count**, not by code. A small SaaS deployment with one brand and one subsidiary uses @@ -77,3 +77,31 @@ exactly the same Java, MyBatis mappers, and stored procedures as a deployment with dozens of brands × dozens of subsidiaries × several editions; only the row distributions in `gdsmodule`, `sisversionflow`, and the business-data tables differ. + +Scaling by row count is operationally simple but has limits the +wiki should not paper over: + +- **Shared physical schema means shared resource contention.** + Every tenant's queries hit the same MySQL instance, same tables, + same indexes. A heavy report on tenant A's data competes for + buffer-pool space and CPU with tenant B's order entry. There is + no per-tenant resource isolation. +- **Tenant filters in every WHERE clause.** Every read query + carries `sBrandsId = ? AND sSubsidiaryId = ?`. Indexes have to + lead with these columns to be useful — and almost all xly tables + do, by convention, but a maintainer adding a new index has to + remember. Forgetting produces a query plan that scans across all + tenants' rows and silently slows down once the table gets large. +- **No physical hard-delete boundary.** A tenant offboarding does + not drop a database; it leaves the rows where they are + (sometimes marked `bInvalid`, sometimes deleted, sometimes + untouched). Permanent removal requires a custom cleanup script + per tenant. From a GDPR / data-residency angle, "this tenant + is gone" is hard to prove. +- **`sBrandsId` / `sSubsidiaryId` everywhere is an inflexible + tenancy unit.** "Tenant" means exactly the `(sBrandsId, + sSubsidiaryId)` tuple. Alternate cuts (e.g., per-region access, + per-department access without sub-tenanting) don't fit the model + and would require parallel scoping columns. The model assumed + this shape would always be right for every customer; in + practice, it has been, but it's a hard commitment. diff --git a/en/docs/concepts/semantic-fk.md b/en/docs/concepts/semantic-fk.md index 593f510..8a5b9c1 100644 --- a/en/docs/concepts/semantic-fk.md +++ b/en/docs/concepts/semantic-fk.md @@ -13,7 +13,7 @@ before reading any further. ## Why xly disabled FKs -Two reasons given by the architecture, both pragmatic: +Two reasons the architecture gives: 1. **Bulk-write performance.** Mass inserts (work-order calculation, month-end closures, batch imports) write hundreds of thousands of @@ -23,8 +23,28 @@ Two reasons given by the architecture, both pragmatic: 2. **Schema-migration agility.** xly evolves quickly: new modules, new fields, new tables. With FKs, every schema change has to consider the constraint graph; without them, a `CREATE TABLE` or `ALTER TABLE` - is a local operation. The cost of that agility is borne at runtime - by the application code. + is a local operation. + +Both are real considerations, but neither is a slam-dunk argument for +"zero FKs across the entire schema": + +- **Bulk-write performance** can be addressed surgically: disable + constraints during the batch (`SET FOREIGN_KEY_CHECKS = 0`), + re-enable after, validate. xly's choice was instead to not have + FKs at all, which means *every* read also pays the cost of trusting + ad-hoc proc validation rather than DB-enforced integrity. +- **Schema-migration agility** is improved by no-FKs, but at the + price of moving every referential check into application code (or + forgetting it). In practice this means the integrity work an FK + would do automatically is now duplicated across hundreds of stored + procedures, with no compile-time guarantee any given proc actually + does the check (see Failure modes below). + +A more honest framing: the system traded **DB-enforced integrity** +for **operational convenience at write time and DDL time**. The +bug surface that trade introduced (orphan rows, cross-tenant +references that go undetected, integrity bugs surfacing weeks +later) is the cost paid every day the system runs. ## What a "semantic FK" is diff --git a/en/docs/concepts/thesis.md b/en/docs/concepts/thesis.md index dbcbf78..ba89ebe 100644 --- a/en/docs/concepts/thesis.md +++ b/en/docs/concepts/thesis.md @@ -43,28 +43,61 @@ Three costs are baked into this design and worth being explicit about: similar joins) is a [semantic FK](semantic-fk.md). Orphan rows are possible. -## The reward - -In exchange xly gets: +## What the design enables (and what each enabler still costs) - **One codebase serves dozens of customers.** Each customer's tenant - has its own metadata rows; the Java is identical. + has its own metadata rows; the Java is identical. — *Limit:* it + *doesn't* serve all customers. The 18 directories under + `script/客户/` (see [Slice 5](../slices/05-customer-sql-override.md)) + are the wall the data-driven design hits — when a customer needs + different procedural logic, "single codebase" stops being true and + becomes "single Java codebase + a fan-out of customer-specific SQL + the database carries silently". - **PMs evolve the application without engineering time.** They open BACK, add a module, define a form, set permissions, and the next user - load shows the change. -- **Customizations are layered cleanly** ([Slice 4](../slices/04-custom-field.md)): + load shows the change. — *Limit:* the PM's effective vocabulary is + whatever `gdsconfigformmaster` / `gdsconfigformslave` columns + expose. Anything genuinely new (a custom calculation, a non-standard + validation, a different save path) requires a stored procedure — + which takes engineering time again, just in SQL instead of Java. And + PMs without DB access can't reason about why their metadata change + produced wrong output, because the procedural side is invisible from + BACK. +- **Customizations are layered "cleanly"** ([Slice 4](../slices/04-custom-field.md)): per-tenant overrides sit *on top of* the shared base without forking. + — *Limit:* the cleanliness is a Java-side property. The runtime + merge logic in `BusinessBaseServiceImpl` is non-trivial (3,500+ + lines), debugging "why does this tenant see field X but not Y" + involves chasing through `gdsconfigformpersonalize` + + `gdsconfigformcustomslave` + `gdsconfigformuserslave` interactions. + And the overlay model can't `ALTER TABLE` — adding a real new + column still needs a coordinated schema migration. + +A more candid reading: the data-driven design **shifts complexity +out of Java and into the database and the PM-built metadata**. The +total complexity isn't lower; it's redistributed to people and tools +the framework can't compile-check. ## When it breaks down Data-driven works until a customer needs behaviour that can't be expressed as metadata — different SQL, different procedure body, an aggregation rule -that doesn't fit the framework's vocabulary. xly's escape hatch for that -case is the [per-customer SQL override channel](../slices/05-customer-sql-override.md): +that doesn't fit the framework's vocabulary. xly's response is the +[per-customer SQL override channel](../slices/05-customer-sql-override.md): hand-written SQL committed to `script/客户//` and applied directly to that customer's schema, bypassing the framework entirely. -That channel is real and used. It is also the most expensive form of -customization to maintain. + +It's worth being blunt about what this means. "Bypassing the framework" +makes the entire data-driven thesis a *partial* property of the system. +For the 18 customers under `script/客户/` the runtime is **no longer +single-codebase** — the Java is shared but the actual proc bodies +running on each customer's DB diverge, with no automated way to +detect drift. A reviewer reading `Sp_SalSalesCheck` in source has no +guarantee it's what runs in production for any given customer. The +"escape hatch" framing is generous; in practice the override channel +has become the standard answer for material business-logic +differences, which is the failure mode the data-driven design was +supposed to prevent. ## What this means for reading the wiki diff --git a/en/docs/reference/maintainer/activiti.md b/en/docs/reference/maintainer/activiti.md index d4861df..b4cddd6 100644 --- a/en/docs/reference/maintainer/activiti.md +++ b/en/docs/reference/maintainer/activiti.md @@ -169,26 +169,47 @@ emit audit entries via a custom `sp_add_flow_log`. This is the empirically-observed customisation channel — Activiti deployment is not seen in any `script/客户/` directory. -### Why this design works for xly's audience +### Why xly avoided Activiti — and what that costs The printing-industry ERP customers run rule-driven business processes (quote → order → production → delivery → invoice → payment) -where each step is **its own document with its own form** by -convention. A user expects "Now I open the next form and fill it in" -rather than "the system tells me a task is waiting for me." For -that audience: - -- Path 1 + Path 2 cover every observed scenario in this dev DB. -- Path 3's value (BPMN modeling, reassignment, parallel gateways) is - reserved for the rare tenant whose approval graph genuinely needs - it. - -The trade-off: workflow logic is **scattered across stored procedures** -rather than declarable in one place. Adding a new step to a flow -means writing or editing one or more procs, not editing a BPMN -diagram. For complex, frequently-changing flows, this is brittle. -For the printing-shop reality (quote-to-cash chain that doesn't -change much per customer), it's pragmatic. +where each step is conventionally its own document with its own form. +The audience-fit argument: a user expects "Now I open the next form +and fill it in" rather than "the system tells me a task is waiting +for me," so Path 1 + Path 2 cover every observed scenario in this +dev DB, and Path 3 is held in reserve. + +The costs of going proc-based instead of BPMN-based: + +- **Workflow logic is scattered across stored procedures, not + declarable in one place.** Adding a step to "what happens after a + quote is approved" means writing or editing one or more `Sp_*` procs, + re-grepping every other proc that references the affected document, + and hoping nothing was missed. A BPMN engine would have one diagram + to look at. +- **No central audit trail of who approved what when.** `bCheck = 1` + records that *some* approval happened, plus who approved it via the + `sCheckPerson` column — but the *path the document took* (which + steps, in which order, with what comments) lives only in proc-side + status flags, not in a queryable workflow history. +- **No parallel-branch or reassignment semantics.** Path 1 + 2 cover + linear single-approver flows. The first time a customer needs + "two people must approve in parallel", or "if person A is on + vacation, route to person B", the system has to either fall back + to Path 3 (Activiti, currently disabled) or hand-code the routing + in stored procs. +- **Flow-graph evolution is invisible.** Changing the steps of a + workflow means editing procs and document chains. There is no + diff that says "the order-approval flow changed from N steps to + N+1 steps on date X" — only commit history of individual procs. +- **The Activiti engine is on the classpath and booted at runtime + for nothing.** Memory + JAR + schema (24 `act_*` base tables + 3 + identity views) are paid for in every deployment whether they're + used or not. + +For the printing-shop reality the trade has been viable. It would +not scale to a domain with frequently-changing approval flows or +strict audit requirements. ## Activiti is wired — engine ON @@ -320,21 +341,46 @@ For a flow to actually run, in roughly this order: transitions; downstream queries that filter on `bCheck = 1` start seeing it. -## Why xly bothered with Activiti at all +## Why xly bothered with Activiti — and whether it was worth it The codebase has its own `biz_flow` / `biz_todo_item` tables that -*could* implement a hand-rolled approval system. The decision to put -Activiti behind them buys: +*could* implement a hand-rolled approval system. The arguments for +putting Activiti behind them: - Standard BPMN modeling (the JS modeler pulls the same stencilset as Activiti Explorer). -- Free state-machine semantics — the engine handles "task A done → - task B available" without xly maintaining the FSM in SQL. +- Engine-managed state-machine semantics — "task A done → task B + available" without xly maintaining the FSM in SQL. - Diagram rendering (the page-as-PNG in `ProcessActController`). -The cost: a second engine running in the JVM, a second DB schema with -its own DDL drift, a second authentication surface (which xly papers -over via the `act_id_*` views). +The costs are not minor: + +- A second engine running in the JVM, with its own startup cost, + memory footprint, and operational surface. +- A second DB schema (24 `act_*` tables + 3 identity views) that + diverges from xly's `gds*`/`biz*` conventions and needs its own + DDL migrations across Activiti versions (and indeed: see the + 5.17 vs 6.0 version skew elsewhere on this page). +- A second authentication surface that xly papers over via the + `act_id_*` views projecting xly's own users into Activiti's shape + — a hack that works but creates two-way coupling between user- + table changes and Activiti correctness. +- A modeler UI (Angular 1.x era) that maintainers have to learn + separately from BACK. +- And — the most damning cost — **on this dev DB the engine is + idle**. The `act_re_procdef` and `biz_flow` tables are empty, and + Path 1 / Path 2 handle every observed workflow scenario. The + Activiti dependency is paid for at every startup whether it's + exercised or not. + +A more honest framing: Activiti was bet on as the "real" workflow +solution; in practice the simpler proc-driven paths covered the +actual demand. The wiring stayed because removing it isn't free +either, but the value the engine delivers in the current deployment +is approximately zero. A future cleanup could plausibly remove +Activiti entirely and consolidate on the document-chain pattern, +trading away the *option* of BPMN-style flows for a smaller +codebase and one fewer schema to maintain. ## What this page is *not* diff --git a/en/docs/reference/maintainer/bi-engine.md b/en/docs/reference/maintainer/bi-engine.md index 27a61b3..5fdadcd 100644 --- a/en/docs/reference/maintainer/bi-engine.md +++ b/en/docs/reference/maintainer/bi-engine.md @@ -148,21 +148,44 @@ several `Sp_SalesOrder_Kpi*` procs (matches the [per-customer SQL override channel](../../slices/05-customer-sql-override.md) — customers who want different KPI rules ship their own proc). -## Why this matters - -xly's BI layer demonstrates the data-driven thesis at scale: - -1. **Adding a new dashboard card requires no Java change** — a PM - inserts a `gdsconfigcharmaster` row pointing at a `Sp_chart_*` proc, - sets `sCharType` and `iWidth`, the SPA picks it up on the next - `getModelBysId` cache miss. -2. **Adding a new chart proc** does require a SQL author (the proc - has to follow the standard tenant-scoped shape so generic dispatch - can call it through `CharServiceImpl`). -3. **No OLAP cube, no MDX, no semantic layer.** Each chart is a - purpose-built SQL stored procedure. This trades reusability for - simplicity — perfect-fit aggregations, no general-purpose ad-hoc - query builder. +## Drawbacks of the homebrewed approach + +The metadata + per-chart-proc design is consistent with xly's data- +driven thesis, and it avoids carrying a heavy OLAP engine. The costs: + +1. **Every new chart needs a SQL author.** "PM adds a metadata row" + is true *after* an engineer has written the matching `Sp_chart_*` + proc. There is no aggregation builder, no field-picker, no auto- + generated query — every metric is a hand-coded stored procedure + the engineering team has to write, review, and maintain. The + 20-proc catalogue and 11 chart types are the **whole** set of + shapes the system can render today. +2. **Charts run heavy SQL on the OLTP DB.** No warehouse, no + pre-aggregation, no incremental rollup. A "today's profit" + chart is a SELECT against the live transactional schema. + Heavy customers will see chart loads contend with order-entry + load on the same MySQL instance. Caching helps, but only on hit; + the first load after metadata change pays full cost. +3. **No semantic consistency between charts.** Each `Sp_chart_*` + proc decides for itself how to compute "monthly profit", "today's + sales", etc. Two charts purporting to show the same metric can + silently disagree because they're separate proc bodies. A real + semantic layer would prevent that; the homebrewed model can't. +4. **No drill-down, no slice-and-dice.** Each chart is a frozen + query shape. Users can't pivot on different dimensions or drill + from a summary card into the underlying transactions without an + engineer authoring a separate proc for each path. +5. **Customer-divergent KPI logic.** Customers under + `script/客户/` ship their own `spKPImodule` and + `Sp_SalesOrder_Kpi*` overrides — different KPI math per + customer, in code that lives only on that customer's DB. This + makes "what does this KPI mean" depend on which schema the + reader is connected to. + +The simpler design is fine for "show me the same 20 cards xly has +always shown". It is not fine if the goal is ad-hoc analytics or +self-service reporting — those would require a separate semantic / +warehouse layer that xly does not have. ## What this is *not* diff --git a/en/docs/reference/maintainer/cache-invalidation.md b/en/docs/reference/maintainer/cache-invalidation.md index 8bdd53d..bde1d5d 100644 --- a/en/docs/reference/maintainer/cache-invalidation.md +++ b/en/docs/reference/maintainer/cache-invalidation.md @@ -117,14 +117,47 @@ against the DB does **not** trigger any cleaner. The cache will serve stale metadata until either: 1. The cache TTL expires (check the cache config for the actual TTL). -2. A bounce of the application servers (one node at a time if the - cache is local; once if shared). +2. A bounce of the application servers (one bounce suffices since the + cache is Redis-backed and shared — see above). 3. A manual call to one of the `BusinessCleanRedisDataImpl.delCleanRedisDataByTableName(, …)` - methods is invoked from inside the application (e.g., via a - maintenance endpoint). Note this clears whatever the local - `CacheManager` is bound to; if that turns out to be in-memory, - the cleanup must run on every node. + methods is invoked from inside the application — once, on any + node, since it clears the shared Redis store. + +## Drawbacks of this design + +The synchronous `@CacheEvict`-during-save model is operationally +simple and (with Redis backing) genuinely cross-node coherent. It is +also fragile in ways worth naming: + +- **Two systems with confusingly similar names.** The JMS path + `CHANGE_GDS_MODULE` + `ConsumerChangeGdsModuleThread` *sounds* + like it should be cache invalidation but isn't. This page exists + partly because that conflation is a recurring source of bugs and + reader confusion. A renaming pass (proc and queue → e.g. + `MERGE_BASE_GDS_MODULE`) would help, but isn't free. +- **Eviction is in the same transaction as the write.** If the + Redis call fails mid-save, the row commits but the cache stays + stale. The framework does not detect or recover from this; a + Redis outage during save silently corrupts the cache for + affected rows until TTL expiry. +- **Eviction is "all or nothing per cache region".** Most + `@CacheEvict` annotations on `CleanRedisServiceImpl` use + `allEntries=true`, which dumps the entire cache region rather + than the affected key. Heavy save throughput causes high + cache-miss rates immediately afterwards — fine for small + metadata caches, expensive when dropping a region with thousands + of entries. +- **No invalidation budget / batching.** Bulk metadata changes + (e.g., editing 100 form fields) trigger 100 `@CacheEvict` fires, + each one round-tripping to Redis. There is no mechanism to + coalesce evictions into one batch. +- **Direct DB writes bypass everything.** Any tooling that touches + the schema outside `BusinessBaseServiceImpl` — including database + admin scripts, `script/客户/` overrides applied via `mysql` + command line, and Channel-2 SQL replacements — leaves the cache + stale until manually invalidated. This is a real operational + hazard for the deployment pattern xly actually uses. ## Common bug: the cache is the bug @@ -135,10 +168,11 @@ old value", check (in this order): 2. Did the change go through a path that invokes `BusinessCleanRedisData`? (Direct DB writes or controllers that bypass `BusinessBaseServiceImpl` won't.) -3. Is the cache shared across nodes (Redis-backed) or local - (`ConcurrentMapCacheManager`)? Confirm by inspecting the active - `CacheManager` bean on a running node. -4. If the cache is local, did every node get the eviction call? +3. Was Redis reachable when the save committed? A failed eviction + does not roll back the save. +4. Is the change in a cache region that's evicted by the table that + was written? `CleanRedisServiceImpl` maps writes to specific + regions; an unmapped table will not invalidate its readers. The five-key composite returned by [`getModelBysId` in Slice 1](../../slices/01-hello-world.md) diff --git a/en/docs/reference/maintainer/proc-dispatch.md b/en/docs/reference/maintainer/proc-dispatch.md index 3aeb7af..ec6d767 100644 --- a/en/docs/reference/maintainer/proc-dispatch.md +++ b/en/docs/reference/maintainer/proc-dispatch.md @@ -43,9 +43,38 @@ by name lets the framework call any proc the metadata names without a code change. The framework treats the proc as a black box: name in, parameters in, result out. -The downside: the runtime cannot statically know which procs exist or -what their effects are. A typo in `gdsmodule.sSaveProName` produces a -runtime "proc not found" error, not a compile error. +That convenience comes with substantial costs that are worth being +explicit about: + +- **No compile-time check** on proc names. A typo in + `gdsmodule.sSaveProName` produces a runtime "proc not found" + error, not a compile error. Refactoring a proc name requires + hand-grepping the metadata; the IDE can't help. +- **No type safety on parameters.** The framework binds parameters + positionally from a `Map`. A proc whose signature + changed but whose callers didn't is a runtime crash with no IDE + warning. +- **No call-site discoverability.** "Which Java code calls + `Sp_SalSalesCheck`?" can't be answered by IDE find-usages because + no Java code does — `gdsmodule` rows do. Maintainers must search + *both* metadata tables *and* the SQL bodies of other procs that + may invoke this one. +- **Effectively no static analysis.** Side effects of any given + proc are invisible to anyone who hasn't read the proc body. A + `Sp_SalSalesCheck` named in `gdsmodule.sProcName` could be a + read-only SELECT or could be doing INSERTs and UPDATEs across a + dozen tables; the framework treats them identically. +- **Stack traces that stop at the boundary.** Java errors thrown + from inside a proc surface as a generic `BadSqlGrammarException` + or `MySQLSyntaxErrorException`. To get the real error you have + to enable MyBatis SQL logging and re-run. + +A more honest framing: hard-wiring 1000+ procs in Java would be +painful, but most of that pain comes from xly *having* 1000+ procs +in the first place. Dynamic dispatch made it cheap to keep adding +them, which made the pile grow, which made the pile harder to +audit. The mechanism is what it is; the *amount* of behaviour +pushed into the SQL layer is the more interesting design question. ## The conventions procs follow diff --git a/en/docs/reference/maintainer/runtime.md b/en/docs/reference/maintainer/runtime.md index 1e5146e..db8005d 100644 --- a/en/docs/reference/maintainer/runtime.md +++ b/en/docs/reference/maintainer/runtime.md @@ -221,6 +221,38 @@ Two flagged in slices that belong here permanently: load entirely for `UserType.ADMIN`. ADMIN account governance must come from outside the app. +## What "universal CRUD" means in practice + +The "one controller writes any row in any table" pattern is the +core data-driven move. It also concentrates risk: + +- **`BusinessBaseServiceImpl` is ~3,500 lines** of tightly + intertwined logic: per-tenant scope-bypass list, special-case + table hardcodes (`mftproductionplanslave` at line 1768), + pre/post-save hook dispatch, sTable-driven write routing. Every + bug fix has to navigate the whole class. +- **The class is the single point of failure for the entire + business runtime.** A regression in `addUpdateDelBusinessData` + breaks save for every form in every tenant simultaneously. + Module-specific controllers would localise the blast radius; + the universal one cannot. +- **No type system on `Map`.** The frontend ships + a bag of (key, value) pairs. The runtime trusts the keys + match column names and the values cast to the column types. + Mismatches surface as `BadSqlGrammarException` at the DAO layer + — far from where the wrong value originated. There is no + schema-aware request validation. +- **Discoverability is poor.** "What endpoints write to + `mftproductionplanslave`?" can't be answered by IDE find-usages + — the answer is "any controller that calls + `BusinessBaseServiceImpl.addBusinessData` with `sTable` set to + `mftproductionplanslave`", which is everything. + +The universal pattern is what makes the data-driven thesis work. +It is also the reason adding a new module is essentially free +*and* the reason that touching the runtime is essentially never +free. + ## Cache invalidation When BACK saves a metadata change, the save service synchronously @@ -229,5 +261,5 @@ calls `BusinessCleanRedisData.delCleanRedisData*`, which fires A separate JMS path (`ConsumerChangeGdsModuleThread`) exists with a similar name but does base-data merging via stored proc, not cache invalidation. See [cache invalidation on metadata change](cache-invalidation.md) -for the full story (including the open question about cross-node -coherence). +for the full story (cross-node coherence is empirically Redis-backed, +no longer an open question). diff --git a/en/docs/reference/maintainer/sql-templates.md b/en/docs/reference/maintainer/sql-templates.md index 5a29636..4f17fb4 100644 --- a/en/docs/reference/maintainer/sql-templates.md +++ b/en/docs/reference/maintainer/sql-templates.md @@ -61,20 +61,39 @@ the target schema. document family the proc operates on. - Other placeholders depending on the scaffold. -## Why this is a "template" library and not a code generator +## "Template" library, not a code generator — and what that costs The framework does **not** auto-generate procs from these templates -based on metadata. The scaffolds exist because xly's procs follow a -common conventional shape; copying the scaffold ensures the new proc: +based on metadata. The scaffolds are convention-enforcing copy-paste +starters, nothing more. They exist to nudge a new proc into the +shape that [generic dispatch](proc-dispatch.md) can call: -- Accepts the standard parameter list `(sGuid, sFormGuid, sLoginId, sBrId, sSuId)` - that [generic dispatch](proc-dispatch.md) can call. -- Returns success/error via the standard `OUT sCode INT, OUT sReturn LONGTEXT`. +- Standard parameter list `(sGuid, sFormGuid, sLoginId, sBrId, sSuId)`. +- Returns success/error via `OUT sCode INT, OUT sReturn LONGTEXT`. - Honours the multi-tenant filter `sBrandsId = sBrId AND sSubsidiaryId = sSuId`. -A proc that *doesn't* follow these conventions cannot be invoked -through generic dispatch and would have to be called from custom Java -code instead. +Costs of staying at "template" instead of "generator": + +- **No enforcement.** A proc that drifts from the convention compiles + fine. The framework discovers the mismatch at runtime as a + `BadSqlGrammarException` or wrong-shaped result. There is no + pre-merge check. +- **No regeneration.** When the convention itself changes (e.g., a + new standard `OUT` param), the existing procs do not update. + Engineers have to grep + rewrite, with no automation. +- **No knowledge of which proc came from which template.** A proc in + the live DB doesn't record its origin scaffold; understanding what + was customised away requires diffing against the scaffold by hand. +- **Customer overrides under `script/客户/` can — and do — diverge + from the scaffold shape.** This is reasonable per customer but + means the conventions are observed by social contract, not by + any mechanical check. + +A real code-generation pipeline (template + metadata → emitted SQL, +checked in or applied at deploy time) would catch these. The +trade xly made: less tooling to maintain, but discipline-rather- +than-enforcement on proc shapes — visible in the 1,687 procs the +schema currently carries, not all of which follow the conventions. ## Two loaders diff --git a/en/docs/slices/04-custom-field.md b/en/docs/slices/04-custom-field.md index 154dae7..a85d527 100644 --- a/en/docs/slices/04-custom-field.md +++ b/en/docs/slices/04-custom-field.md @@ -112,15 +112,36 @@ ignored at merge time. A maintainer audit script that flags such orphans is on the [Maintainer Reference](../reference/maintainer/runtime.md)'s TODO list. -## Why it works without code changes - -The end-customer never asks an engineer for a new column. They open the -BACK builder, add the row, the field appears in FROUNT for their tenant -only. The system's other tenants are untouched. That single-codebase -property is what xly's data-driven thesis ([Concepts → Thesis](../concepts/thesis.md)) -buys — at the cost of the runtime cost of merging metadata on every -request, plus the schema bloat of three customization tables that most -forms never use. +## Why it works without code changes — and what that costs + +The end-customer never asks an engineer for a new column for the +*display* side. They open the BACK builder, add the row, the field +appears in FROUNT for their tenant only. The system's other tenants +are untouched. + +The price for that property: + +- **The merge runs on every request** (not just on overlay-row + changes). Even tenants with zero `gdsconfigformcustomslave` rows + pay the runtime cost of checking — the framework can't tell upfront + whether a tenant has overrides, so the merge code path runs always. +- **Three near-empty tables on every schema.** The three customization + tables exist whether the tenant uses them or not. In this dev DB + `gdsconfigformcustomslave` has 0 rows; the table is still indexed, + backed up, and queried. +- **Display extension only.** The overlay can render an extra field; + it cannot store its value unless the underlying physical table + already has the column. So "no code change for a new field" is true + only for *display-only* fields. Real new persisted fields still + need a coordinated `ALTER TABLE` (Slice 5 territory) — which means + the wins from "no code change" don't apply to the cases that + actually move business value. +- **Debuggability gets worse.** "Why does tenant A see this field + but tenant B doesn't?" requires diffing + `gdsconfigformcustomslave` + `gdsconfigformpersonalize` + + `gdsconfigformuserslave` rows for both tenants. The merge logic in + `BusinessBaseServiceImpl` is non-trivial; reproducing the exact + layout a user sees often means re-running the merge by hand. ## Concepts this slice introduces diff --git a/en/docs/slices/05-customer-sql-override.md b/en/docs/slices/05-customer-sql-override.md index 878cbe9..ec5e86c 100644 --- a/en/docs/slices/05-customer-sql-override.md +++ b/en/docs/slices/05-customer-sql-override.md @@ -90,8 +90,9 @@ framework doesn't know; the framework can't tell. This makes overrides: -- **Powerful.** Anything you can write in MySQL stored-procedure SQL, - you can use to replace standard behaviour. +- **Capable in the technical sense.** Anything you can write in MySQL + stored-procedure SQL can replace standard behaviour. (This isn't a + good thing per se — see drawbacks below.) - **Operationally fragile.** The override must be re-applied (or kept alive) whenever the customer's schema is rebuilt, restored, or migrated. It does not travel with backups of the codebase, only with @@ -101,10 +102,25 @@ This makes overrides: the proc on the live DB is a different piece of code with the same name. Stack traces and "what does this proc do" depend on which schema you're connected to. - -The right rule of thumb: prefer Slice-4 metadata customization. Reach -for Slice-5 SQL overrides only when the metadata model genuinely cannot -express what the customer needs. +- **No version control on the deployed body.** The `.sql` file in + `script/客户/` shows what *should* have been applied. There is no + audit trail confirming what *was* applied (or when, or by whom), + and no automated re-apply on schema rebuild. +- **No type-safety bridge.** When the override changes a result-set + shape, every Java caller that reads from `Sp_SalSalesCheck` may + silently break for that one customer with a `BadSqlGrammarException` + or — worse — a wrong-shaped row that propagates as a wrong number. +- **Compounds the BI problem.** Charts on customers with overridden + procs ([bi-engine.md](../reference/maintainer/bi-engine.md)) + will silently disagree across tenants because the underlying data + is computed by different SQL. + +The "prefer Slice 4, reach for Slice 5 only as last resort" advice is +correct in principle, but the existence of 18 customer directories +suggests that in practice this channel has become the standard answer +for material business-logic differences. That's a signal the metadata +model isn't expressive enough for the actual customer-customisation +demand the system encounters — not a celebration of the escape hatch. ## Worked-example: 重庆展印's `Sp_SalSalesCheck` vs the standard diff --git a/en/docs/slices/06-hardware.md b/en/docs/slices/06-hardware.md index e3eddf4..391e17f 100644 --- a/en/docs/slices/06-hardware.md +++ b/en/docs/slices/06-hardware.md @@ -83,25 +83,50 @@ other data: ## The framework / hardware boundary -This is the cleanest story xly tells about an awkward problem: +xly's response to the press-PLC problem is a strict separation: - **Above the line (xlyEntry, xlyApi, all the metadata machinery): generic framework.** No knowledge of presses, PLCs, byte protocols. - **Below the line (xlyPlc): hardware-specific.** Knows how to talk to a press. -The two communicate only through the database. The bridge writes rows; -the framework reads rows. There's no RPC, no shared in-process state, -no callback. This makes xlyPlc: - -- Independently deployable (and several customers run it on a machine - next to the press, separate from the central ERP server). -- Independently failable: if the bridge crashes, the framework keeps - running on stale machine-state data. If the framework is down, the - bridge keeps writing — when the framework comes back, it sees the - buffered rows. -- Hard to test end-to-end without an actual press. Most CI tests stub - the PLC reads. +The two communicate only through the database — the bridge writes rows, +the framework reads rows. No RPC, no shared in-process state, no +callback. The benefits: + +- Independently deployable; some customers run xlyPlc on a machine next + to the press, separate from the central ERP server. +- Independently failable: if the bridge crashes the framework serves + stale machine-state data; if the framework is down the bridge keeps + writing and the framework picks up the buffered rows on recovery. + +The costs of "DB as the only contract" are real and worth naming: + +- **No backpressure.** If the bridge writes faster than xly can ingest + (or if a slow `mftProduceReportMachineState` index update piles up), + the bridge has no signal to slow down — it just blocks on the next + INSERT. There is no flow-control message between the two halves. +- **No request/response semantics.** The framework cannot ask the + bridge "is the press alive right now?" — it can only read whatever + the bridge last wrote, which may be seconds-to-minutes old depending + on the cron cadence. +- **Bridge-side state is invisible to the framework.** "Why is the + bridge not writing?" requires logging into the bridge host to read + its log; the framework UI shows only the absence of new rows. +- **Cron polling in both directions.** xlyPlc polls the press; the + framework polls the DB; the SPA polls the framework. Three layers + of polling means latency from "press state changes" to "user sees + it" is `cron interval * 3` in the worst case. +- **Hard to test end-to-end without an actual press.** Most CI tests + stub the PLC reads, which means the bridge's most error-prone code + (byte protocol per press model) gets the least automated coverage. + +A real-time-aware architecture would use a streaming channel +(MQTT / Kafka / WebSocket) end-to-end instead of cron + DB. xly's +choice is operationally simpler but trades off latency, observability, +and flow control. For the printing-press tempo (machine state changes +every few seconds, reports every minute) the trade is liveable; for +faster shop-floor signals it would not be. ## Concepts this slice introduces -- libgit2 0.22.2