Commit 047d358b89a6c15a9d25a3428303717e59ff3bac

Authored by zichun
1 parent 0aee42c4

search: keep code identifiers whole (drop CamelCase splitter)

User flagged that searches for `BusinessBaseServiceImpl` and `MyBatis`
were returning thousands of irrelevant matches. Cause: the search
plugin's separator regex included `(?!\b)(?=[A-Z][a-z])`, which split
every CamelCase boundary at INDEX time AND at QUERY time. So the
indexed token stream for `BusinessBaseServiceImpl` was [Business, Base,
Service, Impl] (each a common, low-relevance token), and the same
4-token expansion happened on the query — every page that mentioned
"service" matched.

Removing the CamelCase splitter so identifiers stay whole. Verified
via search_index.json: 18 docs now contain the intact
`BusinessBaseServiceImpl` token (down from 119 spurious matches);
`MyBatis` queries no longer collide with `MySQL` / `My...`. Lunr
still supports wildcard suffixes (`Service*`) for partial-token
search if a maintainer wants it.
Showing 1 changed file with 7 additions and 3 deletions
en/mkdocs.yml
@@ -36,13 +36,17 @@ theme: @@ -36,13 +36,17 @@ theme:
36 icon: material/brightness-4 36 icon: material/brightness-4
37 name: Switch to light mode 37 name: Switch to light mode
38 38
39 -# CJK-aware search: regex separator includes word boundaries plus CJK punctuation;  
40 -# for true Chinese tokenization, jieba is invoked by the catalog generator at index time 39 +# Search separator: whitespace + common punctuation + dots + HTML entities + CJK punctuation.
  40 +# CamelCase splitter removed \u2014 code-identifier searches like "BusinessBaseServiceImpl" or
  41 +# "MyBatis" now match the whole identifier instead of being chopped into [Business, Base,
  42 +# Service, Impl] (which produced 1.9k spurious matches and lost the ranked exact hit).
  43 +# Lunr supports wildcard suffixes (e.g. `Service*`) for partial-token search if needed.
  44 +# For true Chinese tokenization, jieba is invoked by the catalog generator at index time
41 # (see scripts/gen_catalog.py). Mid-term improvement: a custom mkdocs plugin to feed 45 # (see scripts/gen_catalog.py). Mid-term improvement: a custom mkdocs plugin to feed
42 # jieba-segmented terms into lunr. 46 # jieba-segmented terms into lunr.
43 plugins: 47 plugins:
44 - search: 48 - search:
45 - separator: '[\s\-,;:!=\[\]()"`/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;|[\u3000-\u303f\uff00-\uffef]' 49 + separator: '[\s\-,;:!=\[\]()"`/]+|\.(?!\d)|&[lg]t;|[\u3000-\u303f\uff00-\uffef]'
46 50
47 markdown_extensions: 51 markdown_extensions:
48 - admonition 52 - admonition