Commit 047d358b89a6c15a9d25a3428303717e59ff3bac

Authored by zichun
1 parent 0aee42c4

search: keep code identifiers whole (drop CamelCase splitter)

User flagged that searches for `BusinessBaseServiceImpl` and `MyBatis`
were returning thousands of irrelevant matches. Cause: the search
plugin's separator regex included `(?!\b)(?=[A-Z][a-z])`, which split
every CamelCase boundary at INDEX time AND at QUERY time. So the
indexed token stream for `BusinessBaseServiceImpl` was [Business, Base,
Service, Impl] (each a common, low-relevance token), and the same
4-token expansion happened on the query — every page that mentioned
"service" matched.

Removing the CamelCase splitter so identifiers stay whole. Verified
via search_index.json: 18 docs now contain the intact
`BusinessBaseServiceImpl` token (down from 119 spurious matches);
`MyBatis` queries no longer collide with `MySQL` / `My...`. Lunr
still supports wildcard suffixes (`Service*`) for partial-token
search if a maintainer wants it.
Showing 1 changed file with 7 additions and 3 deletions
en/mkdocs.yml
... ... @@ -36,13 +36,17 @@ theme:
36 36 icon: material/brightness-4
37 37 name: Switch to light mode
38 38  
39   -# CJK-aware search: regex separator includes word boundaries plus CJK punctuation;
40   -# for true Chinese tokenization, jieba is invoked by the catalog generator at index time
  39 +# Search separator: whitespace + common punctuation + dots + HTML entities + CJK punctuation.
  40 +# CamelCase splitter removed \u2014 code-identifier searches like "BusinessBaseServiceImpl" or
  41 +# "MyBatis" now match the whole identifier instead of being chopped into [Business, Base,
  42 +# Service, Impl] (which produced 1.9k spurious matches and lost the ranked exact hit).
  43 +# Lunr supports wildcard suffixes (e.g. `Service*`) for partial-token search if needed.
  44 +# For true Chinese tokenization, jieba is invoked by the catalog generator at index time
41 45 # (see scripts/gen_catalog.py). Mid-term improvement: a custom mkdocs plugin to feed
42 46 # jieba-segmented terms into lunr.
43 47 plugins:
44 48 - search:
45   - separator: '[\s\-,;:!=\[\]()"`/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;|[\u3000-\u303f\uff00-\uffef]'
  49 + separator: '[\s\-,;:!=\[\]()"`/]+|\.(?!\d)|&[lg]t;|[\u3000-\u303f\uff00-\uffef]'
46 50  
47 51 markdown_extensions:
48 52 - admonition
... ...