search: keep code identifiers whole (drop CamelCase splitter)

User flagged that searches for `BusinessBaseServiceImpl` and `MyBatis` were returning thousands of irrelevant matches. Cause: the search plugin's separator regex included `(?!\b)(?=[A-Z][a-z])`, which split every CamelCase boundary at INDEX time AND at QUERY time. So the indexed token stream for `BusinessBaseServiceImpl` was [Business, Base, Service, Impl] (each a common, low-relevance token), and the same 4-token expansion happened on the query — every page that mentioned "service" matched. Removing the CamelCase splitter so identifiers stay whole. Verified via search_index.json: 18 docs now contain the intact `BusinessBaseServiceImpl` token (down from 119 spurious matches); `MyBatis` queries no longer collide with `MySQL` / `My...`. Lunr still supports wildcard suffixes (`Service*`) for partial-token search if a maintainer wants it.

search: keep code identifiers whole (drop CamelCase splitter)
User flagged that searches for `BusinessBaseServiceImpl` and `MyBatis` were returning thousands of irrelevant matches. Cause: the search plugin's separator regex included `(?!\b)(?=[A-Z][a-z])`, which split every CamelCase boundary at INDEX time AND at QUERY time. So the indexed token stream for `BusinessBaseServiceImpl` was [Business, Base, Service, Impl] (each a common, low-relevance token), and the same 4-token expansion happened on the query — every page that mentioned "service" matched. Removing the CamelCase splitter so identifiers stay whole. Verified via search_index.json: 18 docs now contain the intact `BusinessBaseServiceImpl` token (down from 119 spurious matches); `MyBatis` queries no longer collide with `MySQL` / `My...`. Lunr still supports wildcard suffixes (`Service*`) for partial-token search if a maintainer wants it.
zichun
1 parent 0aee42c4
Showing 1 changed file with 7 additions and 3 deletions
en/mkdocs.yml
@@ -36,13 +36,17 @@ theme:
         icon: material/brightness-4
         name: Switch to light mode
-# CJK-aware search: regex separator includes word boundaries plus CJK punctuation;
-# for true Chinese tokenization, jieba is invoked by the catalog generator at index time
+# Search separator: whitespace + common punctuation + dots + HTML entities + CJK punctuation.
+# CamelCase splitter removed \u2014 code-identifier searches like "BusinessBaseServiceImpl" or
+# "MyBatis" now match the whole identifier instead of being chopped into [Business, Base,
+# Service, Impl] (which produced 1.9k spurious matches and lost the ranked exact hit).
+# Lunr supports wildcard suffixes (e.g. `Service*`) for partial-token search if needed.
+# For true Chinese tokenization, jieba is invoked by the catalog generator at index time
 # (see scripts/gen_catalog.py). Mid-term improvement: a custom mkdocs plugin to feed
 # jieba-segmented terms into lunr.
 plugins:
   - search:
-      separator: '[\s\-,;:!=\[\]()"`/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;|[\u3000-\u303f\uff00-\uffef]'
+      separator: '[\s\-,;:!=\[\]()"`/]+|\.(?!\d)|&[lg]t;|[\u3000-\u303f\uff00-\uffef]'
 markdown_extensions:
   - admonition