Commit 047d358b89a6c15a9d25a3428303717e59ff3bac
1 parent
0aee42c4
search: keep code identifiers whole (drop CamelCase splitter)
User flagged that searches for `BusinessBaseServiceImpl` and `MyBatis` were returning thousands of irrelevant matches. Cause: the search plugin's separator regex included `(?!\b)(?=[A-Z][a-z])`, which split every CamelCase boundary at INDEX time AND at QUERY time. So the indexed token stream for `BusinessBaseServiceImpl` was [Business, Base, Service, Impl] (each a common, low-relevance token), and the same 4-token expansion happened on the query — every page that mentioned "service" matched. Removing the CamelCase splitter so identifiers stay whole. Verified via search_index.json: 18 docs now contain the intact `BusinessBaseServiceImpl` token (down from 119 spurious matches); `MyBatis` queries no longer collide with `MySQL` / `My...`. Lunr still supports wildcard suffixes (`Service*`) for partial-token search if a maintainer wants it.
Showing
1 changed file
with
7 additions
and
3 deletions
en/mkdocs.yml
| @@ -36,13 +36,17 @@ theme: | @@ -36,13 +36,17 @@ theme: | ||
| 36 | icon: material/brightness-4 | 36 | icon: material/brightness-4 |
| 37 | name: Switch to light mode | 37 | name: Switch to light mode |
| 38 | 38 | ||
| 39 | -# CJK-aware search: regex separator includes word boundaries plus CJK punctuation; | ||
| 40 | -# for true Chinese tokenization, jieba is invoked by the catalog generator at index time | 39 | +# Search separator: whitespace + common punctuation + dots + HTML entities + CJK punctuation. |
| 40 | +# CamelCase splitter removed \u2014 code-identifier searches like "BusinessBaseServiceImpl" or | ||
| 41 | +# "MyBatis" now match the whole identifier instead of being chopped into [Business, Base, | ||
| 42 | +# Service, Impl] (which produced 1.9k spurious matches and lost the ranked exact hit). | ||
| 43 | +# Lunr supports wildcard suffixes (e.g. `Service*`) for partial-token search if needed. | ||
| 44 | +# For true Chinese tokenization, jieba is invoked by the catalog generator at index time | ||
| 41 | # (see scripts/gen_catalog.py). Mid-term improvement: a custom mkdocs plugin to feed | 45 | # (see scripts/gen_catalog.py). Mid-term improvement: a custom mkdocs plugin to feed |
| 42 | # jieba-segmented terms into lunr. | 46 | # jieba-segmented terms into lunr. |
| 43 | plugins: | 47 | plugins: |
| 44 | - search: | 48 | - search: |
| 45 | - separator: '[\s\-,;:!=\[\]()"`/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;|[\u3000-\u303f\uff00-\uffef]' | 49 | + separator: '[\s\-,;:!=\[\]()"`/]+|\.(?!\d)|&[lg]t;|[\u3000-\u303f\uff00-\uffef]' |
| 46 | 50 | ||
| 47 | markdown_extensions: | 51 | markdown_extensions: |
| 48 | - admonition | 52 | - admonition |