Commit 047d358b89a6c15a9d25a3428303717e59ff3bac
1 parent
0aee42c4
search: keep code identifiers whole (drop CamelCase splitter)
User flagged that searches for `BusinessBaseServiceImpl` and `MyBatis` were returning thousands of irrelevant matches. Cause: the search plugin's separator regex included `(?!\b)(?=[A-Z][a-z])`, which split every CamelCase boundary at INDEX time AND at QUERY time. So the indexed token stream for `BusinessBaseServiceImpl` was [Business, Base, Service, Impl] (each a common, low-relevance token), and the same 4-token expansion happened on the query — every page that mentioned "service" matched. Removing the CamelCase splitter so identifiers stay whole. Verified via search_index.json: 18 docs now contain the intact `BusinessBaseServiceImpl` token (down from 119 spurious matches); `MyBatis` queries no longer collide with `MySQL` / `My...`. Lunr still supports wildcard suffixes (`Service*`) for partial-token search if a maintainer wants it.
Showing
1 changed file
with
7 additions
and
3 deletions
en/mkdocs.yml
| ... | ... | @@ -36,13 +36,17 @@ theme: |
| 36 | 36 | icon: material/brightness-4 |
| 37 | 37 | name: Switch to light mode |
| 38 | 38 | |
| 39 | -# CJK-aware search: regex separator includes word boundaries plus CJK punctuation; | |
| 40 | -# for true Chinese tokenization, jieba is invoked by the catalog generator at index time | |
| 39 | +# Search separator: whitespace + common punctuation + dots + HTML entities + CJK punctuation. | |
| 40 | +# CamelCase splitter removed \u2014 code-identifier searches like "BusinessBaseServiceImpl" or | |
| 41 | +# "MyBatis" now match the whole identifier instead of being chopped into [Business, Base, | |
| 42 | +# Service, Impl] (which produced 1.9k spurious matches and lost the ranked exact hit). | |
| 43 | +# Lunr supports wildcard suffixes (e.g. `Service*`) for partial-token search if needed. | |
| 44 | +# For true Chinese tokenization, jieba is invoked by the catalog generator at index time | |
| 41 | 45 | # (see scripts/gen_catalog.py). Mid-term improvement: a custom mkdocs plugin to feed |
| 42 | 46 | # jieba-segmented terms into lunr. |
| 43 | 47 | plugins: |
| 44 | 48 | - search: |
| 45 | - separator: '[\s\-,;:!=\[\]()"`/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;|[\u3000-\u303f\uff00-\uffef]' | |
| 49 | + separator: '[\s\-,;:!=\[\]()"`/]+|\.(?!\d)|&[lg]t;|[\u3000-\u303f\uff00-\uffef]' | |
| 46 | 50 | |
| 47 | 51 | markdown_extensions: |
| 48 | 52 | - admonition | ... | ... |