Word segmentation is a very important technology for search engines to be fully functional.
For example, query processing – when you type something in the search box, maybe “Kung Pao Chicken Soup”, a search engine has to use word segmentation methods to segment your query into a series of words then analyze and match the words with terms in its index.
When building an inverted index, word segmentation is needed too.
This is not very challenging for English and other abc-based languages, because most of the time, the words are well isolated from each other by the space.
For example, the sentence “I love the smell of napalm in the morning” from Apocalypse Now can be easily segmented into I / love / the / smell / of / napalm / in / the / morning .
But when it comes to a language that belongs to a different system, like Chinese, the issue of word segmentation can be killing, because there is no space in the Chinese language, and all the characters and words stand closely together.
A little about the Chinese language:
1. Unlike English, in which words are the smallest particles, in Chinese, the most basic grammatical particle is characters;
2. Characters form words, and words form phrases, and words and phrases form sentences;
3. Most Chinese words contain 1, 2 or 3 characters. And in a Chinese word, for example, a 2-character word, it is very common to see that each character can be used as an independent word too.
All this results into a more complex situation for Chinese word segmentation.
1. String Matching Based Word Segmentation:
In this method, a sentence is divided based on the terms included in a predifined lexicon. If a string matches a term in the lexicon, then it is a successful match and the string is recognized as a word.
String matching based word segmentation can be categorized into:
- By scan directions: obverse matching and reverse matching;
- By string length: maximum matching and minimum matching;
Some common matching methods:
- Maximum obverse matching (from left to right)
- Maximum reverse matching (from right to left)
- Minimum word count matching (minimize the word number of a sentence)
- Maximum bi-direction method (scan from both sides of a sentence)
There is also an improved method called breakpoint filter – based on statistics, a machine first finds the Chinese characters that have very high probability of being used as a single-character word within a sentence, as breakpoints.
2. Semantic Understanding Based Word Segmentation:
This method works the same way as popular English semantic algorithms.
However, due to complexity of the Chinese language, so far this method is still in experiment stage.
3. Statistical Probability Based Word Segmentation:
This method is relied on the probability of adjacent cooccurrence of a series of Chinese characters (usually 2) as words. It usually involves machine learning (unsupervised) algorithms.
Today’s Chinese search engines mostly use a combination of Method 1&3.
- Ambiguity Detection
- New Word Detection
- One Word, One SNS
- Chinese Proper Noun Lexicon: Baidu has it, Google not
- Add Ignored Chinese Search Engines to Google Analytics
- The Comprehensive List of Chinese Keyword Research Tools
- Popular Chinese Content Management Systems (CMS)
- Top 5 Chinese Q&A Sites
- Surprising Chinese SEO Fact: Link-building doesn’t have to use links