MySQL ngram Full-Text Parser
The ngram full-text parser in MySQL is a specialized parser used for full-text indexing and searching, particularly for handling non-latin characters and languages that do not have word boundaries like spaces (e.g., Chinese, Japanese, or Korean). The ngram parser divides text into n-grams, which are continuous sequences of n
characters. This allows MySQL to search and index text even if it's not separated by spaces or other word delimiters.
What Is an n-gram?
An n-gram is a contiguous sequence of n
items (characters or words) from a given sample of text. The ngram parser breaks text into smaller chunks of n
characters to allow for more efficient indexing and searching in full-text fields.
For example:
- For a sentence:
"The quick brown fox"
- 2-grams (bigrams):
['Th', 'he', 'e ', ' q', 'qu', 'ui', 'ic', 'ck', 'k ', ' b', 'br', 'ro', 'ow', 'wn', 'n ', ' f', 'fo', 'ox']
- 3-grams (trigrams):
['The', 'he ', 'e q', ' qu', 'qui', 'uic', 'ick', 'ck ', 'k b', ' br', 'bro', 'row', 'own', 'wn ', 'n f', 'fo ', 'fox']
- 2-grams (bigrams):
The goal of the ngram parser is to index the text by character sequences, allowing MySQL to retrieve rows containing any sequence of characters within the n-grams.
When to Use the ngram Full-Text Parser
The ngram full-text parser is typically used in the following situations:
- Languages Without Word Boundaries: It is most useful for languages like Chinese, Japanese, or Korean, where spaces are not used to separate words.
- Partial Matching: It allows partial matching of words or characters by breaking text into small character-based segments.
- Improved Search for Character Sequences: In cases where word-based full-text search may not work efficiently (e.g., when you need to match parts of words), the ngram parser can be used.
How to Use the ngram Full-Text Parser in MySQL
Enable ngram Parser: To use the ngram full-text parser, you must ensure that you are using MySQL 5.7 or later, as it is supported starting from this version.
Create a Table with a Full-Text Index Using ngram: You can specify the ngram parser when creating a full-text index on a column.
- The
WITH PARSER ngram
clause tells MySQL to use the ngram parser for the full-text index.
- The
Populate the Table with Data: Insert some text into the table that will be indexed by the ngram parser.
Perform Full-Text Search Using MATCH...AGAINST: Once you have created the table with the full-text index using the ngram parser, you can perform searches using the
MATCH...AGAINST
syntax.- In this case, ngram indexing will match character sequences like
ex
,xa
,am
, etc., to the search query'exam'
.
- In this case, ngram indexing will match character sequences like
Advantages of Using the ngram Parser
- Partial Match for Non-Latin Text: The ngram parser is highly beneficial for indexing and searching text in languages without word boundaries like Chinese, Japanese, or Korean.
- Improved Search Performance: The ngram parser allows efficient full-text searching of character sequences, especially for long strings or partial words.
- Character-based Indexing: It allows for the indexing of individual characters, which is especially useful when exact word boundaries aren’t clear.
Disadvantages of Using the ngram Parser
- Increased Index Size: The ngram parser can produce large indexes because it breaks text into multiple small sequences (n-grams). This increases storage requirements and can affect performance when dealing with large volumes of text.
- Limited to Character-Based Matching: Since the ngram parser works with character sequences, it may not always match well with words or semantic meaning in natural language.
- No Support for Full-Word Searches: Unlike traditional full-text search, the ngram parser doesn’t work well for exact word matching and might return unexpected results if you are searching for full words.
Example: ngram Full-Text Search
Let’s say you have a table of documents
where each row has an id
and a text
column. You can create a full-text index using the ngram parser as follows:
Now, if you insert text into the documents
table:
You can search for a sequence of characters (for example, test
):
The result will match any document containing sequences of characters like te
, es
, st
, and test
.
Conclusion
The ngram full-text parser in MySQL is an advanced tool for handling non-latin languages or text without obvious word boundaries. By breaking text into n-grams, it allows for efficient searches on character sequences, making it particularly useful for languages like Chinese, Japanese, or Korean. However, the increased index size and the limitations on word-based searches should be considered when deciding to use this parser in your database.