MySQL ngram Full-Text Parser

The ngram full-text parser in MySQL is a specialized parser used for full-text indexing and searching, particularly for handling non-latin characters and languages that do not have word boundaries like spaces (e.g., Chinese, Japanese, or Korean). The ngram parser divides text into n-grams, which are continuous sequences of n characters. This allows MySQL to search and index text even if it's not separated by spaces or other word delimiters.

What Is an n-gram?

An n-gram is a contiguous sequence of n items (characters or words) from a given sample of text. The ngram parser breaks text into smaller chunks of n characters to allow for more efficient indexing and searching in full-text fields.

For example:

For a sentence: "The quick brown fox"
- 2-grams (bigrams): ['Th', 'he', 'e ', ' q', 'qu', 'ui', 'ic', 'ck', 'k ', ' b', 'br', 'ro', 'ow', 'wn', 'n ', ' f', 'fo', 'ox']
- 3-grams (trigrams): ['The', 'he ', 'e q', ' qu', 'qui', 'uic', 'ick', 'ck ', 'k b', ' br', 'bro', 'row', 'own', 'wn ', 'n f', 'fo ', 'fox']

The goal of the ngram parser is to index the text by character sequences, allowing MySQL to retrieve rows containing any sequence of characters within the n-grams.

When to Use the ngram Full-Text Parser

The ngram full-text parser is typically used in the following situations:

Languages Without Word Boundaries: It is most useful for languages like Chinese, Japanese, or Korean, where spaces are not used to separate words.
Partial Matching: It allows partial matching of words or characters by breaking text into small character-based segments.
Improved Search for Character Sequences: In cases where word-based full-text search may not work efficiently (e.g., when you need to match parts of words), the ngram parser can be used.

How to Use the ngram Full-Text Parser in MySQL

Enable ngram Parser: To use the ngram full-text parser, you must ensure that you are using MySQL 5.7 or later, as it is supported starting from this version.
Create a Table with a Full-Text Index Using ngram: You can specify the ngram parser when creating a full-text index on a column.
```
CREATE TABLE articles (
    id INT PRIMARY KEY AUTO_INCREMENT,
    content TEXT,
    FULLTEXT KEY content_index (content) 
    WITH PARSER ngram
);
```
- The WITH PARSER ngram clause tells MySQL to use the ngram parser for the full-text index.
Populate the Table with Data: Insert some text into the table that will be indexed by the ngram parser.
```
INSERT INTO articles (content) 
VALUES ('This is an example article content.');
```
Perform Full-Text Search Using MATCH...AGAINST: Once you have created the table with the full-text index using the ngram parser, you can perform searches using the MATCH...AGAINST syntax.
```
SELECT * 
FROM articles
WHERE MATCH(content) AGAINST('exam' IN NATURAL LANGUAGE MODE);
```
- In this case, ngram indexing will match character sequences like ex, xa, am, etc., to the search query 'exam'.

Advantages of Using the ngram Parser

Partial Match for Non-Latin Text: The ngram parser is highly beneficial for indexing and searching text in languages without word boundaries like Chinese, Japanese, or Korean.
Improved Search Performance: The ngram parser allows efficient full-text searching of character sequences, especially for long strings or partial words.
Character-based Indexing: It allows for the indexing of individual characters, which is especially useful when exact word boundaries aren’t clear.

Disadvantages of Using the ngram Parser

Increased Index Size: The ngram parser can produce large indexes because it breaks text into multiple small sequences (n-grams). This increases storage requirements and can affect performance when dealing with large volumes of text.
Limited to Character-Based Matching: Since the ngram parser works with character sequences, it may not always match well with words or semantic meaning in natural language.
No Support for Full-Word Searches: Unlike traditional full-text search, the ngram parser doesn’t work well for exact word matching and might return unexpected results if you are searching for full words.

Example: ngram Full-Text Search

Let’s say you have a table of documents where each row has an id and a text column. You can create a full-text index using the ngram parser as follows:

CREATE TABLE documents (
    id INT PRIMARY KEY AUTO_INCREMENT,
    text TEXT,
    FULLTEXT KEY text_index (text) WITH PARSER ngram
);

Now, if you insert text into the documents table:

INSERT INTO documents (text) 
VALUES ('This is a test document.');

You can search for a sequence of characters (for example, test):

SELECT * FROM documents 
WHERE MATCH(text) AGAINST ('test' IN NATURAL LANGUAGE MODE);

The result will match any document containing sequences of characters like te, es, st, and test.

Conclusion

The ngram full-text parser in MySQL is an advanced tool for handling non-latin languages or text without obvious word boundaries. By breaking text into n-grams, it allows for efficient searches on character sequences, making it particularly useful for languages like Chinese, Japanese, or Korean. However, the increased index size and the limitations on word-based searches should be considered when deciding to use this parser in your database.

Souy Soeng

MySQL ngram Full-Text Parser