MySQL ngram Full-Text Parser

MySQL ngram Full-Text Parser

MySQL ngram Full-Text Parser

The ngram full-text parser in MySQL is a specialized parser used for full-text indexing and searching, particularly for handling non-latin characters and languages that do not have word boundaries like spaces (e.g., Chinese, Japanese, or Korean). The ngram parser divides text into n-grams, which are continuous sequences of n characters. This allows MySQL to search and index text even if it's not separated by spaces or other word delimiters.

What Is an n-gram?

An n-gram is a contiguous sequence of n items (characters or words) from a given sample of text. The ngram parser breaks text into smaller chunks of n characters to allow for more efficient indexing and searching in full-text fields.

For example:

  • For a sentence: "The quick brown fox"
    • 2-grams (bigrams): ['Th', 'he', 'e ', ' q', 'qu', 'ui', 'ic', 'ck', 'k ', ' b', 'br', 'ro', 'ow', 'wn', 'n ', ' f', 'fo', 'ox']
    • 3-grams (trigrams): ['The', 'he ', 'e q', ' qu', 'qui', 'uic', 'ick', 'ck ', 'k b', ' br', 'bro', 'row', 'own', 'wn ', 'n f', 'fo ', 'fox']

The goal of the ngram parser is to index the text by character sequences, allowing MySQL to retrieve rows containing any sequence of characters within the n-grams.

When to Use the ngram Full-Text Parser

The ngram full-text parser is typically used in the following situations:

  1. Languages Without Word Boundaries: It is most useful for languages like Chinese, Japanese, or Korean, where spaces are not used to separate words.
  2. Partial Matching: It allows partial matching of words or characters by breaking text into small character-based segments.
  3. Improved Search for Character Sequences: In cases where word-based full-text search may not work efficiently (e.g., when you need to match parts of words), the ngram parser can be used.

How to Use the ngram Full-Text Parser in MySQL

  1. Enable ngram Parser: To use the ngram full-text parser, you must ensure that you are using MySQL 5.7 or later, as it is supported starting from this version.

  2. Create a Table with a Full-Text Index Using ngram: You can specify the ngram parser when creating a full-text index on a column.

    CREATE TABLE articles ( id INT PRIMARY KEY AUTO_INCREMENT, content TEXT, FULLTEXT KEY content_index (content) WITH PARSER ngram );
    • The WITH PARSER ngram clause tells MySQL to use the ngram parser for the full-text index.
  3. Populate the Table with Data: Insert some text into the table that will be indexed by the ngram parser.

    INSERT INTO articles (content) VALUES ('This is an example article content.');
  4. Perform Full-Text Search Using MATCH...AGAINST: Once you have created the table with the full-text index using the ngram parser, you can perform searches using the MATCH...AGAINST syntax.

    SELECT * FROM articles WHERE MATCH(content) AGAINST('exam' IN NATURAL LANGUAGE MODE);
    • In this case, ngram indexing will match character sequences like ex, xa, am, etc., to the search query 'exam'.

Advantages of Using the ngram Parser

  1. Partial Match for Non-Latin Text: The ngram parser is highly beneficial for indexing and searching text in languages without word boundaries like Chinese, Japanese, or Korean.
  2. Improved Search Performance: The ngram parser allows efficient full-text searching of character sequences, especially for long strings or partial words.
  3. Character-based Indexing: It allows for the indexing of individual characters, which is especially useful when exact word boundaries aren’t clear.

Disadvantages of Using the ngram Parser

  1. Increased Index Size: The ngram parser can produce large indexes because it breaks text into multiple small sequences (n-grams). This increases storage requirements and can affect performance when dealing with large volumes of text.
  2. Limited to Character-Based Matching: Since the ngram parser works with character sequences, it may not always match well with words or semantic meaning in natural language.
  3. No Support for Full-Word Searches: Unlike traditional full-text search, the ngram parser doesn’t work well for exact word matching and might return unexpected results if you are searching for full words.

Example: ngram Full-Text Search

Let’s say you have a table of documents where each row has an id and a text column. You can create a full-text index using the ngram parser as follows:

CREATE TABLE documents ( id INT PRIMARY KEY AUTO_INCREMENT, text TEXT, FULLTEXT KEY text_index (text) WITH PARSER ngram );

Now, if you insert text into the documents table:

INSERT INTO documents (text) VALUES ('This is a test document.');

You can search for a sequence of characters (for example, test):

SELECT * FROM documents WHERE MATCH(text) AGAINST ('test' IN NATURAL LANGUAGE MODE);

The result will match any document containing sequences of characters like te, es, st, and test.

Conclusion

The ngram full-text parser in MySQL is an advanced tool for handling non-latin languages or text without obvious word boundaries. By breaking text into n-grams, it allows for efficient searches on character sequences, making it particularly useful for languages like Chinese, Japanese, or Korean. However, the increased index size and the limitations on word-based searches should be considered when deciding to use this parser in your database.

Soeng Souy

Soeng Souy

Website that learns and reads, PHP, Framework Laravel, How to and download Admin template sample source code free.

Post a Comment

CAN FEEDBACK
close