MySQL ngram Full-Text Parser
Summary: in this tutorial, you will learn how to use MySQL ngram full-text parser to support full-text searches for ideographic languages such as Chinese, Japanese, and Korean.
Introduction to MySQL ngram full-text parser
The built-in MySQL full-text parser determines the beginning and end of words using white space. When it comes to ideographic languages such as Chinese, Japanese, and Korean, the full-text parser has a limitation that these ideographic languages do not use word delimiters.
To address this issue, MySQL provided the ngram full-text parser. Since version 5.7.6, MySQL included ngram full-text parser as a built-in server plugin, meaning that MySQL loads this plugin automatically when the MySQL database server starts. MySQL supports ngram full-text parser for both InnoDB and MyISAM storage engines.
By definition, an ngram is a contiguous sequence of a number of characters from a sequence of text. The main function of ngram full-text parser is tokenizing a sequence of text into a contiguous sequence of n characters.
The following illustrates how the ngram full-text parser tokenizes a sequence of text for different value of n:
n = 1: 'm','y','s','q','l'
n = 2: 'my', 'ys', 'sq','ql'
n = 3: 'mys', 'ysq', 'sql'
n = 4: 'mysq', 'ysql'
n = 5: 'mysql'
Creating FULLTEXT
indexes with ngram parser
To create a FULLTEXT
an index that uses ngram full-text parser, you add the WITH PARSER ngram
in the CREATE TABLE
, ALTER TABLE
, or CREATE INDEX
statement.
Consider the following example.
First, create new posts
table and adds the title
and body
columns to the FULLTEXT
index that uses ngram full-text parser.
DROP TABLE IF EXISTS posts;
CREATE TABLE posts (
id INT PRIMARY KEY AUTO_INCREMENT,
title VARCHAR(255),
body TEXT,
FULLTEXT ( title , body ) WITH PARSER NGRAM
) ENGINE=INNODB CHARACTER SET UTF8MB4;
Second, use the SET NAMES
the statement sets the character set to utf8mb4
.
SET NAMES utf8mb4;
Third, insert a new row into the posts
table:
INSERT INTO posts(title,body)
VALUES('MySQL全文搜索','MySQL提供了具有许多好的功能的内置全文搜索'),
('MySQL教程','学习MySQL快速,简单和有趣');
Fourth, to see how the ngram tokenizes the text, you use the following statement:
SET GLOBAL innodb_ft_aux_table="test/posts";
SELECT
*
FROM
information_schema.innodb_ft_index_cache
ORDER BY
doc_id ,
position;
This query is useful for troubleshooting purposes. For example, if a word does not include in the search results, then the word may be not indexed because it is a stopword or it could be another reason.
Setting ngram token size
As you can see in the previous example, the token size (n) in the ngram by default is 2. To change the token size, you use the ngram_token_size
configuration option, which has a value between 1 and 10.
Note that a smaller token size makes a smaller full-text search index and allows you to search faster.
Because ngram_token_size
is a read-only variable, therefore you only can set its value using two options:
First, in the start-up string:
mysqld --ngram_token_size=1
Second, in the configuration file:
[mysqld] ngram_token_size=1
ngram parser phrase search
MySQL converts a phrase search into ngram phrase searches. For example, "abc"
is converted into "ab bc"
, which returns documents that contain "ab bc"
and "abc"
.
The following example shows you to search for the phrase 搜索
in the posts
table:
SELECT
id, title, body
FROM
posts
WHERE
MATCH (title , body) AGAINST ('搜索' );
Processing search result with ngram
Natural language mode
In NATURAL LANGUAGE MODE
searches, the search term is converted to a union of ngram values. Suppose the token size is 2 or bigram, the search term mysql
is converted to my
ys
sq
and ql
.
SELECT
*
FROM
posts
WHERE
MATCH (title , body)
AGAINST ('简单和有趣' IN natural language MODE);
Boolean mode
In BOOLEAN MODE
searches, the search term is converted to an ngram phrase search. For example:
SELECT
*
FROM
posts
WHERE
MATCH (title , body)
AGAINST ('简单和有趣' IN BOOLEAN MODE);
ngram wildcard search
The ngram FULLTEXT
the index contains only ngrams, therefore it does not know the beginning of terms. When you perform wildcard searches, it may return an unexpected result.
The following rules are applied to wildcard search using ngram FULLTEXT
search indexes:
If the prefix term in the wildcard is shorter than ngram token size, the query returns all documents that contain ngram tokens starting with the prefix term. For example:
SELECT
id,
title,
body
FROM
posts
WHERE
MATCH (title , body)
AGAINST ('my*' );
In case the prefix term in the wildcard is longer than ngram token size, MySQL will convert the prefix term into ngram phrases and ignore the wildcard operator. See the following example:
SELECT
id,
title,
body
FROM
posts
WHERE
MATCH (title , body)
AGAINST ('mysqld*' );
In this example, the term “mysqld"
is converted into ngram phrases: "my"
"ys"
"sq"
"ql"
"ld"
. Therefore all documents that contain one of these phrases are returned.
Handling stopwords
The ngram parser excludes tokens that contain the stopword in the stopword list. For example, suppose the ngram_token_size
is 2 and the document contains "abc"
. The ngram parser will tokenize the document to "ab"
and "bc"
. If "b"
is a stopword, ngram will exclude both "ab"
, and "bc"
because they contain "b"
.
Note that you must define your own stopword list if the language is other than English. In addition, the stopwords with lengths that are greater than ngram_token_size
are ignored.
In this tutorial, you have learned how to use MySQL ngram full-text parser to handle full-text searches for ideographic languages.
0 Comments
CAN FEEDBACK
Emoji