feat: impl `NgramIndex` for `FuseTable`, improve like query performance #17852

KKould · 2025-04-25T03:16:56Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

part of: #17724

Implement Ngram Index to improve the retrieval speed of Like query

Its working principle is to insert String type data into multiple substrings in the form of ngram and insert them into BloomFilter. When querying Like, it determines whether there is a substring after ngram that does not exist in BloomFilter to filter out the Block that must not have data in Like in advance.

Therefore, when using Ngram Index, the insertion time will be longer due to ngram (depending on the length of each line of string and the total number of data lines).

Storage

Ngram Index is essentially a data segmentation method based on Bloom Index using Ngram. Therefore, Ngram Index shares Meta with Bloom Index and uses the same storage file.

Benchmark

Using amazon_reviews as the benchmark, the total data size is 39.2 GB, and review_body is 17 GB

CREATE OR REPLACE TABLE `amazon_reviews_ngram` (
                                  `review_date` int(11) NULL,
                                  `marketplace` varchar(20) NULL,
                                  `customer_id` bigint(20) NULL,
                                  `review_id` varchar(40) NULL,
                                  `product_id` varchar(10) NULL,
                                  `product_parent` bigint(20) NULL,
                                  `product_title` varchar(500) NULL,
                                  `product_category` varchar(50) NULL,
                                  `star_rating` smallint(6) NULL,
                                  `helpful_votes` int(11) NULL,
                                  `total_votes` int(11) NULL,
                                  `vine` boolean NULL,
                                  `verified_purchase` boolean NULL,
                                  `review_headline` varchar(500) NULL,
                                  `review_body` string NULL,
                                  NGRAM INDEX idx1 (review_body) gram_size = 10 bitmap_size = 2097152
) Engine = Fuse bloom_index_columns='review_body';

copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2010.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2011.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2012.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2013.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2014.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2015.snappy.parquet file_format = (type = PARQUET);

Using this SQL to test Ngram, the total file size of BloomFilter is 1.5 GB

Query:

SELECT
    product_id,
    any(product_title),
    AVG(star_rating) AS rating,
    COUNT() AS count
FROM
    amazon_reviews_ngram
WHERE
    review_body LIKE '%The first track with Chris Botti is beautiful%'
GROUP BY
    product_id
ORDER BY
    count DESC,
    rating DESC,
    product_id
    LIMIT 5;

Ngram:

1 row read in 1.126 sec. Processed 786.43 thousand row, 444.15 MiB (698.43 thousand rows/s, 394.45 MiB/s)

Without Ngram:

1 row read in 13.045 sec. Processed 135.59 million row, 52.91 GiB (10.39 million rows/s, 4.06 GiB/s)

Insert:

Ngram:

2010: 38.227 sec
2011: 46.212 sec
2012: 67.140 sec
2013: 112.430 sec
2014: 132.978 sec
2015: 102.655 sec

Without Ngram:

2010: 6.090 sec
2011: 6.468 sec
2012: 9.751 sec
2013: 15.562 sec
2014: 23.374 sec
2015: 14.587 sec

Tips: The factors that affect the insertion time are as follows:

The length of each row of data
Number of data rows
BloomFilter Bitmap Size
N (gram_size) of Ngram

Therefore, this benchmark is the parameter I chose for query purposes. In actual applications, users need to weigh the insertion speed and filtering effect.

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

tests/sqllogictests/suites/ee/04_ee_inverted_index/04_0000_inverted_index_base.test

tests/sqllogictests/suites/base/09_fuse_engine/09_0006_func_fuse_history.test

src/query/sql/src/planner/binder/ddl/index.rs

src/query/storages/common/index/src/bloom_index.rs