Skip to content

feat: impl NgramIndex for FuseTable, improve like query performance #17852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

KKould
Copy link
Member

@KKould KKould commented Apr 25, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

part of: #17724

Implement Ngram Index to improve the retrieval speed of Like query

Its working principle is to insert String type data into multiple substrings in the form of ngram and insert them into BloomFilter. When querying Like, it determines whether there is a substring after ngram that does not exist in BloomFilter to filter out the Block that must not have data in Like in advance.

Therefore, when using Ngram Index, the insertion time will be longer due to ngram (depending on the length of each line of string and the total number of data lines).

Storage

Ngram Index is essentially a data segmentation method based on Bloom Index using Ngram. Therefore, Ngram Index shares Meta with Bloom Index and uses the same storage file.

Benchmark

Using amazon_reviews as the benchmark, the total data size is 39.2 GB, and review_body is 17 GB

CREATE OR REPLACE TABLE `amazon_reviews_ngram` (
                                  `review_date` int(11) NULL,
                                  `marketplace` varchar(20) NULL,
                                  `customer_id` bigint(20) NULL,
                                  `review_id` varchar(40) NULL,
                                  `product_id` varchar(10) NULL,
                                  `product_parent` bigint(20) NULL,
                                  `product_title` varchar(500) NULL,
                                  `product_category` varchar(50) NULL,
                                  `star_rating` smallint(6) NULL,
                                  `helpful_votes` int(11) NULL,
                                  `total_votes` int(11) NULL,
                                  `vine` boolean NULL,
                                  `verified_purchase` boolean NULL,
                                  `review_headline` varchar(500) NULL,
                                  `review_body` string NULL,
                                  NGRAM INDEX idx1 (review_body) gram_size = 10 bitmap_size = 2097152
) Engine = Fuse bloom_index_columns='review_body';

copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2010.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2011.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2012.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2013.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2014.snappy.parquet file_format = (type = PARQUET);
copy into amazon_reviews_ngram from @data/ngram_test/amazon_reviews_2015.snappy.parquet file_format = (type = PARQUET);

Using this SQL to test Ngram, the total file size of BloomFilter is 1.5 GB

Query:

SELECT
    product_id,
    any(product_title),
    AVG(star_rating) AS rating,
    COUNT() AS count
FROM
    amazon_reviews_ngram
WHERE
    review_body LIKE '%The first track with Chris Botti is beautiful%'
GROUP BY
    product_id
ORDER BY
    count DESC,
    rating DESC,
    product_id
    LIMIT 5;

Ngram:

  • 1 row read in 1.126 sec. Processed 786.43 thousand row, 444.15 MiB (698.43 thousand rows/s, 394.45 MiB/s)

Without Ngram:

  • 1 row read in 13.045 sec. Processed 135.59 million row, 52.91 GiB (10.39 million rows/s, 4.06 GiB/s)

Insert:

Ngram:

  • 2010: 38.227 sec
  • 2011: 46.212 sec
  • 2012: 67.140 sec
  • 2013: 112.430 sec
  • 2014: 132.978 sec
  • 2015: 102.655 sec

Without Ngram:

  • 2010: 6.090 sec
  • 2011: 6.468 sec
  • 2012: 9.751 sec
  • 2013: 15.562 sec
  • 2014: 23.374 sec
  • 2015: 14.587 sec

Tips: The factors that affect the insertion time are as follows:

  • The length of each row of data
  • Number of data rows
  • BloomFilter Bitmap Size
  • N (gram_size) of Ngram

Therefore, this benchmark is the parameter I chose for query purposes. In actual applications, users need to weigh the insertion speed and filtering effect.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Apr 25, 2025
@KKould KKould force-pushed the feat/ngram_index branch 5 times, most recently from ba2213e to 8aabb9e Compare April 25, 2025 07:10
@KKould KKould force-pushed the feat/ngram_index branch from 8aabb9e to c88200b Compare April 25, 2025 07:36
@KKould KKould force-pushed the feat/ngram_index branch from c88200b to 69d798f Compare April 25, 2025 08:54
@KKould KKould marked this pull request as ready for review April 25, 2025 09:27
@b41sh b41sh self-requested a review April 25, 2025 11:22
@KKould KKould force-pushed the feat/ngram_index branch 3 times, most recently from 28f2ae4 to af65547 Compare April 25, 2025 17:03
KKould added 5 commits April 26, 2025 23:34
Signed-off-by: Kould <kould2333@gmail.com>
Signed-off-by: Kould <kould2333@gmail.com>
Signed-off-by: Kould <kould2333@gmail.com>
Signed-off-by: Kould <kould2333@gmail.com>
@KKould KKould force-pushed the feat/ngram_index branch from af65547 to 67e6f07 Compare April 26, 2025 15:34
Signed-off-by: Kould <kould2333@gmail.com>
@KKould KKould force-pushed the feat/ngram_index branch from ef734e0 to 762d575 Compare April 28, 2025 06:13
@KKould
Copy link
Member Author

KKould commented Apr 28, 2025

Please note that the filter has been adjusted: the original BloomFilter has been removed, and the size is controlled by taking the remainder using Xor8Filter. This may have a significant impact on the benchmark, and it still needs to be tested.

Updated to Readme

Signed-off-by: Kould <kould2333@gmail.com>
@KKould KKould force-pushed the feat/ngram_index branch from 762d575 to ef24126 Compare April 28, 2025 06:19
Signed-off-by: Kould <kould2333@gmail.com>
Signed-off-by: Kould <kould2333@gmail.com>
@KKould KKould force-pushed the feat/ngram_index branch 2 times, most recently from 6c72b5c to 88f3cd4 Compare April 29, 2025 12:16
Signed-off-by: Kould <kould2333@gmail.com>
@KKould KKould force-pushed the feat/ngram_index branch from 88f3cd4 to 9d37617 Compare April 29, 2025 15:05
Signed-off-by: Kould <kould2333@gmail.com>
@KKould KKould force-pushed the feat/ngram_index branch 2 times, most recently from b806534 to 1d304e8 Compare April 30, 2025 04:09
Signed-off-by: Kould <kould2333@gmail.com>
@KKould KKould force-pushed the feat/ngram_index branch from 1d304e8 to 07623b4 Compare April 30, 2025 05:54
…byte

Signed-off-by: Kould <kould2333@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants