Skip to content

[core] Fix scan metric report for extra file-index files #5937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Akwangg
Copy link
Contributor

@Akwangg Akwangg commented Jul 22, 2025

Purpose

Reading table files contains scan phase and partition read phase. When we use bloom-filter or other file indexs, we found the scan metrics are the total data files of table, the file index seem to be not effective.

After analysis, in scan phase, the file index evaluation works just effectively for embedded file index, it does not work for extra file index. But extra file index actually works in partition read phase, this results inaccurate reporting metrics.

Tests

API and Format

Documentation

@JingsongLi JingsongLi closed this Jul 22, 2025
@JingsongLi JingsongLi reopened this Jul 22, 2025
@Tan-JiaLiang
Copy link
Contributor

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

@Akwangg
Copy link
Contributor Author

Akwangg commented Jul 23, 2025

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.

@Tan-JiaLiang
Copy link
Contributor

Tan-JiaLiang commented Jul 23, 2025

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.

I think it was done on purpose. During the scanning phase, we aim to evaluate as quickly as possible, so we only evaluate the embedding file index, which is stored in the manifest. The extra file index, however, is stored independently and requires additional I/O to load, it may slow down the scanning process.

@Akwangg
Copy link
Contributor Author

Akwangg commented Jul 23, 2025

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.

I think it was done on purpose. During the scanning phase, we aim to evaluate as quickly as possible, so we only evaluate the embedding file index, which is stored in the manifest. The extra file index, however, is stored independently and requires additional I/O to load, it may slow down the scanning process.

If so, the reported scan metrics are the total table files, but some files are filtered when the data is actually read, which can easily lead to misunderstand that the index is not effective. Is there a better solution?

@Tan-JiaLiang
Copy link
Contributor

Tan-JiaLiang commented Jul 23, 2025

If so, the reported scan metrics are the total table files, but some files are filtered when the data is actually read, which can easily lead to misunderstand that the index is not effective. Is there a better solution?

We can introduce some metrics in the file index evaluation phase, WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants