-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[core] Fix scan metric report for extra file-index files #5937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase? |
yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index. |
I think it was done on purpose. During the scanning phase, we aim to evaluate as quickly as possible, so we only evaluate the embedding file index, which is stored in the manifest. The extra file index, however, is stored independently and requires additional I/O to load, it may slow down the scanning process. |
If so, the reported scan metrics are the total table files, but some files are filtered when the data is actually read, which can easily lead to misunderstand that the index is not effective. Is there a better solution? |
We can introduce some metrics in the file index evaluation phase, WDYT? |
Purpose
Reading table files contains scan phase and partition read phase. When we use bloom-filter or other file indexs, we found the scan metrics are the total data files of table, the file index seem to be not effective.
After analysis, in scan phase, the file index evaluation works just effectively for embedded file index, it does not work for extra file index. But extra file index actually works in partition read phase, this results inaccurate reporting metrics.
Tests
API and Format
Documentation