[core] Fix scan metric report for extra file-index files #5937

Akwangg · 2025-07-22T09:24:36Z

Purpose

Reading table files contains scan phase and partition read phase. When we use bloom-filter or other file indexs, we found the scan metrics are the total data files of table, the file index seem to be not effective.

After analysis, in scan phase, the file index evaluation works just effectively for embedded file index, it does not work for extra file index. But extra file index actually works in partition read phase, this results inaccurate reporting metrics.

Tests

API and Format

Documentation

Tan-JiaLiang · 2025-07-23T08:02:24Z

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

Akwangg · 2025-07-23T08:23:17Z

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.

Tan-JiaLiang · 2025-07-23T09:00:35Z

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.

I think it was done on purpose. During the scanning phase, we aim to evaluate as quickly as possible, so we only evaluate the embedding file index, which is stored in the manifest. The extra file index, however, is stored independently and requires additional I/O to load, it may slow down the scanning process.

Akwangg · 2025-07-23T09:31:11Z

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.

I think it was done on purpose. During the scanning phase, we aim to evaluate as quickly as possible, so we only evaluate the embedding file index, which is stored in the manifest. The extra file index, however, is stored independently and requires additional I/O to load, it may slow down the scanning process.

If so, the reported scan metrics are the total table files, but some files are filtered when the data is actually read, which can easily lead to misunderstand that the index is not effective. Is there a better solution?

Tan-JiaLiang · 2025-07-23T09:55:47Z

If so, the reported scan metrics are the total table files, but some files are filtered when the data is actually read, which can easily lead to misunderstand that the index is not effective. Is there a better solution?

We can introduce some metrics in the file index evaluation phase, WDYT?

JingsongLi closed this Jul 22, 2025

JingsongLi reopened this Jul 22, 2025

Akwangg added 2 commits July 22, 2025 19:54

add extra file-index file check

59649d2

fix test

648b16b

Akwangg force-pushed the file-index-fix branch from 9321331 to 648b16b Compare July 22, 2025 12:02

fix failed test

ce959af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Fix scan metric report for extra file-index files #5937

[core] Fix scan metric report for extra file-index files #5937

Uh oh!

Akwangg commented Jul 22, 2025 •

edited

Loading

Uh oh!

Tan-JiaLiang commented Jul 23, 2025

Uh oh!

Akwangg commented Jul 23, 2025

Uh oh!

Tan-JiaLiang commented Jul 23, 2025 •

edited

Loading

Uh oh!

Akwangg commented Jul 23, 2025

Uh oh!

Tan-JiaLiang commented Jul 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

[core] Fix scan metric report for extra file-index files #5937

Are you sure you want to change the base?

[core] Fix scan metric report for extra file-index files #5937

Uh oh!

Conversation

Akwangg commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Uh oh!

Tan-JiaLiang commented Jul 23, 2025

Uh oh!

Akwangg commented Jul 23, 2025

Uh oh!

Tan-JiaLiang commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Akwangg commented Jul 23, 2025

Uh oh!

Tan-JiaLiang commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Akwangg commented Jul 22, 2025 •

edited

Loading

Tan-JiaLiang commented Jul 23, 2025 •

edited

Loading

Tan-JiaLiang commented Jul 23, 2025 •

edited

Loading