Skip to content

[BUG] Double file scan with stats skipping #1073

@Kimahriman

Description

@Kimahriman

Bug

Describe the problem

File stats skipping is causing two filesForScan operations.

Steps to reproduce

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

df = spark.range(5).select(F.struct(F.lit('test').alias('test'), F.col('id').cast("string").alias('id')).alias('nested'))
df.write.format('delta').save('test')

spark.read.format('delta').load('test').filter('nested.id = "2"').count()

Observed results

There are two scan stages, caused by hitting this line of code. Example output:

Prepared scan does not match actual filters. Reselecting files to query.
Prepared: ExpressionSet((nested#457.id = 2), isnotnull(nested#457))
Actual: ExpressionSet(isnotnull(nested#457), (nested#457.id = 2))

Expected results

The expression sets are the same so it shouldn't trigger a second file scan.

Further details

Environment information

  • Delta Lake version: 1.2.0
  • Spark version: 3.2.1
  • Scala version: 2.12

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
  • No. I cannot contribute a bug fix at this time.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions