-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug
Describe the problem
File stats skipping is causing two filesForScan
operations.
Steps to reproduce
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.range(5).select(F.struct(F.lit('test').alias('test'), F.col('id').cast("string").alias('id')).alias('nested'))
df.write.format('delta').save('test')
spark.read.format('delta').load('test').filter('nested.id = "2"').count()
Observed results
There are two scan stages, caused by hitting this line of code. Example output:
Prepared scan does not match actual filters. Reselecting files to query.
Prepared: ExpressionSet((nested#457.id = 2), isnotnull(nested#457))
Actual: ExpressionSet(isnotnull(nested#457), (nested#457.id = 2))
Expected results
The expression sets are the same so it shouldn't trigger a second file scan.
Further details
Environment information
- Delta Lake version: 1.2.0
- Spark version: 3.2.1
- Scala version: 2.12
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
- Yes. I can contribute a fix for this bug independently.
- Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
- No. I cannot contribute a bug fix at this time.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working