Skip to content

HIVE-28910: Remove redundant IS NOT NULL predicates when expanding SEARCH #5795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -58,14 +58,16 @@ public class SearchTransformer<C extends Comparable<C>> {
private final RexBuilder rexBuilder;
private final RexNode ref;
private final Sarg<C> sarg;
private final RexUnknownAs unknownContext;
protected final RelDataType type;

public SearchTransformer(RexBuilder rexBuilder, RexCall search) {
public SearchTransformer(RexBuilder rexBuilder, RexCall search, final RexUnknownAs unknownContext) {
this.rexBuilder = rexBuilder;
ref = search.getOperands().get(0);
RexLiteral literal = (RexLiteral) search.operands.get(1);
sarg = Objects.requireNonNull(literal.getValueAs(Sarg.class), "Sarg");
type = literal.getType();
this.unknownContext = unknownContext;
}

public RexNode transform() {
Expand All @@ -76,7 +78,7 @@ public RexNode transform() {
RangeSets.forEach(sarg.rangeSet, consumer);

List<RexNode> orList = new ArrayList<>();
if (sarg.nullAs == RexUnknownAs.TRUE) {
if (sarg.nullAs == RexUnknownAs.TRUE && unknownContext != RexUnknownAs.TRUE) {
orList.add(rexBuilder.makeCall(SqlStdOperatorTable.IS_NULL, ref));
}
switch (consumer.inLiterals.size()) {
Expand All @@ -94,7 +96,7 @@ public RexNode transform() {
orList.addAll(consumer.nodes);
RexNode x = RexUtil.composeDisjunction(rexBuilder, orList);

if (sarg.nullAs == RexUnknownAs.FALSE) {
if (sarg.nullAs == RexUnknownAs.FALSE && unknownContext != RexUnknownAs.FALSE) {
RexNode notNull = rexBuilder.makeCall(SqlStdOperatorTable.IS_NOT_NULL, ref);
x = RexUtil.composeConjunction(rexBuilder, Arrays.asList(notNull, x));
}
Expand All @@ -104,9 +106,11 @@ public RexNode transform() {

public static class Shuttle extends RexShuttle {
private final RexBuilder rexBuilder;
private RexUnknownAs unknownContext;

public Shuttle(final RexBuilder rexBuilder) {
public Shuttle(RexBuilder rexBuilder, RexUnknownAs unknownContext) {
this.rexBuilder = rexBuilder;
this.unknownContext = unknownContext;
}

@Override public RexNode visitCall(RexCall call) {
Expand All @@ -128,8 +132,23 @@ public Shuttle(final RexBuilder rexBuilder) {
} else {
return call;
}
case IS_NULL:
case IS_NOT_NULL:
case CASE:
case COALESCE:
// Everything below must be transformed with the UNKNOWN handler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why you can't use the outermost handler here?

In the Jira ticket you seem towards the opposite direction, that is taking into account that in WHERE clauses filters obey to the unknownAsFalse semantics.

I haven't reviewed the Calcite upgrade which introduced some of the code you touch here, so maybe it should be clear from the context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some operators with special semantics that we need to retain the 3-valued logic even if we are inside the WHERE clause.

Example: COALESCE(SEARCH(...), true)

  1. If SEARCH -> true then COALESCE -> true
  2. If SEARCH -> false then COALESCE -> false
  3. If SEARCH -> null then COALESCE -> true

If we consider unknownAsFalse inside the COALESCE then we will incorrectly drop the IS NOT NULL predicate during the expansion and the COALESCE will fall into the third case returning true instead of false.

Basically, if the SEARCH is below/inside an operator that is sensitive to null values we cannot safely perform the simplification.

Note that RexSimplify also changes the default unknown semantics for certain operators (see simplifyIs2, simplifyCoalesce, etc.).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspected something similar, thanks for the clarification.

Since this is a rather deep technical detail that most people not familiar with CBO will have problems with, I'd suggest to have a more explicit comment like "// These operator kinds require the UNKNOWN handler to correctly handle NULL values" or something similar.

I know it's still a draft PR but for the rest LGTM already, feel free to ping me for a review if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more comments and also made the approach a bit more defensive in the sense that if SEARCH happens to be under any operator other than AND/OR we don't perform the simplification.

RexUnknownAs previousContext = this.unknownContext;
this.unknownContext = RexUnknownAs.UNKNOWN;
clonedOperands = visitList(call.operands, update);
// Restore the original handler once we finish with the operands.
this.unknownContext = previousContext;
if (update[0]) {
return rexBuilder.makeCall(call.op, clonedOperands);
} else {
return call;
}
case SEARCH:
return new SearchTransformer<>(rexBuilder, call).transform();
return new SearchTransformer<>(rexBuilder, call, this.unknownContext).transform();
default:
return super.visitCall(call);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
import org.apache.calcite.rex.RexCall;
import org.apache.calcite.rex.RexNode;
import org.apache.calcite.rex.RexShuttle;
import org.apache.calcite.rex.RexUnknownAs;
import org.apache.calcite.rex.RexUtil;
import org.apache.calcite.sql.fun.SqlStdOperatorTable;
import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelFactories;
Expand Down Expand Up @@ -152,7 +153,7 @@ private RexInBetweenExpander(RexBuilder rexBuilder) {
public RexNode visitCall(final RexCall call) {
switch (call.getKind()) {
case SEARCH: {
return new SearchTransformer<>(rexBuilder, call).transform().accept(this);
return new SearchTransformer<>(rexBuilder, call, RexUnknownAs.UNKNOWN).transform().accept(this);
}
case AND: {
boolean[] update = {false};
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@
import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveJoin;
import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveProject;

import static org.apache.calcite.rex.RexUnknownAs.FALSE;
import static org.apache.calcite.rex.RexUnknownAs.UNKNOWN;

/**
* A holder class for rules related to the SEARCH operator.
*/
Expand All @@ -32,17 +35,17 @@ private HiveSearchRules() {
}

public static final RelOptRule PROJECT_SEARCH_EXPAND =
new HiveRexShuttleTransformRule.Config().withRexShuttle(SearchTransformer.Shuttle::new)
new HiveRexShuttleTransformRule.Config().withRexShuttle(x -> new SearchTransformer.Shuttle(x, UNKNOWN))
.withDescription("HiveProjectSearchExpandRule")
.withOperandSupplier(o -> o.operand(HiveProject.class).anyInputs())
.toRule();
public static final RelOptRule FILTER_SEARCH_EXPAND =
new HiveRexShuttleTransformRule.Config().withRexShuttle(SearchTransformer.Shuttle::new)
new HiveRexShuttleTransformRule.Config().withRexShuttle(x -> new SearchTransformer.Shuttle(x, FALSE))
.withDescription("HiveFilterSearchExpandRule")
.withOperandSupplier(o -> o.operand(HiveFilter.class).anyInputs())
.toRule();
public static final RelOptRule JOIN_SEARCH_EXPAND =
new HiveRexShuttleTransformRule.Config().withRexShuttle(SearchTransformer.Shuttle::new)
new HiveRexShuttleTransformRule.Config().withRexShuttle(x -> new SearchTransformer.Shuttle(x, FALSE))
.withDescription("HiveJoinSearchExpandRule")
.withOperandSupplier(o -> o.operand(HiveJoin.class).anyInputs())
.toRule();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
import org.apache.calcite.rex.RexInputRef;
import org.apache.calcite.rex.RexLiteral;
import org.apache.calcite.rex.RexNode;
import org.apache.calcite.rex.RexUnknownAs;
import org.apache.calcite.rex.RexUtil;
import org.apache.calcite.rex.RexVisitorImpl;
import org.apache.calcite.sql.SqlKind;
Expand Down Expand Up @@ -106,7 +107,7 @@ public Double visitCall(RexCall call) {
break;
}
case SEARCH:
return new SearchTransformer<>(rexBuilder, call).transform().accept(this);
return new SearchTransformer<>(rexBuilder, call, RexUnknownAs.FALSE).transform().accept(this);
case OR: {
selectivity = computeDisjunctionSelectivity(call);
break;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
import org.apache.calcite.rex.RexLiteral;
import org.apache.calcite.rex.RexNode;
import org.apache.calcite.rex.RexOver;
import org.apache.calcite.rex.RexUnknownAs;
import org.apache.calcite.rex.RexUtil;
import org.apache.calcite.rex.RexVisitorImpl;
import org.apache.calcite.rex.RexWindow;
Expand Down Expand Up @@ -200,7 +201,7 @@ public ExprNodeDesc visitCall(RexCall call) {
args.add(operand.accept(this));
}
} else if (call.getKind() == SqlKind.SEARCH) {
return new SearchTransformer<>(rexBuilder, call).transform().accept(this);
return new SearchTransformer<>(rexBuilder, call, RexUnknownAs.UNKNOWN).transform().accept(this);
} else {
for (RexNode operand : call.operands) {
args.add(operand.accept(this));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ HiveSortLimit(sort0=[$0], sort1=[$1], sort2=[$2], sort3=[$3], dir0=[ASC], dir1=[
HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT NULL($2), IS NOT NULL($6), IS NOT NULL($22))])
HiveTableScan(table=[[default, store_sales]], table:alias=[store_sales])
HiveProject(d_date_sk=[$0])
HiveFilter(condition=[AND(IN($6, 2000, 2001, 2002), OR(BETWEEN(false, $9, 1, 3), BETWEEN(false, $9, 25, 28)), IS NOT NULL($9))])
HiveFilter(condition=[AND(IN($6, 2000, 2001, 2002), OR(BETWEEN(false, $9, 1, 3), BETWEEN(false, $9, 25, 28)))])
HiveTableScan(table=[[default, date_dim]], table:alias=[date_dim])
HiveProject(hd_demo_sk=[$0])
HiveFilter(condition=[AND(>($4, 0), IN($2, _UTF-16LE'>10000', _UTF-16LE'unknown'), CASE(>($4, 0), >(/(CAST($3):DOUBLE, CAST($4):DOUBLE), 1.2), false))])
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ HiveAggregate(group=[{}], agg#0=[sum($7)])
HiveTableScan(table=[[default, customer_demographics]], table:alias=[customer_demographics])
HiveJoin(condition=[=($3, $7)], joinType=[inner], algorithm=[none], cost=[not available])
HiveProject(ss_cdemo_sk=[$3], ss_addr_sk=[$5], ss_quantity=[$9], ss_sold_date_sk=[$22], EXPR$0=[BETWEEN(false, $21, 0:DECIMAL(12, 2), 2000:DECIMAL(12, 2))], EXPR$1=[BETWEEN(false, $21, 150:DECIMAL(12, 2), 3000:DECIMAL(12, 2))], EXPR$2=[BETWEEN(false, $21, 50:DECIMAL(12, 2), 25000:DECIMAL(12, 2))])
HiveFilter(condition=[AND(BETWEEN(false, $12, 50:DECIMAL(3, 0), 200:DECIMAL(3, 0)), IS NOT NULL($21), IS NOT NULL($12), IS NOT NULL($3), IS NOT NULL($5), IS NOT NULL($6), IS NOT NULL($22))])
HiveFilter(condition=[AND(BETWEEN(false, $12, 50:DECIMAL(3, 0), 200:DECIMAL(3, 0)), IS NOT NULL($21), IS NOT NULL($3), IS NOT NULL($5), IS NOT NULL($6), IS NOT NULL($22))])
HiveTableScan(table=[[default, store_sales]], table:alias=[store_sales])
HiveProject(d_date_sk=[$0])
HiveFilter(condition=[=($6, 1998)])
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,10 @@ HiveProject(_o__c0=[$0], _o__c1=[$1], _o__c2=[$2], _o__c3=[$3])
HiveFilter(condition=[AND(IN($8, _UTF-16LE'GA', _UTF-16LE'IN', _UTF-16LE'KY', _UTF-16LE'MO', _UTF-16LE'MT', _UTF-16LE'NM', _UTF-16LE'OR', _UTF-16LE'WI', _UTF-16LE'WV'), =($10, _UTF-16LE'United States'))])
HiveTableScan(table=[[default, customer_address]], table:alias=[customer_address])
HiveProject(cd_demo_sk=[$0], cd_marital_status=[$2], cd_education_status=[$3], EXPR$0=[=($2, _UTF-16LE'M')], EXPR$1=[=($3, _UTF-16LE'4 yr Degree')], EXPR$2=[=($2, _UTF-16LE'D')], EXPR$3=[=($3, _UTF-16LE'Primary')], EXPR$4=[=($2, _UTF-16LE'U')], EXPR$5=[=($3, _UTF-16LE'Advanced Degree')])
HiveFilter(condition=[AND(IN($2, _UTF-16LE'D', _UTF-16LE'M', _UTF-16LE'U'), IN($3, _UTF-16LE'4 yr Degree', _UTF-16LE'Advanced Degree', _UTF-16LE'Primary'), IS NOT NULL($2), IS NOT NULL($3))])
HiveFilter(condition=[AND(IN($2, _UTF-16LE'D', _UTF-16LE'M', _UTF-16LE'U'), IN($3, _UTF-16LE'4 yr Degree', _UTF-16LE'Advanced Degree', _UTF-16LE'Primary'))])
HiveTableScan(table=[[default, customer_demographics]], table:alias=[cd1])
HiveProject(cd_demo_sk=[$0], cd_marital_status=[$2], cd_education_status=[$3])
HiveFilter(condition=[AND(IN($2, _UTF-16LE'D', _UTF-16LE'M', _UTF-16LE'U'), IN($3, _UTF-16LE'4 yr Degree', _UTF-16LE'Advanced Degree', _UTF-16LE'Primary'), IS NOT NULL($2), IS NOT NULL($3))])
HiveFilter(condition=[AND(IN($2, _UTF-16LE'D', _UTF-16LE'M', _UTF-16LE'U'), IN($3, _UTF-16LE'4 yr Degree', _UTF-16LE'Advanced Degree', _UTF-16LE'Primary'))])
HiveTableScan(table=[[default, customer_demographics]], table:alias=[cd2])
HiveProject(r_reason_sk=[$0], r_reason_desc=[$2])
HiveTableScan(table=[[default, reason]], table:alias=[reason])
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,10 +109,10 @@ STAGE PLANS:
Map Operator Tree:
TableScan
alias: date_dim
filterExpr: ((d_year) IN (2000, 2001, 2002) and (d_dom BETWEEN 1 AND 3 or d_dom BETWEEN 25 AND 28) and d_dom is not null) (type: boolean)
filterExpr: ((d_year) IN (2000, 2001, 2002) and (d_dom BETWEEN 1 AND 3 or d_dom BETWEEN 25 AND 28)) (type: boolean)
Statistics: Num rows: 73049 Data size: 1168784 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ((d_year) IN (2000, 2001, 2002) and (d_dom BETWEEN 1 AND 3 or d_dom BETWEEN 25 AND 28) and d_dom is not null) (type: boolean)
predicate: ((d_year) IN (2000, 2001, 2002) and (d_dom BETWEEN 1 AND 3 or d_dom BETWEEN 25 AND 28)) (type: boolean)
Statistics: Num rows: 249 Data size: 3984 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: d_date_sk (type: bigint)
Expand Down
20 changes: 10 additions & 10 deletions ql/src/test/results/clientpositive/perf/tpcds30tb/tez/query48.q.out
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,16 @@ STAGE PLANS:
Map Operator Tree:
TableScan
alias: store_sales
filterExpr: (ss_sales_price BETWEEN 50 AND 200 and ss_net_profit is not null and ss_sales_price is not null and ss_cdemo_sk is not null and ss_addr_sk is not null and ss_store_sk is not null) (type: boolean)
probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_66_container, bigKeyColName:ss_addr_sk, smallTablePos:1, keyRatio:3.245580464030688E-4
filterExpr: (ss_sales_price BETWEEN 50 AND 200 and ss_net_profit is not null and ss_cdemo_sk is not null and ss_addr_sk is not null and ss_store_sk is not null) (type: boolean)
probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_66_container, bigKeyColName:ss_addr_sk, smallTablePos:1, keyRatio:3.323843839779123E-4
Statistics: Num rows: 82510879939 Data size: 20962809999708 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (ss_sales_price BETWEEN 50 AND 200 and ss_net_profit is not null and ss_sales_price is not null and ss_cdemo_sk is not null and ss_addr_sk is not null and ss_store_sk is not null) (type: boolean)
Statistics: Num rows: 54925699257 Data size: 13954486953516 Basic stats: COMPLETE Column stats: COMPLETE
predicate: (ss_sales_price BETWEEN 50 AND 200 and ss_net_profit is not null and ss_cdemo_sk is not null and ss_addr_sk is not null and ss_store_sk is not null) (type: boolean)
Statistics: Num rows: 56250168542 Data size: 14290983158452 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: ss_cdemo_sk (type: bigint), ss_addr_sk (type: bigint), ss_quantity (type: int), ss_sold_date_sk (type: bigint), ss_net_profit BETWEEN 0 AND 2000 (type: boolean), ss_net_profit BETWEEN 150 AND 3000 (type: boolean), ss_net_profit BETWEEN 50 AND 25000 (type: boolean)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
Statistics: Num rows: 54925699257 Data size: 2171111781128 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 56250168542 Data size: 2223465613804 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
Expand All @@ -34,7 +34,7 @@ STAGE PLANS:
outputColumnNames: _col0, _col1, _col2, _col4, _col5, _col6
input vertices:
1 Map 3
Statistics: Num rows: 11039131354 Data size: 327336014176 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 11305327153 Data size: 335229341020 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
Expand All @@ -44,7 +44,7 @@ STAGE PLANS:
outputColumnNames: _col1, _col2, _col4, _col5, _col6
input vertices:
1 Map 4
Statistics: Num rows: 315403755 Data size: 3784845072 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 323009350 Data size: 3876112212 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
Expand All @@ -54,14 +54,14 @@ STAGE PLANS:
outputColumnNames: _col2, _col4, _col5, _col6, _col10, _col11, _col12
input vertices:
1 Map 5
Statistics: Num rows: 26779570 Data size: 642709684 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 27425328 Data size: 658207876 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ((_col10 and _col4) or (_col11 and _col5) or (_col12 and _col6)) (type: boolean)
Statistics: Num rows: 20084676 Data size: 482032228 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 20568996 Data size: 493655908 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col2 (type: int)
outputColumnNames: _col2
Statistics: Num rows: 20084676 Data size: 482032228 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 20568996 Data size: 493655908 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: sum(_col2)
minReductionHashAggr: 0.99
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -197,10 +197,10 @@ STAGE PLANS:
Map Operator Tree:
TableScan
alias: cd1
filterExpr: ((cd_marital_status) IN ('D', 'M', 'U') and (cd_education_status) IN ('4 yr Degree ', 'Advanced Degree ', 'Primary ') and cd_marital_status is not null and cd_education_status is not null) (type: boolean)
filterExpr: ((cd_marital_status) IN ('D', 'M', 'U') and (cd_education_status) IN ('4 yr Degree ', 'Advanced Degree ', 'Primary ')) (type: boolean)
Statistics: Num rows: 1920800 Data size: 359189600 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ((cd_marital_status) IN ('D', 'M', 'U') and (cd_education_status) IN ('4 yr Degree ', 'Advanced Degree ', 'Primary ') and cd_marital_status is not null and cd_education_status is not null) (type: boolean)
predicate: ((cd_marital_status) IN ('D', 'M', 'U') and (cd_education_status) IN ('4 yr Degree ', 'Advanced Degree ', 'Primary ')) (type: boolean)
Statistics: Num rows: 493920 Data size: 92363040 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: cd_demo_sk (type: bigint), cd_marital_status (type: char(1)), cd_education_status (type: char(20)), (cd_marital_status = 'M') (type: boolean), (cd_education_status = '4 yr Degree ') (type: boolean), (cd_marital_status = 'D') (type: boolean), (cd_education_status = 'Primary ') (type: boolean), (cd_marital_status = 'U') (type: boolean), (cd_education_status = 'Advanced Degree ') (type: boolean)
Expand Down
Loading