-
Notifications
You must be signed in to change notification settings - Fork 253
feat: implement stddev with ddof parameter support #4949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR implements comprehensive support for the ddof
(Delta Degrees of Freedom) parameter in standard deviation calculations throughout the Daft framework. The changes enable users to calculate both population standard deviation (ddof=0, default) and sample standard deviation (ddof=1), which is essential statistical functionality that distinguishes between dividing by N versus N-1 in the calculation formula.
The implementation introduces a new StddevParams
struct that wraps both the child expression and the ddof parameter, replacing the simple ExprRef
previously used in AggExpr::Stddev
. This change propagates through the entire execution pipeline:
- Python API layer: Updated
DataFrame.stddev()
,GroupedDataFrame.stddev()
, andExpression.stddev()
methods to accept the ddof parameter - DSL layer: Added
stddev_with_ddof()
method and updated expression handling - SQL layer: Registered both
stddev
(ddof=0) andstddev_samp
(ddof=1) functions for SQL compliance - Physical planning: Implemented distinct mathematical formulas for population vs sample standard deviation in both single-stage and multi-stage aggregation scenarios
- Type system: Added support for
DataType::Null
inputs to return consistentFloat64
results
The changes maintain backward compatibility by defaulting ddof to 0, preserving existing behavior while adding new statistical capabilities. Comprehensive test coverage validates the mathematical correctness across various scenarios including grouped operations, multiple partitions, and edge cases like insufficient sample sizes.
Confidence score: 4/5
- This PR is safe to merge with moderate risk due to comprehensive implementation across multiple system layers
- Score reflects thorough mathematical implementation and extensive test coverage, but complexity of changes across distributed aggregation pipeline requires careful validation
- Pay close attention to protobuf serialization logic in
src/daft-ir/src/proto/rex.rs
which may not preserve ddof during round-trip serialization
11 files reviewed, 2 comments
"stddev" => Self::Stddev(StddevParams { | ||
child: arg, | ||
ddof: 0, | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The ddof parameter is hardcoded to 0 during deserialization, which means any stddev operation from protobuf will use population standard deviation regardless of the original ddof value.
Self::Stddev(StddevParams { child: expr, .. }) => { | ||
let child_id = expr.semantic_id(schema); | ||
FieldID::new(format!("{child_id}.local_stddev()")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Missing ddof parameter in semantic_id generation - this could cause issues with plan caching/optimization when different ddof values should produce different semantic IDs
Self::Stddev(StddevParams { child: expr, .. }) => { | |
let child_id = expr.semantic_id(schema); | |
FieldID::new(format!("{child_id}.local_stddev()")) | |
Self::Stddev(StddevParams { child: expr, ddof }) => { | |
let child_id = expr.semantic_id(schema); | |
FieldID::new(format!("{child_id}.local_stddev(ddof={})", ddof)) |
@venkateshdb I'm going to make this PR a draft since CI is failing. Can you open up the PR for review once you're ready? |
Changes Made
Related Issues
Checklist
docs/mkdocs.yml
navigation