Skip to content

feat: implement stddev with ddof parameter support #4949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

venkateshdb
Copy link

Changes Made

  • Added ddof param to stddev in DataFrame, GroupedDataFrame, and Expression (default 0 = population; 1 = sample).
  • Introduced StddevParams and updated AggExpr::Stddev to carry ddof through DSL, IR, logical, and physical plans.
  • Implemented population vs sample stddev formulas in planners, updated Python bindings to stddev(ddof) and SQL to support stddev (ddof=0) and stddev_samp (ddof=1).
  • Allowed Null input type for stddev supertype.

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the feat label Aug 11, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR implements comprehensive support for the ddof (Delta Degrees of Freedom) parameter in standard deviation calculations throughout the Daft framework. The changes enable users to calculate both population standard deviation (ddof=0, default) and sample standard deviation (ddof=1), which is essential statistical functionality that distinguishes between dividing by N versus N-1 in the calculation formula.

The implementation introduces a new StddevParams struct that wraps both the child expression and the ddof parameter, replacing the simple ExprRef previously used in AggExpr::Stddev. This change propagates through the entire execution pipeline:

  • Python API layer: Updated DataFrame.stddev(), GroupedDataFrame.stddev(), and Expression.stddev() methods to accept the ddof parameter
  • DSL layer: Added stddev_with_ddof() method and updated expression handling
  • SQL layer: Registered both stddev (ddof=0) and stddev_samp (ddof=1) functions for SQL compliance
  • Physical planning: Implemented distinct mathematical formulas for population vs sample standard deviation in both single-stage and multi-stage aggregation scenarios
  • Type system: Added support for DataType::Null inputs to return consistent Float64 results

The changes maintain backward compatibility by defaulting ddof to 0, preserving existing behavior while adding new statistical capabilities. Comprehensive test coverage validates the mathematical correctness across various scenarios including grouped operations, multiple partitions, and edge cases like insufficient sample sizes.

Confidence score: 4/5

  • This PR is safe to merge with moderate risk due to comprehensive implementation across multiple system layers
  • Score reflects thorough mathematical implementation and extensive test coverage, but complexity of changes across distributed aggregation pipeline requires careful validation
  • Pay close attention to protobuf serialization logic in src/daft-ir/src/proto/rex.rs which may not preserve ddof during round-trip serialization

11 files reviewed, 2 comments

Edit Code Review Bot Settings | Greptile

Comment on lines +353 to +356
"stddev" => Self::Stddev(StddevParams {
child: arg,
ddof: 0,
}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The ddof parameter is hardcoded to 0 during deserialization, which means any stddev operation from protobuf will use population standard deviation regardless of the original ddof value.

Comment on lines +544 to 546
Self::Stddev(StddevParams { child: expr, .. }) => {
let child_id = expr.semantic_id(schema);
FieldID::new(format!("{child_id}.local_stddev()"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Missing ddof parameter in semantic_id generation - this could cause issues with plan caching/optimization when different ddof values should produce different semantic IDs

Suggested change
Self::Stddev(StddevParams { child: expr, .. }) => {
let child_id = expr.semantic_id(schema);
FieldID::new(format!("{child_id}.local_stddev()"))
Self::Stddev(StddevParams { child: expr, ddof }) => {
let child_id = expr.semantic_id(schema);
FieldID::new(format!("{child_id}.local_stddev(ddof={})", ddof))

@venkateshdb venkateshdb changed the title feat: implement stddev with ddof parameter support feat: implement stddev with ddof parameter support #4464 Aug 11, 2025
@venkateshdb venkateshdb changed the title feat: implement stddev with ddof parameter support #4464 feat: implement stddev with ddof parameter support Aug 11, 2025
@srilman
Copy link
Contributor

srilman commented Aug 11, 2025

@venkateshdb I'm going to make this PR a draft since CI is failing. Can you open up the PR for review once you're ready?

@srilman srilman marked this pull request as draft August 11, 2025 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants