-
Notifications
You must be signed in to change notification settings - Fork 253
feat: support count(1) in dataframe #4977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR adds support for SQL-compatible count(1)
syntax to Daft DataFrames. The change extends the existing count method to recognize integer arguments as equivalent to count()
or count('*')
, performing a total row count operation rather than counting non-null values in specific columns.
The implementation modifies the count method in daft/dataframe/dataframe.py
by adding a new condition or (len(cols) == 1 and isinstance(cols[0], int))
to the existing special-case handling for empty arguments and asterisk strings. This allows integer literals like 1
to be treated the same way as the wildcard *
syntax.
In SQL databases, COUNT(1)
, COUNT(*)
, and COUNT()
are functionally equivalent operations that count all rows regardless of null values, which differs from COUNT(column_name)
that only counts non-null values in specific columns. This change brings Daft's DataFrame API closer to standard SQL behavior, making it more intuitive for users familiar with SQL conventions.
The change integrates seamlessly with Daft's existing expression system, where the integer literal gets processed through the same aggregation pipeline as other count operations. A corresponding test case in tests/dataframe/test_count.py
validates that df.count(1)
correctly returns the total row count.
Confidence score: 4/5
- This PR is safe to merge with minimal risk as it's a straightforward feature addition with clear SQL semantics
- Score reflects simple logic changes that extend existing functionality without breaking backward compatibility
- Pay close attention to
daft/dataframe/dataframe.py
to ensure the new condition integrates properly with existing count logic
2 files reviewed, 1 comment
if ( | ||
len(cols) == 0 | ||
or (len(cols) == 1 and isinstance(cols[0], str) and cols[0] == "*") | ||
or (len(cols) == 1 and isinstance(cols[0], int)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, does count(1) have to select a specific column, or is it also selected according to the optimizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the following code, it can be seen that the first field is taken
def count(self) -> LogicalPlanBuilder:
# TODO(Clark): Add dedicated logical/physical ops when introducing metadata-based count optimizations.
first_col = col(self.schema().column_names()[0])
builder = self._builder.aggregate([first_col.count(CountMode.All)._expr], [])
builder = builder.select([first_col.alias("count")._expr])
return LogicalPlanBuilder(builder)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. The method of selecting fields needs to be optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In SQL, COUNT(<lit>)
will return the total number of rows regardless of the <lit>
value. I would translate any literal value to COUNT(*)
and this PR would be complete.
I agree that choosing the column with the smallest "size" makes sense, like choose an int32
column instead of some binary
. I'm not sure this is really necessary or does actually improve perf? Maybe worth testing.
Schema (PostgreSQL v17)
CREATE TABLE T (x INT);
INSERT INTO T VALUES (1), (2), (3);
Query #1
SELECT COUNT(1.0), COUNT('xyz'), COUNT(-1), COUNT(false) FROM T;
count | count | count | count |
---|---|---|---|
3 | 3 | 3 | 3 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4977 +/- ##
==========================================
- Coverage 76.46% 73.92% -2.55%
==========================================
Files 917 916 -1
Lines 127012 130215 +3203
==========================================
- Hits 97124 96263 -861
- Misses 29888 33952 +4064
🚀 New features to boost your workflow:
|
Changes Made
Not count(1) in dataframe is not supported:
Using this pr:
(Showing first 1 of 1 rows)
Related Issues
Checklist
docs/mkdocs.yml
navigation