Skip to content

feat: support count(1) in dataframe #4977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

huleilei
Copy link
Contributor

@huleilei huleilei commented Aug 14, 2025

Changes Made

Not count(1) in dataframe is not supported:

In [49]: df.count(1).show()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[49], line 1
----> 1 df.count(1).show()

    [... skipping hidden 1 frame]

File /data00/code/tmp2/Daft/daft/dataframe/dataframe.py:3011, in DataFrame.count(self, *cols)
   3008     raise ValueError("Cannot call count() with both * and column names")
   3010 # Otherwise, perform a column-wise count on the specified columns
-> 3011 return self._apply_agg_fn(Expression.count, cols)

File /data00/code/tmp2/Daft/daft/dataframe/dataframe.py:2854, in DataFrame._apply_agg_fn(self, fn, cols, group_by)
   2852     groupby_name_set = set() if group_by is None else group_by.to_name_set()
   2853     cols = tuple(c for c in self.column_names if c not in groupby_name_set)
-> 2854 exprs = self._wildcard_inputs_to_expressions(cols)
   2855 return self._agg([fn(c) for c in exprs], group_by)

File /data00/code/tmp2/Daft/daft/dataframe/dataframe.py:1596, in DataFrame._wildcard_inputs_to_expressions(self, columns)
   1594 """Handles wildcard argument column inputs."""
   1595 column_input: Iterable[ColumnInputType] = columns[0] if len(columns) == 1 else columns  # type: ignore
-> 1596 return column_inputs_to_expressions(column_input)

File /data00/code/tmp2/Daft/daft/utils.py:126, in column_inputs_to_expressions(columns)
    123 from daft.expressions import col
    125 column_iter: Iterable[ColumnInputType] = [columns] if is_column_input(columns) else columns  # type: ignore
--> 126 return [col(c) if isinstance(c, str) else c for c in column_iter]

TypeError: 'int' object is not iterable

Using this pr:

In [5]: df.count(1).show()
╭────────╮
│ count  │
│ ---    │
│ UInt64 │
╞════════╡
│ 4      │
╰────────╯

(Showing first 1 of 1 rows)

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the feat label Aug 14, 2025
@huleilei huleilei marked this pull request as draft August 14, 2025 08:52
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds support for SQL-compatible count(1) syntax to Daft DataFrames. The change extends the existing count method to recognize integer arguments as equivalent to count() or count('*'), performing a total row count operation rather than counting non-null values in specific columns.

The implementation modifies the count method in daft/dataframe/dataframe.py by adding a new condition or (len(cols) == 1 and isinstance(cols[0], int)) to the existing special-case handling for empty arguments and asterisk strings. This allows integer literals like 1 to be treated the same way as the wildcard * syntax.

In SQL databases, COUNT(1), COUNT(*), and COUNT() are functionally equivalent operations that count all rows regardless of null values, which differs from COUNT(column_name) that only counts non-null values in specific columns. This change brings Daft's DataFrame API closer to standard SQL behavior, making it more intuitive for users familiar with SQL conventions.

The change integrates seamlessly with Daft's existing expression system, where the integer literal gets processed through the same aggregation pipeline as other count operations. A corresponding test case in tests/dataframe/test_count.py validates that df.count(1) correctly returns the total row count.

Confidence score: 4/5

  • This PR is safe to merge with minimal risk as it's a straightforward feature addition with clear SQL semantics
  • Score reflects simple logic changes that extend existing functionality without breaking backward compatibility
  • Pay close attention to daft/dataframe/dataframe.py to ensure the new condition integrates properly with existing count logic

2 files reviewed, 1 comment

Edit Code Review Bot Settings | Greptile

if (
len(cols) == 0
or (len(cols) == 1 and isinstance(cols[0], str) and cols[0] == "*")
or (len(cols) == 1 and isinstance(cols[0], int))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, does count(1) have to select a specific column, or is it also selected according to the optimizer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the following code, it can be seen that the first field is taken

    def count(self) -> LogicalPlanBuilder:
        # TODO(Clark): Add dedicated logical/physical ops when introducing metadata-based count optimizations.
        first_col = col(self.schema().column_names()[0])
        builder = self._builder.aggregate([first_col.count(CountMode.All)._expr], [])
        builder = builder.select([first_col.alias("count")._expr])
        return LogicalPlanBuilder(builder)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huleilei Yes, so is it possible to make an optimization here? It seems more appropriate to select a column with the min size and read the count. WDYT cc @rchowell

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. The method of selecting fields needs to be optimized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SQL, COUNT(<lit>) will return the total number of rows regardless of the <lit> value. I would translate any literal value to COUNT(*) and this PR would be complete.

I agree that choosing the column with the smallest "size" makes sense, like choose an int32 column instead of some binary. I'm not sure this is really necessary or does actually improve perf? Maybe worth testing.

Schema (PostgreSQL v17)

CREATE TABLE T (x INT);
INSERT INTO T VALUES (1), (2), (3);

Query #1

SELECT COUNT(1.0), COUNT('xyz'), COUNT(-1), COUNT(false) FROM T;
count count count count
3 3 3 3

View on DB Fiddle

Copy link

codecov bot commented Aug 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.92%. Comparing base (23549cd) to head (711a89d).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4977      +/-   ##
==========================================
- Coverage   76.46%   73.92%   -2.55%     
==========================================
  Files         917      916       -1     
  Lines      127012   130215    +3203     
==========================================
- Hits        97124    96263     -861     
- Misses      29888    33952    +4064     
Files with missing lines Coverage Δ
daft/dataframe/dataframe.py 86.75% <100.00%> (ø)

... and 77 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants