perf: URL Operators #4871

srilman · 2025-07-30T00:32:53Z

Changes Made

Custom Swordfish operator for URL downloads and uploads that can consume in a streaming sink fashion. Rather than block downloading an entire block, instead we can start connections for a larger block of files and return the first max_connections files that we get back, continuing in the pipeline until we get more.

Checklist

Documented in API Docs (if applicable)
Documented in User Guide (if applicable)
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

srilman · 2025-08-01T06:31:37Z

src/daft-ir/src/proto/rel.rs

@@ -160,6 +160,12 @@ impl ToFromProto for ir::rel::LogicalPlan {
                let udf_project = udf_project.to_proto()?.into();
                proto::RelVariant::UdfProject(udf_project)
            }
+            Self::UrlDownload(url_download) => {


Couldn't figure out how to serialize the IOConfig, so leaving as a followup. Other opts use the same macro

Jay-ju · 2025-08-01T06:44:46Z

I'm particularly curious about what prompted the performance optimization for URL handling here—could you elaborate on the core motivation behind this design? @srilman

srilman · 2025-08-01T18:21:15Z

@Jay-ju this is actually an old PR that I wanted to get through to unblock some newer features. Right now, URL downloads and uploads are done somewhat in a blocking way. The idea behind this PR is to get rid of those blocking operations by having dedicated operators that execute in a fully async manner.

In addition (what I'm working on right now), we can collect more & better statistics by having a dedicated operator such as # of failed requests, waiting time, etc.

greptile-apps

Greptile Summary

This pull request introduces custom Swordfish operators for URL downloads and uploads that implement streaming sink functionality, representing a significant performance optimization over the previous blocking approach. The core innovation allows starting connections for larger blocks of files and returning the first max_connections files that complete, enabling the pipeline to continue processing while additional downloads/uploads complete in the background.

The implementation adds new logical plan operators (UrlDownload and UrlUpload) that integrate with Daft's execution framework through streaming sinks. These operators support configurable connection limits, error handling modes, and passthrough columns for column pruning optimization. The streaming architecture uses async templates with JoinSet for concurrent task management, proper backpressure control through input size buffering, and sophisticated memory management.

Key architectural changes include:

New StreamingSinkFinalizeOutput enum that supports multi-stage finalization with HasMoreOutput and Finished variants
Async operation template framework in streaming_sink/async_ops/ with reusable patterns for concurrent I/O operations
Updated physical and logical plan translation layers to handle URL operations
Integration with the distributed pipeline system for scalable execution
API simplifications across the codebase, particularly removing error handling from size_bytes() methods that are now infallible

The changes also introduce a new configuration parameter url_ops_bytes_buffer (default 256MB) for memory management during streaming operations and update test files to handle non-deterministic result ordering using sort_pydict utilities. The implementation follows established patterns in the codebase while providing substantial performance improvements for URL-based data operations.

Confidence score: 3/5

This PR introduces complex streaming functionality with potential edge cases in async operation management and memory handling that may cause issues under high load or error conditions
Score reflects the significant architectural changes involving async operations, complex state management, and distributed execution that increase the risk of subtle bugs
Pay close attention to streaming sink implementations, async operation templates, and error handling in URL download/upload operations

_{89 files reviewed, 14 comments}

_{Edit Code Review Bot Settings | Greptile}

greptile-apps · 2025-08-11T16:50:00Z

src/daft-recordbatch/src/ops/mod.rs

+        let columns = columns
+            .iter()
+            .map(|col| self.columns[col.index].clone())
+            .collect::<Vec<_>>();


style: Variable name collision: the parameter columns is shadowed by the local variable on line 28. Consider renaming the local variable to selected_columns for clarity.

Suggested change

let columns = columns

.iter()

.map(|col| self.columns[col.index].clone())

.collect::<Vec<_>>();

let selected_columns = columns

.iter()

.map(|col| self.columns[col.index].clone())

.collect::<Vec<_>>();

greptile-apps · 2025-08-11T16:50:13Z

src/daft-stats/src/table_stats.rs

+    pub fn select_columns(&self, columns: &[BoundColumn]) -> crate::Result<Self> {
+        let new_schema = Arc::new(Schema::new(
+            columns.iter().map(|col| self.schema[col.index].clone()),
+        ));
+        let columns = columns
+            .iter()
+            .map(|col| self.columns[col.index].clone())
+            .collect::<Vec<_>>();
+        Ok(Self {
+            columns,
+            schema: new_schema,
+        })
+    }


style: Variable name collision: the parameter columns shadows the struct field columns. Consider renaming the parameter to selected_columns or column_refs for clarity.

Suggested change

pub fn select_columns(&self, columns: &[BoundColumn]) -> crate::Result<Self> {

let new_schema = Arc::new(Schema::new(

columns.iter().map(|col| self.schema[col.index].clone()),

));

let columns = columns

.iter()

.map(|col| self.columns[col.index].clone())

.collect::<Vec<_>>();

Ok(Self {

columns,

schema: new_schema,

})

}

pub fn select_columns(&self, selected_columns: &[BoundColumn]) -> crate::Result<Self> {

let new_schema = Arc::new(Schema::new(

selected_columns.iter().map(|col| self.schema[col.index].clone()),

));

let columns = selected_columns

.iter()

.map(|col| self.columns[col.index].clone())

.collect::<Vec<_>>();

Ok(Self {

columns,

schema: new_schema,

})

}

greptile-apps · 2025-08-11T16:50:14Z

src/daft-local-execution/src/streaming_sink/async_ops/template.rs

+//!
+//! On StreamingSink::finalize
+//! 1. Poll for finished tasks up to finalize_batch_size()
+//! 2. If num_active_tasks() >= 0, return HasMoreOutput to loop again


logic: Logic error: condition should be > 0 not >= 0 since we want to continue when there are active tasks

Suggested change

//! 2. If num_active_tasks() >= 0, return HasMoreOutput to loop again

//! 2. If num_active_tasks() > 0, return HasMoreOutput to loop again

greptile-apps · 2025-08-11T16:50:24Z

src/daft-core/src/array/ops/len.rs

+            let size = pickle_dumps(c)
+                .expect("Failed to pickle Python array for size estimation")
+                .len();


logic: Using expect() will cause a panic if pickle serialization fails. Consider if this is the desired behavior - it changes from graceful error handling to a potential crash.

greptile-apps · 2025-08-11T16:50:26Z

src/daft-local-plan/Cargo.toml

@@ -7,6 +7,7 @@ common-scan-info = {path = "../common/scan-info", default-features = false}
 common-treenode = {path = "../common/treenode", default-features = false}
 daft-core = {path = "../daft-core", default-features = false}
 daft-dsl = {path = "../daft-dsl", default-features = false}
+daft-functions-uri = {path = "../daft-functions-uri", default-features = false}


logic: Missing python feature propagation for daft-functions-uri dependency

greptile-apps · 2025-08-11T16:50:41Z

src/daft-functions-uri/src/download.rs

+            {
+                panic!("Python feature is required for io_config");
+            }


logic: panic! will crash the program. Consider returning a proper error instead

greptile-apps · 2025-08-11T16:50:47Z

src/daft-logical-plan/src/optimization/rules/granular_projections.rs

+                        if let Expr::ScalarFn(ScalarFn::Builtin(BuiltinScalarFn { udf, inputs })) =
+                            child.as_ref()
+                            && udf.as_ref().type_id() == TypeId::of::<UrlDownload>()
+                        {
+                            changed = true;
+
+                            // Split and save child expression
+                            // Child may not have an alias, so we need to generate a new one
+                            // TODO: Remove with ordinals
+                            let child_name = format!("id-{}", uuid::Uuid::new_v4());
+
+                            let args: UrlDownloadArgs<ExprRef> = inputs.clone().try_into()?;
+                            split_exprs.push(SplitExpr::UrlDownload {
+                                child_name: child_name.clone(),
+                                args,
+                            });
+
+                            new_children[idx] = resolved_col(child_name);
+                        } else if let Expr::ScalarFn(ScalarFn::Builtin(BuiltinScalarFn {
+                            udf,
+                            inputs,
+                        })) = child.as_ref()
+                            && udf.as_ref().type_id() == TypeId::of::<UrlUpload>()
+                        {
                            changed = true;
+
                            // Split and save child expression
                            // Child may not have an alias, so we need to generate a new one
                            // TODO: Remove with ordinals
                            let child_name = format!("id-{}", uuid::Uuid::new_v4());
-                            let child = child.alias(child_name.clone());
-                            split_exprs.push(child);
+
+                            let args: UrlUploadArgs<ExprRef> = inputs.clone().try_into()?;
+                            split_exprs.push(SplitExpr::UrlUpload {
+                                child_name: child_name.clone(),
+                                args,
+                            });
+
                            new_children[idx] = resolved_col(child_name);
                        }


style: The pattern matching logic here is quite complex with nested conditionals. Consider extracting this into a helper method to improve readability.

greptile-apps · 2025-08-11T16:50:47Z

src/daft-logical-plan/src/optimization/rules/granular_projections.rs

+                    SplitExpr::UrlDownload { child_name, args } => {
+                        out_names.insert(child_name.clone());
+                        out_exprs.push(resolved_col(child_name.clone()));
+                        let passthrough_columns = last_child
+                            .schema()
+                            .field_names()
+                            .map(name_to_col)
+                            .collect::<Vec<_>>();
+                        last_child = Arc::new(LogicalPlan::UrlDownload(UrlDownloadOp::new(
+                            last_child,
+                            args,
+                            child_name,
+                            passthrough_columns,
+                        )));
+                    }
+                    SplitExpr::UrlUpload { child_name, args } => {
+                        out_names.insert(child_name.clone());
+                        out_exprs.push(resolved_col(child_name.clone()));
+                        let passthrough_columns = last_child
+                            .schema()
+                            .field_names()
+                            .map(name_to_col)
+                            .collect::<Vec<_>>();
+                        last_child = Arc::new(LogicalPlan::UrlUpload(UrlUploadOp::new(
+                            last_child,
+                            args,
+                            child_name,
+                            passthrough_columns,
+                        )));
+                    }


style: Both UrlDownload and UrlUpload branches have nearly identical logic for creating passthrough columns and plan nodes. Consider extracting a common helper function to reduce duplication.

greptile-apps · 2025-08-11T16:50:55Z

tests/sql/test_uri_exprs.py

+        actual.explain(show_all=True)
+        actual = actual.to_pydict()


style: Debug statement left in test code. Should be removed before merging.

Suggested change

actual.explain(show_all=True)

actual = actual.to_pydict()

actual = actual.to_pydict()

greptile-apps · 2025-08-11T16:51:01Z

src/daft-logical-plan/src/ops/url_download.rs

+) -> SchemaRef {
+    let mut fields = passthrough_columns
+        .iter()
+        .map(|c| input_schema.get_field(&c.name()).unwrap())


logic: using .unwrap() here could panic if column doesn't exist in schema - consider using proper error handling

codspeed-hq · 2025-08-11T17:02:24Z

CodSpeed Performance Report

Merging #4871 will degrade performances by 93.79%

_{Comparing slade/url-ops (913fa64) with main (00386c9)}

Summary

❌ 1 regressions
✅ 23 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
❌	`test_iter_rows_first_row[1 Small File]`	125.1 ms	2,015.3 ms	-93.79%

srilman added 30 commits June 10, 2025 11:36

Initial attempt

9b1e3f8

Compiling operator

55cd774

Well its running

35f416b

Bound input size

56212aa

Better flow

a714e56

Bound and try

751f806

More tweaking

7813b42

Logging

7e40ee1

Try next

4d959c7

Remove prints

c6bb7cb

Fix finalize

74da99b

Save changes

1eadc84

Use IO runtime

255a0fb

Remove IO runtime and add url upload

6bc9e19

Forgot to set input schema

fc41bf4

Fix on empty

617ce41

prints

091221a

shoot

2414034

Remove in progress

af74a83

Fix contents name

511fd9f

Fix the bug

5b3c4d7

IO Runtime

44bd472

Merge branch 'main' into slade/url-ops

13e4225

Merge branch 'main' into slade/url-ops

7381d15

Merge branch 'main' into slade/url-ops

9d1ea50

Save

ac7ad69

Save2

5f48286

Merge branch 'main' into slade/url-ops

6b42eea

Another round

1afc416

Good to go?

d054fa1

srilman commented Aug 1, 2025

View reviewed changes

srilman added 5 commits August 1, 2025 13:13

Fix Rust optimizer tests

adef4f0

Ready to go

a5c27b3

Merge branch 'main' into slade/url-ops

0b93804

Fix the bug

e2256e2

Test now

641bdfa

srilman changed the title ~~[wip] perf: URL Operators~~ perf: URL Operators Aug 9, 2025

github-actions bot added the perf label Aug 9, 2025

srilman added 4 commits August 10, 2025 09:44

Fix last set of CI issues

fe5b9c7

Fix ordering in integration tests

94d7dab

Ready to go

74961a3

oops now

913fa64

srilman requested a review from colin-ho August 11, 2025 16:49

srilman marked this pull request as ready for review August 11, 2025 16:49

greptile-apps bot reviewed Aug 11, 2025

View reviewed changes

srilman marked this pull request as draft August 11, 2025 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: URL Operators #4871

perf: URL Operators #4871

srilman commented Jul 30, 2025 •

edited

Loading

Uh oh!

srilman Aug 1, 2025

Uh oh!

Jay-ju commented Aug 1, 2025

Uh oh!

srilman commented Aug 1, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Aug 11, 2025

Uh oh!

greptile-apps bot Aug 11, 2025

Uh oh!

greptile-apps bot Aug 11, 2025

Uh oh!

greptile-apps bot Aug 11, 2025

Uh oh!

greptile-apps bot Aug 11, 2025

Uh oh!

greptile-apps bot Aug 11, 2025

Uh oh!

greptile-apps bot Aug 11, 2025

Uh oh!

greptile-apps bot Aug 11, 2025

Uh oh!

greptile-apps bot Aug 11, 2025

Uh oh!

greptile-apps bot Aug 11, 2025

Uh oh!

codspeed-hq bot commented Aug 11, 2025

Uh oh!

Uh oh!

	//! 2. If num_active_tasks() >= 0, return HasMoreOutput to loop again
	//! 2. If num_active_tasks() > 0, return HasMoreOutput to loop again

	actual.explain(show_all=True)
	actual = actual.to_pydict()
	actual = actual.to_pydict()

perf: URL Operators #4871

Are you sure you want to change the base?

perf: URL Operators #4871

Conversation

srilman commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Checklist

Uh oh!

srilman Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Jay-ju commented Aug 1, 2025

Uh oh!

srilman commented Aug 1, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 3/5

Uh oh!

greptile-apps bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

codspeed-hq bot commented Aug 11, 2025

CodSpeed Performance Report

Merging #4871 will degrade performances by 93.79%

Summary

Benchmarks breakdown

Uh oh!

Uh oh!

srilman commented Jul 30, 2025 •

edited

Loading