Skip to content

Commit 5c006b4

Browse files
committed
Support copying from glob patterns
Closes #112.
1 parent 6b5a0d5 commit 5c006b4

File tree

14 files changed

+589
-122
lines changed

14 files changed

+589
-122
lines changed

.devcontainer/.env

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ AZURE_TEST_READ_WRITE_SAS="se=2100-05-05&sp=rcw&sv=2022-11-02&sr=c&sig=TPz2jEz0t
2020

2121
# http(s) tests
2222
ALLOW_HTTP=true
23-
HTTP_ENDPOINT=http://localhost:8080
23+
HTTP_ENDPOINT=http://localhost:9999
2424

2525
# GCS tests
2626
GOOGLE_TEST_BUCKET=testbucket

.devcontainer/docker-compose.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,13 +53,13 @@ services:
5353

5454
webdav:
5555
image: rclone/rclone
56-
command: ["serve", "webdav", "/data", "--addr", ":8080"]
56+
command: ["serve", "webdav", "/data", "--addr", ":9999"]
5757
env_file:
5858
- .env
5959
network_mode: host
6060
restart: unless-stopped
6161
healthcheck:
62-
test: ["CMD", "nc", "-z", "localhost", "8080"]
62+
test: ["CMD", "nc", "-z", "localhost", "9999"]
6363
interval: 6s
6464
timeout: 2s
6565
retries: 3

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ jobs:
146146
run: |
147147
docker run -d \
148148
--env-file .devcontainer/.env \
149-
-p 8080:80 \
149+
-p 9999:80 \
150150
rclone/rclone serve webdav /data --addr :80
151151
152152
while ! curl $HTTP_ENDPOINT; do

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ aws-config = { version = "=1.5.18", default-features = false, features = ["rustl
2727
aws-credential-types = {version = "=1.2.1", default-features = false}
2828
azure_storage = {version = "0.21", default-features = false}
2929
futures = "0.3"
30+
glob = "0.3"
3031
home = "0.5"
3132
object_store = {version = "0.12", default-features = false, features = ["aws", "azure", "fs", "gcp", "http"]}
3233
once_cell = "1"

README.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ COPY table FROM 's3://mybucket/data.parquet' WITH (format 'parquet');
2222
- [Inspect Parquet schema](#inspect-parquet-schema)
2323
- [Inspect Parquet metadata](#inspect-parquet-metadata)
2424
- [Inspect Parquet column statistics](#inspect-parquet-column-statistics)
25+
- [List and read Parquet files from uri pattern](#list-and-read-parquet-files-from-uri-pattern)
2526
- [Object Store Support](#object-store-support)
2627
- [Copy Options](#copy-options)
2728
- [Configuration](#configuration)
@@ -192,6 +193,40 @@ SELECT * FROM parquet.column_stats('/tmp/product_example.parquet')
192193
(13 rows)
193194
```
194195

196+
### List and read Parquet files from uri pattern
197+
198+
You can call `SELECT * FROM parquet.list(<uri_pattern>)` to see all uris that matches with the uri pattern.
199+
Uri pattern can resolve `**` for directories and `*` for words in the uri.
200+
201+
202+
```sql
203+
COPY (SELECT i FROM generate_series(1, 1000000) i) TO '/tmp/some/test.parquet' with (file_size_bytes '1MB');
204+
COPY 1000000
205+
206+
SELECT * FROM parquet.list('/tmp/some/**/*.parquet');
207+
uri | size
208+
---------------------------------------+---------
209+
/tmp/some/test.parquet/data_4.parquet | 100162
210+
/tmp/some/test.parquet/data_3.parquet | 1486916
211+
/tmp/some/test.parquet/data_2.parquet | 1486916
212+
/tmp/some/test.parquet/data_0.parquet | 1486920
213+
/tmp/some/test.parquet/data_1.parquet | 1486916
214+
(5 rows)
215+
216+
```
217+
218+
Uri pattern is also supported by `COPY FROM` for all supported object stores except `http(s)` endpoints.
219+
```sql
220+
COPY (SELECT i FROM generate_series(1, 1000000) i) TO 's3://testbucket/some/test.parquet' with (file_size_bytes '1MB');
221+
COPY 1000000
222+
223+
CREATE TABLE test(a int);
224+
CREATE TABLE
225+
226+
COPY test FROM 's3://testbucket/some/**/*.parquet';
227+
COPY 1000000
228+
```
229+
195230
## Object Store Support
196231
`pg_parquet` supports reading and writing Parquet files from/to `S3`, `Azure Blob Storage`, `http(s)` and `Google Cloud Storage` object stores.
197232

@@ -278,7 +313,7 @@ Supported authorization methods' priority order is shown below:
278313

279314
#### Http(s) Storage
280315

281-
`Https` uris are supported by default. You can set `ALLOW_HTTP` environment variable to allow `http` uris.
316+
Only `Https` uris are supported by default. You can set `ALLOW_HTTP` environment variable to allow `http` uris.
282317

283318
#### Google Cloud Storage
284319

0 commit comments

Comments
 (0)