danthegoodman1
diff --git a/‎.gitignore
Lines changed: 1 addition & 0 deletions b/‎.gitignore
Lines changed: 1 addition & 0 deletions
diff --git a/‎ARCHITECTURE.md
Lines changed: 102 additions & 0 deletions b/‎ARCHITECTURE.md
Lines changed: 102 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 97 additions & 0 deletions b/‎README.md
Lines changed: 97 additions & 0 deletions
diff --git a/‎docker-compose.yml
Lines changed: 18 additions & 18 deletions b/‎docker-compose.yml
Lines changed: 18 additions & 18 deletions
diff --git a/‎examples/clickhouse.md
Lines changed: 7 additions & 0 deletions b/‎examples/clickhouse.md
Lines changed: 7 additions & 0 deletions
diff --git a/‎icedb/__init__.py
Lines changed: 5 additions & 1 deletion b/‎icedb/__init__.py
Lines changed: 5 additions & 1 deletion
diff --git a/‎icedb/ch-test.sql
Lines changed: 4 additions & 0 deletions b/‎icedb/ch-test.sql
Lines changed: 4 additions & 0 deletions
diff --git a/‎icedb/ddb_test.py
Lines changed: 38 additions & 0 deletions b/‎icedb/ddb_test.py
Lines changed: 38 additions & 0 deletions
@@ -1,3 +1,4 @@
 .env
 .parquet
 __pycache__
+.idea
@@ -0,0 +1,102 @@
+# IceDB v3 Architecture
+
+## Log file(s)
+
+IceDB keeps track of the active files and schema in a log, much like other database systems. This log is stored in S3, and is append-only. This log can also be truncated via a tombstone cleanup process described below.
+
+Both the schema and active files are tracked within the same log file, and in each log file.
+
+### Log file structure
+
+The log file is new-line delimited JSON, with the line being special. The first line is in the schema (typescript format):
+
+```ts
+interface {
+  v: string // the version number
+  t: number // unix ms timestamp of file creation. For merges, this is the timestamp after listing ends and merging logic begins, for append operations, it's the moment metadata is created, tombstone cleanup leaves the current value when replacing a file
+  sch: number // line number that the accumulated schema begins at
+  f: number // line number that the list of file markers begins at
+  tmb?: number // line number that the list of log file tombstones start at
+}
+```
+
+#### Schema (sch)
+
+There is only one schema line per file, taking the form:
+
+```ts
+interface {
+  [column: string]: string // example: "user_id": "VARCHAR"
+}
+```
+
+Columns are never removed from the schema, and they always consist of the union of log file schemas.
+
+If data type conflicts are found (e.g. log file A has a column as a VARCHAR, but log file B has a column as a BIGINT), then the merge fails and error logs are thrown. This can be mitigated by having the ingestion instances read the schema periodically and caching in memory (only need to read log files up through the last schema line, then can abort request). One could also choose to use a transactionally-secure schema catalog to protect this, have data sources declare their schema ahead of time, and more to validate the schema. Ultimately it is not up to IceDB to verify the schema during inserts.
+
+#### Log file tombstones (tmb)
+
+These are the logs files that were merged into a new file. If log files A and B were merged into C, not all data part files listed in A and B were necessarily merged into new data parts marked in C. Because of this, files that existed in A and B that were not part of the merge are copied to log file C in the alive status. Any files that were merged are marked as not alive by having a tombstone reference.
+
+Because log files A and B were merged into C, we created "tombstones" for log files A and B. Tombstones are kept track of so that some background cleaning process can remove the merged log files after some grace period (for example files older than the max query timeout * 2). This is why it's important to insert infrequently and in large batches.
+
+They take the format:
+
+```ts
+interface {
+  "p": string // the file path, e.g. /some/prefixed/_log/ksuid.jsonl
+  "t": number // the timestamp when the tombstone was created (when this log file was first part of a merge)
+}
+```
+
+#### File marker (f)
+
+There are at least one file markers per log file, taking the form:
+
+```ts
+interface {
+  "p": string // the file path, e.g. /some/prefixed/file.parquet
+  "b": number // the size in bytes
+  "t": number // created timestamp in milliseconds
+  "tmb"?: int // exists if the file is not alive, the unix ms the file was tombstoned
+}
+```
+
+### Reading the log files
+
+To get a snapshot-consistent view of the database, a reader must perform the following actions:
+
+1. List all files in the `_log` prefix for the table
+2. Read each found log file sequentially (they are sorted by time), removing known data parts as file markers are found with tombstone references, and accumulating the current schema (handling schema conflicts if found)
+3. Return the final list of active files and accumulated schema
+
+A stable timestamp can be optionally used to "time travel" *(this should not be older than the `tmb_grace_sec` to 
+prevent missing data)*. This can be used for repeatable reads of the same view of the data.
+
+## Merging
+
+Merging requires coordination with an exclusive lock on a table.
+
+When a merge occurs, both data parts and log files are merged. A newly created log file is the combination of:
+
+1. New data parts created in the merge (should be 1) (`f`)
+2. Files that were part of the merge, marked with tombstone references (`f`)
+3. Files that were not part of the merge, marked alive (`f`)
+4. Tombstones of the logs files involved in the merge (`tmb`)
+
+The reason for copying the state of untouched files is that the new log file represents a new view of modified data. 
+If log files A and B were merged into C, then A and B represent a stale version of the data and only exist to prevent breaking existing list query operations from not being able to find their files.
+
+Merged log files are not immediately deleted to prevent issues with current operations, and are marked as merged in the new log file so they are able to be cleaned up. You must to ensure that files are only cleaned long after they could be in use (say multiple times the max e2e query timeout, including list operation times). This is why it's important to insert infrequently and in large batches, to prevent too many files from building up before they can be deleted.
+
+Data part files that were part of the data merge are marked with tombstones so in the event that a list operation sees files A, B, and C, it knows that the old files were merged and should be removed from the resulting list of active files. If it only ends up seeing A and B, then it just gets a stale view of the data. This is why it's important to ensure that a single query gets a constant-time view of the database, so nested queries do not cause an inconsistent view of the data.
+
+Tombstones include the timestamp when they were first merged for the tombstone cleanup worker. When files merge, the must always carry forward any found tombstones. Tombstone cleanup is idempotent so in the event that a merge occurs concurrently with tombstone cleanup there is no risk of data loss or duplication. Merging must also always carry forward file markers that have tombstones, as these are also removed by the tomestone cleanup process.
+
+## Tombstone cleanup
+
+The second level of coordination with a second exclusive lock that is needed is tombstone cleaning. There is a separate `tmb_grace_sec` parameter that controls how long tombstone files are kept for.
+
+When tombstone cleanup occurs, the entire state of the log is read. Any tombstones that are found older than the `tmb_grace_sec` are deleted from S3.
+
+When the cleaning process finds a log file with tombstones, it first deletes those files from S3. If that is successful (not found errors being idempotent passes), then the log file is replaced with the same contents, minus the tombstones and any file markers that had those tombstone references.
@@ -8,6 +8,103 @@ _Massive WIP_
 
 See https://blog.danthegoodman.com/icedb-v2
 
+<!-- TOC -->
+* [IceDB](#icedb)
+  * [Performance test](#performance-test)
+  * [KNOWN GOTCHAS](#known-gotchas)
+  * [Examples](#examples)
+  * [Usage](#usage)
+    * [`partitionStrategy`](#partitionstrategy)
+    * [`sortOrder`](#sortorder)
+    * [`formatRow`](#formatrow)
+    * [`unique_row_key`](#uniquerowkey)
+  * [Pre-installing extensions](#pre-installing-extensions)
+  * [Merging](#merging)
+  * [Concurrent merges](#concurrent-merges)
+  * [Cleaning Merged Files](#cleaning-merged-files)
+  * [Custom Merge Query (ADVANCED USAGE)](#custom-merge-query-advanced-usage)
+    * [Handling `_row_id`](#handling-rowid)
+      * [Deduplicating Data on Merge](#deduplicating-data-on-merge)
+      * [Replacing Data on Merge](#replacing-data-on-merge)
+      * [Aggregating Data on Merge](#aggregating-data-on-merge)
+  * [Multiple Tables](#multiple-tables)
+  * [Meta Store Schema](#meta-store-schema)
+<!-- TOC -->
+
+## Performance test
+
+From the test, inserting 2000 times with 2 parts, shows performance against S3 and reading in the state and schema
+
+```
+============== insert hundreds ==============
+this will take a while...
+inserted 200
+inserted hundreds in 11.283345699310303
+reading in the state
+read hundreds in 0.6294591426849365
+files 405 logs 202
+verify expected results
+got 405 alive files
+[(406, 'a'), (203, 'b')] in 0.638556957244873
+merging it
+merged partition cust=test/d=2023-02-11 with 203 files in 1.7919442653656006
+read post merge state in 0.5759727954864502
+files 406 logs 203
+verify expected results
+got 203 alive files
+[(406, 'a'), (203, 'b')] in 0.5450308322906494
+merging many more times to verify
+merged partition cust=test/d=2023-06-07 with 200 files in 2.138633966445923
+merged partition cust=test/d=2023-06-07 with 3 files in 0.638775110244751
+merged partition None with 0 files in 0.5988118648529053
+merged partition None with 0 files in 0.6049611568450928
+read post merge state in 0.6064021587371826
+files 408 logs 205
+verify expected results
+got 2 alive files
+[(406, 'a'), (203, 'b')] in 0.0173952579498291
+tombstone clean it
+tombstone cleaned 4 cleaned log files, 811 deleted log files, 1012 data files in 4.3332929611206055
+read post tombstone clean state in 0.0069119930267333984
+verify expected results
+got 2 alive files
+[(406, 'a'), (203, 'b')] in 0.015745878219604492
+
+============== insert thousands ==============
+this will take a while...
+inserted 2000
+inserted thousands in 107.14211988449097
+reading in the state
+read thousands in 7.370793104171753
+files 4005 logs 2002
+verify expected results
+[(4006, 'a'), (2003, 'b')] in 6.49034309387207
+merging it
+breaking on marker count
+merged 2000 in 16.016802072525024
+read post merge state in 6.011193037033081
+files 4006 logs 2003
+verify expected results
+[(4006, 'a'), (2003, 'b')] in 6.683710098266602
+# laptop became unstable around here
+```
+
+Some notes:
+
+1. Very impressive state read performance with so many files (remember it has to open each one and accumulate the 
+   state!)
+2. Merging happens very quick
+3. Tombstone cleaning happens super quick as well
+4. DuckDB performs surprisingly well with so many files (albeit they are one or two rows each)
+5. At hundreds of log files and partitions (where most tables should live at), performance was exceptional
+6. Going from hundreds to thousands, performance is nearly perfectly linear, sometimes even super linear (merges)!
+
+Having such a large log files (merged but not tombstone cleaned) is very unrealistic. Chances are worst case you 
+have <100 log files and hundreds or low thousands of data files. Otherwise you are either not merging/cleaning 
+enough, or your partition scheme is far too granular.
+
+The stability of my laptop struggled when doing the thousands test, so I only showed where I could consistently get to.
+
 ## KNOWN GOTCHAS
 
 There is a bug in duckdb right now where `read_parquet` will fire a table macro twice, and show the file twice when listing them, but this doesn't affect actual query results: https://github.com/duckdb/duckdb/issues/7897
 
@@ -4,15 +4,15 @@ volumes:
   minio_storage: null
   crdb_storage: null
 services:
-  crdb:
-    container_name: crdb
-    image: cockroachdb/cockroach
-    ports:
-      - "26257:26257"
-      - "8080:8080"
-    command: start-single-node --insecure
-    volumes:
-      - crdb_storage:/cockroach/cockroach-data
+#  crdb:
+#    container_name: crdb
+#    image: cockroachdb/cockroach
+#    ports:
+#      - "26257:26257"
+#      - "8080:8080"
+#    command: start-single-node --insecure
+#    volumes:
+#      - crdb_storage:/cockroach/cockroach-data
   minio:
     image: minio/minio
     ports:
@@ -35,12 +35,12 @@ services:
       /usr/bin/mc mb myminio/testbucket;
       exit 0;
       "
-  clickhouse:
-    image: clickhouse/clickhouse-server
-    depends_on:
-      - minio
-      - crdb
-    container_name: ch
-    volumes:
-      - ./ch/user_scripts:/var/lib/clickhouse/user_scripts:0777
-      - /workspaces/icedb/ch/functions/get_files_function.xml:/etc/clickhouse-server/get_files_function.xml
+#  clickhouse:
+#    image: clickhouse/clickhouse-server:latest
+#    depends_on:
+#      - minio
+#      - crdb
+#    container_name: ch
+#    volumes:
+#      - ./ch/user_scripts:/var/lib/clickhouse/user_scripts:0777
+#      - /workspaces/icedb/ch/functions/get_files_function.xml:/etc/clickhouse-server/get_files_function.xml
@@ -26,3 +26,10 @@ docker exec ch clickhouse-client -q "SELECT sum(JSONExtractInt(properties, 'numt
 ```
 
 This will show the same results as found in the final query of `examples/simple.py`
+
+You can create a parameterized view for a nicer query experience like:
+```
+docker exec ch clickhouse-client -q "create view icedb as select * from s3(get_files(toYear({start_date:Date}), toMonth({start_date:Date}), toDate({start_date:Date}), toYear({end_date:Date}), toMonth({end_date:Date}), toDate({end_date:Date})), 'user', 'password', 'Parquet')"
+
+docker exec ch clickhouse-client -q "SELECT sum(JSONExtractInt(properties, 'numtime')), user_id from icedb where start_date = '2023-02-01' and end_date = '2023-08-01 and event = 'page_load' group by user_id FORMAT Pretty;"
+```
@@ -1 +1,5 @@
-from .icedb import IceDB, PartitionFunctionType
+from .log import (
+    IceLogIO, Schema, LogMetadata, LogTombstone, NoLogFilesException, FileMarker, S3Client,
+    LogMetadataFromJSON, FileMarkerFromJSON, LogTombstoneFromJSON, SchemaConflictException, get_log_file_info
+)
+from .icedb import IceDBv3, PartitionFunctionType, FormatRowType
@@ -0,0 +1,4 @@
+select * FROM s3('https://webhook.site/1d7527f0-be57-4e48-aea1-f988b6ff62f5/ookla-open-data/parquet/performance/type=*/year=*/quarter=*/*.parquet', 'Parquet', 'quadkey Nullable(String), tile Nullable(String), avg_d_kbps Nullable(Int64), avg_u_kbps Nullable(Int64), avg_lat_ms Nullable(Int64), tests Nullable(Int64), devices Nullable(Int64)')
+
+
+select * FROM s3('https://webhook.site/1d7527f0-be57-4e48-aea1-f988b6ff62f5/ookla-open-data/parquet/performance/a.parquet', 'Parquet', 'quadkey Nullable(String), tile Nullable(String), avg_d_kbps Nullable(Int64), avg_u_kbps Nullable(Int64), avg_lat_ms Nullable(Int64), tests Nullable(Int64), devices Nullable(Int64)')
@@ -0,0 +1,38 @@
+import duckdb
+import pandas as pd
+
+ddb = duckdb.connect()
+
+# print(ddb.sql('''
+# load httpfs
+# '''))
+# print(ddb.sql('''
+# load parquet
+# '''))
+# print(ddb.sql('''
+# SET s3_region='us-east-1'
+# '''))
+# print(ddb.sql('''
+# SET s3_endpoint='webhook.site'
+# '''))
+# print(ddb.sql('''
+# SET s3_url_style='path'
+# '''))
+# try:
+#   print(ddb.sql('''
+#   select * from read_parquet('s3://1d7527f0-be57-4e48-aea1-f988b6ff62f5/ookla-open-data/parquet/performance/type=*/year=*/quarter=*/*.parquet')
+#   '''))
+# except:
+#   pass
+# try:
+#   print(ddb.sql('''
+#   select * from read_parquet('s3://1d7527f0-be57-4e48-aea1-f988b6ff62f5/blah.parquet')
+#   '''))
+# except:
+#   pass
+
+df = pd.DataFrame([{'a': 123, 'b': 1.2, 'c': 'hey', 'd': ['hey']}])
+ddb.execute("describe select * from df")
+res = ddb.df()
+print(res['column_name'].tolist(), res['column_type'].tolist())
+print('/'.join([None, 'hey', 'ho']))
-Original file line number
+Diff line change
@@ @@ -1,3 +1,4 @@ @@
 .env
 .parquet
 __pycache__
 +.idea