Description
Apache Iceberg version
None
Query engine
None
Please describe the bug 🐞
The iceberg-rest-fixture
REST catalog server uses JdbcCatalog
and specifically sqlite, jdbc:sqlite::memory:
as the underlying database.
Under high concurrent load, the catalog server errors with SQLiteException
such as,
Caused by: org.sqlite.SQLiteException: [SQLITE_ERROR] SQL error or missing database (no such table: iceberg_tables)
This issue arises because jdbc:sqlite::memory:
creates a distinct, ephemeral in-memory database for each new JDBC connection. The JdbcClientPool
defaults to client_pool
size of 2.
When managing more than 2 connections, the JdbcClientPool
can create new connections and thus new in-memory new database, without running the initialization logic. Subsequent db operations will fail due to missing tables in these newly created database.
Reproduce
Start a IRC server using iceberg-rest-fixture
. I'm using pyiceberg's integration docker compose file by running
make test-integration-setup
in the pyiceberg repo.
Create concurrent requests to the IRC. In poetry run python
, run this script
from pyiceberg.catalog import load_catalog
from concurrent.futures import ThreadPoolExecutor
def run_rest_catalog():
catalog = load_catalog(
"rest",
**{
"type": "rest",
"uri": "http://localhost:8181",
"s3.endpoint": "http://localhost:9000",
"s3.access-key-id": "admin",
"s3.secret-access-key": "password",
},
)
try:
ns = catalog.list_namespaces()
print(ns)
except Exception as e:
print(e)
def run_catalog_in_parallel(workers, runs):
# Run in parallel using threads
with ThreadPoolExecutor(max_workers=workers) as executor:
futures = [executor.submit(run_rest_catalog) for _ in range(runs)]
results = {}
for future in futures:
_ = future.result()
run_catalog_in_parallel(20, 200)
Solution
In iceberg-rest-fixture's Dockerfile, set
CATALOG_URI=jdbc:sqlite:file::memory:?cache=shared
- Use
file::memory:
to explicitly indicate an in-memory database that supports URI parameters. - Add the
?cache=shared
parameter to allow SQLite to use a shared in-memory cache so that all connections will access the same underlying in-memory database.
EDIT: Fokko suggested a better solution by limiting sqlite to just 1 client connection
Additionally, we should replace all subproject using the original jdbc:sqlite::memory:
connection string.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time