Skip to main content

Object Stores

Object stores let you query remote data via URL schemes. The pattern is: register the store, then register tables using URLs instead of local paths.

URL Schemes

SchemeStore
file:///Local filesystem
s3://bucketAmazon S3 (and compatible)
az://containerAzure Blob Storage
gs://bucketGoogle Cloud Storage
memory://In-memory (ephemeral)

Local Filesystem

context.RegisterLocalFileSystem("file:///");

await context.RegisterCsvAsync("orders", "file:///data/orders.csv");
using var df = await context.SqlAsync("SELECT * FROM orders");

Amazon S3

// Public bucket (anonymous access)
context.RegisterS3ObjectStore("s3://arrow-datasets", new S3ObjectStoreOptions
{
BucketName = "arrow-datasets",
Region = "us-east-2",
SkipSignature = true,
});

await context.RegisterParquetAsync("data", "s3://arrow-datasets/data.parquet");

S3 with Credentials

context.RegisterS3ObjectStore("s3://my-bucket", new S3ObjectStoreOptions
{
BucketName = "my-bucket",
Region = "us-west-2",
AccessKeyId = "AKIA...",
SecretAccessKey = "secret",
});

S3-Compatible (MinIO)

context.RegisterS3ObjectStore("s3://my-bucket", new S3ObjectStoreOptions
{
BucketName = "my-bucket",
Endpoint = "http://localhost:9000",
AccessKeyId = "minioadmin",
SecretAccessKey = "minioadmin",
AllowHttp = true,
});

S3ObjectStoreOptions

PropertyTypeRequiredDescription
BucketNamestringYesS3 bucket name
Regionstring?NoAWS region (e.g., us-east-1)
AccessKeyIdstring?NoAWS access key ID
SecretAccessKeystring?NoAWS secret access key
Endpointstring?NoCustom endpoint for S3-compatible services
Tokenstring?NoSession token for temporary credentials
AllowHttpbool?NoAllow HTTP (non-TLS) connections
VirtualHostedStyleRequestbool?NoUse virtual hosted style requests
SkipSignaturebool?NoSkip request signing (for public buckets)

Azure Blob Storage

context.RegisterAzureBlobStorage("az://my-container", new AzureBlobStorageOptions
{
ContainerName = "my-container",
AccountName = "myaccount",
AccessKey = "base64key...",
});

await context.RegisterParquetAsync("data", "az://my-container/path/data.parquet");

Azure with SAS Token

context.RegisterAzureBlobStorage("az://my-container", new AzureBlobStorageOptions
{
ContainerName = "my-container",
AccountName = "myaccount",
SasKey = "...",
});

Azure with Service Principal

context.RegisterAzureBlobStorage("az://my-container", new AzureBlobStorageOptions
{
ContainerName = "my-container",
ClientId = "client-id",
ClientSecret = "client-secret",
TenantId = "tenant-id",
});

Azurite Emulator

context.RegisterAzureBlobStorage("az://my-container", new AzureBlobStorageOptions
{
ContainerName = "my-container",
UseEmulator = true,
AllowHttp = true,
});

AzureBlobStorageOptions

PropertyTypeRequiredDescription
ContainerNamestringYesAzure container name
AccountNamestring?NoStorage account name
AccessKeystring?NoStorage access key
BearerTokenstring?NoBearer token
ClientIdstring?NoService principal client ID
ClientSecretstring?NoService principal client secret
TenantIdstring?NoService principal tenant ID
SasKeystring?NoShared Access Signature key
Endpointstring?NoCustom endpoint URL
UseEmulatorbool?NoUse Azurite emulator
AllowHttpbool?NoAllow HTTP connections
SkipSignaturebool?NoSkip request signing

Google Cloud Storage

// With service account credentials file
context.RegisterGoogleCloudStorage("gs://my-bucket", new GoogleCloudStorageOptions
{
BucketName = "my-bucket",
CredentialsPath = "/path/to/service-account.json",
});

await context.RegisterParquetAsync("data", "gs://my-bucket/data.parquet");

Anonymous Access

context.RegisterGoogleCloudStorage("gs://public-bucket", new GoogleCloudStorageOptions
{
BucketName = "public-bucket",
SkipSignature = true,
});

GoogleCloudStorageOptions

PropertyTypeRequiredDescription
BucketNamestringYesGCS bucket name
ProjectIdstring?NoGCP project ID
CredentialsPathstring?NoPath to service account JSON key file
Credentialsstring?NoService account JSON key as string
ServiceAccountEmailstring?NoService account email
CustomEndpointstring?NoCustom endpoint URL
AllowHttpbool?NoAllow HTTP connections
SkipSignaturebool?NoSkip request signing

In-Memory Object Store

The in-memory object store keeps data entirely in memory. It is useful for tests, short-lived pipelines, or any scenario where you want to load data from an external source (database, HTTP API, message queue, etc.) and query it with SQL without writing temporary files.

Basic Usage

using var runtime = DataFusionRuntime.Create();
using var context = runtime.CreateSessionContext();

// 1. Create the store
using var store = runtime.CreateInMemoryStore();

// 2. Put data into the store
var csvBytes = await File.ReadAllBytesAsync("data/customers.csv");
await store.PutAsync("customers.csv", csvBytes);

// 3. Register the store with a URL scheme
context.RegisterInMemoryObjectStore("memory://", store);

// 4. Register a table using the memory:// URL and query it
await context.RegisterCsvAsync("customers", "memory:///customers.csv");

using var df = await context.SqlAsync("SELECT * FROM customers");
await df.ShowAsync();

Any file format works — CSV, JSON, or Parquet:

await store.PutAsync("events.parquet", parquetBytes);
context.RegisterInMemoryObjectStore("memory://", store);
await context.RegisterParquetAsync("events", "memory:///events.parquet");

Deleting Data

Remove an object from the store with DeleteAsync. Subsequent queries against the table will return no rows (the table registration remains, but the underlying data is gone):

await store.DeleteAsync("customers.csv");

Zero-Copy Put

PutAsStaticAsync avoids copying data into native memory. The caller must pin the memory and keep it alive for the entire lifetime of the store:

var csvBytes = await File.ReadAllBytesAsync("data/customers.csv");
using var pin = csvBytes.AsMemory().Pin();

await store.PutAsStaticAsync("customers.csv", pin, csvBytes.Length);

Warning: If the pinned memory is released or modified while the store is still in use, behaviour is undefined.

Multiple Stores

You can register several in-memory stores on the same session by giving each one a distinct URL scheme:

using var storeA = runtime.CreateInMemoryStore();
using var storeB = runtime.CreateInMemoryStore();

await storeA.PutAsync("customers.csv", customersCsv);
await storeB.PutAsync("orders.csv", ordersCsv);

context.RegisterInMemoryObjectStore("mem-a://", storeA);
context.RegisterInMemoryObjectStore("mem-b://", storeB);

await context.RegisterCsvAsync("customers", "mem-a:///customers.csv");
await context.RegisterCsvAsync("orders", "mem-b:///orders.csv");

using var df = await context.SqlAsync(
"SELECT c.customer_name, o.order_id FROM customers c JOIN orders o ON c.customer_id = o.customer_id");

Sharing a Store Across Sessions

A single store can be registered in multiple session contexts. Each session maintains its own table catalog, but they share the underlying data:

context.RegisterInMemoryObjectStore("memory://", store);
contextB.RegisterInMemoryObjectStore("memory://", store);

InMemoryObjectStore API

MethodDescription
PutAsync(string path, Memory<byte> data)Copies data into the store at path.
PutAsStaticAsync(string path, MemoryHandle handle, int length)Stores data without copying. The pinned memory must stay alive.
DeleteAsync(string path)Removes the object at path.
Dispose()Releases all native resources held by the store.

Deregistering Object Stores

Remove a previously registered object store by its URL scheme. After deregistration, tables that reference the store's URLs can no longer access data.

context.DeregisterObjectStore("s3://my-bucket");