Skip to main content

Reading Data

Register files as named tables, then query them with SQL.

CSV

// Simple registration
await context.RegisterCsvAsync("orders", "data/orders.csv");

// With custom options
await context.RegisterCsvAsync("orders", "data/orders.csv", new CsvReadOptions
{
HasHeader = true,
Delimiter = ';',
SchemaInferMaxRecords = 1000,
});

CsvReadOptions

PropertyTypeDescription
HasHeaderbool?Whether the file has a header row (default: true)
Delimiterchar?Column delimiter character (default: ',')
Quotechar?Quote character (default, '"')
Escapechar?Escape character
Terminatorchar?Line terminator character
Commentchar?Comment character
NewlinesInValuesbool?Support newlines inside quoted values
SchemaSchema?Explicit Arrow schema
SchemaInferMaxRecordsulong?Max rows for schema inference
FileExtensionstring?File extension filter (default: .csv)
FileCompressionTypeCompressionType?Compression type
NullRegexstring?Regex pattern for null values
TruncatedRowsbool?Allow truncated rows
TablePartitionColsIReadOnlyList<PartitionColumn>?Hive partition columns

Parquet

await context.RegisterParquetAsync("events", "data/events.parquet");

// With options
await context.RegisterParquetAsync("events", "data/events.parquet", new ParquetReadOptions
{
ParquetPruning = true,
SkipMetadata = false,
});

ParquetReadOptions

PropertyTypeDescription
SchemaSchema?Explicit Arrow schema
FileExtensionstring?File extension filter (default: .parquet)
TablePartitionColsIReadOnlyList<PartitionColumn>?Hive partition columns
ParquetPruningbool?Prune row groups using predicates
SkipMetadatabool?Skip metadata in file schema

JSON

await context.RegisterJsonAsync("logs", "data/logs.json");

// With options
await context.RegisterJsonAsync("logs", "data/logs.json", new JsonReadOptions
{
SchemaInferMaxRecords = 500,
});

JsonReadOptions

PropertyTypeDescription
SchemaSchema?Explicit Arrow schema
SchemaInferMaxRecordsulong?Max rows for schema inference
FileExtensionstring?File extension filter (default: .json)
FileCompressionTypeCompressionType?Compression type
TablePartitionColsIReadOnlyList<PartitionColumn>?Hive partition columns

RecordBatch

Register an in-memory Arrow RecordBatch as a queryable table:

using Apache.Arrow;

var idArray = new Int64Array.Builder().Append(1).Append(2).Build();
var nameArray = new StringArray.Builder().Append("Alice").Append("Bob").Build();

var schema = new Schema.Builder()
.Field(new Field("id", Int64Type.Default, false))
.Field(new Field("name", StringType.Default, true))
.Build();

using var batch = new RecordBatch(schema, [idArray, nameArray], 2);

context.RegisterBatch("users", batch);

This is useful when you have data already in Arrow format or need to inject programmatically created data into SQL queries -- for example, to join in-memory lookup tables with file-based data:

await context.RegisterCsvAsync("orders", "data/orders.csv");
context.RegisterBatch("statuses", statusBatch);

using var df = await context.SqlAsync(
"SELECT o.order_id, s.description FROM orders o JOIN statuses s ON o.status = s.name");

Note: RegisterBatch is synchronous, unlike the async file-based registration methods.

CompressionType

Applies to CsvReadOptions.FileCompressionType and JsonReadOptions.FileCompressionType:

Gzip, Bzip2, Xz, Zstd, Uncompressed (default)

Deregistering Tables

await context.DeregisterTableAsync("orders");

Registering Directories

All registration methods accept both file paths and directory paths.

When pointing to a directory, DataFusion reads all matching files within it.

This is especially useful with Hive-style partitioning.