Working with Apache Arrow
What is Apache Arrow?
Apache Arrow is a columnar in-memory data format designed for efficient analytics. It enables zero-copy data sharing between systems and languages.
DataFusion uses Arrow as its native data format — all query results are returned as Arrow data structures.
Columnar vs Row-based
Traditional databases return data row by row. Arrow stores data column by column — all values of a single column are stored contiguously in memory. This layout is optimized for analytical queries that scan and aggregate columns.
When you execute a query with DataFusionSharp, you don't get row objects. Instead, you get typed column arrays where each column is a separate array of values:
Row-based:
┌───────┬─────┐
│ Alice │ 100 │
│ Bob │ 250 │
│ Carol │ 75 │
└───────┴─────┘
Columnar (Arrow):
names: ["Alice", "Bob", "Carol"]
amounts: [100, 250, 75]
RecordBatch
A RecordBatch is Arrow's unit of data — a group of rows stored in columnar form. Each batch has:
- A Schema describing column names and types
- A set of typed column arrays, one per column
Query results may arrive as multiple batches. Use CollectAsync() to get all batches at once, or ExecuteStreamAsync() to process them as a stream:
// All at once
using var result = await df.CollectAsync();
foreach (var batch in result.Batches)
{
// process batch
}
// Streaming
await using var stream = await df.ExecuteStreamAsync();
await foreach (var batch in stream)
{
// process batch
}
Why You Need to Cast
batch.Column() returns IArrowArray — a generic interface. To read actual values, you cast to the concrete array type:
var ids = (Int64Array)batch.Column("id");
long? firstId = ids.GetValue(0);
The cast is necessary because Arrow arrays are strongly typed at the storage level, but the RecordBatch API returns the common interface. Without casting, you can't access typed values.