Rayforce Rayforce ← Back to home
GitHub

Storage

Columnar file I/O, splayed tables, date-partitioned storage, symbol table persistence, and CSV import/export — everything for getting data in and out of Rayforce.

Columnar .col Files

The .col format is Rayforce's native binary representation for a single vector. Each file stores a 32-byte header followed by the raw element data and an optional null bitmap. The format is designed for direct memory mapping — no deserialization needed.

File Structure

/*
 * .col file layout:
 *
 * Bytes  0-31:   ray_t header (type, attrs, len, etc.)
 * Bytes 32-N:    element data (len * elem_size bytes)
 * Bytes N-M:     null bitmap (if RAY_ATTR_HAS_NULLS + RAY_ATTR_NULLMAP_EXT)
 *                  — (len + 7) / 8 bytes
 */

C API

Function Description
ray_col_save(vec, path) Write a vector to a .col file. Handles slices, external null bitmaps, and string pools transparently.
ray_col_load(path) Read a .col file into a heap-allocated vector. The file is read entirely into memory.
ray_col_mmap(path) Memory-map a .col file for zero-copy access. The returned vector points directly into the mapped file pages. Ideal for large datasets that exceed available RAM.
// Save a vector
ray_t* prices = ray_vec_from_raw(RAY_F64, data, 1000000);
ray_err_t err = ray_col_save(prices, "db/trades/price.col");

// Load into memory
ray_t* loaded = ray_col_load("db/trades/price.col");

// Memory-map for zero-copy access
ray_t* mapped = ray_col_mmap("db/trades/price.col");
// mapped->data points into file pages — no allocation
Memory-mapped vectors have mmod = 1 in their header, distinguishing them from heap-allocated vectors. The buddy allocator skips them during free. Slices of mmap'd vectors retain a reference to the parent mapping.

Splayed Tables

A splayed table stores each column as a separate .col file in a directory. This is the standard on-disk representation for a Rayforce table. The schema (column names and types) is stored alongside the data files.

Directory Layout

db/trades/
  .schema.col        — I64 vector of column name symbol IDs
  sym.col            — SYM column (stock tickers)
  price.col          — F64 column (trade prices)
  qty.col            — I64 column (quantities)
  time.col           — TIMESTAMP column

C API

Function Description
ray_splay_save(tbl, dir, sym_path) Save a table as a splayed directory. Each column becomes a .col file named after its column symbol. Pass sym_path to also save the symbol table.
ray_splay_load(dir, sym_path) Load a splayed table from a directory. Columns are memory-mapped by default. Pass sym_path to load the associated symbol table.
// Save table to disk
ray_err_t err = ray_splay_save(table, "db/trades", "db/sym");

// Load table (columns are mmap'd)
ray_t* trades = ray_splay_load("db/trades", "db/sym");
; Rayfall: save and load splayed tables
ray> (splay-save t "db/trades")
ray> (set trades (splay-load "db/trades"))

Date-Partitioned Tables

For large time-series datasets, Rayforce supports date-partitioned storage. Data is split into directories named by date, each containing a splayed table for that day's data.

Directory Layout

db/trades/
  sym                 — shared symbol table
  2024.01.15/
    sym.col
    price.col
    qty.col
    time.col
  2024.01.16/
    sym.col
    price.col
    qty.col
    time.col
  2024.01.17/
    ...

Loading Partitioned Data

The ray_part_load() function scans all partition directories, memory-maps every column file, and assembles them into a single logical table with parted columns and a virtual MAPCOMMON date column.

// C API: load all partitions
ray_t* trades = ray_part_load("db", "trades");

// The result is a single table with:
// - A MAPCOMMON 'date' column derived from directory names
// - Parted columns (RAY_PARTED_BASE + base_type) for each data column
// - All segments are memory-mapped — no data copy
; Rayfall: load partitioned table
ray> (set trades (part-load "db" "trades"))

; Filter on date — optimizer prunes partitions
ray> (select {from:trades where: (= date 2024.01.15)})

; Range filter — only relevant partitions are scanned
ray> (select {from:trades where: (and (>= date 2024.01.15) (<= date 2024.01.17))})

Partition Pruning

The query optimizer recognizes predicates on the MAPCOMMON column and eliminates entire partitions from the scan plan. This means a query filtering on a single date in a year of data only touches 1/365th of the files on disk — with zero per-row cost for the pruned partitions.

Symbol Table Persistence

The global symbol intern table maps strings to integer IDs. When saving data to disk, the symbol table must be persisted so that symbol vectors can be correctly interpreted when reloaded.

Append-Only .sym Files

Symbol files use an append-only format. New symbols are appended to the end of the file without rewriting existing entries. This makes concurrent writes safe and enables incremental updates.

Function Description
ray_sym_save(path) Persist the current global symbol table to a .sym file
ray_sym_load(path) Load a symbol table from disk, merging with any existing entries
ray_sym_intern(str, len) Intern a string, returning its integer ID
ray_sym_find(str, len) Look up a string without interning (returns -1 if absent)
ray_sym_str(id) Resolve an ID back to its string
// Save symbol table alongside data
ray_sym_save("db/sym");

// On startup, load symbols before loading data
ray_sym_load("db/sym");

Concurrency and Integrity

CSV Import and Export

Rayforce includes a high-performance CSV loader with parallel parsing, automatic type inference, and null handling. No external libraries are used — the parser operates directly on memory-mapped file contents.

Reading CSV Files

; Basic CSV load — auto-detect types, comma delimiter, header row
ray> (set data (read-csv "trades.csv"))
sym  price   qty  date
----------------------------
AAPL 150.25  100  2024.01.15
GOOG 140.50  200  2024.01.15
MSFT 380.00   50  2024.01.15

C API

Function Description
ray_read_csv(path) Load a CSV file with default options: comma delimiter, first row as header, automatic type inference, "" as null.
ray_read_csv_opts(path, delim, header, null_str) Load with custom options: delimiter character, whether first row is a header, and null string representation.
ray_write_csv(table, path) Write a table to a CSV file with header row and comma delimiter.
// Default options
ray_t* data = ray_read_csv("trades.csv");

// Tab-delimited, no header, "NA" as null
ray_t* tsv = ray_read_csv_opts("data.tsv", '\t', false, "NA");

// Write results back
ray_write_csv(result, "output.csv");

Type Inference

The CSV loader samples values in each column and infers types in priority order:

  1. BOOLtrue/false, 1/0
  2. I64 — integer values within 64-bit range
  3. F64 — floating-point values
  4. DATEYYYY.MM.DD or YYYY-MM-DD format
  5. TIMESTAMP — date + time with nanosecond precision
  6. SYM — short repeated strings (auto-interned as symbols)
  7. STR — fallback for everything else

Parallel Parsing

The CSV file is memory-mapped and split into chunks. Multiple threads parse chunks in parallel, with a merge step that reconciles column types and combines partial results. For large files (100 MB+), this delivers near-linear speedup with core count.

Null Handling

Empty fields and fields matching the null string (default: "") are recognized as null values. The loader sets the appropriate null bitmap bits on the resulting vectors and marks them with RAY_ATTR_HAS_NULLS.

Symbol Merge

When loading CSV data with symbol columns, the loader interns all unique strings into the global symbol table. If a symbol table was previously loaded from disk, existing IDs are preserved and new symbols are appended.

Cross-Platform File I/O

All file operations go through a portable abstraction layer in src/store/fileio.{h,c} that handles platform differences:

Feature POSIX Windows
File lockingflock()LockFileEx()
Sync to diskfsync()FlushFileBuffers()
Atomic renamerename()MoveFileEx(MOVEFILE_REPLACE_EXISTING)
Memory mappingmmap()CreateFileMapping() + MapViewOfFile()
Atomic writes: When saving column files, Rayforce writes to a temporary file first, calls fsync(), then atomically renames it to the target path. This prevents data corruption if the process is interrupted during a write.