String Operations
Complete reference for string manipulation in Rayforce — from basic transforms to pattern matching, covering both RAY_SYM and RAY_STR column types.
RAY_SYM vs RAY_STR
Rayforce provides two distinct string column representations, each optimized for different workloads. Choosing the right type is critical for performance.
RAY_SYM — Dictionary-Encoded Symbols
RAY_SYM columns store strings as integer indices into a global intern table. Ideal for low-cardinality categorical data (country codes, status flags, product categories).
- Adaptive-width indices — 8, 16, 32, or 64-bit integers depending on dictionary size
- Equality comparison is a single integer compare — O(1)
- Group-by on SYM columns uses direct index lookup instead of hashing
- Memory efficient — each row stores only 1–8 bytes regardless of string length
- Global intern table — symbols are shared across all columns and tables
; Create a table with a SYM column (default for short repeated strings in CSV)
ray> (set t (read-csv "trades.csv"))
; region column is automatically SYM — only 4 unique values across 1M rows
RAY_STR — Variable-Length Strings
RAY_STR columns store variable-length strings with a hybrid inline/pool layout. Best for high-cardinality or unique text data (names, descriptions, URLs).
- SSO (Small String Optimization) — strings of 12 bytes or fewer are stored inline in the 16-byte
ray_str_telement, requiring zero indirection - Pool storage — strings longer than 12 bytes are written to a per-vector pool; the element stores a 4-byte offset and 4-byte length, plus a 4-byte prefix copied inline for fast comparison rejection
- Fast comparison rejection — the 4-byte prefix allows most unequal comparisons to short-circuit without following the pool pointer
- Per-vector pool — each RAY_STR vector has its own pool;
col_propagate_str_pool()shares pools between source and destination during execution
; STR columns are used for unique/high-cardinality text
ray> (set names (vec-str ["Alice" "Bob" "Charlie"]))
; "Alice" (5 bytes) → stored inline (SSO)
; "A longer description here" (26 bytes) → stored in pool with 4-byte prefix
RAY_SYM for columns with fewer than ~65K unique values (status codes, categories, tickers). Use RAY_STR for free-text, names, addresses, or any column where most values are unique. The CSV reader auto-detects: columns with a high repeat ratio become SYM, others become STR.
Null Propagation
All string operations in Rayforce follow strict null propagation semantics:
- Null input produces null output — if any required input row is null, the output row is null
- CONCAT is null if any argument is null —
concat("hello", NULL, "world")returns NULL, not"helloworld" - Null propagation applies uniformly to both RAY_SYM and RAY_STR columns
- Null bitmaps are carried through the execution pipeline per morsel (1024 elements)
ray> (set t (table [name] (list ["Alice" 0N "Charlie"])))
ray> (select {from:t cols: {upper_name: (upper name)}})
upper_name
----------
ALICE
0N
CHARLIE
String Functions
upper
lower
strlen
trim
concat
substr
start (0-based) with length len. If start is beyond the string length, returns an empty string. If start + len exceeds the string, returns everything from start to the end. Null input produces null output.replace
from with to in each string element. If from is not found, the original string is returned unchanged. Null input produces null output.like
% (match any sequence of characters) and _ (match any single character). Works on both RAY_SYM and RAY_STR columns.ilike
like (% and _) but ignores character case during comparison.split
format
{} are replaced with stringified arguments in order. Useful for building display strings or log messages.String Operations in the DAG
When using the C API, string operations are available as DAG opcodes. These are fused into morsel-driven execution alongside arithmetic and comparison operations.
| Opcode | C API | Description |
|---|---|---|
OP_UPPER | ray_upper(g, a) | Uppercase transform |
OP_LOWER | ray_lower(g, a) | Lowercase transform |
OP_STRLEN | ray_strlen(g, a) | String byte length |
OP_TRIM | ray_trim_op(g, a) | Strip leading/trailing whitespace |
OP_SUBSTR | ray_substr(g, str, start, len) | Extract substring by position |
OP_REPLACE | ray_replace(g, str, from, to) | Replace all occurrences |
OP_CONCAT | ray_concat(g, args, n) | Concatenate N strings |
OP_LIKE | ray_like(g, input, pattern) | Case-sensitive pattern match |
OP_ILIKE | ray_ilike(g, input, pattern) | Case-insensitive pattern match |
C API Example
/* Filter rows where upper(name) LIKE "A%" and compute strlen */
ray_graph_t* g = ray_graph_new(table);
ray_op_t* name = ray_scan(g, "name");
ray_op_t* up_name = ray_upper(g, name);
ray_op_t* pattern = ray_const_str(g, "A%", 2);
ray_op_t* pred = ray_like(g, up_name, pattern);
ray_op_t* filt_name = ray_filter(g, name, pred);
ray_op_t* name_len = ray_strlen(g, filt_name);
/* Execute — upper, like, filter, strlen all fused into one morsel pass */
ray_t* result = ray_execute(g, ray_optimize(g, name_len));
String Pool Internals
Understanding the internal layout helps explain performance characteristics of string operations.
ray_str_t Element Layout (16 bytes)
| Bytes | Inline (SSO) | Pool Reference |
|---|---|---|
| 0–3 | String data [0..3] | 4-byte prefix (first 4 bytes of string) |
| 4–7 | String data [4..7] | Pool offset (uint32_t) |
| 8–11 | String data [8..11] | String length (uint32_t) |
| 12–15 | Length + flag | Length + flag (high bit = 1 for pool) |
ray_str_vec_get() returns a pointer to the inline data or pool data transparently.
Hash and Comparison
String hashing uses ray_str_t_hash() which operates directly on the element bytes. Comparison via ray_str_t_cmp() / ray_str_t_eq() first compares the 4-byte prefix for fast rejection, then falls through to a full byte comparison only when prefixes match. This makes hash joins and group-by on string columns significantly faster than naive approaches.
Dictionary-Encoded Symbol Width
RAY_SYM columns use adaptive-width integer indices to minimize memory:
| Dictionary Size | Index Width | Bytes per Row |
|---|---|---|
| ≤ 255 | 8-bit | 1 |
| ≤ 65,535 | 16-bit | 2 |
| ≤ 4,294,967,295 | 32-bit | 4 |
| Larger | 64-bit | 8 |
Width is set at column creation via ray_sym_vec_new(sym_width, capacity) where sym_width is 1, 2, 4, or 8. The CSV reader picks the narrowest width that fits the observed cardinality.