Rayforce Rayforce ← Back to home
GitHub

String Operations

Complete reference for string manipulation in Rayforce — from basic transforms to pattern matching, covering both RAY_SYM and RAY_STR column types.

RAY_SYM vs RAY_STR

Rayforce provides two distinct string column representations, each optimized for different workloads. Choosing the right type is critical for performance.

RAY_SYM — Dictionary-Encoded Symbols

RAY_SYM columns store strings as integer indices into a global intern table. Ideal for low-cardinality categorical data (country codes, status flags, product categories).

; Create a table with a SYM column (default for short repeated strings in CSV)
ray> (set t (read-csv "trades.csv"))
; region column is automatically SYM — only 4 unique values across 1M rows

RAY_STR — Variable-Length Strings

RAY_STR columns store variable-length strings with a hybrid inline/pool layout. Best for high-cardinality or unique text data (names, descriptions, URLs).

; STR columns are used for unique/high-cardinality text
ray> (set names (vec-str ["Alice" "Bob" "Charlie"]))
; "Alice" (5 bytes) → stored inline (SSO)
; "A longer description here" (26 bytes) → stored in pool with 4-byte prefix
When to use which? Use RAY_SYM for columns with fewer than ~65K unique values (status codes, categories, tickers). Use RAY_STR for free-text, names, addresses, or any column where most values are unique. The CSV reader auto-detects: columns with a high repeat ratio become SYM, others become STR.

Null Propagation

All string operations in Rayforce follow strict null propagation semantics:

ray> (set t (table [name] (list ["Alice" 0N "Charlie"])))
ray> (select {from:t cols: {upper_name: (upper name)}})
upper_name
----------
ALICE
0N
CHARLIE

String Functions

upper

(upper x) unary · element-wise
Converts every character in each string element to uppercase. Works on both RAY_SYM and RAY_STR columns. Null inputs produce null outputs.
ray> (upper "hello world") "HELLO WORLD" ray> (select {from:t cols: {name: (upper name)}})

lower

(lower x) unary · element-wise
Converts every character in each string element to lowercase. Works on both RAY_SYM and RAY_STR columns. Null inputs produce null outputs.
ray> (lower "HELLO WORLD") "hello world" ray> (select {from:t cols: {email: (lower email)}})

strlen

(strlen x) unary · element-wise
Returns the byte length of each string element as an I64 vector. Null inputs produce null outputs.
ray> (strlen "hello") 5 ray> (select {from:t cols: {name:name len: (strlen name)}}) name len ----------- Alice 5 Bob 3 Charlie 7

trim

(trim x) unary · element-wise
Removes leading and trailing whitespace (spaces, tabs, newlines) from each string element. Works on both RAY_SYM and RAY_STR. Null inputs produce null outputs.
ray> (trim " hello ") "hello" ray> (select {from:t cols: {clean: (trim raw_name)}})

concat

(concat a b ...) variadic · element-wise
Concatenates two or more string arguments element-wise. Accepts any number of arguments. Returns null if any argument is null — this follows SQL CONCAT semantics with strict null propagation.
ray> (concat "hello" " " "world") "hello world" ; Null propagation: any null argument makes the result null ray> (concat "hi" 0N "there") 0N ray> (select {from:t cols: {full: (concat first " " last)}})

substr

(substr str start len) ternary · element-wise
Extracts a substring starting at byte position start (0-based) with length len. If start is beyond the string length, returns an empty string. If start + len exceeds the string, returns everything from start to the end. Null input produces null output.
ray> (substr "hello world" 6 5) "world" ray> (select {from:t cols: {prefix: (substr code 0 3)}})

replace

(replace str from to) ternary · element-wise
Replaces all occurrences of substring from with to in each string element. If from is not found, the original string is returned unchanged. Null input produces null output.
ray> (replace "hello world" "world" "rayforce") "hello rayforce" ray> (select {from:t cols: {clean: (replace path "/" "-")}})

like

(like str pattern) binary · element-wise
Case-sensitive SQL LIKE pattern matching. Returns a BOOL vector. Supports % (match any sequence of characters) and _ (match any single character). Works on both RAY_SYM and RAY_STR columns.
ray> (like "hello world" "%world") 1b ray> (select {from:t where: (like name "A%")}) ; Returns all rows where name starts with "A"

ilike

(ilike str pattern) binary · element-wise
Case-insensitive SQL LIKE pattern matching. Returns a BOOL vector. Same wildcard syntax as like (% and _) but ignores character case during comparison.
ray> (ilike "Hello World" "%hello%") 1b ray> (select {from:t where: (ilike email "%@gmail.com")})

split

(split str delimiter) binary · element-wise
Splits each string element by the given delimiter and returns a list of string vectors. Each element in the result is a vector of the split parts. Null input produces null output.
ray> (split "a,b,c" ",") ["a" "b" "c"] ray> (split "hello world" " ") ["hello" "world"]

format

(format fmt ...args) variadic
Formats values into a string using a format template. Placeholders {} are replaced with stringified arguments in order. Useful for building display strings or log messages.
ray> (format "Hello, {}!" "world") "Hello, world!" ray> (format "{} + {} = {}" 1 2 3) "1 + 2 = 3"

String Operations in the DAG

When using the C API, string operations are available as DAG opcodes. These are fused into morsel-driven execution alongside arithmetic and comparison operations.

Opcode C API Description
OP_UPPERray_upper(g, a)Uppercase transform
OP_LOWERray_lower(g, a)Lowercase transform
OP_STRLENray_strlen(g, a)String byte length
OP_TRIMray_trim_op(g, a)Strip leading/trailing whitespace
OP_SUBSTRray_substr(g, str, start, len)Extract substring by position
OP_REPLACEray_replace(g, str, from, to)Replace all occurrences
OP_CONCATray_concat(g, args, n)Concatenate N strings
OP_LIKEray_like(g, input, pattern)Case-sensitive pattern match
OP_ILIKEray_ilike(g, input, pattern)Case-insensitive pattern match

C API Example

/* Filter rows where upper(name) LIKE "A%" and compute strlen */
ray_graph_t* g = ray_graph_new(table);

ray_op_t* name    = ray_scan(g, "name");
ray_op_t* up_name = ray_upper(g, name);
ray_op_t* pattern = ray_const_str(g, "A%", 2);
ray_op_t* pred    = ray_like(g, up_name, pattern);

ray_op_t* filt_name = ray_filter(g, name, pred);
ray_op_t* name_len  = ray_strlen(g, filt_name);

/* Execute — upper, like, filter, strlen all fused into one morsel pass */
ray_t* result = ray_execute(g, ray_optimize(g, name_len));

String Pool Internals

Understanding the internal layout helps explain performance characteristics of string operations.

ray_str_t Element Layout (16 bytes)

Bytes Inline (SSO) Pool Reference
0–3String data [0..3]4-byte prefix (first 4 bytes of string)
4–7String data [4..7]Pool offset (uint32_t)
8–11String data [8..11]String length (uint32_t)
12–15Length + flagLength + flag (high bit = 1 for pool)
SSO threshold: Strings of 12 bytes or fewer are stored entirely within the 16-byte element — no heap allocation, no pointer chase. The majority of real-world strings (tickers, codes, short names) benefit from this optimization. Access via ray_str_vec_get() returns a pointer to the inline data or pool data transparently.

Hash and Comparison

String hashing uses ray_str_t_hash() which operates directly on the element bytes. Comparison via ray_str_t_cmp() / ray_str_t_eq() first compares the 4-byte prefix for fast rejection, then falls through to a full byte comparison only when prefixes match. This makes hash joins and group-by on string columns significantly faster than naive approaches.

Dictionary-Encoded Symbol Width

RAY_SYM columns use adaptive-width integer indices to minimize memory:

Dictionary Size Index Width Bytes per Row
≤ 2558-bit1
≤ 65,53516-bit2
≤ 4,294,967,29532-bit4
Larger64-bit8

Width is set at column creation via ray_sym_vec_new(sym_width, capacity) where sym_width is 1, 2, 4, or 8. The CSV reader picks the narrowest width that fits the observed cardinality.