21  Training Data in R

Reading Feature Store exports — semi-structured types and ADBC paths

Keywords

snowflake, R, RStudio, Posit, VS Code, workspace notebooks, snowflakeR, RSnowflake, mlops

21.1 Overview

This chapter covers getting Feature Store training data into R for modeling — not authoring Feature Views (see Authoring Feature Views). It is organized in three layers:

  1. Materialize — Feature Store joins (save_as, Datasets) and RSnowflake read transport (ADBC vs SQL API).
  2. Read techniques — flatten in Snowflake, unpack Arrow list columns in R, or parse JSON as a fallback; plus VECTOR layout reference.
  3. End-to-end pipelines (A–E) — recipes that combine materialization + read technique for each feature shape (scalar, wide scalar, ARRAY, OBJECT, Python decode).

Start with Feature Store training data in R if you are new to exports. Author features first in Authoring Feature Views; jump to Choose a pipeline if views are already registered.

It complements Feature Store (catalog and training APIs), Authoring Feature Views (warehouse SQL/dbplyr for feature_df), and RSnowflake: Query (dbplyr push-down, compute(), bulk I/O).

21.1.1 Scalars first, semi-structured when you need them

Use scalar columns: Most Feature Views and training data sets use ordinary scalar columns, one Snowflake column per feature, point-in-time join via sfr_generate_training_data(), read with RSnowflake + ADBC as native R types (Pipeline A). That is the default and works well for up to a few hundred columns after multi–Feature View joins.

Use OBJECT, ARRAY, or VECTOR when:

Situation Why semi-structured helps
Wide tables (hundreds–thousands of features) Few physical columns; lighter SQL compile; efficient transport via packed ARRAY/VECTOR (Pipeline B)
Feature / schema evolution Add keys to an OBJECT or slots in an ARRAY without ALTER TABLE … ADD COLUMN for every new feature
Ultra-wide or sparse result sets Store present values only; carry null masks; decode on the client
Embeddings ARRAY or VECTOR instead of one column per dimension (Pipeline C)

For authoring features (SQL/dbplyr feature_df, simple through wide/temporal), see Authoring Feature Views. For platform background (hybrid OBJECT payloads, compile cost, sparse arrays, DT cascades), read the Feature Store Implementation Guide — Advanced patterns §12.10 — especially §12.10.3 Semi-structured types in Feature Views and §12.10.4 Hybrid storage. This chapter focuses on reading materialized training data into R once Feature Views exist.

21.1.2 SQL and dbplyr tabs

Warehouse logic appears as SQL or dbplyr (RSnowflake lazy tables) wherever both are practical — the same idea as the RSnowflake: Query §dbplyr (group="query-sql-dplyr"). Examples here use tabs where your tab choice persists while you read this chapter. Use dbplyr when you already work in R with tbl(); use SQL when you share DDL with other teams or generate views from a feature manifest.

21.2 Learning Objectives

  • Navigate the chapter layers: materialize → read technique → pipeline workflow
  • Decide when scalar columns suffice vs OBJECT/ARRAY/VECTOR exports
  • Materialize Feature Store training data (save_as, Datasets) before large RSnowflake reads
  • Flatten OBJECT/ARRAY/VECTOR in SQL when possible; use ADBC decode or JSON parse when not
  • Choose pipeline A–E by how features are stored in Snowflake

21.3 How this chapter is organized

Layer Sections Question it answers
Transport Semi-structured in R, ADBC setup What column types do I get over SQL API vs ADBC?
Materialize Feature Store training data How does data leave Feature Store (save_as, Dataset, direct SQL)?
Author Authoring Feature Views How do I define feature_df SQL/dbplyr before export?
Read techniques Extract in Snowflake, ADBC decode, JSON parse, VECTOR layout After materialization, how do I turn OBJECT/ARRAY/VECTOR into a modeling frame?
Choose workflow Wide training sets, Pipelines A–E Which full story matches how my features are stored?

The pipelines are not a separate topic from extract/export — they are composed workflows. Pipeline steps link back to the read techniques above instead of redefining them.


21.4 Semi-structured columns in R

How semi-structured columns appear in R depends on which RSnowflake transport you use:

Transport When Semi-structured in R (typical)
SQL API (default) RSnowflake.backend = "auto" and ADBC not installed, or ADBC failed VARIANT, OBJECT, ARRAY, VECTOR → character (JSON text per cell)
ADBC adbcsnowflake installed + options(RSnowflake.backend = "adbc") Scalars native; VECTOR often list columns of numbers; typed ARRAY / structured OBJECT may be native if account flag is on (below)

Scalar columns (DOUBLE, NUMBER, timestamps, …) are native in both paths. Feature Store training tables that are mostly scalars still benefit from ADBC for volume (columnar transfer, less parsing overhead on numerics).

21.4.1 Recommendation for Feature Store workloads

For retrieving training data into R when views use OBJECT, ARRAY, or VECTOR:

  1. Materialize in Snowflakesfr_generate_training_data(..., save_as = "MY_TRAINING") or a managed table / export view (Feature Store training data).
  2. Read with RSnowflake + ADBC — install adbcsnowflake (CRE or bootstrap), then:
SELECT *
FROM MY_DB.ML.MY_TRAINING
LIMIT 100000;
options(RSnowflake.backend = "adbc")   # force Arrow path for reads

library(dplyr)
training <- tbl(con, in_schema("ML", "MY_TRAINING")) |>
  head(100000) |>
  collect()
  1. Prefer unpacking in SQL or dbplyr when you can — see Extract in Snowflake — so collect() returns scalars even on the SQL API path.
  2. Reserve client-side JSON parsing for VARIANT/OBJECT bundles you cannot flatten in SQL, or for validation samples (LIMIT).

sfr_generate_training_data() without save_as still goes through snowflakeR → pandas → R; it does not use RSnowflake ADBC. Use save_as + RSnowflake when semi-structured or wide training pulls are the bottleneck.

21.4.2 Why Feature Store users care

Snowflake recommends OBJECT or ARRAY columns for very wide feature sets — see Scalars first and the Feature Store Implementation Guide — Wide & sparse. Joining multiple Feature Views can still produce hundreds of scalar columns even when each view is modest (Wide training sets).

Pattern In Snowflake SQL API → R ADBC → R (with account settings)
Many scalar columns One column per feature Native types Native types (faster at scale)
VARIANT / untyped OBJECT Flexible JSON blob JSON character JSON character (no native Arrow)
Typed ARRAY / structured OBJECT Declared element types JSON character Native list/struct if account flag enabled
VECTOR packs Fixed-length numerics JSON character Often list column → matrix (no account flag required)

RSnowflake helps you build bundles on the warehouse (object_agg, array_agg, ::VECTOR); how you read them is a separate choice between SQL flattening, ADBC, and client-side decode.

Use How this chapter is organized to navigate; Choose a pipeline comes after the read techniques.


21.5 Feature Store training data in R

Feature Store training flows through snowflakeR, not RSnowflake alone — but the data types you get in R depend on which read path you choose.

21.5.1 Path 1 — sfr_generate_training_data() (transient join)

Joins your spine to one or more Feature Views and returns an R data.frame:

training <- sfr_generate_training_data(
  fs,
  spine    = spine_sql,
  features = list(fv1, fv2),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label"
)

What happens under the hood: the Python Feature Store SDK runs the join in Snowflake, materializes a Snowpark result, converts it to pandas, then snowflakeR transfers columns into R (today via a JSON round-trip for compatibility with Workspace). Scalar feature columns arrive as normal R types. Any OBJECT/ARRAY/VECTOR columns in the join result follow the pandas/JSON path — treat them like SQL API character columns if they appear as strings.

Materialize without pulling everything to R:

sfr_generate_training_data(
  fs,
  spine    = spine_sql,
  features = list(fv1, fv2),
  save_as  = "CHURN_TRAINING_V1",   # table in Feature Store schema
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label"
)

# Inspect or model from Snowflake with RSnowflake
library(dplyr)
training_tbl <- tbl(con, in_schema("FEATURE_STORE", "CHURN_TRAINING_V1"))
sample       <- training_tbl |> head(10000) |> collect()

Use save_as for wide or large training sets: validate joins in Snowflake, then read with RSnowflake + ADBC (ADBC setup):

SELECT *
FROM MY_DB.FEATURE_STORE.CHURN_TRAINING_V1
LIMIT 100000;
options(RSnowflake.backend = "adbc")
training_tbl <- tbl(con, in_schema("FEATURE_STORE", "CHURN_TRAINING_V1"))
sample       <- training_tbl |> head(100000) |> collect()

21.5.2 Path 2 — sfr_generate_dataset() (versioned Dataset + lineage)

Creates an immutable Snowflake ML Dataset (Parquet files on a stage — the same object Python reads with fsspec):

training <- sfr_generate_dataset(
  fs,
  name     = "CHURN_TRAINING",
  version  = "v1",
  spine    = spine_sql,
  features = list(fv1, fv2),
  spine_label_cols = "label"
)

attr(training, "dataset_name")     # for sfr_log_model() lineage
attr(training, "dataset_version")

What R receives today: the bridge reads the Dataset through Dataset.read.to_snowpark_dataframe().to_pandas() — not direct Parquet file access. That matches demo and moderate row counts; it is the same semantic data Python would get from an in-session read, not the maximum-throughput fsspec path.

What Python does at scale: after generate_dataset(), the SDK exposes Parquet files on internal storage:

# Python (snowflake-ml) — not wrapped in snowflakeR yet
ds = session.dataset.load("MY_DB.ML.CHURN_TRAINING").select_version("v1")
files = ds.read.files()          # fsspec-compatible paths
# pd.read_parquet(files) or pyarrow.dataset

See Snowflake ML Dataset — direct file access.

R options for large Datasets today:

  1. sfr_read_dataset(conn, name, version) — same pandas → R path as sfr_generate_dataset(); suitable when the Dataset already exists.
  2. save_as table + RSnowflake + ADBC — recommended for wide/scalar-heavy training frames at scale (ADBC setup). Dataset Parquet still stores VARIANT as JSON strings on disk — same limitation as Python unless columns are typed ARRAY / scalars.
  3. Python decode + reticulatePipeline E for three-VECTOR export at maximum throughput.

A future snowflakeR release may add format = "parquet" on dataset reads (fsspec → arrow::read_parquet()). Until then, assume Dataset APIs in R are lineage- and moderate-scale-friendly, not a substitute for a full Parquet export pipeline.

Full export workflows: Reading materialized training data, then Choose a pipeline (A–E). Define features first? Authoring Feature Views.

21.5.3 Path 3 — RSnowflake only (custom SQL / dbplyr)

When you own the SQL (external Feature View table, export view, or save_as output), use RSnowflake directly:

SELECT *
FROM MY_DB.ML.CHURN_TRAINING_V1;
options(RSnowflake.backend = "adbc")   # recommended for materialized training reads

training_tbl <- tbl(con, in_schema("ML", "CHURN_TRAINING_V1"))
training     <- training_tbl |> collect()
# Scalars: native types. VARIANT/OBJECT: chr → [JSON parse](#sec-parse-json).
# VECTOR / typed ARRAY: list → [ADBC decode](#sec-parse-adbc).
# Or flatten first: [Extract in Snowflake](#sec-sql-extract).

This path gives you the most control over which columns cross the wire — important when joins are wide.

21.6 ADBC, Arrow, and account settings

ADBC (adbcsnowflake) uses Snowflake’s Arrow result path. That is the recommended read transport for large materialized training tables and for VECTOR / typed ARRAY columns — not a magic bypass for every semi-structured type.

21.6.1 Enable ADBC in Workspace

Step Action
Packages adbcsnowflake, adbcdrivermanager (CRE image or bootstrap languages.r.addons.adbc: true)
Connection Same dbConnect(Snowflake()) — backend is lazy-initialised
Reads options(RSnowflake.backend = "adbc") or leave "auto" when ADBC is installed

See RSnowflake in Workspace and WORKSPACE_ADBC.md.

21.6.2 Account parameter for native structured types

Snowflake can return typed ARRAY, structured OBJECT, and MAP as native Arrow types instead of JSON strings. This is controlled at account scope (not session):

-- Accountadmin — one-time per account; confirm with your admin
ALTER ACCOUNT SET ENABLE_STRUCTURED_TYPES_IN_CLIENT_RESPONSE = TRUE;
Column type Native Arrow over ADBC?
Scalars (DOUBLE, INT, …) Yes
VECTOR(FLOAT/INT, n) Yes — fixed_size_list (does not require the account flag)
Typed ARRAY(…) Yes with account flag; otherwise JSON strings
Structured OBJECT (OBJECT(f1 DOUBLE, …)) Yes with account flag
VARIANT / untyped OBJECT No — always JSON strings in Arrow

Ask your account team whether the flag is already enabled before relying on native ARRAY/struct columns in R.

Warning

SQL API has no server-side Arrow. dbGetQuery() on the REST path always uses JSON. dbGetQueryArrow() on the SQL API path builds Arrow client-side from JSON — interface compatibility only, not the same as ADBC transport.

For Feature Store exports you install adbcsnowflake in the Workspace R environment (CRE or bootstrap), materialize training with save_as, then read that table through RSnowflake with options(RSnowflake.backend = "adbc"). That is a client-side ADBC driver talking Arrow to Snowflake — not dbGetQueryArrow() on the REST API.

21.6.3 Inspect what you received

After collect(), inspect column classes before choosing a decoder:

SELECT * FROM MY_DB.ML.MY_TRAINING LIMIT 100;
options(RSnowflake.backend = "adbc")
sample <- dbGetQuery(con, "SELECT * FROM MY_DB.ML.MY_TRAINING LIMIT 100")
str(sample, max.level = 1)
# PACKED_FLOATS   : List of 100  | num [1:128]  ...  → VECTOR over ADBC
# FEATURE_OBJECT  : chr [1:100]  ...                 → VARIANT/OBJECT JSON
# AVG_SPEND       : num ...                          → scalar

21.7 Reading materialized training data

Once training data is a table or view in Snowflake (save_as, export VIEW, or custom SQL), you choose how semi-structured columns cross the wire. These three strategies apply across pipelines — Pipeline A uses scalars only; B–E combine repack in SQL with decode on the client.

Priority Strategy When Section
1 (preferred) Extract in Snowflake Known OBJECT keys or ARRAY slots; moderate width Extract in Snowflake
2 ADBC unpack in R VECTOR packs or typed ARRAY with account flag; large row counts ADBC decode
3 (fallback) JSON parse in R SQL API path, VARIANT blobs, validation samples (LIMIT) JSON parse

For wide numeric exports, see VECTOR layout (three-column pack) — used by Pipeline B and Pipeline E.


21.8 Extract in Snowflake

The most portable and often fastest approach: project features in Snowflake (SQL or dbplyr), then collect() scalars only. Works with SQL API or ADBC and avoids JSON parsing in R entirely.

21.8.1 OBJECT path access

SELECT
  customer_id,
  event_ts,
  label,
  feature_object:avg_spend::FLOAT  AS avg_spend,
  feature_object:n_orders::INTEGER AS n_orders
FROM MY_DB.ML.CUSTOMER_FEATURES_BUNDLED;
library(dplyr)

training <- tbl(con, in_schema("ML", "CUSTOMER_FEATURES_BUNDLED")) |>
  mutate(
    avg_spend = sql("FEATURE_OBJECT:avg_spend::FLOAT"),
    n_orders  = sql("FEATURE_OBJECT:n_orders::INTEGER")
  ) |>
  select(-FEATURE_OBJECT) |>
  collect()

21.8.2 ARRAY → scalars (known positions)

SELECT
  entity_id,
  feature_array[0]::FLOAT AS f1,
  feature_array[1]::FLOAT AS f2
FROM MY_DB.ML.TRAINING_ARRAY_PACK;
training <- tbl(con, "TRAINING_ARRAY_PACK") |>
  mutate(
    f1 = sql("feature_array[0]::FLOAT"),
    f2 = sql("feature_array[1]::FLOAT")
  ) |>
  select(-FEATURE_ARRAY) |>
  collect()

For hundreds of features, generate the mutate() expressions from a feature manifest in R, or build a VIEW in Snowflake that flattens once and register that as the Feature View backing table.

21.8.3 VECTOR packs — keep packed for modeling

If the model expects a matrix, you may keep VECTOR/ARRAY columns through collect() and decode in R (ADBC decode) rather than exploding to hundreds of scalar columns in SQL.


21.9 ADBC: unpack native Arrow columns in R

Use this path when collect() returns list columns (VECTOR or typed ARRAY with account flag), not JSON character.

21.9.1 VECTOR or ARRAY as list column → matrix

SELECT
  entity_id,
  features_float_arr::VECTOR(FLOAT, 128) AS features_float
FROM MY_DB.ML.TRAINING_VECTOR_PACK
LIMIT 50000;
options(RSnowflake.backend = "adbc")

pack <- dbGetQuery(con, "
  SELECT entity_id,
         features_float_arr::VECTOR(FLOAT, 128) AS features_float
  FROM MY_DB.ML.TRAINING_VECTOR_PACK
  LIMIT 50000
")

stopifnot(is.list(pack$FEATURES_FLOAT))
X <- do.call(rbind, pack$FEATURES_FLOAT)
training <- data.frame(entity_id = pack$ENTITY_ID, X)

If a column is still character, fall back to JSON parse — the warehouse or connection may not be using the Arrow path for that query.

21.9.2 Typed ARRAY with account flag

SELECT entity_id, feature_array
FROM MY_DB.ML.TRAINING_TYPED_ARRAY   -- ARRAY(DOUBLE) column
LIMIT 10000;
arr <- dbGetQuery(con, "
  SELECT entity_id, feature_array
  FROM MY_DB.ML.TRAINING_TYPED_ARRAY
  LIMIT 10000
")

if (is.list(arr$FEATURE_ARRAY)) {
  X <- do.call(rbind, arr$FEATURE_ARRAY)
} else {
  X <- matrix(
    vapply(arr$FEATURE_ARRAY, jsonlite::fromJSON, numeric(0), simplify = TRUE),
    nrow = nrow(arr), byrow = TRUE
  )
}

21.9.3 Mix scalar + packed columns

SELECT
  entity_id,
  LABEL::FLOAT AS label_num,
  features_float_arr::VECTOR(FLOAT, 128) AS features_float,
  feature_object
FROM MY_DB.ML.MY_TRAINING;
library(dplyr)

training <- tbl(con, in_schema("ML", "MY_TRAINING")) |>
  mutate(label_num = sql("LABEL::FLOAT")) |>
  collect()   # ADBC: scalars native; packs may be list or character

if ("FEATURES_FLOAT" %in% names(training) && is.list(training$FEATURES_FLOAT)) {
  X_pack <- do.call(rbind, training$FEATURES_FLOAT)
}
if ("FEATURE_OBJECT" %in% names(training) && is.character(training$FEATURE_OBJECT)) {
  feats <- vapply(training$FEATURE_OBJECT, jsonlite::fromJSON, numeric(0), simplify = FALSE)
}

For very large n, consider the arrow package on exported Parquet (Dataset files or COPY INTO) instead of expanding wide matrices in memory — see Feature Store training data.


21.10 SQL API: parse JSON in R

When results arrive as character JSON (SQL API path, or VARIANT/OBJECT over ADBC), use jsonlite. Prefer vapply() over lapply() + rbind() for thousands of rows.

21.10.1 OBJECT column → named features per row

SELECT entity_id, feature_object
FROM MY_DB.ML.CUSTOMER_FEATURES_BUNDLED
LIMIT 5000;
library(jsonlite)

sample <- dbGetQuery(con, "
  SELECT entity_id, feature_object
  FROM MY_DB.ML.CUSTOMER_FEATURES_BUNDLED
  LIMIT 5000
")

fromJSON(sample$FEATURE_OBJECT[1])

feat_list <- vapply(
  sample$FEATURE_OBJECT,
  fromJSON,
  FUN.VALUE = numeric(0),
  simplify = FALSE
)
feat_mat <- do.call(rbind, lapply(feat_list, `[`, unique(names(feat_list[[1]]))))
training <- cbind(entity_id = sample$ENTITY_ID, as.data.frame(feat_mat))

If keys differ by row, normalize with tidyr::unnest_wider() after tibble::enframe(feat_list) or keep features in SQL with feature_object:avg_spend::FLOAT style extracts (Extract in Snowflake).

21.10.2 ARRAY column → numeric matrix

SELECT entity_id, feature_array
FROM MY_DB.ML.TRAINING_ARRAY_PACK
LIMIT 5000;
library(jsonlite)

arr <- dbGetQuery(con, "
  SELECT entity_id, feature_array
  FROM MY_DB.ML.TRAINING_ARRAY_PACK
  LIMIT 5000
")

rows <- vapply(
  arr$FEATURE_ARRAY,
  function(s) fromJSON(s, simplifyVector = TRUE),
  FUN.VALUE = numeric(0)
)
X <- matrix(unlist(rows), nrow = nrow(arr), byrow = TRUE)

Check ncol(X) once against your declared array length; mismatches usually mean NULL or ragged arrays in source data.

21.10.3 VECTOR column → numeric matrix

VECTOR values also arrive as JSON text on the SQL API path. The decode step is the same as ARRAY:

SELECT
  entity_id,
  packed_floats::VECTOR(FLOAT, 128) AS packed_floats
FROM MY_DB.ML.TRAINING_VECTOR_PACK
LIMIT 5000;
library(jsonlite)

vec <- dbGetQuery(con, "
  SELECT entity_id,
         packed_floats::VECTOR(FLOAT, 128) AS packed_floats
  FROM MY_DB.ML.TRAINING_VECTOR_PACK
  LIMIT 5000
")

float_rows <- vapply(
  vec$PACKED_FLOATS,
  function(s) fromJSON(s, simplifyVector = TRUE),
  FUN.VALUE = numeric(0)
)
X_float <- matrix(unlist(float_rows), nrow = nrow(vec), byrow = TRUE)

For production-sized training sets, do not rely on row-wise fromJSON over millions of rows unless columns are VARIANT/OBJECT with no SQL flatten path. Prefer Extract in Snowflake, ADBC + list columns (ADBC decode), or Parquet — see Feature Store training data.


21.11 VECTOR columns for wide feature exports

Snowflake VECTOR types store a fixed-length sequence of numbers (up to 4,096 elements per column). They are useful when you export hundreds or thousands of features but want the query planner to see only a small number of columns — not one expression per feature.

This section documents the three-column VECTOR layout referenced by Pipeline B step 3 and Pipeline E. For the full repack story (wide ASOF → ARRAY → VECTOR → decode), follow Pipeline B; this section is the reference for cast dimensions and client decode snippets.

RSnowflake runs SQL that references VECTOR(FLOAT, n) and VECTOR(INT, n) without restriction. There are no dplyr helpers yet for VECTOR_AVG / VECTOR_SUM; write those in SQL with sql() or dbGetQuery().

After collect() on the SQL API path, VECTOR columns are character (JSON) — use JSON parse. With ADBC, VECTOR is often a list column — use ADBC decode.

21.11.1 Packed export layout (three columns)

A common layout for mixed numeric and categorical features stores data as ARRAY in tables, then casts to VECTOR at read time:

Column Type Role
features_float VECTOR(FLOAT, n) Imputed numerics
features_int VECTOR(INT, n) Ordinal-encoded categoricals
null_mask VECTOR(INT, ceil(n/32)) Bitmask of which features were NULL

The export SQL stays short regardless of feature count — the warehouse compiles three expressions instead of hundreds:

SELECT
  entity_id,
  features_float_arr::VECTOR(FLOAT, 1000) AS features_float,
  features_int_arr::VECTOR(INT, 400)     AS features_int,
  null_mask_arr::VECTOR(INT, 32)         AS null_mask
FROM MY_DB.ML.TRAINING_EXPORT_VIEW
WHERE train_date = '2026-01-01'
LIMIT 10000;
library(dplyr)

options(RSnowflake.backend = "adbc")

export <- tbl(con, in_schema("ML", "TRAINING_EXPORT_VIEW")) |>
  filter(sql("train_date = '2026-01-01'")) |>
  mutate(
    features_float = sql("features_float_arr::VECTOR(FLOAT, 1000)"),
    features_int   = sql("features_int_arr::VECTOR(INT, 400)"),
    null_mask      = sql("null_mask_arr::VECTOR(INT, 32)")
  ) |>
  select(entity_id, features_float, features_int, null_mask) |>
  head(10000) |>
  collect()

if (is.list(export$FEATURES_FLOAT)) {
  X_num <- do.call(rbind, export$FEATURES_FLOAT)
} else {
  X_num <- matrix(
    vapply(export$FEATURES_FLOAT, jsonlite::fromJSON, numeric(0), simplify = TRUE),
    nrow = nrow(export), byrow = TRUE
  )
}

Define features_*_arr in the Feature View or a downstream Dynamic Table — not row-by-row in R. See the Implementation Guide §12.10 for platform patterns.

Note

Parquet and VECTOR: Snowflake cannot write VECTOR columns directly to Parquet. For file-based pipelines, export typed ARRAY columns (or scalars) instead; VECTOR is for in-warehouse query and SQL API/ADBC reads.


21.12 Wide training sets: views, joins, and column count

This section summarises why width appears and which pipeline to pick. Full step-by-step workflows are in End-to-end pipelines (next section).

Width is not only “one Feature View with 500 columns.” It also accumulates when you join several Feature Views into one training set:

training_data <- sfr_generate_training_data(
  fs,
  spine    = spine_sql,
  features = list(
    list(name = "CUSTOMER_DEMOGRAPHICS", version = "v1"),  # 40 cols
    list(name = "CUSTOMER_TRANSACTIONS", version = "v1"), # 120 cols
    list(name = "PRODUCT_EMBEDDINGS",    version = "v1")  # 256 cols
  ),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  auto_prefix         = TRUE   # CUSTOMER_TRANSACTIONS_avg_spend, ...
)

Each view adds its feature columns (plus keys and optional timestamp columns). auto_prefix = TRUE avoids name collisions but increases total width — plan for hundreds of columns even when each view is modest.

Situation Pipeline
Few views, up to ~300 total scalar columns after join A — scalar direct
Several views → ~300–500+ scalar columns B — scalar-wide → three VECTOR packs
Features already ARRAY in views C — ARRAY-native
Features as OBJECT / VARIANT D — OBJECT-native (flatten SQL or B-style repack)
Maximum packed decode throughput E — Python → R
Multi-view name collisions auto_prefix = TRUE; consider fewer views or pre-joined SQL

21.13 End-to-end training export pipelines

Each pipeline is a composed workflow — not a new read technique. Before diving in, you should understand:

Pipeline steps link to those sections for extract/decode details rather than repeating them. Steps assume a multilang Workspace notebook: snowflakeR for Feature Store APIs, RSnowflake for reads, optional ADBC in R (ADBC setup).

21.13.1 Choose a pipeline

How features live today Combined ASOF / training width Recommended pipeline
Scalar columns per feature up to a few hundred (~300) columns total Pipeline A — scalar, direct
Scalar columns per feature ~300+ columns after multi–Feature View join Pipeline B — scalar-wide → three VECTOR packs
ARRAY columns (typed packs) Any Pipeline C — ARRAY-native
OBJECT / VARIANT JSON per row Any Pipeline D — OBJECT-native
Packed export; maximum decode speed in notebook Large row × width Pipeline E — Python decode → R

flowchart TB
  subgraph fs [Snowflake — Feature Store]
    SP[Spine + labels]
    FV[Feature Views]
    JOIN[generate_training_data ASOF join]
    MAT[(Materialized table or export VIEW)]
    SP --> JOIN
    FV --> JOIN
    JOIN --> MAT
  end

  subgraph pack [Optional repack in SQL]
    WIDE[Wide scalar table]
    ARR[Three ARRAY columns]
    VEC[VECTOR cast at read time]
    WIDE --> ARR --> VEC
  end

  subgraph client [Workspace client]
    ADBC[RSnowflake ADBC collect]
    R[R data.frame / matrix]
    PY[Python Arrow decode]
    RET[reticulate py_to_r]
    ADBC --> R
    PY --> RET --> R
  end

  MAT --> WIDE
  MAT --> ARR
  ARR --> VEC
  MAT --> ADBC
  VEC --> ADBC
  VEC --> PY


21.13.2 Pipeline A — Scalar features, moderate width

When: Each feature is its own column; total width after the ASOF join stays in a moderate range (rough guide: up to a few hundred — ~300 — scalar feature columns with RSnowflake + ADBC).

flowchart LR
  A1[Spine + Feature Views] --> A2[sfr_generate_training_data]
  A2 --> A3{Row count?}
  A3 -->|Demo / small| A4[Return R data.frame]
  A3 -->|Large or reuse| A5[save_as table]
  A5 --> A6[RSnowflake ADBC collect]
  A6 --> A7[Model in R]
  A4 --> A7

1 — Register and join (R, snowflakeR)

library(snowflakeR)
conn <- sfr_connect()
fs   <- sfr_feature_store(conn, database = "MY_DB", schema = "ML")

fv1 <- sfr_get_feature_view(fs, "CUSTOMER_FEATURES", "v1")
fv2 <- sfr_get_feature_view(fs, "CUSTOMER_ACTIVITY", "v1")

spine_sql <- "
  SELECT customer_id, event_ts, churned AS label
  FROM MY_DB.ML.CHURN_SPINE
  WHERE event_ts >= '2024-01-01'
"

2a — Small pull: straight to R

training <- sfr_generate_training_data(
  fs,
  spine               = spine_sql,
  features            = list(fv1, fv2),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  auto_prefix         = TRUE
)
# Scalar columns → ready for glm / tidymodels

2b — Large or repeated reads: materialize, then ADBC

sfr_generate_training_data(
  fs,
  spine               = spine_sql,
  features            = list(fv1, fv2),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  auto_prefix         = TRUE,
  save_as             = "CHURN_TRAINING_SCALAR"
)
SELECT *
FROM MY_DB.ML.CHURN_TRAINING_SCALAR;
%%R
library(DBI)
library(RSnowflake)
library(dplyr)

options(RSnowflake.backend = "adbc")
con <- dbConnect(Snowflake())

training <- tbl(con, in_schema("ML", "CHURN_TRAINING_SCALAR")) |>
  collect()

Read transport: ADBC setup · Lineage Dataset (optional): Feature Store Path 2


21.13.3 Pipeline B — Scalar-wide ASOF → persist as three ARRAYs → VECTOR read → R

When: The point-in-time join is correct but wide (many scalar columns from one or more Feature Views). You want compile-friendly exports and efficient transport without pulling 500+ columns through the client.

Idea: Materialize the wide ASOF result once, then build an export view (or table) with exactly three ARRAY columns — float numerics, int ordinals/categoricals, null bitmask — and cast to VECTOR only in the SELECT you use for training reads.

flowchart LR
  B1[ASOF join save_as wide table] --> B2[Export VIEW: ARRAY_CONSTRUCT x3]
  B3[Read SQL: ::VECTOR casts] --> B4[ADBC collect]
  B4 --> B5[List columns to matrix in R]
  B2 --> B3

1 — Wide ASOF materialization (R, snowflakeR)

sfr_generate_training_data(
  fs,
  spine               = spine_sql,
  features            = list(fv1, fv2, fv3),   # many scalar columns per view
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  auto_prefix         = TRUE,
  save_as             = "CHURN_TRAINING_WIDE"
)

2 — Repack to three ARRAY columns (Snowflake)

Maintain a feature manifest (which columns are float vs categorical, imputation rules). Below is a minimal pattern; production pipelines generate this SQL from metadata (see Implementation Guide §12.10).

CREATE OR REPLACE VIEW MY_DB.ML.CHURN_TRAINING_EXPORT AS
WITH base AS (
  SELECT * FROM MY_DB.ML.CHURN_TRAINING_WIDE
),
typed AS (
  SELECT
    customer_id,
    event_ts,
    label,
    -- example: three scalar features from the wide table
    CUSTOMER_FEATURES_V1_avg_order_total::FLOAT AS f_avg_spend,
    CUSTOMER_FEATURES_V1_order_count::INT       AS f_order_count,
    CUSTOMER_ACTIVITY_V1_page_views::FLOAT      AS f_page_views
  FROM base
),
packed AS (
  SELECT
    customer_id,
    event_ts,
    label,
    ARRAY_CONSTRUCT(
      COALESCE(f_avg_spend, 0),
      COALESCE(f_page_views, 0)
    ) AS features_float_arr,
    ARRAY_CONSTRUCT(
      COALESCE(f_order_count, 0)
      -- add ordinal-encoded categoricals here as INT elements
    ) AS features_int_arr,
    ARRAY_CONSTRUCT(
      IFF(f_avg_spend IS NULL, 1, 0),
      IFF(f_order_count IS NULL, 1, 0),
      IFF(f_page_views IS NULL, 1, 0)
    ) AS null_mask_arr
  FROM typed
)
SELECT * FROM packed;
library(dplyr)

packed <- tbl(con, in_schema("ML", "CHURN_TRAINING_WIDE")) |>
  mutate(
    f_avg_spend   = sql("CUSTOMER_FEATURES_V1_avg_order_total::FLOAT"),
    f_order_count = sql("CUSTOMER_FEATURES_V1_order_count::INT"),
    f_page_views  = sql("CUSTOMER_ACTIVITY_V1_page_views::FLOAT"),
    features_float_arr = sql(
      "ARRAY_CONSTRUCT(
         COALESCE(CUSTOMER_FEATURES_V1_avg_order_total::FLOAT, 0),
         COALESCE(CUSTOMER_ACTIVITY_V1_page_views::FLOAT, 0)
       )"
    ),
    features_int_arr = sql(
      "ARRAY_CONSTRUCT(COALESCE(CUSTOMER_FEATURES_V1_order_count::INT, 0))"
    ),
    null_mask_arr = sql(
      "ARRAY_CONSTRUCT(
         IFF(CUSTOMER_FEATURES_V1_avg_order_total::FLOAT IS NULL, 1, 0),
         IFF(CUSTOMER_FEATURES_V1_order_count::INT IS NULL, 1, 0),
         IFF(CUSTOMER_ACTIVITY_V1_page_views::FLOAT IS NULL, 1, 0)
       )"
    )
  ) |>
  select(
    customer_id, event_ts, label,
    features_float_arr, features_int_arr, null_mask_arr
  )

# Materialize (permanent table) — or render SQL and CREATE VIEW
packed |>
  compute(name = in_schema("ML", "CHURN_TRAINING_EXPORT"), temporary = FALSE)

ARRAY_CONSTRUCT and IFF are expressed with sql() inside dplyr — RSnowflake does not wrap every semi-structured builder yet (RSnowflake: Query §dbplyr).

3 — Read with VECTOR cast + ADBC (R, RSnowflake)

Cast at read time so the stored table stays ARRAY (Parquet-friendly); the planner sees three expressions regardless of feature count.

SELECT
  customer_id,
  event_ts,
  label,
  features_float_arr::VECTOR(FLOAT, 2) AS features_float,
  features_int_arr::VECTOR(INT, 1)     AS features_int,
  null_mask_arr::VECTOR(INT, 3)        AS null_mask
FROM MY_DB.ML.CHURN_TRAINING_EXPORT;
options(RSnowflake.backend = "adbc")

export <- tbl(con, in_schema("ML", "CHURN_TRAINING_EXPORT")) |>
  mutate(
    features_float = sql("features_float_arr::VECTOR(FLOAT, 2)"),
    features_int   = sql("features_int_arr::VECTOR(INT, 1)"),
    features_mask  = sql("null_mask_arr::VECTOR(INT, 3)")
  ) |>
  select(customer_id, event_ts, label, features_float, features_int, features_mask) |>
  collect()

4 — Decode VECTOR list columns in R

This step applies ADBC decode to the packed export from steps 2–3. Below is the minimal three-feature walkthrough; production pipelines load the manifest from the same metadata that generated the repack SQL.

collect() returns each VECTOR as a list column of numeric vectors (one vector per row). Turn those into matrices, then join labels and feature names for modeling.

A feature manifest is metadata you maintain alongside the export SQL — typically a small table or CSV with one row per original wide-column feature: name, whether it lands in the float or int pack, its index inside that pack, and how to treat NULLs. It must stay in sync with the order of IFF(... IS NULL, 1, 0) expressions in null_mask_arr (see step 2 above: f_avg_spend, f_order_count, f_page_views).

The null mask does not store feature values — it stores 1 where the wide table had NULL before COALESCE(..., 0), 0 where the value was present. Use it to put NA back on imputed slots, drop masked columns, or pass the mask to a model that supports missingness.

library(dplyr)
library(tibble)

X_float <- do.call(rbind, export$FEATURES_FLOAT)  # nrow x 2
X_int   <- do.call(rbind, export$FEATURES_INT)    # nrow x 1
mask    <- do.call(rbind, export$FEATURES_MASK)   # nrow x 3 (same order as step 2 null_mask_arr)

# Manifest for the minimal three-feature export (production: hundreds of rows)
manifest <- tibble(
  feature   = c("f_avg_spend", "f_order_count", "f_page_views"),
  mask_ix   = 1:3,
  float_ix  = c(1L, NA_integer_, 2L),
  int_ix    = c(NA_integer_, 1L, NA_integer_)
)

# Optional: undo COALESCE(0) on float slots that were NULL in Snowflake
for (i in seq_len(nrow(manifest))) {
  ix <- manifest$float_ix[i]
  if (!is.na(ix)) {
    was_null <- mask[, manifest$mask_ix[i]] == 1L
    X_float[was_null, ix] <- NA_real_
  }
}

training <- tibble(
  customer_id = export$CUSTOMER_ID,
  event_ts    = export$EVENT_TS,
  label       = export$LABEL,
  f_avg_spend   = X_float[, 1],
  f_page_views  = X_float[, 2],
  f_order_count = X_int[, 1]
)
# training is ready for glm(), tidymodels, etc.

Decode details: ADBC client decode · Python variant: Pipeline E


21.13.4 Pipeline C — Features already stored as ARRAY columns

When: Feature Views (or ETL) already expose typed ARRAY(...) packs — no wide scalar table to repack.

flowchart LR
  C1[Feature View with ARRAY cols] --> C2[ASOF join save_as]
  C2 --> C3[SELECT with optional ::VECTOR]
  C3 --> C4[ADBC collect + list to matrix]

1 — Feature View SQL (ARRAY in the definition)

-- Backing query for Feature View CUSTOMER_FEATURES_ARRAY v1
SELECT
  customer_id,
  ARRAY_CONSTRUCT(avg_order_total, page_views) AS features_float_arr,
  ARRAY_CONSTRUCT(order_count)                 AS features_int_arr
FROM analytics.customer_rollup;
fv_arrays <- sfr_create_feature_view(
  fs,
  name       = "CUSTOMER_FEATURES_ARRAY",
  version    = "v1",
  entities   = customer,
  feature_df = "
    SELECT
      customer_id,
      ARRAY_CONSTRUCT(avg_order_total, page_views) AS features_float_arr,
      ARRAY_CONSTRUCT(order_count)                 AS features_int_arr
    FROM analytics.customer_rollup
  "
)

2 — Point-in-time training materialization

sfr_generate_training_data(
  fs,
  spine               = spine_sql,
  features            = list(fv_arrays),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  save_as             = "CHURN_TRAINING_ARRAY"
)

3 — Read (R + ADBC)

With account flag ENABLE_STRUCTURED_TYPES_IN_CLIENT_RESPONSE = TRUE, typed ARRAY may arrive as list columns without JSON parsing (ADBC setup). Otherwise treat as character JSON (JSON parse).

SELECT
  customer_id,
  label,
  features_float_arr::VECTOR(FLOAT, 2) AS features_float,
  features_int_arr::VECTOR(INT, 1)     AS features_int
FROM MY_DB.ML.CHURN_TRAINING_ARRAY
LIMIT 100000;
options(RSnowflake.backend = "adbc")

rows <- tbl(con, in_schema("ML", "CHURN_TRAINING_ARRAY")) |>
  mutate(
    features_float = sql("features_float_arr::VECTOR(FLOAT, 2)"),
    features_int   = sql("features_int_arr::VECTOR(INT, 1)")
  ) |>
  select(customer_id, label, features_float, features_int) |>
  head(100000) |>
  collect()
X <- do.call(rbind, rows$FEATURES_FLOAT)

Parquet / Dataset note: Prefer typed ARRAY in storage for file export; untyped semi-structured ARRAY may serialize as JSON in Parquet (VECTOR — Parquet).


21.13.5 Pipeline D — Features stored as OBJECT / VARIANT

When: One OBJECT (or VARIANT) column per entity row — flexible schema, sparse keys, Feature Store “wide & sparse” pattern.

flowchart LR
  D1[OBJECT in Feature View or object_agg] --> D2[ASOF join save_as]
  D2 --> D3{Export strategy}
  D3 -->|R-friendly scalars| D4[SQL path extract in VIEW]
  D3 -->|Scale| D5[OBJECT to VECTOR export SQL]
  D3 -->|Sample only| D6[jsonlite in R]
  D4 --> D7[ADBC collect]
  D5 --> D7

1 — Feature View with OBJECT bundle

-- Backing query for Feature View CUSTOMER_FEATURES_OBJECT v1
SELECT
  customer_id,
  OBJECT_CONSTRUCT(
    'avg_spend', AVG(order_total),
    'n_orders',  COUNT(*)
  ) AS feature_object
FROM orders
GROUP BY customer_id;
fv_obj <- sfr_create_feature_view(
  fs,
  name       = "CUSTOMER_FEATURES_OBJECT",
  version    = "v1",
  entities   = customer,
  feature_df = "
    SELECT customer_id,
           OBJECT_CONSTRUCT(
             'avg_spend', AVG(order_total),
             'n_orders',  COUNT(*)
           ) AS feature_object
    FROM orders
    GROUP BY customer_id
  "
)
library(dplyr)

bundled <- tbl(con, in_schema("ANALYTICS", "ORDERS")) |>
  group_by(CUSTOMER_ID) |>
  summarise(
    feature_object = sql("OBJECT_CONSTRUCT(
      'avg_spend', AVG(order_total),
      'n_orders',  COUNT(*)
    )"),
    .groups = "drop"
  )
# Register the rendered query (show_query()) as Feature View backing SQL

Or build with dplyr + object_agg() (RSnowflake Query §dbplyr) before registration.

2 — ASOF join

sfr_generate_training_data(
  fs,
  spine               = spine_sql,
  features            = list(fv_obj),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  save_as             = "CHURN_TRAINING_OBJECT"
)

3a — Export for R: flatten in SQL (recommended)

ADBC does not turn VARIANT into native structs. Project keys in an export view or dplyr pipeline:

CREATE OR REPLACE VIEW MY_DB.ML.CHURN_TRAINING_OBJECT_FLAT AS
SELECT
  customer_id,
  event_ts,
  label,
  feature_object:avg_spend::FLOAT AS avg_spend,
  feature_object:n_orders::INT  AS n_orders
FROM MY_DB.ML.CHURN_TRAINING_OBJECT;
library(dplyr)

training <- tbl(con, in_schema("ML", "CHURN_TRAINING_OBJECT")) |>
  mutate(
    avg_spend = sql("feature_object:avg_spend::FLOAT"),
    n_orders  = sql("feature_object:n_orders::INT")
  ) |>
  select(customer_id, event_ts, label, avg_spend, n_orders) |>
  collect()
options(RSnowflake.backend = "adbc")
training <- tbl(con, in_schema("ML", "CHURN_TRAINING_OBJECT_FLAT")) |> collect()
# or: training <- <dbplyr pipeline above> |> collect()

More extract patterns: Extract in Snowflake.

3b — Export at scale: OBJECT → three VECTOR packs

For hundreds of keys, generate export SQL from a feature manifest (path extract → encode → ARRAY_CONSTRUCT::VECTOR) — same pattern as Pipeline B step 2, starting from feature_object:field::TYPE instead of wide scalars. Then read with Pipeline B step 3 or Pipeline E.

3c — Validation sample: JSON in R

SELECT customer_id, label, feature_object
FROM MY_DB.ML.CHURN_TRAINING_OBJECT
LIMIT 5000;
sample <- dbGetQuery(con, "
  SELECT customer_id, label, feature_object
  FROM MY_DB.ML.CHURN_TRAINING_OBJECT
  LIMIT 5000
")
feats <- vapply(sample$FEATURE_OBJECT, jsonlite::fromJSON, FUN.VALUE = numeric(0), simplify = FALSE)

Do not use 3c for full training volume.


21.13.6 Pipeline E — Python Arrow decode, R modeling (multilang)

When: You use Pipeline B or D→B export views with three VECTOR columns and want maximum client throughput (NumPy zero-copy on fixed_size_list), then model in R.

Step Language Action
1 R Feature Store registration + save_as / export VIEW (Pipelines B or D)
2 Python session.sql(...) on export VIEW → Arrow/pandas → decode mask + packs → X, y
3 R reticulate::py_to_r() or pass matrices only

2 — Python: read VECTOR export, decode with manifest

Snowpark returns VECTOR columns as NumPy arrays per row (Arrow fixed_size_list). Stack them into matrices, then use the same feature manifest as Pipeline B step 4 — metadata that maps each wide-column feature to its index in the float/int packs and its bit in null_mask_arr. “Wide-table metadata” means that manifest, not anything stored inside the export table.

Dimensions in the ::VECTOR(type, n) casts must match your export VIEW (production might be 1000 + 400 + 32 mask bits; the minimal Pipeline B demo uses 2 + 1 + 3).

import numpy as np

# Python cell — same Snowpark session as snowflakeR
df = session.sql("""
  SELECT customer_id, label,
         features_float_arr::VECTOR(FLOAT, 2) AS features_float,
         features_int_arr::VECTOR(INT, 1)     AS features_int,
         null_mask_arr::VECTOR(INT, 3)        AS null_mask
  FROM MY_DB.ML.CHURN_TRAINING_EXPORT
""").to_pandas()

X_float = np.vstack(df["FEATURES_FLOAT"].to_numpy())  # (n_rows, n_float)
X_int   = np.vstack(df["FEATURES_INT"].to_numpy())    # (n_rows, n_int)
mask    = np.vstack(df["NULL_MASK"].to_numpy())       # (n_rows, n_wide_features)

# Same manifest as Pipeline B step 4 (production: load from CSV / registry table)
manifest = [
    {"name": "f_avg_spend",   "mask_ix": 0, "float_ix": 0},
    {"name": "f_order_count", "mask_ix": 1, "int_ix": 0},
    {"name": "f_page_views",  "mask_ix": 2, "float_ix": 1},
]

for spec in manifest:
    ix = spec.get("float_ix")
    if ix is not None:
        was_null = mask[:, spec["mask_ix"]] == 1
        X_float[was_null, ix] = np.nan  # undo COALESCE(..., 0) where wide value was NULL

X = np.hstack([X_float, X_int.astype(np.float64)])  # design matrix for R
y = df["LABEL"].to_numpy(dtype=np.float64)
# reticulate R cell below reads py$X and py$y

3 — R: model on decoded matrices

library(reticulate)
y <- py$y
X <- py_to_r(py$X)

Contrast: Default sfr_generate_training_data() without save_as does not run this decode path (Feature Store reads).


21.13.7 Shared prerequisites

Requirement Pipelines
snowflakeR + sfr_connect() A–E
sfr_generate_training_data(..., save_as = ...) A–E (materialize)
adbcsnowflake in R (CRE/bootstrap) A–D (ADBC reads)
ENABLE_STRUCTURED_TYPES_IN_CLIENT_RESPONSE (account) C (typed ARRAY as native Arrow)
Python + reticulate E
Feature manifest for repack / decode B, D→B, E

Related chapters