21 Training Data in R

Reading Feature Store exports — semi-structured types and ADBC paths

Keywords

snowflake, R, RStudio, Posit, VS Code, workspace notebooks, snowflakeR, RSnowflake, mlops

21.1 Overview

This chapter covers getting Feature Store training data into R for modeling — not authoring Feature Views (see Authoring Feature Views). It is organized in three layers:

Materialize — Feature Store joins (save_as, Datasets) and RSnowflake read transport (ADBC vs SQL API).
Read techniques — flatten in Snowflake, unpack Arrow list columns in R, or parse JSON as a fallback; plus VECTOR layout reference.
End-to-end pipelines (A–E) — recipes that combine materialization + read technique for each feature shape (scalar, wide scalar, ARRAY, OBJECT, Python decode).

Start with Feature Store training data in R if you are new to exports. Author features first in Authoring Feature Views; jump to Choose a pipeline if views are already registered.

It complements Feature Store (catalog and training APIs), Authoring Feature Views (warehouse SQL/dbplyr for feature_df), and RSnowflake: Query (dbplyr push-down, compute(), bulk I/O).

21.1.1 Scalars first, semi-structured when you need them

Use scalar columns: Most Feature Views and training data sets use ordinary scalar columns, one Snowflake column per feature, point-in-time join via sfr_generate_training_data(), read with RSnowflake + ADBC as native R types (Pipeline A). That is the default and works well for up to a few hundred columns after multi–Feature View joins.

Use OBJECT, ARRAY, or VECTOR when:

Situation	Why semi-structured helps
Wide tables (hundreds–thousands of features)	Few physical columns; lighter SQL compile; efficient transport via packed ARRAY/VECTOR (Pipeline B)
Feature / schema evolution	Add keys to an OBJECT or slots in an ARRAY without `ALTER TABLE … ADD COLUMN` for every new feature
Ultra-wide or sparse result sets	Store present values only; carry null masks; decode on the client
Embeddings	ARRAY or VECTOR instead of one column per dimension (Pipeline C)

For authoring features (SQL/dbplyr feature_df, simple through wide/temporal), see Authoring Feature Views. For platform background (hybrid OBJECT payloads, compile cost, sparse arrays, DT cascades), read the Feature Store Implementation Guide — Advanced patterns §12.10 — especially §12.10.3 Semi-structured types in Feature Views and §12.10.4 Hybrid storage. This chapter focuses on reading materialized training data into R once Feature Views exist.

21.1.2 SQL and dbplyr tabs

Warehouse logic appears as SQL or dbplyr (RSnowflake lazy tables) wherever both are practical — the same idea as the RSnowflake: Query §dbplyr (group="query-sql-dplyr"). Examples here use tabs where your tab choice persists while you read this chapter. Use dbplyr when you already work in R with tbl(); use SQL when you share DDL with other teams or generate views from a feature manifest.

21.2 Learning Objectives

Navigate the chapter layers: materialize → read technique → pipeline workflow
Decide when scalar columns suffice vs OBJECT/ARRAY/VECTOR exports
Materialize Feature Store training data (save_as, Datasets) before large RSnowflake reads
Flatten OBJECT/ARRAY/VECTOR in SQL when possible; use ADBC decode or JSON parse when not
Choose pipeline A–E by how features are stored in Snowflake

21.3 How this chapter is organized

Layer	Sections	Question it answers
Transport	Semi-structured in R, ADBC setup	What column types do I get over SQL API vs ADBC?
Materialize	Feature Store training data	How does data leave Feature Store (`save_as`, Dataset, direct SQL)?
Author	Authoring Feature Views	How do I define `feature_df` SQL/dbplyr before export?
Read techniques	Extract in Snowflake, ADBC decode, JSON parse, VECTOR layout	After materialization, how do I turn OBJECT/ARRAY/VECTOR into a modeling frame?
Choose workflow	Wide training sets, Pipelines A–E	Which full story matches how my features are stored?

The pipelines are not a separate topic from extract/export — they are composed workflows. Pipeline steps link back to the read techniques above instead of redefining them.

21.4 Semi-structured columns in R

How semi-structured columns appear in R depends on which RSnowflake transport you use:

Transport	When	Semi-structured in R (typical)
SQL API (default)	`RSnowflake.backend = "auto"` and ADBC not installed, or ADBC failed	VARIANT, OBJECT, ARRAY, VECTOR → character (JSON text per cell)
ADBC	`adbcsnowflake` installed + `options(RSnowflake.backend = "adbc")`	Scalars native; VECTOR often list columns of numbers; typed ARRAY / structured OBJECT may be native if account flag is on (below)

Scalar columns (DOUBLE, NUMBER, timestamps, …) are native in both paths. Feature Store training tables that are mostly scalars still benefit from ADBC for volume (columnar transfer, less parsing overhead on numerics).

21.4.1 Recommendation for Feature Store workloads

For retrieving training data into R when views use OBJECT, ARRAY, or VECTOR:

Materialize in Snowflake — sfr_generate_training_data(..., save_as = "MY_TRAINING") or a managed table / export view (Feature Store training data).
Read with RSnowflake + ADBC — install adbcsnowflake (CRE or bootstrap), then:

SQL
dbplyr

SELECT *
FROM MY_DB.ML.MY_TRAINING
LIMIT 100000;

options(RSnowflake.backend = "adbc")   # force Arrow path for reads

library(dplyr)
training <- tbl(con, in_schema("ML", "MY_TRAINING")) |>
  head(100000) |>
  collect()

Prefer unpacking in SQL or dbplyr when you can — see Extract in Snowflake — so collect() returns scalars even on the SQL API path.
Reserve client-side JSON parsing for VARIANT/OBJECT bundles you cannot flatten in SQL, or for validation samples (LIMIT).

sfr_generate_training_data() without save_as still goes through snowflakeR → pandas → R; it does not use RSnowflake ADBC. Use save_as + RSnowflake when semi-structured or wide training pulls are the bottleneck.

21.4.2 Why Feature Store users care

Snowflake recommends OBJECT or ARRAY columns for very wide feature sets — see Scalars first and the Feature Store Implementation Guide — Wide & sparse. Joining multiple Feature Views can still produce hundreds of scalar columns even when each view is modest (Wide training sets).

Pattern	In Snowflake	SQL API → R	ADBC → R (with account settings)
Many scalar columns	One column per feature	Native types	Native types (faster at scale)
VARIANT / untyped OBJECT	Flexible JSON blob	JSON character	JSON character (no native Arrow)
Typed ARRAY / structured OBJECT	Declared element types	JSON character	Native list/struct if account flag enabled
VECTOR packs	Fixed-length numerics	JSON character	Often list column → `matrix` (no account flag required)

RSnowflake helps you build bundles on the warehouse (object_agg, array_agg, ::VECTOR); how you read them is a separate choice between SQL flattening, ADBC, and client-side decode.

Use How this chapter is organized to navigate; Choose a pipeline comes after the read techniques.

21.5 Feature Store training data in R

Feature Store training flows through snowflakeR, not RSnowflake alone — but the data types you get in R depend on which read path you choose.

21.5.1 Path 1 — `sfr_generate_training_data()` (transient join)

Joins your spine to one or more Feature Views and returns an R data.frame:

training <- sfr_generate_training_data(
  fs,
  spine    = spine_sql,
  features = list(fv1, fv2),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label"
)

What happens under the hood: the Python Feature Store SDK runs the join in Snowflake, materializes a Snowpark result, converts it to pandas, then snowflakeR transfers columns into R (today via a JSON round-trip for compatibility with Workspace). Scalar feature columns arrive as normal R types. Any OBJECT/ARRAY/VECTOR columns in the join result follow the pandas/JSON path — treat them like SQL API character columns if they appear as strings.

Materialize without pulling everything to R:

sfr_generate_training_data(
  fs,
  spine    = spine_sql,
  features = list(fv1, fv2),
  save_as  = "CHURN_TRAINING_V1",   # table in Feature Store schema
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label"
)

# Inspect or model from Snowflake with RSnowflake
library(dplyr)
training_tbl <- tbl(con, in_schema("FEATURE_STORE", "CHURN_TRAINING_V1"))
sample       <- training_tbl |> head(10000) |> collect()

Use save_as for wide or large training sets: validate joins in Snowflake, then read with RSnowflake + ADBC (ADBC setup):

SQL
dbplyr

SELECT *
FROM MY_DB.FEATURE_STORE.CHURN_TRAINING_V1
LIMIT 100000;

options(RSnowflake.backend = "adbc")
training_tbl <- tbl(con, in_schema("FEATURE_STORE", "CHURN_TRAINING_V1"))
sample       <- training_tbl |> head(100000) |> collect()

21.5.2 Path 2 — `sfr_generate_dataset()` (versioned Dataset + lineage)

Creates an immutable Snowflake ML Dataset (Parquet files on a stage — the same object Python reads with fsspec):

training <- sfr_generate_dataset(
  fs,
  name     = "CHURN_TRAINING",
  version  = "v1",
  spine    = spine_sql,
  features = list(fv1, fv2),
  spine_label_cols = "label"
)

attr(training, "dataset_name")     # for sfr_log_model() lineage
attr(training, "dataset_version")

What R receives today: the bridge reads the Dataset through Dataset.read.to_snowpark_dataframe().to_pandas() — not direct Parquet file access. That matches demo and moderate row counts; it is the same semantic data Python would get from an in-session read, not the maximum-throughput fsspec path.

What Python does at scale: after generate_dataset(), the SDK exposes Parquet files on internal storage:

# Python (snowflake-ml) — not wrapped in snowflakeR yet
ds = session.dataset.load("MY_DB.ML.CHURN_TRAINING").select_version("v1")
files = ds.read.files()          # fsspec-compatible paths
# pd.read_parquet(files) or pyarrow.dataset

See Snowflake ML Dataset — direct file access.

R options for large Datasets today:

sfr_read_dataset(conn, name, version) — same pandas → R path as sfr_generate_dataset(); suitable when the Dataset already exists.
save_as table + RSnowflake + ADBC — recommended for wide/scalar-heavy training frames at scale (ADBC setup). Dataset Parquet still stores VARIANT as JSON strings on disk — same limitation as Python unless columns are typed ARRAY / scalars.
Python decode + reticulate — Pipeline E for three-VECTOR export at maximum throughput.

A future snowflakeR release may add format = "parquet" on dataset reads (fsspec → arrow::read_parquet()). Until then, assume Dataset APIs in R are lineage- and moderate-scale-friendly, not a substitute for a full Parquet export pipeline.

Full export workflows: Reading materialized training data, then Choose a pipeline (A–E). Define features first? Authoring Feature Views.

21.5.3 Path 3 — RSnowflake only (custom SQL / dbplyr)

When you own the SQL (external Feature View table, export view, or save_as output), use RSnowflake directly:

SQL
dbplyr

SELECT *
FROM MY_DB.ML.CHURN_TRAINING_V1;

options(RSnowflake.backend = "adbc")   # recommended for materialized training reads

training_tbl <- tbl(con, in_schema("ML", "CHURN_TRAINING_V1"))
training     <- training_tbl |> collect()
# Scalars: native types. VARIANT/OBJECT: chr → [JSON parse](#sec-parse-json).
# VECTOR / typed ARRAY: list → [ADBC decode](#sec-parse-adbc).
# Or flatten first: [Extract in Snowflake](#sec-sql-extract).

This path gives you the most control over which columns cross the wire — important when joins are wide.

21.6 ADBC, Arrow, and account settings

ADBC (adbcsnowflake) uses Snowflake’s Arrow result path. That is the recommended read transport for large materialized training tables and for VECTOR / typed ARRAY columns — not a magic bypass for every semi-structured type.

21.6.1 Enable ADBC in Workspace

Step	Action
Packages	`adbcsnowflake`, `adbcdrivermanager` (CRE image or bootstrap `languages.r.addons.adbc: true`)
Connection	Same `dbConnect(Snowflake())` — backend is lazy-initialised
Reads	`options(RSnowflake.backend = "adbc")` or leave `"auto"` when ADBC is installed

See RSnowflake in Workspace and WORKSPACE_ADBC.md.

21.6.2 Account parameter for native structured types

Snowflake can return typed ARRAY, structured OBJECT, and MAP as native Arrow types instead of JSON strings. This is controlled at account scope (not session):

-- Accountadmin — one-time per account; confirm with your admin
ALTER ACCOUNT SET ENABLE_STRUCTURED_TYPES_IN_CLIENT_RESPONSE = TRUE;

Column type	Native Arrow over ADBC?
Scalars (DOUBLE, INT, …)	Yes
VECTOR(FLOAT/INT, n)	Yes — `fixed_size_list` (does not require the account flag)
Typed ARRAY(…)	Yes with account flag; otherwise JSON strings
Structured OBJECT (`OBJECT(f1 DOUBLE, …)`)	Yes with account flag
VARIANT / untyped OBJECT	No — always JSON strings in Arrow

Ask your account team whether the flag is already enabled before relying on native ARRAY/struct columns in R.

Warning

SQL API has no server-side Arrow. dbGetQuery() on the REST path always uses JSON. dbGetQueryArrow() on the SQL API path builds Arrow client-side from JSON — interface compatibility only, not the same as ADBC transport.

For Feature Store exports you install adbcsnowflake in the Workspace R environment (CRE or bootstrap), materialize training with save_as, then read that table through RSnowflake with options(RSnowflake.backend = "adbc"). That is a client-side ADBC driver talking Arrow to Snowflake — not dbGetQueryArrow() on the REST API.

21.6.3 Inspect what you received

After collect(), inspect column classes before choosing a decoder:

SQL
R (inspect)

SELECT * FROM MY_DB.ML.MY_TRAINING LIMIT 100;

options(RSnowflake.backend = "adbc")
sample <- dbGetQuery(con, "SELECT * FROM MY_DB.ML.MY_TRAINING LIMIT 100")
str(sample, max.level = 1)
# PACKED_FLOATS   : List of 100  | num [1:128]  ...  → VECTOR over ADBC
# FEATURE_OBJECT  : chr [1:100]  ...                 → VARIANT/OBJECT JSON
# AVG_SPEND       : num ...                          → scalar

21.7 Reading materialized training data

Once training data is a table or view in Snowflake (save_as, export VIEW, or custom SQL), you choose how semi-structured columns cross the wire. These three strategies apply across pipelines — Pipeline A uses scalars only; B–E combine repack in SQL with decode on the client.

Priority	Strategy	When	Section
1 (preferred)	Extract in Snowflake	Known OBJECT keys or ARRAY slots; moderate width	Extract in Snowflake
2	ADBC unpack in R	VECTOR packs or typed ARRAY with account flag; large row counts	ADBC decode
3 (fallback)	JSON parse in R	SQL API path, VARIANT blobs, validation samples (`LIMIT`)	JSON parse

For wide numeric exports, see VECTOR layout (three-column pack) — used by Pipeline B and Pipeline E.

21.8 Extract in Snowflake

The most portable and often fastest approach: project features in Snowflake (SQL or dbplyr), then collect() scalars only. Works with SQL API or ADBC and avoids JSON parsing in R entirely.

SELECT
  customer_id,
  event_ts,
  label,
  feature_object:avg_spend::FLOAT  AS avg_spend,
  feature_object:n_orders::INTEGER AS n_orders
FROM MY_DB.ML.CUSTOMER_FEATURES_BUNDLED;

library(dplyr)

training <- tbl(con, in_schema("ML", "CUSTOMER_FEATURES_BUNDLED")) |>
  mutate(
    avg_spend = sql("FEATURE_OBJECT:avg_spend::FLOAT"),
    n_orders  = sql("FEATURE_OBJECT:n_orders::INTEGER")
  ) |>
  select(-FEATURE_OBJECT) |>
  collect()

21.8.2 ARRAY → scalars (known positions)

SQL
dbplyr

SELECT
  entity_id,
  feature_array[0]::FLOAT AS f1,
  feature_array[1]::FLOAT AS f2
FROM MY_DB.ML.TRAINING_ARRAY_PACK;

training <- tbl(con, "TRAINING_ARRAY_PACK") |>
  mutate(
    f1 = sql("feature_array[0]::FLOAT"),
    f2 = sql("feature_array[1]::FLOAT")
  ) |>
  select(-FEATURE_ARRAY) |>
  collect()

For hundreds of features, generate the mutate() expressions from a feature manifest in R, or build a VIEW in Snowflake that flattens once and register that as the Feature View backing table.

21.8.3 VECTOR packs — keep packed for modeling

If the model expects a matrix, you may keep VECTOR/ARRAY columns through collect() and decode in R (ADBC decode) rather than exploding to hundreds of scalar columns in SQL.

21.9 ADBC: unpack native Arrow columns in R

Use this path when collect() returns list columns (VECTOR or typed ARRAY with account flag), not JSON character.

21.9.1 VECTOR or ARRAY as list column → matrix

SQL
R (ADBC decode)

SELECT
  entity_id,
  features_float_arr::VECTOR(FLOAT, 128) AS features_float
FROM MY_DB.ML.TRAINING_VECTOR_PACK
LIMIT 50000;

options(RSnowflake.backend = "adbc")

pack <- dbGetQuery(con, "
  SELECT entity_id,
         features_float_arr::VECTOR(FLOAT, 128) AS features_float
  FROM MY_DB.ML.TRAINING_VECTOR_PACK
  LIMIT 50000
")

stopifnot(is.list(pack$FEATURES_FLOAT))
X <- do.call(rbind, pack$FEATURES_FLOAT)
training <- data.frame(entity_id = pack$ENTITY_ID, X)

If a column is still character, fall back to JSON parse — the warehouse or connection may not be using the Arrow path for that query.

21.9.2 Typed ARRAY with account flag

SQL
R (ADBC decode)

SELECT entity_id, feature_array
FROM MY_DB.ML.TRAINING_TYPED_ARRAY   -- ARRAY(DOUBLE) column
LIMIT 10000;

arr <- dbGetQuery(con, "
  SELECT entity_id, feature_array
  FROM MY_DB.ML.TRAINING_TYPED_ARRAY
  LIMIT 10000
")

if (is.list(arr$FEATURE_ARRAY)) {
  X <- do.call(rbind, arr$FEATURE_ARRAY)
} else {
  X <- matrix(
    vapply(arr$FEATURE_ARRAY, jsonlite::fromJSON, numeric(0), simplify = TRUE),
    nrow = nrow(arr), byrow = TRUE
  )
}

21.9.3 Mix scalar + packed columns

SQL
dbplyr

SELECT
  entity_id,
  LABEL::FLOAT AS label_num,
  features_float_arr::VECTOR(FLOAT, 128) AS features_float,
  feature_object
FROM MY_DB.ML.MY_TRAINING;

library(dplyr)

training <- tbl(con, in_schema("ML", "MY_TRAINING")) |>
  mutate(label_num = sql("LABEL::FLOAT")) |>
  collect()   # ADBC: scalars native; packs may be list or character

if ("FEATURES_FLOAT" %in% names(training) && is.list(training$FEATURES_FLOAT)) {
  X_pack <- do.call(rbind, training$FEATURES_FLOAT)
}
if ("FEATURE_OBJECT" %in% names(training) && is.character(training$FEATURE_OBJECT)) {
  feats <- vapply(training$FEATURE_OBJECT, jsonlite::fromJSON, numeric(0), simplify = FALSE)
}

For very large n, consider the arrow package on exported Parquet (Dataset files or COPY INTO) instead of expanding wide matrices in memory — see Feature Store training data.

21.10 SQL API: parse JSON in R

When results arrive as character JSON (SQL API path, or VARIANT/OBJECT over ADBC), use jsonlite. Prefer vapply() over lapply() + rbind() for thousands of rows.

21.10.1 OBJECT column → named features per row

SQL
R (parse JSON)

SELECT entity_id, feature_object
FROM MY_DB.ML.CUSTOMER_FEATURES_BUNDLED
LIMIT 5000;

library(jsonlite)

sample <- dbGetQuery(con, "
  SELECT entity_id, feature_object
  FROM MY_DB.ML.CUSTOMER_FEATURES_BUNDLED
  LIMIT 5000
")

fromJSON(sample$FEATURE_OBJECT[1])

feat_list <- vapply(
  sample$FEATURE_OBJECT,
  fromJSON,
  FUN.VALUE = numeric(0),
  simplify = FALSE
)
feat_mat <- do.call(rbind, lapply(feat_list, `[`, unique(names(feat_list[[1]]))))
training <- cbind(entity_id = sample$ENTITY_ID, as.data.frame(feat_mat))

If keys differ by row, normalize with tidyr::unnest_wider() after tibble::enframe(feat_list) or keep features in SQL with feature_object:avg_spend::FLOAT style extracts (Extract in Snowflake).

21.10.2 ARRAY column → numeric matrix

SQL
R (parse JSON)

SELECT entity_id, feature_array
FROM MY_DB.ML.TRAINING_ARRAY_PACK
LIMIT 5000;

library(jsonlite)

arr <- dbGetQuery(con, "
  SELECT entity_id, feature_array
  FROM MY_DB.ML.TRAINING_ARRAY_PACK
  LIMIT 5000
")

rows <- vapply(
  arr$FEATURE_ARRAY,
  function(s) fromJSON(s, simplifyVector = TRUE),
  FUN.VALUE = numeric(0)
)
X <- matrix(unlist(rows), nrow = nrow(arr), byrow = TRUE)

Check ncol(X) once against your declared array length; mismatches usually mean NULL or ragged arrays in source data.

21.10.3 VECTOR column → numeric matrix

VECTOR values also arrive as JSON text on the SQL API path. The decode step is the same as ARRAY:

SQL
R (parse JSON)

SELECT
  entity_id,
  packed_floats::VECTOR(FLOAT, 128) AS packed_floats
FROM MY_DB.ML.TRAINING_VECTOR_PACK
LIMIT 5000;

library(jsonlite)

vec <- dbGetQuery(con, "
  SELECT entity_id,
         packed_floats::VECTOR(FLOAT, 128) AS packed_floats
  FROM MY_DB.ML.TRAINING_VECTOR_PACK
  LIMIT 5000
")

float_rows <- vapply(
  vec$PACKED_FLOATS,
  function(s) fromJSON(s, simplifyVector = TRUE),
  FUN.VALUE = numeric(0)
)
X_float <- matrix(unlist(float_rows), nrow = nrow(vec), byrow = TRUE)

For production-sized training sets, do not rely on row-wise fromJSON over millions of rows unless columns are VARIANT/OBJECT with no SQL flatten path. Prefer Extract in Snowflake, ADBC + list columns (ADBC decode), or Parquet — see Feature Store training data.

21.11 VECTOR columns for wide feature exports

Snowflake VECTOR types store a fixed-length sequence of numbers (up to 4,096 elements per column). They are useful when you export hundreds or thousands of features but want the query planner to see only a small number of columns — not one expression per feature.

This section documents the three-column VECTOR layout referenced by Pipeline B step 3 and Pipeline E. For the full repack story (wide ASOF → ARRAY → VECTOR → decode), follow Pipeline B; this section is the reference for cast dimensions and client decode snippets.

RSnowflake runs SQL that references VECTOR(FLOAT, n) and VECTOR(INT, n) without restriction. There are no dplyr helpers yet for VECTOR_AVG / VECTOR_SUM; write those in SQL with sql() or dbGetQuery().

After collect() on the SQL API path, VECTOR columns are character (JSON) — use JSON parse. With ADBC, VECTOR is often a list column — use ADBC decode.

21.11.1 Packed export layout (three columns)

A common layout for mixed numeric and categorical features stores data as ARRAY in tables, then casts to VECTOR at read time:

Column	Type	Role
`features_float`	`VECTOR(FLOAT, n)`	Imputed numerics
`features_int`	`VECTOR(INT, n)`	Ordinal-encoded categoricals
`null_mask`	`VECTOR(INT, ceil(n/32))`	Bitmask of which features were NULL

The export SQL stays short regardless of feature count — the warehouse compiles three expressions instead of hundreds:

SQL
dbplyr

SELECT
  entity_id,
  features_float_arr::VECTOR(FLOAT, 1000) AS features_float,
  features_int_arr::VECTOR(INT, 400)     AS features_int,
  null_mask_arr::VECTOR(INT, 32)         AS null_mask
FROM MY_DB.ML.TRAINING_EXPORT_VIEW
WHERE train_date = '2026-01-01'
LIMIT 10000;

library(dplyr)

options(RSnowflake.backend = "adbc")

export <- tbl(con, in_schema("ML", "TRAINING_EXPORT_VIEW")) |>
  filter(sql("train_date = '2026-01-01'")) |>
  mutate(
    features_float = sql("features_float_arr::VECTOR(FLOAT, 1000)"),
    features_int   = sql("features_int_arr::VECTOR(INT, 400)"),
    null_mask      = sql("null_mask_arr::VECTOR(INT, 32)")
  ) |>
  select(entity_id, features_float, features_int, null_mask) |>
  head(10000) |>
  collect()

if (is.list(export$FEATURES_FLOAT)) {
  X_num <- do.call(rbind, export$FEATURES_FLOAT)
} else {
  X_num <- matrix(
    vapply(export$FEATURES_FLOAT, jsonlite::fromJSON, numeric(0), simplify = TRUE),
    nrow = nrow(export), byrow = TRUE
  )
}

Define features_*_arr in the Feature View or a downstream Dynamic Table — not row-by-row in R. See the Implementation Guide §12.10 for platform patterns.

Note

Parquet and VECTOR: Snowflake cannot write VECTOR columns directly to Parquet. For file-based pipelines, export typed ARRAY columns (or scalars) instead; VECTOR is for in-warehouse query and SQL API/ADBC reads.

21.12 Wide training sets: views, joins, and column count

This section summarises why width appears and which pipeline to pick. Full step-by-step workflows are in End-to-end pipelines (next section).

Width is not only “one Feature View with 500 columns.” It also accumulates when you join several Feature Views into one training set:

training_data <- sfr_generate_training_data(
  fs,
  spine    = spine_sql,
  features = list(
    list(name = "CUSTOMER_DEMOGRAPHICS", version = "v1"),  # 40 cols
    list(name = "CUSTOMER_TRANSACTIONS", version = "v1"), # 120 cols
    list(name = "PRODUCT_EMBEDDINGS",    version = "v1")  # 256 cols
  ),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  auto_prefix         = TRUE   # CUSTOMER_TRANSACTIONS_avg_spend, ...
)

Each view adds its feature columns (plus keys and optional timestamp columns). auto_prefix = TRUE avoids name collisions but increases total width — plan for hundreds of columns even when each view is modest.

Situation	Pipeline
Few views, up to ~300 total scalar columns after join	A — scalar direct
Several views → ~300–500+ scalar columns	B — scalar-wide → three VECTOR packs
Features already ARRAY in views	C — ARRAY-native
Features as OBJECT / VARIANT	D — OBJECT-native (flatten SQL or B-style repack)
Maximum packed decode throughput	E — Python → R
Multi-view name collisions	`auto_prefix = TRUE`; consider fewer views or pre-joined SQL

21.13 End-to-end training export pipelines

Each pipeline is a composed workflow — not a new read technique. Before diving in, you should understand:

Feature Store materialization (save_as, Datasets)
Read strategies (Extract in Snowflake, ADBC decode, JSON parse)

Pipeline steps link to those sections for extract/decode details rather than repeating them. Steps assume a multilang Workspace notebook: snowflakeR for Feature Store APIs, RSnowflake for reads, optional ADBC in R (ADBC setup).

21.13.1 Choose a pipeline

How features live today	Combined ASOF / training width	Recommended pipeline
Scalar columns per feature	up to a few hundred (~300) columns total	Pipeline A — scalar, direct
Scalar columns per feature	~300+ columns after multi–Feature View join	Pipeline B — scalar-wide → three VECTOR packs
ARRAY columns (typed packs)	Any	Pipeline C — ARRAY-native
OBJECT / VARIANT JSON per row	Any	Pipeline D — OBJECT-native
Packed export; maximum decode speed in notebook	Large row × width	Pipeline E — Python decode → R

flowchart TB
  subgraph fs [Snowflake — Feature Store]
    SP[Spine + labels]
    FV[Feature Views]
    JOIN[generate_training_data ASOF join]
    MAT[(Materialized table or export VIEW)]
    SP --> JOIN
    FV --> JOIN
    JOIN --> MAT
  end

  subgraph pack [Optional repack in SQL]
    WIDE[Wide scalar table]
    ARR[Three ARRAY columns]
    VEC[VECTOR cast at read time]
    WIDE --> ARR --> VEC
  end

  subgraph client [Workspace client]
    ADBC[RSnowflake ADBC collect]
    R[R data.frame / matrix]
    PY[Python Arrow decode]
    RET[reticulate py_to_r]
    ADBC --> R
    PY --> RET --> R
  end

  MAT --> WIDE
  MAT --> ARR
  ARR --> VEC
  MAT --> ADBC
  VEC --> ADBC
  VEC --> PY

21.13.2 Pipeline A — Scalar features, moderate width

When: Each feature is its own column; total width after the ASOF join stays in a moderate range (rough guide: up to a few hundred — ~300 — scalar feature columns with RSnowflake + ADBC).

flowchart LR
  A1[Spine + Feature Views] --> A2[sfr_generate_training_data]
  A2 --> A3{Row count?}
  A3 -->|Demo / small| A4[Return R data.frame]
  A3 -->|Large or reuse| A5[save_as table]
  A5 --> A6[RSnowflake ADBC collect]
  A6 --> A7[Model in R]
  A4 --> A7

1 — Register and join (R, snowflakeR)

library(snowflakeR)
conn <- sfr_connect()
fs   <- sfr_feature_store(conn, database = "MY_DB", schema = "ML")

fv1 <- sfr_get_feature_view(fs, "CUSTOMER_FEATURES", "v1")
fv2 <- sfr_get_feature_view(fs, "CUSTOMER_ACTIVITY", "v1")

spine_sql <- "
  SELECT customer_id, event_ts, churned AS label
  FROM MY_DB.ML.CHURN_SPINE
  WHERE event_ts >= '2024-01-01'
"

2a — Small pull: straight to R

training <- sfr_generate_training_data(
  fs,
  spine               = spine_sql,
  features            = list(fv1, fv2),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  auto_prefix         = TRUE
)
# Scalar columns → ready for glm / tidymodels

2b — Large or repeated reads: materialize, then ADBC

sfr_generate_training_data(
  fs,
  spine               = spine_sql,
  features            = list(fv1, fv2),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  auto_prefix         = TRUE,
  save_as             = "CHURN_TRAINING_SCALAR"
)

SQL
dbplyr

SELECT *
FROM MY_DB.ML.CHURN_TRAINING_SCALAR;

%%R
library(DBI)
library(RSnowflake)
library(dplyr)

options(RSnowflake.backend = "adbc")
con <- dbConnect(Snowflake())

training <- tbl(con, in_schema("ML", "CHURN_TRAINING_SCALAR")) |>
  collect()

Read transport: ADBC setup · Lineage Dataset (optional): Feature Store Path 2

21.13.3 Pipeline B — Scalar-wide ASOF → persist as three ARRAYs → VECTOR read → R

When: The point-in-time join is correct but wide (many scalar columns from one or more Feature Views). You want compile-friendly exports and efficient transport without pulling 500+ columns through the client.

Idea: Materialize the wide ASOF result once, then build an export view (or table) with exactly three ARRAY columns — float numerics, int ordinals/categoricals, null bitmask — and cast to VECTOR only in the SELECT you use for training reads.

flowchart LR
  B1[ASOF join save_as wide table] --> B2[Export VIEW: ARRAY_CONSTRUCT x3]
  B3[Read SQL: ::VECTOR casts] --> B4[ADBC collect]
  B4 --> B5[List columns to matrix in R]
  B2 --> B3

1 — Wide ASOF materialization (R, snowflakeR)

sfr_generate_training_data(
  fs,
  spine               = spine_sql,
  features            = list(fv1, fv2, fv3),   # many scalar columns per view
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  auto_prefix         = TRUE,
  save_as             = "CHURN_TRAINING_WIDE"
)

2 — Repack to three ARRAY columns (Snowflake)

Maintain a feature manifest (which columns are float vs categorical, imputation rules). Below is a minimal pattern; production pipelines generate this SQL from metadata (see Implementation Guide §12.10).

SQL
dbplyr

CREATE OR REPLACE VIEW MY_DB.ML.CHURN_TRAINING_EXPORT AS
WITH base AS (
  SELECT * FROM MY_DB.ML.CHURN_TRAINING_WIDE
),
typed AS (
  SELECT
    customer_id,
    event_ts,
    label,
    -- example: three scalar features from the wide table
    CUSTOMER_FEATURES_V1_avg_order_total::FLOAT AS f_avg_spend,
    CUSTOMER_FEATURES_V1_order_count::INT       AS f_order_count,
    CUSTOMER_ACTIVITY_V1_page_views::FLOAT      AS f_page_views
  FROM base
),
packed AS (
  SELECT
    customer_id,
    event_ts,
    label,
    ARRAY_CONSTRUCT(
      COALESCE(f_avg_spend, 0),
      COALESCE(f_page_views, 0)
    ) AS features_float_arr,
    ARRAY_CONSTRUCT(
      COALESCE(f_order_count, 0)
      -- add ordinal-encoded categoricals here as INT elements
    ) AS features_int_arr,
    ARRAY_CONSTRUCT(
      IFF(f_avg_spend IS NULL, 1, 0),
      IFF(f_order_count IS NULL, 1, 0),
      IFF(f_page_views IS NULL, 1, 0)
    ) AS null_mask_arr
  FROM typed
)
SELECT * FROM packed;

library(dplyr)

packed <- tbl(con, in_schema("ML", "CHURN_TRAINING_WIDE")) |>
  mutate(
    f_avg_spend   = sql("CUSTOMER_FEATURES_V1_avg_order_total::FLOAT"),
    f_order_count = sql("CUSTOMER_FEATURES_V1_order_count::INT"),
    f_page_views  = sql("CUSTOMER_ACTIVITY_V1_page_views::FLOAT"),
    features_float_arr = sql(
      "ARRAY_CONSTRUCT(
         COALESCE(CUSTOMER_FEATURES_V1_avg_order_total::FLOAT, 0),
         COALESCE(CUSTOMER_ACTIVITY_V1_page_views::FLOAT, 0)
       )"
    ),
    features_int_arr = sql(
      "ARRAY_CONSTRUCT(COALESCE(CUSTOMER_FEATURES_V1_order_count::INT, 0))"
    ),
    null_mask_arr = sql(
      "ARRAY_CONSTRUCT(
         IFF(CUSTOMER_FEATURES_V1_avg_order_total::FLOAT IS NULL, 1, 0),
         IFF(CUSTOMER_FEATURES_V1_order_count::INT IS NULL, 1, 0),
         IFF(CUSTOMER_ACTIVITY_V1_page_views::FLOAT IS NULL, 1, 0)
       )"
    )
  ) |>
  select(
    customer_id, event_ts, label,
    features_float_arr, features_int_arr, null_mask_arr
  )

# Materialize (permanent table) — or render SQL and CREATE VIEW
packed |>
  compute(name = in_schema("ML", "CHURN_TRAINING_EXPORT"), temporary = FALSE)

ARRAY_CONSTRUCT and IFF are expressed with sql() inside dplyr — RSnowflake does not wrap every semi-structured builder yet (RSnowflake: Query §dbplyr).

3 — Read with VECTOR cast + ADBC (R, RSnowflake)

Cast at read time so the stored table stays ARRAY (Parquet-friendly); the planner sees three expressions regardless of feature count.

SQL
dbplyr

SELECT
  customer_id,
  event_ts,
  label,
  features_float_arr::VECTOR(FLOAT, 2) AS features_float,
  features_int_arr::VECTOR(INT, 1)     AS features_int,
  null_mask_arr::VECTOR(INT, 3)        AS null_mask
FROM MY_DB.ML.CHURN_TRAINING_EXPORT;

options(RSnowflake.backend = "adbc")

export <- tbl(con, in_schema("ML", "CHURN_TRAINING_EXPORT")) |>
  mutate(
    features_float = sql("features_float_arr::VECTOR(FLOAT, 2)"),
    features_int   = sql("features_int_arr::VECTOR(INT, 1)"),
    features_mask  = sql("null_mask_arr::VECTOR(INT, 3)")
  ) |>
  select(customer_id, event_ts, label, features_float, features_int, features_mask) |>
  collect()

4 — Decode VECTOR list columns in R

This step applies ADBC decode to the packed export from steps 2–3. Below is the minimal three-feature walkthrough; production pipelines load the manifest from the same metadata that generated the repack SQL.

collect() returns each VECTOR as a list column of numeric vectors (one vector per row). Turn those into matrices, then join labels and feature names for modeling.

A feature manifest is metadata you maintain alongside the export SQL — typically a small table or CSV with one row per original wide-column feature: name, whether it lands in the float or int pack, its index inside that pack, and how to treat NULLs. It must stay in sync with the order of IFF(... IS NULL, 1, 0) expressions in null_mask_arr (see step 2 above: f_avg_spend, f_order_count, f_page_views).

The null mask does not store feature values — it stores 1 where the wide table had NULL before COALESCE(..., 0), 0 where the value was present. Use it to put NA back on imputed slots, drop masked columns, or pass the mask to a model that supports missingness.

library(dplyr)
library(tibble)

X_float <- do.call(rbind, export$FEATURES_FLOAT)  # nrow x 2
X_int   <- do.call(rbind, export$FEATURES_INT)    # nrow x 1
mask    <- do.call(rbind, export$FEATURES_MASK)   # nrow x 3 (same order as step 2 null_mask_arr)

# Manifest for the minimal three-feature export (production: hundreds of rows)
manifest <- tibble(
  feature   = c("f_avg_spend", "f_order_count", "f_page_views"),
  mask_ix   = 1:3,
  float_ix  = c(1L, NA_integer_, 2L),
  int_ix    = c(NA_integer_, 1L, NA_integer_)
)

# Optional: undo COALESCE(0) on float slots that were NULL in Snowflake
for (i in seq_len(nrow(manifest))) {
  ix <- manifest$float_ix[i]
  if (!is.na(ix)) {
    was_null <- mask[, manifest$mask_ix[i]] == 1L
    X_float[was_null, ix] <- NA_real_
  }
}

training <- tibble(
  customer_id = export$CUSTOMER_ID,
  event_ts    = export$EVENT_TS,
  label       = export$LABEL,
  f_avg_spend   = X_float[, 1],
  f_page_views  = X_float[, 2],
  f_order_count = X_int[, 1]
)
# training is ready for glm(), tidymodels, etc.

Decode details: ADBC client decode · Python variant: Pipeline E

21.13.4 Pipeline C — Features already stored as ARRAY columns

When: Feature Views (or ETL) already expose typed ARRAY(...) packs — no wide scalar table to repack.

flowchart LR
  C1[Feature View with ARRAY cols] --> C2[ASOF join save_as]
  C2 --> C3[SELECT with optional ::VECTOR]
  C3 --> C4[ADBC collect + list to matrix]

1 — Feature View SQL (ARRAY in the definition)

SQL
snowflakeR

-- Backing query for Feature View CUSTOMER_FEATURES_ARRAY v1
SELECT
  customer_id,
  ARRAY_CONSTRUCT(avg_order_total, page_views) AS features_float_arr,
  ARRAY_CONSTRUCT(order_count)                 AS features_int_arr
FROM analytics.customer_rollup;

fv_arrays <- sfr_create_feature_view(
  fs,
  name       = "CUSTOMER_FEATURES_ARRAY",
  version    = "v1",
  entities   = customer,
  feature_df = "
    SELECT
      customer_id,
      ARRAY_CONSTRUCT(avg_order_total, page_views) AS features_float_arr,
      ARRAY_CONSTRUCT(order_count)                 AS features_int_arr
    FROM analytics.customer_rollup
  "
)

2 — Point-in-time training materialization

sfr_generate_training_data(
  fs,
  spine               = spine_sql,
  features            = list(fv_arrays),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  save_as             = "CHURN_TRAINING_ARRAY"
)

3 — Read (R + ADBC)

With account flag ENABLE_STRUCTURED_TYPES_IN_CLIENT_RESPONSE = TRUE, typed ARRAY may arrive as list columns without JSON parsing (ADBC setup). Otherwise treat as character JSON (JSON parse).

SQL
dbplyr

SELECT
  customer_id,
  label,
  features_float_arr::VECTOR(FLOAT, 2) AS features_float,
  features_int_arr::VECTOR(INT, 1)     AS features_int
FROM MY_DB.ML.CHURN_TRAINING_ARRAY
LIMIT 100000;

options(RSnowflake.backend = "adbc")

rows <- tbl(con, in_schema("ML", "CHURN_TRAINING_ARRAY")) |>
  mutate(
    features_float = sql("features_float_arr::VECTOR(FLOAT, 2)"),
    features_int   = sql("features_int_arr::VECTOR(INT, 1)")
  ) |>
  select(customer_id, label, features_float, features_int) |>
  head(100000) |>
  collect()

X <- do.call(rbind, rows$FEATURES_FLOAT)

Parquet / Dataset note: Prefer typed ARRAY in storage for file export; untyped semi-structured ARRAY may serialize as JSON in Parquet (VECTOR — Parquet).

21.13.5 Pipeline D — Features stored as OBJECT / VARIANT

When: One OBJECT (or VARIANT) column per entity row — flexible schema, sparse keys, Feature Store “wide & sparse” pattern.

flowchart LR
  D1[OBJECT in Feature View or object_agg] --> D2[ASOF join save_as]
  D2 --> D3{Export strategy}
  D3 -->|R-friendly scalars| D4[SQL path extract in VIEW]
  D3 -->|Scale| D5[OBJECT to VECTOR export SQL]
  D3 -->|Sample only| D6[jsonlite in R]
  D4 --> D7[ADBC collect]
  D5 --> D7

1 — Feature View with OBJECT bundle

-- Backing query for Feature View CUSTOMER_FEATURES_OBJECT v1
SELECT
  customer_id,
  OBJECT_CONSTRUCT(
    'avg_spend', AVG(order_total),
    'n_orders',  COUNT(*)
  ) AS feature_object
FROM orders
GROUP BY customer_id;

fv_obj <- sfr_create_feature_view(
  fs,
  name       = "CUSTOMER_FEATURES_OBJECT",
  version    = "v1",
  entities   = customer,
  feature_df = "
    SELECT customer_id,
           OBJECT_CONSTRUCT(
             'avg_spend', AVG(order_total),
             'n_orders',  COUNT(*)
           ) AS feature_object
    FROM orders
    GROUP BY customer_id
  "
)

library(dplyr)

bundled <- tbl(con, in_schema("ANALYTICS", "ORDERS")) |>
  group_by(CUSTOMER_ID) |>
  summarise(
    feature_object = sql("OBJECT_CONSTRUCT(
      'avg_spend', AVG(order_total),
      'n_orders',  COUNT(*)
    )"),
    .groups = "drop"
  )
# Register the rendered query (show_query()) as Feature View backing SQL

Or build with dplyr + object_agg() (RSnowflake Query §dbplyr) before registration.

2 — ASOF join

sfr_generate_training_data(
  fs,
  spine               = spine_sql,
  features            = list(fv_obj),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label",
  save_as             = "CHURN_TRAINING_OBJECT"
)

3a — Export for R: flatten in SQL (recommended)

ADBC does not turn VARIANT into native structs. Project keys in an export view or dplyr pipeline:

SQL
dbplyr

CREATE OR REPLACE VIEW MY_DB.ML.CHURN_TRAINING_OBJECT_FLAT AS
SELECT
  customer_id,
  event_ts,
  label,
  feature_object:avg_spend::FLOAT AS avg_spend,
  feature_object:n_orders::INT  AS n_orders
FROM MY_DB.ML.CHURN_TRAINING_OBJECT;

library(dplyr)

training <- tbl(con, in_schema("ML", "CHURN_TRAINING_OBJECT")) |>
  mutate(
    avg_spend = sql("feature_object:avg_spend::FLOAT"),
    n_orders  = sql("feature_object:n_orders::INT")
  ) |>
  select(customer_id, event_ts, label, avg_spend, n_orders) |>
  collect()

options(RSnowflake.backend = "adbc")
training <- tbl(con, in_schema("ML", "CHURN_TRAINING_OBJECT_FLAT")) |> collect()
# or: training <- <dbplyr pipeline above> |> collect()

More extract patterns: Extract in Snowflake.

3b — Export at scale: OBJECT → three VECTOR packs

For hundreds of keys, generate export SQL from a feature manifest (path extract → encode → ARRAY_CONSTRUCT → ::VECTOR) — same pattern as Pipeline B step 2, starting from feature_object:field::TYPE instead of wide scalars. Then read with Pipeline B step 3 or Pipeline E.

3c — Validation sample: JSON in R

SQL
R (DBI)

SELECT customer_id, label, feature_object
FROM MY_DB.ML.CHURN_TRAINING_OBJECT
LIMIT 5000;

sample <- dbGetQuery(con, "
  SELECT customer_id, label, feature_object
  FROM MY_DB.ML.CHURN_TRAINING_OBJECT
  LIMIT 5000
")
feats <- vapply(sample$FEATURE_OBJECT, jsonlite::fromJSON, FUN.VALUE = numeric(0), simplify = FALSE)

Do not use 3c for full training volume.

21.13.6 Pipeline E — Python Arrow decode, R modeling (multilang)

When: You use Pipeline B or D→B export views with three VECTOR columns and want maximum client throughput (NumPy zero-copy on fixed_size_list), then model in R.

Step	Language	Action
1	R	Feature Store registration + `save_as` / export VIEW (Pipelines B or D)
2	Python	`session.sql(...)` on export VIEW → Arrow/pandas → decode mask + packs → `X`, `y`
3	R	`reticulate::py_to_r()` or pass matrices only

2 — Python: read VECTOR export, decode with manifest

Snowpark returns VECTOR columns as NumPy arrays per row (Arrow fixed_size_list). Stack them into matrices, then use the same feature manifest as Pipeline B step 4 — metadata that maps each wide-column feature to its index in the float/int packs and its bit in null_mask_arr. “Wide-table metadata” means that manifest, not anything stored inside the export table.

Dimensions in the ::VECTOR(type, n) casts must match your export VIEW (production might be 1000 + 400 + 32 mask bits; the minimal Pipeline B demo uses 2 + 1 + 3).

import numpy as np

# Python cell — same Snowpark session as snowflakeR
df = session.sql("""
  SELECT customer_id, label,
         features_float_arr::VECTOR(FLOAT, 2) AS features_float,
         features_int_arr::VECTOR(INT, 1)     AS features_int,
         null_mask_arr::VECTOR(INT, 3)        AS null_mask
  FROM MY_DB.ML.CHURN_TRAINING_EXPORT
""").to_pandas()

X_float = np.vstack(df["FEATURES_FLOAT"].to_numpy())  # (n_rows, n_float)
X_int   = np.vstack(df["FEATURES_INT"].to_numpy())    # (n_rows, n_int)
mask    = np.vstack(df["NULL_MASK"].to_numpy())       # (n_rows, n_wide_features)

# Same manifest as Pipeline B step 4 (production: load from CSV / registry table)
manifest = [
    {"name": "f_avg_spend",   "mask_ix": 0, "float_ix": 0},
    {"name": "f_order_count", "mask_ix": 1, "int_ix": 0},
    {"name": "f_page_views",  "mask_ix": 2, "float_ix": 1},
]

for spec in manifest:
    ix = spec.get("float_ix")
    if ix is not None:
        was_null = mask[:, spec["mask_ix"]] == 1
        X_float[was_null, ix] = np.nan  # undo COALESCE(..., 0) where wide value was NULL

X = np.hstack([X_float, X_int.astype(np.float64)])  # design matrix for R
y = df["LABEL"].to_numpy(dtype=np.float64)
# reticulate R cell below reads py$X and py$y

3 — R: model on decoded matrices

library(reticulate)
y <- py$y
X <- py_to_r(py$X)

Contrast: Default sfr_generate_training_data() without save_as does not run this decode path (Feature Store reads).

21.13.7 Shared prerequisites

Requirement	Pipelines
snowflakeR + `sfr_connect()`	A–E
`sfr_generate_training_data(..., save_as = ...)`	A–E (materialize)
`adbcsnowflake` in R (CRE/bootstrap)	A–D (ADBC reads)
`ENABLE_STRUCTURED_TYPES_IN_CLIENT_RESPONSE` (account)	C (typed ARRAY as native Arrow)
Python + reticulate	E
Feature manifest for repack / decode	B, D→B, E

Related chapters

Feature Store — entities, views, sfr_generate_training_data(), Datasets
Authoring Feature Views — SQL/dbplyr for feature_df (simple → wide/temporal)
RSnowflake: Query — dbplyr, compute(), bulk writes
Feature Store — Wide & sparse — OBJECT/ARRAY in views
RSnowflake in Workspace — install and enable ADBC
Model Registry — training_dataset lineage from sfr_generate_dataset()

--- title: "Training Data in R" subtitle: "Reading Feature Store exports — semi-structured types and ADBC paths" --- ## Overview This chapter covers **getting Feature Store training data into R** for modeling — not authoring Feature Views (see [Authoring Feature Views](../17b_authoring_feature_views/index.qmd)). It is organized in three layers: 1. **Materialize** — Feature Store joins (`save_as`, Datasets) and RSnowflake read transport (ADBC vs SQL API). 2. **Read techniques** — flatten in Snowflake, unpack Arrow list columns in R, or parse JSON as a fallback; plus VECTOR layout reference. 3. **End-to-end pipelines (A–E)** — recipes that combine materialization + read technique for each feature shape (scalar, wide scalar, ARRAY, OBJECT, Python decode). Start with [Feature Store training data in R](#sec-feature-store-read) if you are new to exports. Author features first in [Authoring Feature Views](../17b_authoring_feature_views/index.qmd); jump to [Choose a pipeline](#sec-e2e-pipelines) if views are already registered. It complements [Feature Store](../17_feature_store/index.qmd) (catalog and training APIs), [Authoring Feature Views](../17b_authoring_feature_views/index.qmd) (warehouse SQL/dbplyr for `feature_df`), and [RSnowflake: Query](../13_rsnowflake_query/index.qmd) (dbplyr push-down, `compute()`, bulk I/O). ### Scalars first, semi-structured when you need them {#sec-scalar-vs-semi} **Use scalar columns:** Most Feature Views and training data sets use ordinary scalar columns, one Snowflake column per feature, point-in-time join via `sfr_generate_training_data()`, read with RSnowflake + ADBC as native R types ([Pipeline A](#sec-pipeline-a)). That is the default and works well for **up to a few hundred columns** after multi–Feature View joins. **Use OBJECT, ARRAY, or VECTOR when:** | Situation | Why semi-structured helps | |-----------|---------------------------| | **Wide tables** (hundreds–thousands of features) | Few physical columns; lighter SQL compile; efficient transport via packed ARRAY/VECTOR ([Pipeline B](#sec-pipeline-b)) | | **Feature / schema evolution** | Add keys to an OBJECT or slots in an ARRAY without `ALTER TABLE … ADD COLUMN` for every new feature | | **Ultra-wide or sparse result sets** | Store present values only; carry null masks; decode on the client | | **Embeddings** | ARRAY or VECTOR instead of one column per dimension ([Pipeline C](#sec-pipeline-c)) | For **authoring** features (SQL/dbplyr `feature_df`, simple through wide/temporal), see [Authoring Feature Views](../17b_authoring_feature_views/index.qmd). For platform background (hybrid OBJECT payloads, compile cost, sparse arrays, DT cascades), read the **[Feature Store Implementation Guide — Advanced patterns §12.10](https://snowflake-labs.github.io/snowflake-featurestore-imp-guide/12_advanced_patterns/index.html#sec-wide-sparse)** — especially [§12.10.3 Semi-structured types in Feature Views](https://snowflake-labs.github.io/snowflake-featurestore-imp-guide/12_advanced_patterns/index.html#sec-12-10-3-semi-structured-types-in-feature-views) and [§12.10.4 Hybrid storage](https://snowflake-labs.github.io/snowflake-featurestore-imp-guide/12_advanced_patterns/index.html#sec-12-10-4-hybrid-storage-key-columns--object-payload). **This chapter** focuses on **reading materialized training data into R** once Feature Views exist. ### SQL and dbplyr tabs {#sec-sql-dplyr-tabs} Warehouse logic appears as **SQL** or **dbplyr** (RSnowflake lazy tables) wherever both are practical — the same idea as the [RSnowflake: Query §dbplyr](../13_rsnowflake_query/index.qmd#sec-dbplyr) (`group="query-sql-dplyr"`). Examples here use **tabs** where your tab choice persists while you read this chapter. Use **dbplyr** when you already work in R with `tbl()`; use **SQL** when you share DDL with other teams or generate views from a feature manifest. ## Learning Objectives - Navigate the chapter layers: materialize → read technique → pipeline workflow - Decide when scalar columns suffice vs OBJECT/ARRAY/VECTOR exports - Materialize Feature Store training data (`save_as`, Datasets) before large RSnowflake reads - Flatten OBJECT/ARRAY/VECTOR in SQL when possible; use ADBC decode or JSON parse when not - Choose pipeline A–E by how features are stored in Snowflake --- ## How this chapter is organized {#sec-chapter-map} | Layer | Sections | Question it answers | |-------|----------|---------------------| | **Transport** | [Semi-structured in R](#sec-semi-structured-read), [ADBC setup](#sec-adbc-semistructured) | What column types do I get over SQL API vs ADBC? | | **Materialize** | [Feature Store training data](#sec-feature-store-read) | How does data leave Feature Store (`save_as`, Dataset, direct SQL)? | | **Author** | [Authoring Feature Views](../17b_authoring_feature_views/index.qmd) | How do I define `feature_df` SQL/dbplyr before export? | | **Read techniques** | [Extract in Snowflake](#sec-sql-extract), [ADBC decode](#sec-parse-adbc), [JSON parse](#sec-parse-json), [VECTOR layout](#sec-vector) | After materialization, how do I turn OBJECT/ARRAY/VECTOR into a modeling frame? | | **Choose workflow** | [Wide training sets](#sec-wide-training), [Pipelines A–E](#sec-e2e-pipelines) | Which full story matches how my features are stored? | The **pipelines** are not a separate topic from extract/export — they are **composed workflows**. Pipeline steps link back to the read techniques above instead of redefining them. --- ## Semi-structured columns in R {#sec-semi-structured-read} How semi-structured columns appear in R depends on **which RSnowflake transport** you use: | Transport | When | Semi-structured in R (typical) | |-----------|------|--------------------------------| | **SQL API** (default) | `RSnowflake.backend = "auto"` and ADBC not installed, or ADBC failed | VARIANT, OBJECT, ARRAY, VECTOR → **character** (JSON text per cell) | | **ADBC** | `adbcsnowflake` installed + `options(RSnowflake.backend = "adbc")` | Scalars native; **VECTOR** often **list** columns of numbers; typed ARRAY / structured OBJECT may be native **if** account flag is on (below) | Scalar columns (`DOUBLE`, `NUMBER`, timestamps, …) are native in both paths. Feature Store training tables that are mostly scalars still benefit from ADBC for **volume** (columnar transfer, less parsing overhead on numerics). ### Recommendation for Feature Store workloads For **retrieving training data into R** when views use OBJECT, ARRAY, or VECTOR: 1. **Materialize in Snowflake** — `sfr_generate_training_data(..., save_as = "MY_TRAINING")` or a managed table / export view ([Feature Store training data](#sec-feature-store-read)). 2. **Read with RSnowflake + ADBC** — install `adbcsnowflake` (CRE or bootstrap), then: ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT * FROM MY_DB.ML.MY_TRAINING LIMIT 100000; ``` ##### dbplyr ```r options(RSnowflake.backend = "adbc") # force Arrow path for reads library(dplyr) training <- tbl(con, in_schema("ML", "MY_TRAINING")) |> head(100000) |> collect() ``` ::: 3. **Prefer unpacking in SQL or dbplyr** when you can — see [Extract in Snowflake](#sec-sql-extract) — so `collect()` returns scalars even on the SQL API path. 4. **Reserve client-side JSON parsing** for VARIANT/OBJECT bundles you cannot flatten in SQL, or for validation samples (`LIMIT`). `sfr_generate_training_data()` without `save_as` still goes through **snowflakeR → pandas → R**; it does not use RSnowflake ADBC. Use `save_as` + RSnowflake when semi-structured or wide training pulls are the bottleneck. ### Why Feature Store users care Snowflake recommends **OBJECT** or **ARRAY** columns for very wide feature sets — see [Scalars first](#sec-scalar-vs-semi) and the [Feature Store Implementation Guide — Wide & sparse](https://snowflake-labs.github.io/snowflake-featurestore-imp-guide/12_advanced_patterns/#sec-wide-sparse). Joining **multiple Feature Views** can still produce hundreds of scalar columns even when each view is modest ([Wide training sets](#sec-wide-training)). | Pattern | In Snowflake | SQL API → R | ADBC → R (with account settings) | |---------|--------------|-------------|----------------------------------| | Many scalar columns | One column per feature | Native types | Native types (faster at scale) | | VARIANT / untyped OBJECT | Flexible JSON blob | JSON **character** | JSON **character** (no native Arrow) | | Typed ARRAY / structured OBJECT | Declared element types | JSON **character** | Native list/struct **if** account flag enabled | | VECTOR packs | Fixed-length numerics | JSON **character** | Often **list** column → `matrix` (no account flag required) | RSnowflake helps you **build** bundles on the warehouse (`object_agg`, `array_agg`, `::VECTOR`); **how** you read them is a separate choice between SQL flattening, ADBC, and client-side decode. Use [How this chapter is organized](#sec-chapter-map) to navigate; [Choose a pipeline](#sec-e2e-pipelines) comes after the read techniques. --- ## Feature Store training data in R {#sec-feature-store-read} Feature Store training flows through **snowflakeR**, not RSnowflake alone — but the data types you get in R depend on **which read path** you choose. ### Path 1 — `sfr_generate_training_data()` (transient join) Joins your spine to one or more Feature Views and returns an R **`data.frame`**: ```r training <- sfr_generate_training_data( fs, spine = spine_sql, features = list(fv1, fv2), spine_timestamp_col = "event_ts", spine_label_cols = "label" ) ``` **What happens under the hood:** the Python Feature Store SDK runs the join in Snowflake, materializes a Snowpark result, converts it to **pandas**, then snowflakeR transfers columns into R (today via a JSON round-trip for compatibility with Workspace). Scalar feature columns arrive as normal R types. Any OBJECT/ARRAY/VECTOR columns in the join result follow the pandas/JSON path — treat them like SQL API character columns if they appear as strings. **Materialize without pulling everything to R:** ```r sfr_generate_training_data( fs, spine = spine_sql, features = list(fv1, fv2), save_as = "CHURN_TRAINING_V1", # table in Feature Store schema spine_timestamp_col = "event_ts", spine_label_cols = "label" ) # Inspect or model from Snowflake with RSnowflake library(dplyr) training_tbl <- tbl(con, in_schema("FEATURE_STORE", "CHURN_TRAINING_V1")) sample <- training_tbl |> head(10000) |> collect() ``` Use `save_as` for wide or large training sets: validate joins in Snowflake, then read with RSnowflake + **ADBC** ([ADBC setup](#sec-adbc-semistructured)): ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT * FROM MY_DB.FEATURE_STORE.CHURN_TRAINING_V1 LIMIT 100000; ``` ##### dbplyr ```r options(RSnowflake.backend = "adbc") training_tbl <- tbl(con, in_schema("FEATURE_STORE", "CHURN_TRAINING_V1")) sample <- training_tbl |> head(100000) |> collect() ``` ::: ### Path 2 — `sfr_generate_dataset()` (versioned Dataset + lineage) Creates an immutable **Snowflake ML Dataset** (Parquet files on a stage — the same object Python reads with **fsspec**): ```r training <- sfr_generate_dataset( fs, name = "CHURN_TRAINING", version = "v1", spine = spine_sql, features = list(fv1, fv2), spine_label_cols = "label" ) attr(training, "dataset_name") # for sfr_log_model() lineage attr(training, "dataset_version") ``` **What R receives today:** the bridge reads the Dataset through **`Dataset.read.to_snowpark_dataframe().to_pandas()`** — not direct Parquet file access. That matches demo and moderate row counts; it is the same semantic data Python would get from an in-session read, not the maximum-throughput fsspec path. **What Python does at scale:** after `generate_dataset()`, the SDK exposes Parquet files on internal storage: ```python # Python (snowflake-ml) — not wrapped in snowflakeR yet ds = session.dataset.load("MY_DB.ML.CHURN_TRAINING").select_version("v1") files = ds.read.files() # fsspec-compatible paths # pd.read_parquet(files) or pyarrow.dataset ``` See [Snowflake ML Dataset — direct file access](https://docs.snowflake.com/en/developer-guide/snowflake-ml/dataset#direct-file-access). **R options for large Datasets today:** 1. **`sfr_read_dataset(conn, name, version)`** — same pandas → R path as `sfr_generate_dataset()`; suitable when the Dataset already exists. 2. **`save_as` table + RSnowflake + ADBC** — recommended for wide/scalar-heavy training frames at scale ([ADBC setup](#sec-adbc-semistructured)). Dataset Parquet still stores VARIANT as JSON strings on disk — same limitation as Python unless columns are typed ARRAY / scalars. 3. **Python decode + reticulate** — [Pipeline E](#sec-pipeline-e) for three-VECTOR export at maximum throughput. A future snowflakeR release may add **`format = "parquet"`** on dataset reads (fsspec → `arrow::read_parquet()`). Until then, assume Dataset APIs in R are **lineage- and moderate-scale-friendly**, not a substitute for a full Parquet export pipeline. **Full export workflows:** [Reading materialized training data](#sec-read-strategies), then [Choose a pipeline](#sec-e2e-pipelines) (A–E). **Define features first?** [Authoring Feature Views](../17b_authoring_feature_views/index.qmd). ### Path 3 — RSnowflake only (custom SQL / dbplyr) When you own the SQL (external Feature View table, export view, or `save_as` output), use RSnowflake directly: ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT * FROM MY_DB.ML.CHURN_TRAINING_V1; ``` ##### dbplyr ```r options(RSnowflake.backend = "adbc") # recommended for materialized training reads training_tbl <- tbl(con, in_schema("ML", "CHURN_TRAINING_V1")) training <- training_tbl |> collect() # Scalars: native types. VARIANT/OBJECT: chr → [JSON parse](#sec-parse-json). # VECTOR / typed ARRAY: list → [ADBC decode](#sec-parse-adbc). # Or flatten first: [Extract in Snowflake](#sec-sql-extract). ``` ::: This path gives you the most control over **which columns** cross the wire — important when joins are wide. ## ADBC, Arrow, and account settings {#sec-adbc-semistructured} ADBC (`adbcsnowflake`) uses Snowflake's **Arrow** result path. That is the recommended read transport for **large** materialized training tables and for **VECTOR** / typed **ARRAY** columns — not a magic bypass for every semi-structured type. ### Enable ADBC in Workspace | Step | Action | |------|--------| | Packages | `adbcsnowflake`, `adbcdrivermanager` (CRE image or bootstrap `languages.r.addons.adbc: true`) | | Connection | Same `dbConnect(Snowflake())` — backend is lazy-initialised | | Reads | `options(RSnowflake.backend = "adbc")` or leave `"auto"` when ADBC is installed | See [RSnowflake in Workspace](../14_rsnowflake_workspace/index.qmd#sec-adbc) and [WORKSPACE_ADBC.md](https://github.com/Snowflake-Labs/RSnowflake/blob/main/WORKSPACE_ADBC.md). ### Account parameter for native structured types Snowflake can return **typed ARRAY**, **structured OBJECT**, and **MAP** as native Arrow types instead of JSON strings. This is controlled at **account** scope (not session): ```sql -- Accountadmin — one-time per account; confirm with your admin ALTER ACCOUNT SET ENABLE_STRUCTURED_TYPES_IN_CLIENT_RESPONSE = TRUE; ``` | Column type | Native Arrow over ADBC? | |-------------|---------------------------| | Scalars (DOUBLE, INT, …) | Yes | | **VECTOR(FLOAT/INT, n)** | Yes — `fixed_size_list` (does **not** require the account flag) | | Typed **ARRAY(...)** | Yes **with** account flag; otherwise JSON strings | | **Structured OBJECT** (`OBJECT(f1 DOUBLE, …)`) | Yes **with** account flag | | **VARIANT** / untyped OBJECT | **No** — always JSON strings in Arrow | Ask your account team whether the flag is already enabled before relying on native ARRAY/struct columns in R. ::: {.callout-warning} **SQL API has no server-side Arrow.** `dbGetQuery()` on the REST path always uses JSON. `dbGetQueryArrow()` on the SQL API path builds Arrow **client-side** from JSON — interface compatibility only, not the same as ADBC transport. For Feature Store exports you install **`adbcsnowflake` in the Workspace R environment** (CRE or bootstrap), materialize training with `save_as`, then read that table through RSnowflake with `options(RSnowflake.backend = "adbc")`. That is a **client-side ADBC driver** talking Arrow to Snowflake — not `dbGetQueryArrow()` on the REST API. ::: ### Inspect what you received After `collect()`, inspect column classes before choosing a decoder: ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT * FROM MY_DB.ML.MY_TRAINING LIMIT 100; ``` ##### R (inspect) ```r options(RSnowflake.backend = "adbc") sample <- dbGetQuery(con, "SELECT * FROM MY_DB.ML.MY_TRAINING LIMIT 100") str(sample, max.level = 1) # PACKED_FLOATS : List of 100 | num [1:128] ... → VECTOR over ADBC # FEATURE_OBJECT : chr [1:100] ... → VARIANT/OBJECT JSON # AVG_SPEND : num ... → scalar ``` ::: --- ## Reading materialized training data {#sec-read-strategies} Once training data is a table or view in Snowflake (`save_as`, export VIEW, or custom SQL), you choose **how semi-structured columns cross the wire**. These three strategies apply across pipelines — [Pipeline A](#sec-pipeline-a) uses scalars only; [B–E](#sec-e2e-pipelines) combine repack in SQL with decode on the client. | Priority | Strategy | When | Section | |----------|----------|------|---------| | 1 (preferred) | **Extract in Snowflake** | Known OBJECT keys or ARRAY slots; moderate width | [Extract in Snowflake](#sec-sql-extract) | | 2 | **ADBC unpack in R** | VECTOR packs or typed ARRAY with account flag; large row counts | [ADBC decode](#sec-parse-adbc) | | 3 (fallback) | **JSON parse in R** | SQL API path, VARIANT blobs, validation samples (`LIMIT`) | [JSON parse](#sec-parse-json) | For **wide numeric exports**, see [VECTOR layout](#sec-vector) (three-column pack) — used by [Pipeline B](#sec-pipeline-b) and [Pipeline E](#sec-pipeline-e). --- ## Extract in Snowflake {#sec-sql-extract} The most portable and often fastest approach: **project features in Snowflake** (SQL or dbplyr), then `collect()` scalars only. Works with **SQL API or ADBC** and avoids JSON parsing in R entirely. ### OBJECT path access ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT customer_id, event_ts, label, feature_object:avg_spend::FLOAT AS avg_spend, feature_object:n_orders::INTEGER AS n_orders FROM MY_DB.ML.CUSTOMER_FEATURES_BUNDLED; ``` ##### dbplyr ```r library(dplyr) training <- tbl(con, in_schema("ML", "CUSTOMER_FEATURES_BUNDLED")) |> mutate( avg_spend = sql("FEATURE_OBJECT:avg_spend::FLOAT"), n_orders = sql("FEATURE_OBJECT:n_orders::INTEGER") ) |> select(-FEATURE_OBJECT) |> collect() ``` ::: ### ARRAY → scalars (known positions) ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT entity_id, feature_array[0]::FLOAT AS f1, feature_array[1]::FLOAT AS f2 FROM MY_DB.ML.TRAINING_ARRAY_PACK; ``` ##### dbplyr ```r training <- tbl(con, "TRAINING_ARRAY_PACK") |> mutate( f1 = sql("feature_array[0]::FLOAT"), f2 = sql("feature_array[1]::FLOAT") ) |> select(-FEATURE_ARRAY) |> collect() ``` ::: For hundreds of features, generate the `mutate()` expressions from a feature manifest in R, or build a **VIEW** in Snowflake that flattens once and register that as the Feature View backing table. ### VECTOR packs — keep packed for modeling If the model expects a matrix, you may **keep** VECTOR/ARRAY columns through `collect()` and decode in R ([ADBC decode](#sec-parse-adbc)) rather than exploding to hundreds of scalar columns in SQL. --- ## ADBC: unpack native Arrow columns in R {#sec-parse-adbc} Use this path when `collect()` returns **list** columns (VECTOR or typed ARRAY with account flag), not JSON character. ### VECTOR or ARRAY as list column → matrix ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT entity_id, features_float_arr::VECTOR(FLOAT, 128) AS features_float FROM MY_DB.ML.TRAINING_VECTOR_PACK LIMIT 50000; ``` ##### R (ADBC decode) ```r options(RSnowflake.backend = "adbc") pack <- dbGetQuery(con, " SELECT entity_id, features_float_arr::VECTOR(FLOAT, 128) AS features_float FROM MY_DB.ML.TRAINING_VECTOR_PACK LIMIT 50000 ") stopifnot(is.list(pack$FEATURES_FLOAT)) X <- do.call(rbind, pack$FEATURES_FLOAT) training <- data.frame(entity_id = pack$ENTITY_ID, X) ``` ::: If a column is still **character**, fall back to [JSON parse](#sec-parse-json) — the warehouse or connection may not be using the Arrow path for that query. ### Typed ARRAY with account flag ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT entity_id, feature_array FROM MY_DB.ML.TRAINING_TYPED_ARRAY -- ARRAY(DOUBLE) column LIMIT 10000; ``` ##### R (ADBC decode) ```r arr <- dbGetQuery(con, " SELECT entity_id, feature_array FROM MY_DB.ML.TRAINING_TYPED_ARRAY LIMIT 10000 ") if (is.list(arr$FEATURE_ARRAY)) { X <- do.call(rbind, arr$FEATURE_ARRAY) } else { X <- matrix( vapply(arr$FEATURE_ARRAY, jsonlite::fromJSON, numeric(0), simplify = TRUE), nrow = nrow(arr), byrow = TRUE ) } ``` ::: ### Mix scalar + packed columns ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT entity_id, LABEL::FLOAT AS label_num, features_float_arr::VECTOR(FLOAT, 128) AS features_float, feature_object FROM MY_DB.ML.MY_TRAINING; ``` ##### dbplyr ```r library(dplyr) training <- tbl(con, in_schema("ML", "MY_TRAINING")) |> mutate(label_num = sql("LABEL::FLOAT")) |> collect() # ADBC: scalars native; packs may be list or character if ("FEATURES_FLOAT" %in% names(training) && is.list(training$FEATURES_FLOAT)) { X_pack <- do.call(rbind, training$FEATURES_FLOAT) } if ("FEATURE_OBJECT" %in% names(training) && is.character(training$FEATURE_OBJECT)) { feats <- vapply(training$FEATURE_OBJECT, jsonlite::fromJSON, numeric(0), simplify = FALSE) } ``` ::: For very large `n`, consider the **`arrow`** package on exported Parquet (Dataset files or `COPY INTO`) instead of expanding wide matrices in memory — see [Feature Store training data](#sec-feature-store-read). --- ## SQL API: parse JSON in R {#sec-parse-json} When results arrive as **character** JSON (SQL API path, or VARIANT/OBJECT over ADBC), use **`jsonlite`**. Prefer **`vapply()`** over `lapply()` + `rbind()` for thousands of rows. ### OBJECT column → named features per row ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT entity_id, feature_object FROM MY_DB.ML.CUSTOMER_FEATURES_BUNDLED LIMIT 5000; ``` ##### R (parse JSON) ```r library(jsonlite) sample <- dbGetQuery(con, " SELECT entity_id, feature_object FROM MY_DB.ML.CUSTOMER_FEATURES_BUNDLED LIMIT 5000 ") fromJSON(sample$FEATURE_OBJECT[1]) feat_list <- vapply( sample$FEATURE_OBJECT, fromJSON, FUN.VALUE = numeric(0), simplify = FALSE ) feat_mat <- do.call(rbind, lapply(feat_list, `[`, unique(names(feat_list[[1]])))) training <- cbind(entity_id = sample$ENTITY_ID, as.data.frame(feat_mat)) ``` ::: If keys differ by row, normalize with **`tidyr::unnest_wider()`** after `tibble::enframe(feat_list)` or keep features in SQL with `feature_object:avg_spend::FLOAT` style extracts ([Extract in Snowflake](#sec-sql-extract)). ### ARRAY column → numeric matrix ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT entity_id, feature_array FROM MY_DB.ML.TRAINING_ARRAY_PACK LIMIT 5000; ``` ##### R (parse JSON) ```r library(jsonlite) arr <- dbGetQuery(con, " SELECT entity_id, feature_array FROM MY_DB.ML.TRAINING_ARRAY_PACK LIMIT 5000 ") rows <- vapply( arr$FEATURE_ARRAY, function(s) fromJSON(s, simplifyVector = TRUE), FUN.VALUE = numeric(0) ) X <- matrix(unlist(rows), nrow = nrow(arr), byrow = TRUE) ``` ::: Check `ncol(X)` once against your declared array length; mismatches usually mean NULL or ragged arrays in source data. ### VECTOR column → numeric matrix VECTOR values also arrive as JSON text on the SQL API path. The decode step is the same as ARRAY: ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT entity_id, packed_floats::VECTOR(FLOAT, 128) AS packed_floats FROM MY_DB.ML.TRAINING_VECTOR_PACK LIMIT 5000; ``` ##### R (parse JSON) ```r library(jsonlite) vec <- dbGetQuery(con, " SELECT entity_id, packed_floats::VECTOR(FLOAT, 128) AS packed_floats FROM MY_DB.ML.TRAINING_VECTOR_PACK LIMIT 5000 ") float_rows <- vapply( vec$PACKED_FLOATS, function(s) fromJSON(s, simplifyVector = TRUE), FUN.VALUE = numeric(0) ) X_float <- matrix(unlist(float_rows), nrow = nrow(vec), byrow = TRUE) ``` ::: For production-sized training sets, **do not** rely on row-wise `fromJSON` over millions of rows unless columns are VARIANT/OBJECT with no SQL flatten path. Prefer [Extract in Snowflake](#sec-sql-extract), **ADBC + list columns** ([ADBC decode](#sec-parse-adbc)), or Parquet — see [Feature Store training data](#sec-feature-store-read). --- ## VECTOR columns for wide feature exports {#sec-vector} Snowflake **VECTOR** types store a fixed-length sequence of numbers (up to **4,096** elements per column). They are useful when you export hundreds or thousands of features but want the query planner to see only a **small number of columns** — not one expression per feature. This section documents the **three-column VECTOR layout** referenced by [Pipeline B](#sec-pipeline-b) step 3 and [Pipeline E](#sec-pipeline-e). For the full repack story (wide ASOF → ARRAY → VECTOR → decode), follow Pipeline B; this section is the **reference** for cast dimensions and client decode snippets. RSnowflake runs SQL that references `VECTOR(FLOAT, n)` and `VECTOR(INT, n)` without restriction. There are no dplyr helpers yet for `VECTOR_AVG` / `VECTOR_SUM`; write those in SQL with `sql()` or `dbGetQuery()`. After `collect()` on the **SQL API** path, VECTOR columns are **character** (JSON) — use [JSON parse](#sec-parse-json). With **ADBC**, VECTOR is often a **list** column — use [ADBC decode](#sec-parse-adbc). ### Packed export layout (three columns) A common layout for mixed numeric and categorical features stores data as ARRAY in tables, then casts to VECTOR at read time: | Column | Type | Role | |--------|------|------| | `features_float` | `VECTOR(FLOAT, n)` | Imputed numerics | | `features_int` | `VECTOR(INT, n)` | Ordinal-encoded categoricals | | `null_mask` | `VECTOR(INT, ceil(n/32))` | Bitmask of which features were NULL | The export SQL stays short regardless of feature count — the warehouse compiles three expressions instead of hundreds: ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT entity_id, features_float_arr::VECTOR(FLOAT, 1000) AS features_float, features_int_arr::VECTOR(INT, 400) AS features_int, null_mask_arr::VECTOR(INT, 32) AS null_mask FROM MY_DB.ML.TRAINING_EXPORT_VIEW WHERE train_date = '2026-01-01' LIMIT 10000; ``` ##### dbplyr ```r library(dplyr) options(RSnowflake.backend = "adbc") export <- tbl(con, in_schema("ML", "TRAINING_EXPORT_VIEW")) |> filter(sql("train_date = '2026-01-01'")) |> mutate( features_float = sql("features_float_arr::VECTOR(FLOAT, 1000)"), features_int = sql("features_int_arr::VECTOR(INT, 400)"), null_mask = sql("null_mask_arr::VECTOR(INT, 32)") ) |> select(entity_id, features_float, features_int, null_mask) |> head(10000) |> collect() if (is.list(export$FEATURES_FLOAT)) { X_num <- do.call(rbind, export$FEATURES_FLOAT) } else { X_num <- matrix( vapply(export$FEATURES_FLOAT, jsonlite::fromJSON, numeric(0), simplify = TRUE), nrow = nrow(export), byrow = TRUE ) } ``` ::: Define `features_*_arr` in the Feature View or a downstream Dynamic Table — not row-by-row in R. See the [Implementation Guide §12.10](https://snowflake-labs.github.io/snowflake-featurestore-imp-guide/12_advanced_patterns/#sec-wide-sparse) for platform patterns. ::: {.callout-note} **Parquet and VECTOR:** Snowflake cannot write VECTOR columns directly to Parquet. For file-based pipelines, export **typed ARRAY** columns (or scalars) instead; VECTOR is for in-warehouse query and SQL API/ADBC reads. ::: --- ## Wide training sets: views, joins, and column count {#sec-wide-training} This section summarises **why** width appears and **which pipeline** to pick. Full step-by-step workflows are in [End-to-end pipelines](#sec-e2e-pipelines) (next section). Width is not only “one Feature View with 500 columns.” It also accumulates when you **join several Feature Views** into one training set: ```r training_data <- sfr_generate_training_data( fs, spine = spine_sql, features = list( list(name = "CUSTOMER_DEMOGRAPHICS", version = "v1"), # 40 cols list(name = "CUSTOMER_TRANSACTIONS", version = "v1"), # 120 cols list(name = "PRODUCT_EMBEDDINGS", version = "v1") # 256 cols ), spine_timestamp_col = "event_ts", spine_label_cols = "label", auto_prefix = TRUE # CUSTOMER_TRANSACTIONS_avg_spend, ... ) ``` Each view adds its feature columns (plus keys and optional timestamp columns). **`auto_prefix = TRUE`** avoids name collisions but **increases total width** — plan for hundreds of columns even when each view is modest. | Situation | Pipeline | |-----------|----------| | Few views, up to ~300 **total** scalar columns after join | [A — scalar direct](#sec-pipeline-a) | | Several views → **~300–500+ scalar columns** | [B — scalar-wide → three VECTOR packs](#sec-pipeline-b) | | Features already **ARRAY** in views | [C — ARRAY-native](#sec-pipeline-c) | | Features as **OBJECT** / VARIANT | [D — OBJECT-native](#sec-pipeline-d) (flatten SQL or B-style repack) | | Maximum packed decode throughput | [E — Python → R](#sec-pipeline-e) | | Multi-view name collisions | `auto_prefix = TRUE`; consider fewer views or pre-joined SQL | --- ## End-to-end training export pipelines {#sec-e2e-pipelines} Each pipeline is a **composed workflow** — not a new read technique. Before diving in, you should understand: - [Feature Store materialization](#sec-feature-store-read) (`save_as`, Datasets) - [Read strategies](#sec-read-strategies) ([Extract in Snowflake](#sec-sql-extract), [ADBC decode](#sec-parse-adbc), [JSON parse](#sec-parse-json)) Pipeline steps **link to those sections** for extract/decode details rather than repeating them. Steps assume a **multilang Workspace notebook**: snowflakeR for Feature Store APIs, RSnowflake for reads, optional **ADBC** in R ([ADBC setup](#sec-adbc-semistructured)). ### Choose a pipeline | How features live today | Combined ASOF / training width | Recommended pipeline | |-------------------------|--------------------------------|----------------------| | Scalar columns per feature | up to a few hundred (~300) columns total | [Pipeline A — scalar, direct](#sec-pipeline-a) | | Scalar columns per feature | ~300+ columns after multi–Feature View join | [Pipeline B — scalar-wide → three VECTOR packs](#sec-pipeline-b) | | **ARRAY** columns (typed packs) | Any | [Pipeline C — ARRAY-native](#sec-pipeline-c) | | **OBJECT** / VARIANT JSON per row | Any | [Pipeline D — OBJECT-native](#sec-pipeline-d) | | Packed export; maximum decode speed in notebook | Large row × width | [Pipeline E — Python decode → R](#sec-pipeline-e) | ```{mermaid} flowchart TB subgraph fs [Snowflake — Feature Store] SP[Spine + labels] FV[Feature Views] JOIN[generate_training_data ASOF join] MAT[(Materialized table or export VIEW)] SP --> JOIN FV --> JOIN JOIN --> MAT end subgraph pack [Optional repack in SQL] WIDE[Wide scalar table] ARR[Three ARRAY columns] VEC[VECTOR cast at read time] WIDE --> ARR --> VEC end subgraph client [Workspace client] ADBC[RSnowflake ADBC collect] R[R data.frame / matrix] PY[Python Arrow decode] RET[reticulate py_to_r] ADBC --> R PY --> RET --> R end MAT --> WIDE MAT --> ARR ARR --> VEC MAT --> ADBC VEC --> ADBC VEC --> PY ``` --- ### Pipeline A — Scalar features, moderate width {#sec-pipeline-a} **When:** Each feature is its own column; total width after the ASOF join stays in a **moderate** range (rough guide: up to a few hundred — ~300 — scalar feature columns with RSnowflake + ADBC). ```{mermaid} flowchart LR A1[Spine + Feature Views] --> A2[sfr_generate_training_data] A2 --> A3{Row count?} A3 -->|Demo / small| A4[Return R data.frame] A3 -->|Large or reuse| A5[save_as table] A5 --> A6[RSnowflake ADBC collect] A6 --> A7[Model in R] A4 --> A7 ``` **1 — Register and join (R, snowflakeR)** ```r library(snowflakeR) conn <- sfr_connect() fs <- sfr_feature_store(conn, database = "MY_DB", schema = "ML") fv1 <- sfr_get_feature_view(fs, "CUSTOMER_FEATURES", "v1") fv2 <- sfr_get_feature_view(fs, "CUSTOMER_ACTIVITY", "v1") spine_sql <- " SELECT customer_id, event_ts, churned AS label FROM MY_DB.ML.CHURN_SPINE WHERE event_ts >= '2024-01-01' " ``` **2a — Small pull: straight to R** ```r training <- sfr_generate_training_data( fs, spine = spine_sql, features = list(fv1, fv2), spine_timestamp_col = "event_ts", spine_label_cols = "label", auto_prefix = TRUE ) # Scalar columns → ready for glm / tidymodels ``` **2b — Large or repeated reads: materialize, then ADBC** ```r sfr_generate_training_data( fs, spine = spine_sql, features = list(fv1, fv2), spine_timestamp_col = "event_ts", spine_label_cols = "label", auto_prefix = TRUE, save_as = "CHURN_TRAINING_SCALAR" ) ``` ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT * FROM MY_DB.ML.CHURN_TRAINING_SCALAR; ``` ##### dbplyr ```r %%R library(DBI) library(RSnowflake) library(dplyr) options(RSnowflake.backend = "adbc") con <- dbConnect(Snowflake()) training <- tbl(con, in_schema("ML", "CHURN_TRAINING_SCALAR")) |> collect() ``` ::: **Read transport:** [ADBC setup](#sec-adbc-semistructured) · **Lineage Dataset (optional):** [Feature Store Path 2](#sec-feature-store-read) --- ### Pipeline B — Scalar-wide ASOF → persist as three ARRAYs → VECTOR read → R {#sec-pipeline-b} **When:** The point-in-time join is correct but **wide** (many scalar columns from one or more Feature Views). You want compile-friendly exports and efficient transport without pulling 500+ columns through the client. **Idea:** Materialize the wide ASOF result once, then build an **export view** (or table) with exactly **three** `ARRAY` columns — float numerics, int ordinals/categoricals, null bitmask — and cast to `VECTOR` only in the `SELECT` you use for training reads. ```{mermaid} flowchart LR B1[ASOF join save_as wide table] --> B2[Export VIEW: ARRAY_CONSTRUCT x3] B3[Read SQL: ::VECTOR casts] --> B4[ADBC collect] B4 --> B5[List columns to matrix in R] B2 --> B3 ``` **1 — Wide ASOF materialization (R, snowflakeR)** ```r sfr_generate_training_data( fs, spine = spine_sql, features = list(fv1, fv2, fv3), # many scalar columns per view spine_timestamp_col = "event_ts", spine_label_cols = "label", auto_prefix = TRUE, save_as = "CHURN_TRAINING_WIDE" ) ``` **2 — Repack to three ARRAY columns (Snowflake)** Maintain a **feature manifest** (which columns are float vs categorical, imputation rules). Below is a **minimal** pattern; production pipelines generate this SQL from metadata (see [Implementation Guide §12.10](https://snowflake-labs.github.io/snowflake-featurestore-imp-guide/12_advanced_patterns/#sec-wide-sparse)). ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql CREATE OR REPLACE VIEW MY_DB.ML.CHURN_TRAINING_EXPORT AS WITH base AS ( SELECT * FROM MY_DB.ML.CHURN_TRAINING_WIDE ), typed AS ( SELECT customer_id, event_ts, label, -- example: three scalar features from the wide table CUSTOMER_FEATURES_V1_avg_order_total::FLOAT AS f_avg_spend, CUSTOMER_FEATURES_V1_order_count::INT AS f_order_count, CUSTOMER_ACTIVITY_V1_page_views::FLOAT AS f_page_views FROM base ), packed AS ( SELECT customer_id, event_ts, label, ARRAY_CONSTRUCT( COALESCE(f_avg_spend, 0), COALESCE(f_page_views, 0) ) AS features_float_arr, ARRAY_CONSTRUCT( COALESCE(f_order_count, 0) -- add ordinal-encoded categoricals here as INT elements ) AS features_int_arr, ARRAY_CONSTRUCT( IFF(f_avg_spend IS NULL, 1, 0), IFF(f_order_count IS NULL, 1, 0), IFF(f_page_views IS NULL, 1, 0) ) AS null_mask_arr FROM typed ) SELECT * FROM packed; ``` ##### dbplyr ```r library(dplyr) packed <- tbl(con, in_schema("ML", "CHURN_TRAINING_WIDE")) |> mutate( f_avg_spend = sql("CUSTOMER_FEATURES_V1_avg_order_total::FLOAT"), f_order_count = sql("CUSTOMER_FEATURES_V1_order_count::INT"), f_page_views = sql("CUSTOMER_ACTIVITY_V1_page_views::FLOAT"), features_float_arr = sql( "ARRAY_CONSTRUCT( COALESCE(CUSTOMER_FEATURES_V1_avg_order_total::FLOAT, 0), COALESCE(CUSTOMER_ACTIVITY_V1_page_views::FLOAT, 0) )" ), features_int_arr = sql( "ARRAY_CONSTRUCT(COALESCE(CUSTOMER_FEATURES_V1_order_count::INT, 0))" ), null_mask_arr = sql( "ARRAY_CONSTRUCT( IFF(CUSTOMER_FEATURES_V1_avg_order_total::FLOAT IS NULL, 1, 0), IFF(CUSTOMER_FEATURES_V1_order_count::INT IS NULL, 1, 0), IFF(CUSTOMER_ACTIVITY_V1_page_views::FLOAT IS NULL, 1, 0) )" ) ) |> select( customer_id, event_ts, label, features_float_arr, features_int_arr, null_mask_arr ) # Materialize (permanent table) — or render SQL and CREATE VIEW packed |> compute(name = in_schema("ML", "CHURN_TRAINING_EXPORT"), temporary = FALSE) ``` ::: `ARRAY_CONSTRUCT` and `IFF` are expressed with `sql()` inside dplyr — RSnowflake does not wrap every semi-structured builder yet ([RSnowflake: Query §dbplyr](../13_rsnowflake_query/index.qmd#sec-snowflake-dbplyr)). **3 — Read with VECTOR cast + ADBC (R, RSnowflake)** Cast at **read time** so the stored table stays ARRAY (Parquet-friendly); the planner sees three expressions regardless of feature count. ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT customer_id, event_ts, label, features_float_arr::VECTOR(FLOAT, 2) AS features_float, features_int_arr::VECTOR(INT, 1) AS features_int, null_mask_arr::VECTOR(INT, 3) AS null_mask FROM MY_DB.ML.CHURN_TRAINING_EXPORT; ``` ##### dbplyr ```r options(RSnowflake.backend = "adbc") export <- tbl(con, in_schema("ML", "CHURN_TRAINING_EXPORT")) |> mutate( features_float = sql("features_float_arr::VECTOR(FLOAT, 2)"), features_int = sql("features_int_arr::VECTOR(INT, 1)"), features_mask = sql("null_mask_arr::VECTOR(INT, 3)") ) |> select(customer_id, event_ts, label, features_float, features_int, features_mask) |> collect() ``` ::: **4 — Decode VECTOR list columns in R** This step applies [ADBC decode](#sec-parse-adbc) to the packed export from steps 2–3. Below is the minimal three-feature walkthrough; production pipelines load the manifest from the same metadata that generated the repack SQL. `collect()` returns each VECTOR as a **list column** of numeric vectors (one vector per row). Turn those into matrices, then join labels and feature names for modeling. A **feature manifest** is metadata you maintain alongside the export SQL — typically a small table or CSV with one row per **original wide-column** feature: name, whether it lands in the float or int pack, its index inside that pack, and how to treat NULLs. It must stay in sync with the order of `IFF(... IS NULL, 1, 0)` expressions in `null_mask_arr` (see step 2 above: `f_avg_spend`, `f_order_count`, `f_page_views`). The **null mask** does not store feature values — it stores **1 where the wide table had NULL** before `COALESCE(..., 0)`, **0 where the value was present**. Use it to put `NA` back on imputed slots, drop masked columns, or pass the mask to a model that supports missingness. ```r library(dplyr) library(tibble) X_float <- do.call(rbind, export$FEATURES_FLOAT) # nrow x 2 X_int <- do.call(rbind, export$FEATURES_INT) # nrow x 1 mask <- do.call(rbind, export$FEATURES_MASK) # nrow x 3 (same order as step 2 null_mask_arr) # Manifest for the minimal three-feature export (production: hundreds of rows) manifest <- tibble( feature = c("f_avg_spend", "f_order_count", "f_page_views"), mask_ix = 1:3, float_ix = c(1L, NA_integer_, 2L), int_ix = c(NA_integer_, 1L, NA_integer_) ) # Optional: undo COALESCE(0) on float slots that were NULL in Snowflake for (i in seq_len(nrow(manifest))) { ix <- manifest$float_ix[i] if (!is.na(ix)) { was_null <- mask[, manifest$mask_ix[i]] == 1L X_float[was_null, ix] <- NA_real_ } } training <- tibble( customer_id = export$CUSTOMER_ID, event_ts = export$EVENT_TS, label = export$LABEL, f_avg_spend = X_float[, 1], f_page_views = X_float[, 2], f_order_count = X_int[, 1] ) # training is ready for glm(), tidymodels, etc. ``` **Decode details:** [ADBC client decode](#sec-parse-adbc) · **Python variant:** [Pipeline E](#sec-pipeline-e) --- ### Pipeline C — Features already stored as ARRAY columns {#sec-pipeline-c} **When:** Feature Views (or ETL) already expose **typed `ARRAY(...)`** packs — no wide scalar table to repack. ```{mermaid} flowchart LR C1[Feature View with ARRAY cols] --> C2[ASOF join save_as] C2 --> C3[SELECT with optional ::VECTOR] C3 --> C4[ADBC collect + list to matrix] ``` **1 — Feature View SQL (ARRAY in the definition)** ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql -- Backing query for Feature View CUSTOMER_FEATURES_ARRAY v1 SELECT customer_id, ARRAY_CONSTRUCT(avg_order_total, page_views) AS features_float_arr, ARRAY_CONSTRUCT(order_count) AS features_int_arr FROM analytics.customer_rollup; ``` ##### snowflakeR ```r fv_arrays <- sfr_create_feature_view( fs, name = "CUSTOMER_FEATURES_ARRAY", version = "v1", entities = customer, feature_df = " SELECT customer_id, ARRAY_CONSTRUCT(avg_order_total, page_views) AS features_float_arr, ARRAY_CONSTRUCT(order_count) AS features_int_arr FROM analytics.customer_rollup " ) ``` ::: **2 — Point-in-time training materialization** ```r sfr_generate_training_data( fs, spine = spine_sql, features = list(fv_arrays), spine_timestamp_col = "event_ts", spine_label_cols = "label", save_as = "CHURN_TRAINING_ARRAY" ) ``` **3 — Read (R + ADBC)** With account flag `ENABLE_STRUCTURED_TYPES_IN_CLIENT_RESPONSE = TRUE`, typed ARRAY may arrive as **list** columns without JSON parsing ([ADBC setup](#sec-adbc-semistructured)). Otherwise treat as character JSON ([JSON parse](#sec-parse-json)). ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT customer_id, label, features_float_arr::VECTOR(FLOAT, 2) AS features_float, features_int_arr::VECTOR(INT, 1) AS features_int FROM MY_DB.ML.CHURN_TRAINING_ARRAY LIMIT 100000; ``` ##### dbplyr ```r options(RSnowflake.backend = "adbc") rows <- tbl(con, in_schema("ML", "CHURN_TRAINING_ARRAY")) |> mutate( features_float = sql("features_float_arr::VECTOR(FLOAT, 2)"), features_int = sql("features_int_arr::VECTOR(INT, 1)") ) |> select(customer_id, label, features_float, features_int) |> head(100000) |> collect() ``` ::: ```r X <- do.call(rbind, rows$FEATURES_FLOAT) ``` **Parquet / Dataset note:** Prefer **typed ARRAY** in storage for file export; untyped semi-structured ARRAY may serialize as JSON in Parquet ([VECTOR — Parquet](#sec-vector)). --- ### Pipeline D — Features stored as OBJECT / VARIANT {#sec-pipeline-d} **When:** One **OBJECT** (or VARIANT) column per entity row — flexible schema, sparse keys, Feature Store “wide & sparse” pattern. ```{mermaid} flowchart LR D1[OBJECT in Feature View or object_agg] --> D2[ASOF join save_as] D2 --> D3{Export strategy} D3 -->|R-friendly scalars| D4[SQL path extract in VIEW] D3 -->|Scale| D5[OBJECT to VECTOR export SQL] D3 -->|Sample only| D6[jsonlite in R] D4 --> D7[ADBC collect] D5 --> D7 ``` **1 — Feature View with OBJECT bundle** ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql -- Backing query for Feature View CUSTOMER_FEATURES_OBJECT v1 SELECT customer_id, OBJECT_CONSTRUCT( 'avg_spend', AVG(order_total), 'n_orders', COUNT(*) ) AS feature_object FROM orders GROUP BY customer_id; ``` ##### snowflakeR ```r fv_obj <- sfr_create_feature_view( fs, name = "CUSTOMER_FEATURES_OBJECT", version = "v1", entities = customer, feature_df = " SELECT customer_id, OBJECT_CONSTRUCT( 'avg_spend', AVG(order_total), 'n_orders', COUNT(*) ) AS feature_object FROM orders GROUP BY customer_id " ) ``` ##### dbplyr ```r library(dplyr) bundled <- tbl(con, in_schema("ANALYTICS", "ORDERS")) |> group_by(CUSTOMER_ID) |> summarise( feature_object = sql("OBJECT_CONSTRUCT( 'avg_spend', AVG(order_total), 'n_orders', COUNT(*) )"), .groups = "drop" ) # Register the rendered query (show_query()) as Feature View backing SQL ``` ::: Or build with dplyr + `object_agg()` ([RSnowflake Query §dbplyr](../13_rsnowflake_query/index.qmd#sec-snowflake-dbplyr)) before registration. **2 — ASOF join** ```r sfr_generate_training_data( fs, spine = spine_sql, features = list(fv_obj), spine_timestamp_col = "event_ts", spine_label_cols = "label", save_as = "CHURN_TRAINING_OBJECT" ) ``` **3a — Export for R: flatten in SQL (recommended)** ADBC does **not** turn VARIANT into native structs. Project keys in an export view or dplyr pipeline: ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql CREATE OR REPLACE VIEW MY_DB.ML.CHURN_TRAINING_OBJECT_FLAT AS SELECT customer_id, event_ts, label, feature_object:avg_spend::FLOAT AS avg_spend, feature_object:n_orders::INT AS n_orders FROM MY_DB.ML.CHURN_TRAINING_OBJECT; ``` ##### dbplyr ```r library(dplyr) training <- tbl(con, in_schema("ML", "CHURN_TRAINING_OBJECT")) |> mutate( avg_spend = sql("feature_object:avg_spend::FLOAT"), n_orders = sql("feature_object:n_orders::INT") ) |> select(customer_id, event_ts, label, avg_spend, n_orders) |> collect() ``` ::: ```r options(RSnowflake.backend = "adbc") training <- tbl(con, in_schema("ML", "CHURN_TRAINING_OBJECT_FLAT")) |> collect() # or: training <- <dbplyr pipeline above> |> collect() ``` More extract patterns: [Extract in Snowflake](#sec-sql-extract). **3b — Export at scale: OBJECT → three VECTOR packs** For hundreds of keys, generate export SQL from a **feature manifest** (path extract → encode → `ARRAY_CONSTRUCT` → `::VECTOR`) — same pattern as [Pipeline B](#sec-pipeline-b) step 2, starting from `feature_object:field::TYPE` instead of wide scalars. Then read with Pipeline B step 3 or [Pipeline E](#sec-pipeline-e). **3c — Validation sample: JSON in R** ::: {.panel-tabset group="fe-sql-dplyr"} ##### SQL ```sql SELECT customer_id, label, feature_object FROM MY_DB.ML.CHURN_TRAINING_OBJECT LIMIT 5000; ``` ##### R (DBI) ```r sample <- dbGetQuery(con, " SELECT customer_id, label, feature_object FROM MY_DB.ML.CHURN_TRAINING_OBJECT LIMIT 5000 ") feats <- vapply(sample$FEATURE_OBJECT, jsonlite::fromJSON, FUN.VALUE = numeric(0), simplify = FALSE) ``` ::: Do not use 3c for full training volume. --- ### Pipeline E — Python Arrow decode, R modeling (multilang) {#sec-pipeline-e} **When:** You use [Pipeline B](#sec-pipeline-b) or [D→B](#sec-pipeline-d) export views with three VECTOR columns and want **maximum** client throughput (NumPy zero-copy on `fixed_size_list`), then model in R. | Step | Language | Action | |------|----------|--------| | 1 | R | Feature Store registration + `save_as` / export VIEW (Pipelines B or D) | | 2 | Python | `session.sql(...)` on export VIEW → Arrow/pandas → decode mask + packs → `X`, `y` | | 3 | R | `reticulate::py_to_r()` or pass matrices only | **2 — Python: read VECTOR export, decode with manifest** Snowpark returns VECTOR columns as **NumPy arrays per row** (Arrow `fixed_size_list`). Stack them into matrices, then use the same **feature manifest** as [Pipeline B step 4](#sec-pipeline-b) — metadata that maps each wide-column feature to its index in the float/int packs and its bit in `null_mask_arr`. "Wide-table metadata" means that manifest, not anything stored inside the export table. Dimensions in the `::VECTOR(type, n)` casts must match your export VIEW (production might be 1000 + 400 + 32 mask bits; the minimal [Pipeline B](#sec-pipeline-b) demo uses 2 + 1 + 3). ```python import numpy as np # Python cell — same Snowpark session as snowflakeR df = session.sql(""" SELECT customer_id, label, features_float_arr::VECTOR(FLOAT, 2) AS features_float, features_int_arr::VECTOR(INT, 1) AS features_int, null_mask_arr::VECTOR(INT, 3) AS null_mask FROM MY_DB.ML.CHURN_TRAINING_EXPORT """).to_pandas() X_float = np.vstack(df["FEATURES_FLOAT"].to_numpy()) # (n_rows, n_float) X_int = np.vstack(df["FEATURES_INT"].to_numpy()) # (n_rows, n_int) mask = np.vstack(df["NULL_MASK"].to_numpy()) # (n_rows, n_wide_features) # Same manifest as Pipeline B step 4 (production: load from CSV / registry table) manifest = [ {"name": "f_avg_spend", "mask_ix": 0, "float_ix": 0}, {"name": "f_order_count", "mask_ix": 1, "int_ix": 0}, {"name": "f_page_views", "mask_ix": 2, "float_ix": 1}, ] for spec in manifest: ix = spec.get("float_ix") if ix is not None: was_null = mask[:, spec["mask_ix"]] == 1 X_float[was_null, ix] = np.nan # undo COALESCE(..., 0) where wide value was NULL X = np.hstack([X_float, X_int.astype(np.float64)]) # design matrix for R y = df["LABEL"].to_numpy(dtype=np.float64) # reticulate R cell below reads py$X and py$y ``` **3 — R: model on decoded matrices** ```r library(reticulate) y <- py$y X <- py_to_r(py$X) ``` **Contrast:** Default `sfr_generate_training_data()` without `save_as` does **not** run this decode path ([Feature Store reads](#sec-feature-store-read)). --- ### Shared prerequisites | Requirement | Pipelines | |-------------|-----------| | snowflakeR + `sfr_connect()` | A–E | | `sfr_generate_training_data(..., save_as = ...)` | A–E (materialize) | | `adbcsnowflake` in R (CRE/bootstrap) | A–D (ADBC reads) | | `ENABLE_STRUCTURED_TYPES_IN_CLIENT_RESPONSE` (account) | C (typed ARRAY as native Arrow) | | Python + reticulate | E | | Feature manifest for repack / decode | B, D→B, E | --- ::: {.callout-tip} ## Related chapters - [Feature Store](../17_feature_store/index.qmd) — entities, views, `sfr_generate_training_data()`, Datasets - [Authoring Feature Views](../17b_authoring_feature_views/index.qmd) — SQL/dbplyr for `feature_df` (simple → wide/temporal) - [RSnowflake: Query](../13_rsnowflake_query/index.qmd) — dbplyr, `compute()`, bulk writes - [Feature Store — Wide & sparse](../17_feature_store/index.qmd#sec-wide-sparse) — OBJECT/ARRAY in views - [RSnowflake in Workspace](../14_rsnowflake_workspace/index.qmd) — install and enable ADBC - [Model Registry](../18_model_registry/index.qmd) — `training_dataset` lineage from `sfr_generate_dataset()` :::

21.1 Overview

21.1.1 Scalars first, semi-structured when you need them

21.1.2 SQL and dbplyr tabs

21.2 Learning Objectives

21.3 How this chapter is organized

21.4 Semi-structured columns in R

21.4.1 Recommendation for Feature Store workloads

21.4.2 Why Feature Store users care

21.5 Feature Store training data in R

21.5.1 Path 1 — sfr_generate_training_data() (transient join)

21.5.2 Path 2 — sfr_generate_dataset() (versioned Dataset + lineage)

21.5.3 Path 3 — RSnowflake only (custom SQL / dbplyr)

21.6 ADBC, Arrow, and account settings

21.6.1 Enable ADBC in Workspace

21.6.2 Account parameter for native structured types

21.6.3 Inspect what you received

21.7 Reading materialized training data

21.8 Extract in Snowflake

21.8.1 OBJECT path access

21.8.2 ARRAY → scalars (known positions)

21.8.3 VECTOR packs — keep packed for modeling

21.9 ADBC: unpack native Arrow columns in R

21.9.1 VECTOR or ARRAY as list column → matrix

21.9.2 Typed ARRAY with account flag

21.9.3 Mix scalar + packed columns

21.10 SQL API: parse JSON in R

21.10.1 OBJECT column → named features per row

21.10.2 ARRAY column → numeric matrix

21.10.3 VECTOR column → numeric matrix

21.11 VECTOR columns for wide feature exports

21.11.1 Packed export layout (three columns)

21.12 Wide training sets: views, joins, and column count

21.13 End-to-end training export pipelines

21.13.1 Choose a pipeline

21.13.2 Pipeline A — Scalar features, moderate width

21.13.3 Pipeline B — Scalar-wide ASOF → persist as three ARRAYs → VECTOR read → R

21.13.4 Pipeline C — Features already stored as ARRAY columns

21.13.5 Pipeline D — Features stored as OBJECT / VARIANT

21.13.6 Pipeline E — Python Arrow decode, R modeling (multilang)

21.13.7 Shared prerequisites

21.5.1 Path 1 — `sfr_generate_training_data()` (transient join)

21.5.2 Path 2 — `sfr_generate_dataset()` (versioned Dataset + lineage)