20 Authoring Feature Views
SQL and dbplyr patterns from simple aggregates to wide and temporal features
snowflake, R, RStudio, Posit, VS Code, workspace notebooks, snowflakeR, RSnowflake, mlops
20.1 Overview
This chapter covers defining features on the Snowflake warehouse and registering them as Feature Views from R — the step before materializing training data or reading into R.
The snowflakeR Feature Store trilogy in this guide:
| Chapter | Question |
|---|---|
| Feature Store | What is the store, and how do I register entities, join spines, and generate training? |
| Authoring Feature Views (this chapter) | How do I write the SQL or dbplyr that produces features? |
| Training Data in R | After save_as / export, how do I get OBJECT/ARRAY/VECTOR into R efficiently? |
Sources can be any tabular object (table, view, Dynamic Table, dbt output). The feature_df you register may pass through that source, transform it in SQL/dbplyr, or combine both — see Feature Store — Architecture.
Platform background on hybrid OBJECT payloads, compile cost, and sparse arrays: Implementation Guide — Advanced patterns §12.10.
20.1.1 SQL and dbplyr tabs
Examples use SQL and dbplyr tabs (group="fe-sql-dplyr") — the same pattern as RSnowflake: Query and Training Data in R. Use dbplyr when you already work in R with tbl(); use SQL when you share DDL with other teams or generate views from a manifest.
20.2 Learning Objectives
- Choose managed vs external Feature Views and draft-then-register vs one-step creation
- Author entity-level aggregates in SQL and dbplyr as
feature_df - Add event-level features with
timestamp_colfor point-in-time joins - Build rollup Feature Views with
sfr_feature()andfeature_granularity - Explain when OBJECT / ARRAY bundles beat wide scalar columns
- Compose multi-step feature SQL with dbplyr
compute() - Validate lazy pipelines before registration
20.3 Prerequisites
- Feature Store — Setup and at least one entity
- RSnowflake + dbplyr (RSnowflake: Query)
- Optional:
workspace_feature_store.ipynb
20.4 Progression: simple → complex
| Stage | Feature shape | Registration hook | Section |
|---|---|---|---|
| 1 | One row per entity; scalar columns | SQL or dbplyr summarise() |
Entity aggregates |
| 2 | You maintain the table (dbt/ETL) | External view, no refresh_freq |
External views |
| 3 | One row per entity per event time | timestamp_col |
Event-level features |
| 4 | Rollup / tiled aggregates over event time | features, feature_granularity |
Rollup and tiled Feature Views |
| 5 | Multi-step rollups, joins, windows | dbplyr + compute() |
Multi-step pipelines |
| 6 | Hundreds of features | OBJECT / ARRAY / typed packs | Wide and sparse bundles |
After registration: Generate training data → Training Data in R.
20.5 Registration patterns
| Pattern | API | When |
|---|---|---|
| One-step | sfr_create_feature_view(fs, ..., version = "v1") |
Fastest path; draft + register in one call |
| Draft then register | sfr_feature_view() → sfr_register_feature_view() |
Inspect draft metadata before materializing |
| Managed | refresh_freq = "1 hour" (etc.) |
Feature Store maintains backing Dynamic Table |
| External | omit refresh_freq |
dbt/ETL owns the physical table; FS registers metadata |
Both feature_df forms work everywhere below:
- SQL string — passed to Snowflake as-is
- dbplyr lazy table — translated to SQL at registration (no
collect()first)
20.6 Entity-level aggregates
One feature row per entity — classic customer rollups: AVG(order_total), COUNT(*), MAX(order_date). Entity keys must match your entity definition.
20.6.1 Managed view — SQL
fv <- sfr_create_feature_view(
fs,
name = "CUSTOMER_FEATURES",
version = "v1",
entities = customer,
feature_df = "
SELECT customer_id,
AVG(order_total) AS avg_order_total,
COUNT(*) AS order_count,
MAX(order_date) AS last_order_date
FROM analytics.orders
GROUP BY customer_id
",
refresh_freq = "1 hour",
desc = "Customer order aggregates"
)customer_features <- tbl(con, in_schema("ANALYTICS", "ORDERS")) |>
group_by(CUSTOMER_ID) |>
summarise(
avg_order_total = mean(ORDER_TOTAL, na.rm = TRUE),
order_count = n(),
last_order_date = max(ORDER_DATE, na.rm = TRUE),
.groups = "drop"
)
fv <- sfr_create_feature_view(
fs,
name = "CUSTOMER_FEATURES_DBPLYR",
version = "v1",
entities = customer,
feature_df = customer_features,
refresh_freq = "1 hour"
)20.6.2 Draft then register
Useful when you want to inspect the draft object before Snowflake materializes the backing table:
fv_draft <- sfr_feature_view(
name = "CUSTOMER_FEATURES",
entities = customer,
feature_df = "
SELECT customer_id, AVG(order_total) AS avg_order_total
FROM analytics.orders GROUP BY 1
",
refresh_freq = "1 hour"
)
fv <- sfr_register_feature_view(fs, fv_draft, version = "v1")Preview generated SQL from dbplyr before registering:
20.7 External views
When dbt, a standalone Dynamic Table, or another ETL job already maintains the feature table, register a pass-through Feature View — Feature Store holds metadata and lineage; you own refresh.
Timestamp on external tables: if rows can change over time, set timestamp_col to a column your ETL updates whenever feature values change. Stale timestamps cause future leakage at ASOF join time — the store thinks old values were valid longer than they were.
Hybrid: pass-through from a maintained table plus SQL enrich in the same feature_df:
SELECT
d.customer_id,
d.region,
OBJECT_CONSTRUCT(
'avg_spend', o.avg_order_total,
'n_orders', o.order_count
) AS feature_object
FROM analytics.customer_demographics d
JOIN analytics.customer_order_rollups o USING (customer_id);Register with timestamp_col on the driving table if either side is time-varying.
20.8 Event-level features
When features change over time (activity logs, sessions, transactions as events), the offline store holds one row per entity per timestamp — not one row per entity. Set timestamp_col so training and inference ASOF joins pick values known at each spine row.
| Shape | Rows per customer | timestamp_col |
Training join |
|---|---|---|---|
| Entity aggregate | 1 | optional / NULL | Latest or snapshot semantics |
| Event-level | many | required | ASOF on spine_timestamp_col |
Spine alignment: training spines must use the same timestamp semantics — spine_timestamp_col in sfr_generate_training_data() matches the view’s timestamp_col. See Feature Store — Point-in-time join.
For advanced temporal patterns (lags, session windows, sequence features), see the Implementation Guide — Temporal features.
20.9 Rollup and tiled Feature Views
Rollup Feature Views let you declare aggregates over lookback windows on event-level source data. Snowflake tiles events at feature_granularity and maintains rollups on refresh — you do not hand-write a separate window function for every feature.
Use when you need rolling SUM / AVG / COUNT over fixed windows (e.g. 1 hour, 1 day, 7 days) on a stream of events.
20.9.1 Source data
Rollups require an event-level feature_df with timestamp_col — typically raw or lightly filtered events:
20.9.2 Declare rollup features
Each sfr_feature() names a source column, dtype, aggregation, and lookback window:
rollup_feats <- list(
sfr_feature("AMOUNT", "FLOAT", agg = "SUM", window = "1 hour"),
sfr_feature("AMOUNT", "FLOAT", agg = "AVG", window = "1 day"),
sfr_feature("AMOUNT", "FLOAT", agg = "SUM", window = "7 day")
)
fv_txn_rollup <- sfr_create_feature_view(
fs,
name = "TXN_ROLLUPS",
version = "v1",
entities = customer,
feature_df = "
SELECT customer_id, event_ts, amount
FROM analytics.transactions
",
features = rollup_feats,
feature_granularity = "1 DAY",
timestamp_col = "event_ts",
refresh_freq = "1 hour"
)| Argument | Role |
|---|---|
features |
List of sfr_feature() — column, dtype, agg func, window string |
feature_granularity |
Tile size for the rollup engine (e.g. "1 DAY", "1 HOUR") |
timestamp_col |
Event time column in source data |
refresh_freq |
Managed refresh schedule |
20.9.3 Rollups vs hand-written windows
| Approach | Best for |
|---|---|
Rollup Feature View (sfr_feature + feature_granularity) |
Standard rolling agg/count over fixed windows on event streams |
SQL OVER (...) / dbplyr windows |
Custom frame bounds, ROW_NUMBER, joins, non-standard logic — Multi-step pipelines, RSnowflake Query — windows |
Entity-level GROUP BY |
Lifetime aggregates with no time dimension — Entity aggregates |
20.10 Multi-step feature pipelines
Complex features often need named intermediate tables — daily rollups before 90-day sums, joins across sources, or window filters. Build the pipeline with dbplyr + compute() on the warehouse, then pass the final lazy table as feature_df.
CREATE TEMPORARY TABLE daily_orders_tmp AS
SELECT
customer_id,
order_date,
COUNT(*) AS n_orders,
SUM(amount) AS revenue
FROM analytics.orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id, order_date;
CREATE TEMPORARY TABLE customer_90d_tmp AS
SELECT
customer_id,
SUM(n_orders) AS orders_90d,
SUM(revenue) AS revenue_90d,
SUM(revenue) / NULLIF(SUM(n_orders), 0) AS avg_order_90d
FROM daily_orders_tmp
GROUP BY customer_id;
-- Register customer_90d_tmp logic as feature_df (permanent view in practice)
SELECT * FROM customer_90d_tmp;orders <- tbl(con, in_schema("ANALYTICS", "ORDERS")) |>
filter(ORDER_DATE >= "2024-01-01")
daily <- orders |>
group_by(CUSTOMER_ID, ORDER_DATE) |>
summarise(
n_orders = n(),
revenue = sum(AMOUNT, na.rm = TRUE),
.groups = "drop"
) |>
compute(name = "daily_orders_tmp", temporary = TRUE)
customer_90d <- daily |>
group_by(CUSTOMER_ID) |>
summarise(
orders_90d = sum(n_orders),
revenue_90d = sum(revenue),
avg_order_90d = sum(revenue) / sum(n_orders),
.groups = "drop"
)
customer_90d |> show_query()
fv_90d <- sfr_create_feature_view(
fs,
name = "CUSTOMER_90D_ROLLUPS",
version = "v1",
entities = customer,
feature_df = customer_90d,
refresh_freq = "1 hour"
)For production, prefer permanent tables or views (temporary = FALSE, fully qualified names) when the intermediate must survive beyond the session. See RSnowflake Query — Multi-step ETL.
20.11 Wide and sparse bundles
When feature count reaches hundreds or thousands, wide scalar Feature Views become expensive to compile, join, and export. Pack features into OBJECT or typed ARRAY columns; keep entity keys (and usually timestamp_col) as scalars.
20.11.1 Why bundles vs wide scalars
| Concern | Many scalar columns | OBJECT / ARRAY bundle |
|---|---|---|
| Schema evolution | New column per feature | New OBJECT key or ARRAY slot; sparse rows omit keys |
| SQL compile / planner | One expression per feature | Few physical columns |
| Multi–Feature View joins | Width grows with auto_prefix |
Pack per view; flatten at export |
| Embeddings | Impractical | ARRAY / VECTOR packs |
Bundles are an authoring choice. Reads may flatten in SQL (Extract in Snowflake) or stay packed (Pipelines B–C).
20.11.2 Fixed keys — OBJECT_CONSTRUCT
When feature names are known at registration time:
fv_obj <- sfr_create_feature_view(
fs,
name = "CUSTOMER_FEATURES_OBJECT",
version = "v1",
entities = customer,
feature_df = "
SELECT customer_id,
OBJECT_CONSTRUCT(
'avg_spend', AVG(order_total),
'n_orders', COUNT(*)
) AS feature_object
FROM analytics.orders
GROUP BY customer_id
",
refresh_freq = "1 hour"
)bundled <- tbl(con, in_schema("ANALYTICS", "ORDERS")) |>
group_by(CUSTOMER_ID) |>
summarise(
feature_object = sql("OBJECT_CONSTRUCT(
'avg_spend', AVG(order_total),
'n_orders', COUNT(*)
)"),
.groups = "drop"
)
fv_obj <- sfr_create_feature_view(
fs,
name = "CUSTOMER_FEATURES_OBJECT",
version = "v1",
entities = customer,
feature_df = bundled,
refresh_freq = "1 hour"
)20.11.3 Dynamic keys — OBJECT_AGG from long/narrow source
When features arrive as name/value pairs (one row per entity per feature):
See RSnowflake Query — OBJECT bundles for object_agg() / array_agg() details.
20.11.4 Hybrid: scalar keys + OBJECT payload
Keep join keys and timestamps as columns; pack evolving features into OBJECT (Implementation Guide §12.10.4):
SELECT
customer_id,
event_ts,
OBJECT_CONSTRUCT(
'page_views', page_views,
'clicks', clicks
) AS activity_object
FROM analytics.customer_activity;Register with timestamp_col = "event_ts" for ASOF training.
20.11.5 Typed ARRAY packs (embeddings)
For fixed-length numeric vectors (embeddings, slot-based features):
Export and decode paths: Training Data in R — Pipeline C.
20.11.6 Feature manifest
Maintain a small manifest (CSV or table) mapping OBJECT keys or ARRAY indices to model input names — used when flattening for training (Training Data in R — Pipeline D) or when building VECTOR export SQL (Pipeline B).
20.12 Validate before register
Catch schema and key issues before Snowflake materializes a managed Dynamic Table.
20.12.1 Preview lazy SQL
20.12.2 Entity key uniqueness
Entity-level views should have at most one row per entity key (unless using event-level + timestamp_col):
Non-empty results mean duplicate keys — fix GROUP BY or add timestamp_col for event grain.
20.12.3 After registration
20.12.4 Multi-view training joins
When several views join into one training set, enable auto_prefix = TRUE in sfr_generate_training_data() to avoid column collisions — plan short, distinct feature column names at authoring time where possible.