19  Feature Store

Entities, feature views, datasets, and training data from R

Keywords

snowflake, R, RStudio, Posit, VS Code, workspace notebooks, snowflakeR, RSnowflake, mlops

19.1 Overview

Machine learning features — customer aggregates, rolling transaction sums, product embeddings — are usually built in one-off SQL scripts or notebook cells, then copied into training CSVs. That breaks down quickly:

  • Training/serving skew — batch SQL for training differs from the app query at inference time
  • Leakage — historical labels joined to features computed with future data
  • No reuse — each team re-implements the same AVG(order_total) logic
  • Weak lineage — auditors cannot trace which feature definition trained which model

The Snowflake Feature Store is a governed catalog of feature definitions: entities (join keys), versioned feature views (SQL transformations), and APIs for point-in-time-correct training joins and consistent inference retrieval. snowflakeR exposes the full Python Feature Store API from R — objects you create in R appear identically in Snowsight and in Python notebooks.

Platform deep dive (recommended)

For temporal joins, Dynamic Tables vs external views, online serving, and operations, read the Snowflake Feature Store Implementation Guide — especially Concepts, Feature Views, and Temporal Features. Examples there are Python; the objects and semantics are identical when created from R.

See also MLOps on Snowflake for where Feature Store sits in the lifecycle, Authoring Feature Views for warehouse SQL/dbplyr feature definitions, Training Data in R for reading exports (OBJECT/ARRAY/VECTOR → R), and Model Registry for logging models trained on Feature Store datasets.

19.2 Learning Objectives

  • Explain why point-in-time joins matter and how leakage happens without them
  • Describe entities, feature views, spines, and datasets
  • Register managed vs external feature views from R (SQL or dbplyr)
  • Generate training data and retrieve inference-time features
  • Connect Feature Store outputs to Model Registry lineage

19.3 The problem Feature Store solves

Consider churn modeling:

Anti-pattern What goes wrong
Join labels to today’s customer feature table Model sees future behaviour at training time — inflated metrics
Different SQL in training notebook vs production app Training/serving skew — production accuracy collapses
Ad hoc feature tables per project No shared definition of avg_order_total across teams
Export 10M rows to R with collect() Slow, expensive, breaks governance

Feature Store centralizes definitions and enforces temporal correctness when building training sets.

19.3.1 Training vs inference (different operations)

Phase Question Feature Store API (R)
Training “What did we know about entity X at time T?” sfr_generate_training_data() / sfr_generate_dataset()
Inference (latest) “What do we know now about entity X?” sfr_retrieve_features() without spine_timestamp_col, or OFT / online read
Inference (as-of batch) “What did we know at each row’s timestamp?” (incremental backfill, replay scoring) sfr_retrieve_features(..., spine_timestamp_col = ...) — same ASOF logic as training, without labels

Training uses a spine (entity keys + event timestamps + labels). Inference often uses latest feature values, but batch pipelines can pass a timestamp spine for point-in-time-correct feature lookup on the offline store.


19.4 Architecture

A Feature View connects tabular sources in Snowflake to registered metadata and a materialized offline store (table or Dynamic Table). The feature_df you pass at registration time can be as simple or as complex as your pipeline requires.

19.4.1 Sources and definitions

Layer Options
Source data Any tabular object — table, view, Dynamic Table, dbt model output, stage-backed table, etc.
feature_df definition Pass-throughSELECT * or column subset from an externally maintained table (external Feature View). Transform — aggregations, joins, windows, bundles entirely in SQL or dbplyr (managed Feature View with refresh_freq). Hybrid — read from a maintained table and enrich with SQL (e.g. keys from ETL + OBJECT_CONSTRUCT in the same query).
Registration mode External — no refresh_freq; you own refresh. Managedrefresh_freq → platform maintains backing Dynamic Table (or equivalent).

See Authoring Feature Views for SQL/dbplyr patterns from simple aggregates through rollup tiles and OBJECT/ARRAY bundles.

flowchart TB
  subgraph sources [Tabular sources in Snowflake]
    direction LR
    TBL[(Table)]
    VW[View]
    DTs[(Dynamic Table)]
    TBL ~~~ VW ~~~ DTs
  end

  subgraph define [feature_df at registration]
    direction TB
    PASS[Pass-through SELECT from source]
    XFORM[Transform SQL or dbplyr aggregates joins bundles]
    HYBRID[Hybrid source table plus SQL enrich]
  end

  subgraph store [Feature Store offline]
    FV[Feature View versioned metadata]
    OFF[(Offline materialization table or DT)]
    FV --> OFF
  end

  subgraph consume [Consumption paths]
    direction TB
    TRAIN[sfr_generate_training_data spine plus ASOF]
    INFER_L[sfr_retrieve_features latest keys]
    INFER_T[sfr_retrieve_features timestamp spine ASOF]
    ONLN[Online Feature Table key lookup]
  end

  sources --> define
  PASS --> FV
  XFORM --> FV
  HYBRID --> FV
  OFF --> TRAIN
  OFF --> INFER_L
  OFF --> INFER_T
  FV --> ONLN

Compute note: Feature view registration and refresh use the warehouse you pass to sfr_feature_store(). Feature data materializes in Snowflake — not in your R session until you explicitly read or generate training data.


19.5 Core concepts

Concept Definition Think of it as
Feature Store Dedicated schema/database area for entities, views, metadata Catalog + governance layer
Entity Business object + join keys (CUSTOMER_ID) “How features attach to real-world things”
Feature View Versioned SQL (or dbplyr) transformation producing features for entity(ies) Reusable feature definition
Managed view Has refresh_freq → Snowflake maintains backing Dynamic Table Auto-refreshed pipeline
External view No refresh_freq → you maintain source table (dbt, ETL) Registered SQL over existing table
Spine Rows: entity keys + event timestamp + optional labels “One training example per row”
Point-in-time join For each spine row, features as-of that timestamp Prevents future leakage
Dataset Immutable, versioned snapshot of joined training data Reproducible input to sfr_log_model()

19.5.1 Point-in-time join — concrete example

Wrong (leakage):

-- Labels from 2024, features computed TODAY
SELECT l.customer_id, l.churned, f.avg_order_total
FROM churn_labels l
JOIN customer_features_today f ON l.customer_id = f.customer_id

If avg_order_total includes orders placed after the churn event, the model cheats.

Right (Feature Store):

For each row in the spine (customer_id, event_ts, churned), the store joins feature values known at event_ts — orders after that timestamp are excluded automatically.

sequenceDiagram
  participant SP as Spine row C001 event_ts=T
  participant FS as Feature Store
  participant FV as CUSTOMER_ACTIVITY view

  SP->>FS: Need features for C001 at time T
  FS->>FV: Point-in-time filter event_ts <= T
  FV-->>FS: page_views clicks as of T
  FS-->>SP: Feature row attached to label


19.6 Setup

Connect to (or create) a Feature Store in a dedicated schema:

library(snowflakeR)
conn <- sfr_connect()
conn <- sfr_load_notebook_config(conn)

fs <- sfr_feature_store(
  conn,
  database  = "ML_DB",
  schema    = "FEATURES",
  warehouse = "ML_WH",
  create    = TRUE   # creates schema + tags if missing
)

First-time admin (roles, grants, tags):

sfr_setup_feature_store(conn, "ML_DB", "FEATURES", "ML_WH")

Requires privileges per Feature Store RBAC. Data engineers often run setup once; data scientists use sfr_feature_store(..., create = FALSE) against an existing schema.


19.7 Workflow phases

Step Action Typical owner
1 Define entities (join keys) ML engineer + domain expert
2 Author feature view SQL or dbplyr ML engineer
3 Register view with version sfr_create_feature_view()
4 Build spine from labels table Same or analytics team
5 Generate training data or Dataset sfr_generate_*()
6 Train in R (tidymodels, glm, …) Data scientist
7 Log model with training_dataset Model Registry

19.8 Entities

Entities define how features join to business objects. Join keys are immutable after creation — plan compound keys carefully.

customer <- sfr_create_entity(
  fs, "CUSTOMER", "CUSTOMER_ID",
  desc = "Customer entity"
)

product <- sfr_create_entity(
  fs, "PRODUCT", "PRODUCT_ID",
  desc = "Product entity"
)

sfr_list_entities(fs)
sfr_get_entity(fs, "CUSTOMER")

Multi-key entities (e.g. store + SKU) pass multiple columns — see vignette feature-store.


19.9 Feature views

A feature view is a versioned transformation tied to one or more entities. The feature_df argument accepts:

  • SQL string — any SELECT over tabular sources (table, view, Dynamic Table, …)
  • dbplyr lazy table — dplyr pipeline translated to SQL (RSnowflake)

The backing query may pass through an externally maintained table, compute features entirely in SQL/dbplyr, or combine both (hybrid). See Architecture — Sources and definitions and Authoring Feature Views.

19.9.1 Managed views (automatic refresh)

When you set refresh_freq, Snowflake registers a Dynamic Table (or equivalent managed object) that refreshes on schedule:

fv <- sfr_create_feature_view(
  fs,
  name         = "CUSTOMER_FEATURES",
  version      = "v1",
  entities     = customer,
  feature_df   = "
    SELECT customer_id,
           AVG(order_total) AS avg_order_total,
           COUNT(*) AS order_count,
           MAX(order_date) AS last_order_date
    FROM orders
    GROUP BY customer_id
  ",
  refresh_freq = "1 hour",
  desc         = "Customer aggregates from orders"
)

Use managed views when Feature Store should own refresh — aggregates rebuilt hourly from raw orders.

19.9.2 External views (you maintain the table)

Omit refresh_freq when dbt, Dynamic Tables elsewhere, or ETL already maintains the physical table:

fv_external <- sfr_create_feature_view(
  fs,
  name       = "CUSTOMER_DEMOGRAPHICS",
  version    = "v1",
  entities   = customer,
  feature_df = "SELECT customer_id, age_band, region FROM analytics.customer_demo"
)

Feature Store registers metadata and lineage; refresh is your pipeline’s job. An external process maintains the contents of analytics.customer_demo, and Feature Store will use the latest values from that table, or if defined and maintained with a timestamp column, the values as-of the timestamp. When a timestamp is defined it is critical that the external process maintains the timestamp column and that it is updated with the current timestamp when the data is updated to avoid future leakage.

19.9.3 Draft then register (two-step)

Inspect locally before materializing — mirrors Python API:

fv_draft <- sfr_feature_view(
  name         = "CUSTOMER_FEATURES",
  entities     = customer,
  feature_df   = "SELECT customer_id, AVG(order_total) AS avg_order_total FROM orders GROUP BY 1",
  refresh_freq = "1 hour"
)

fv <- sfr_register_feature_view(fs, fv_draft, version = "v1")

19.9.4 dbplyr as feature_df

Keep feature logic in R dplyr while SQL runs on the warehouse:

library(dplyr)
dbi_con <- sfr_dbi_connection(conn)

customer_features_query <- tbl(dbi_con, "ORDERS") |>
  group_by(customer_id) |>
  summarise(
    avg_order_total = mean(order_total, na.rm = TRUE),
    order_count     = n(),
    .groups         = "drop"
  )

fv <- sfr_create_feature_view(
  fs,
  name       = "CUSTOMER_FEATURES_DBPLYR",
  version    = "v1",
  entities   = customer,
  feature_df = customer_features_query
)

snowflakeR translates the lazy table to SQL — no collect() before registration.

19.9.5 Wide and sparse feature sets

When a feature view would need hundreds of scalar columns, prefer OBJECT or ARRAY bundles (and, for training export at scale, ARRAY → ::VECTOR columns) instead of a wide table.

Why bundles help:

  • Schema evolution — add feature keys or array slots without a new column per feature (especially sparse or slowly growing catalogs).
  • Scale — fewer columns for the planner and for cross-view joins; optional flatten only at export time.
  • Hybrid layouts — keep join keys and timestamps as scalars; pack high-cardinality or evolving payloads into OBJECT (Implementation Guide §12.10.4).

Snowflake documents platform patterns in Implementation Guide §12.10 — Wide & sparse. Authoring detail: Authoring Feature Views — Wide and sparse bundles.

From R, build bundles on the warehouse with RSnowflake-only dplyr helpers such as object_agg() and array_agg() — see RSnowflake Query — dbplyr and Authoring Feature Views.

For end-to-end training export (scalar-wide ASOF → three ARRAY/VECTOR packs, ARRAY-native views, OBJECT flattening, Python decode), use the pipeline table in Training Data in R — End-to-end pipelines (Pipelines A–E).

19.9.6 Time-series (event-level) features

When features change over time (not one row per entity), set timestamp_col:

fv_ts <- sfr_create_feature_view(
  fs,
  name          = "CUSTOMER_ACTIVITY",
  version       = "v1",
  entities      = customer,
  feature_df    = "SELECT customer_id, event_ts, page_views, clicks FROM activity",
  timestamp_col = "event_ts",
  refresh_freq  = "30 minutes"
)

Required for correct point-in-time joins — the store uses timestamps to pick feature values as-of each spine row.

19.9.7 Rollup and tiled aggregates

For rollup Feature Views over event streams (platform-managed tiles and lookback windows — e.g. sum of amount over 1 day), declare aggregates with sfr_feature() and set feature_granularity instead of writing every window in SQL. See Authoring Feature Views — Rollup and tiled Feature Views and the feature-store vignette.


19.10 Managing views

sfr_list_feature_views(fs)
sfr_get_feature_view(fs, "CUSTOMER_FEATURES", "v1")
sfr_show_fv_versions(fs, "CUSTOMER_FEATURES")

feature_data <- sfr_read_feature_view(fs, "CUSTOMER_FEATURES", "v1")

sfr_refresh_feature_view(fs, "CUSTOMER_FEATURES", "v1")
sfr_get_refresh_history(fs, "CUSTOMER_FEATURES", "v1")
sfr_suspend_feature_view(fs, "CUSTOMER_FEATURES", "v1")
sfr_resume_feature_view(fs, "CUSTOMER_FEATURES", "v1")

Refresh history shows Dynamic Table run status — use it when training data looks stale. Suspend/resume pauses automatic refresh during pipeline migrations.


19.11 Generate training data

The spine drives training. Each row needs entity key(s), an event timestamp, and label column(s):

training_data <- sfr_generate_training_data(
  fs,
  spine = "
    SELECT customer_id, churn_date AS event_ts, churned AS label
    FROM churn_labels
    WHERE churn_date >= '2024-01-01'
  ",
  features = list(
    list(name = "CUSTOMER_FEATURES", version = "v1"),
    list(name = "CUSTOMER_ACTIVITY", version = "v1")
  ),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "label"
)

Or pass registered view objects:

fv1 <- sfr_get_feature_view(fs, "CUSTOMER_FEATURES", "v1")
fv2 <- sfr_get_feature_view(fs, "CUSTOMER_ACTIVITY", "v1")

training_data <- sfr_generate_training_data(
  fs,
  spine    = "SELECT customer_id, churn_date AS event_ts, churned FROM churn_labels",
  features = list(fv1, fv2),
  spine_timestamp_col = "event_ts",
  spine_label_cols    = "churned"
)

Materialize to a table for reuse:

training_data <- sfr_generate_training_data(
  fs,
  spine    = "...",
  features = list(fv1, fv2),
  save_as  = "CHURN_TRAINING_SET_V1"
)

19.11.1 Datasets (immutable snapshots + lineage)

For Model Registry lineage (visible in Snowsight as Source → Feature View → Dataset → Model):

training <- sfr_generate_dataset(
  fs,
  name     = "CHURN_TRAINING",
  spine    = "SELECT customer_id, label FROM labels",
  features = list(list(name = "CUSTOMER_FEATURES", version = "v1")),
  version  = "v1",
  spine_label_cols = "label"
)

model <- glm(label ~ ., data = training, family = binomial)

mv <- sfr_log_model(
  reg,
  model            = model,
  model_name       = "CHURN_MODEL",
  training_dataset = training,
  input_cols       = sfr_input_cols(training, exclude = "label"),
  output_cols      = list(prediction = "double")
)

See Model Registry — lineage.


19.12 Inference-time features (offline store)

For batch scoring against offline Feature Views, use sfr_retrieve_features() to join a spine to feature values.

19.12.1 Latest values (no timestamp spine)

When each scoring row needs current offline features, pass entity keys only — omit spine_timestamp_col:

features_for_scoring <- sfr_retrieve_features(
  fs,
  spine    = "SELECT customer_id FROM customers_to_score",
  features = list(fv1, fv2)
)

19.12.2 Point-in-time batch inference (timestamp spine)

For incremental batch inference, backfills, or replay scoring where each row has its own as-of time, pass spine_timestamp_col — the same ASOF join logic as training, without labels:

features_for_scoring <- sfr_retrieve_features(
  fs,
  spine = "
    SELECT customer_id, score_as_of_ts
    FROM incremental_score_queue
  ",
  features            = list(fv1, fv2),
  spine_timestamp_col = "score_as_of_ts"
)

Use this when the scoring queue carries an event or decision timestamp and features must match what was knowable at that time — not necessarily “latest” across the whole table.

Pass results to predict() locally or to a batch scoring pipeline. Align column names with input_cols declared at model registration.

For sub-minute, key-based serving of latest values, use the online Hybrid Table — Online serving (sfr_read_feature_view(..., store_type = "ONLINE") or SQL/dbplyr on the OFT).


19.13 Online serving

Online Feature Tables (OFTs) expose the latest feature values per entity key on a Hybrid Table, refreshed from the offline Feature View on a schedule driven by target_lag (Stream + Task + MERGE). They are for low-latency inference, not point-in-time training — use offline views and Inference-time features / sfr_generate_training_data() for history.

Requires snowflake-ml-python ≥ 1.18.0 and a platform edition that supports online serving. Platform detail: Implementation Guide — Online Features.

19.13.1 Enable online on a Feature View

fv_online <- sfr_create_feature_view(
  fs,
  name          = "CUSTOMER_ONLINE",
  version       = "v1",
  entities      = customer,
  feature_df    = "SELECT customer_id, credit_score FROM credit_data",
  refresh_freq  = "5 minutes",
  online_config = sfr_online_config(enable = TRUE, target_lag = "2 minutes")
)

# Or add online to an existing view
sfr_update_feature_view(
  fs, "CUSTOMER_FEATURES", "v1",
  online_config = sfr_online_config(enable = TRUE, target_lag = "5 minutes")
)

Snowflake creates the OFT (Hybrid Table + sync Task). target_lag on the online config is independent of offline refresh_freq — end-to-end staleness is roughly both paths combined.

19.13.2 Retrieve values — snowflakeR → snowflake-ml-python

Use sfr_read_feature_view() with store_type = "ONLINE" (wraps fs.read_feature_view(..., StoreType.ONLINE)). Pass entity keys and optional feature column names; batch keys in one call when scoring many entities.

# Single customer lookup
online_row <- sfr_read_feature_view(
  fs,
  name          = "CUSTOMER_ONLINE",
  version       = "v1",
  store_type    = "ONLINE",
  keys          = list(CUSTOMER_ID = "C001"),
  feature_names = c("CREDIT_SCORE")
)

# Batch: many keys in one round trip (preferred over a loop)
ids <- c("C001", "C002", "C003")
online_batch <- sfr_read_feature_view(
  fs,
  name          = "CUSTOMER_ONLINE",
  version       = "v1",
  store_type    = "ONLINE",
  keys          = list(CUSTOMER_ID = ids),
  feature_names = c("CREDIT_SCORE")
)
API Store Use when
sfr_read_feature_view(..., store_type = "ONLINE") Hybrid Table (OFT) Real-time lookup by entity key(s)
sfr_read_feature_view(..., store_type = "OFFLINE") Offline DT / table Inspect materialized offline features
sfr_retrieve_features() Offline Feature Views Join spine to features — latest (no timestamp) or ASOF (spine_timestamp_col) — not the OFT API

For multiple Feature Views at inference, the platform typically does one OFT read per view (or a pre-joined Feature View registered for online) — see the Implementation Guide — multi-FV online reads.

Operational helpers:

sfr_refresh_feature_view(fs, "CUSTOMER_ONLINE", "v1", store_type = "ONLINE")
sfr_get_refresh_history(fs, "CUSTOMER_ONLINE", "v1", store_type = "ONLINE")

19.13.3 Retrieve values — SQL or dbplyr on the Online Feature Table

The OFT is a queryable Hybrid Table in your Feature Store schema (naming is managed by the platform — often a *_OFT suffix). Discover the exact name in Snowsight (Feature Store UI) or metadata commands.

Prefer the Feature Store API for serving

For low-latency key lookup, sfr_read_feature_view(..., store_type = "ONLINE") is the supported path (batch keys in one call). Direct SQL or dbplyr against the OFT is useful for warehouse-side joins, debugging, or services that already speak RSnowflake — not as a substitute for the online API in high-QPS serving tiers.

-- Illustrative: replace with your FS database.schema and OFT name
SELECT customer_id, credit_score
FROM MY_ML_DB.FEATURE_STORE.CUSTOMER_ONLINE_OFT
WHERE customer_id IN ('C001', 'C002', 'C003');
library(dplyr)

oft <- tbl(con, in_schema("FEATURE_STORE", "CUSTOMER_ONLINE_OFT")) |>
  filter(CUSTOMER_ID %in% c("C001", "C002", "C003")) |>
  select(CUSTOMER_ID, CREDIT_SCORE)

features <- oft |> collect()
library(DBI)
library(RSnowflake)

con <- dbConnect(Snowflake())
features <- dbGetQuery(con, "
  SELECT customer_id, credit_score
  FROM MY_ML_DB.FEATURE_STORE.CUSTOMER_ONLINE_OFT
  WHERE customer_id = ?
", params = list("C001"))

Use bind parameters (?) for repeated lookups so Snowflake can reuse plans (Implementation Guide — parameter binding).

Offline vs online retrieval
  • Training / point-in-time: spine + sfr_generate_training_data() or sfr_retrieve_features() with spine_timestamp_coloffline store only.
  • Batch inference (latest): sfr_retrieve_features() without spine_timestamp_col → current offline values per key.
  • Batch inference (ASOF): sfr_retrieve_features(..., spine_timestamp_col = ...) → features as-of each spine row (Inference-time features).
  • Real-time scoring: sfr_read_feature_view(..., ONLINE) or SQL/dbplyr on the OFT → latest values per key, bounded by target_lag.

19.14 Common pitfalls

Issue Mitigation
Leakage Always use spine + spine_timestamp_col; never join “current” features to historical labels
Column case mismatch Snowflake UPPERCASE vs R lowercase — align in SQL and input_cols
Missing warehouse Pass warehouse= to sfr_feature_store()
Huge collect() before FS Register views in SQL/dbplyr; let platform materialize
External view stale Monitor your dbt/ETL — Feature Store won’t refresh for you
Wrong entity keys Keys in spine must match entity definition exactly

19.15 Companion artifacts

Resource Content
Vignette feature-store Full API, tile aggregation, advanced options
workspace_feature_store.ipynb Runnable Workspace walkthrough
FS Implementation Guide Platform authority

19.16 Next steps

Authoring Feature Views — SQL and dbplyr patterns for feature_df (simple aggregates through wide/temporal).

Training Data in R — materialize and read training exports into R.

Model Registry — log models with lineage from Feature Store datasets.

End-to-end pipeline