flowchart TB PY[Snowpark Python DataFrame lazy ops] --> PLAN[SQL query plan] PLAN --> REG[Feature View registration] REG --> CDT[CREATE DYNAMIC TABLE from generated SQL] CDT --> RUN[Refresh cycles execute SQL only]
Appendix D — Snowpark to Dynamic Table
How Python features become SQL Dynamic Tables
snowflake, feature store, ml, machine learning, mlops
D.1 Overview
A common question when evaluating Snowflake Feature Store is: “How does Python feature logic actually run inside a Dynamic Table?” This appendix explains the end-to-end path from Snowpark DataFrame to Dynamic Table, the limitations that arise from that translation, and optimization strategies for production pipelines.
Examples that need a database and schema use the same canonical names as the rest of this guide: database FEATURE_STORE_DEMO, Feature Store schema FEATURE_STORE, source schema CLICKSTREAM_DATA, warehouse FS_DEV_WH, and Feature View versions in V01 format (dynamic table suffix $V01).
D.2 The Translation Path
When you define feature transformations in Python using Snowpark DataFrames, no Python code ships to the Dynamic Table. The pipeline works as follows:
D.2.1 Step 1 – Snowpark DataFrame (Lazy Evaluation)
Snowpark DataFrames are lazily evaluated. Each operation (filter, select, join, group_by, agg, etc.) appends to a logical query plan but does not execute immediately. The Python syntax is a thin wrapper around SQL semantics.
from snowflake.ml.feature_store import FeatureView, Entity
import snowflake.snowpark.functions as F
user_purchase_df = (
session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS")
.group_by("USER_ID")
.agg(
F.count_distinct("ORDER_ID").alias("ORDER_CNT"),
F.sum("AMOUNT").alias("TOTAL_SPEND"),
F.avg("AMOUNT").alias("AVG_ORDER_AMT"),
F.max("ORDER_TS").alias("LAST_ORDER_TS"),
)
)D.2.2 Step 2 – Query Pushdown (Python-to-SQL Translation)
When the DataFrame is consumed – either by an action like collect() or by passing it to a Feature View – Snowpark translates the entire chain of operations into a single optimized SQL statement. You can inspect the generated SQL at any time:
This produces the equivalent:
SELECT USER_ID,
COUNT(DISTINCT ORDER_ID) AS ORDER_CNT,
SUM(AMOUNT) AS TOTAL_SPEND,
AVG(AMOUNT) AS AVG_ORDER_AMT,
MAX(ORDER_TS) AS LAST_ORDER_TS
FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
GROUP BY USER_IDThe translation is direct and deterministic: F.sum() becomes SUM(), F.col("X") / F.col("Y") becomes X / Y, .filter() becomes WHERE, .join() becomes JOIN, and so on. You can also write SQL directly via session.sql("SELECT ...") – the end result is identical.
D.2.3 Step 3 – Feature View Registration Creates a Dynamic Table
When you register a Feature View with a refresh_freq, the Feature Store API issues a CREATE DYNAMIC TABLE ... AS <generated-SQL>. From that point forward, the Dynamic Table is a pure SQL object – no Python interpreter is involved in the refresh cycle.
user_purchase_fv = FeatureView(
name="USER_PURCHASE_STATS",
entities=[user_entity],
feature_df=user_purchase_df,
timestamp_col="LAST_ORDER_TS",
refresh_freq="15 minutes",
desc="User purchase statistics"
)
fs.register_feature_view(feature_view=user_purchase_fv, version="V01")Users who prefer to express feature-engineering logic directly in SQL can do so by wrapping a raw SQL SELECT statement in session.sql() and passing the resulting DataFrame as the feature_df. The outcome is identical – the SQL text is used as-is for the Dynamic Table definition:
user_purchase_df = session.sql("""
SELECT USER_ID,
COUNT(DISTINCT ORDER_ID) AS ORDER_CNT,
SUM(AMOUNT) AS TOTAL_SPEND,
AVG(AMOUNT) AS AVG_ORDER_AMT,
MAX(ORDER_TS) AS LAST_ORDER_TS
FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS
GROUP BY USER_ID
""")
user_purchase_fv = FeatureView(
name="USER_PURCHASE_STATS",
entities=[user_entity],
feature_df=user_purchase_df,
timestamp_col="LAST_ORDER_TS",
refresh_freq="15 minutes",
desc="User purchase statistics"
)This is often preferable when teams already have well-tested SQL transformations, or when the SQL is complex enough that the Snowpark DataFrame API machine generated SQL adds verbosity without added clarity.
Key takeaway: Python is a development-time convenience. At runtime, the Dynamic Table refreshes by re-executing the generated SQL with no Python involvement. Whether you author with Snowpark DataFrames or raw SQL via
session.sql(), the end result is the same.
When a Feature View includes a StorageConfig with StorageFormat.ICEBERG, the registration creates a Dynamic Iceberg Table instead of a standard Dynamic Table. The translation path above is identical – the only difference is that the materialized output is written as Parquet files in Iceberg format on external cloud storage rather than in Snowflake’s internal format. All optimization strategies and limitations described in this appendix apply equally to Dynamic Iceberg Tables. See Chapter 10: Iceberg-Backed Feature Views for details.
D.3 Limitations: What Can and Cannot Be Expressed
Because the Dynamic Table is ultimately a SQL object, the constraints come from what Dynamic Tables support in their query definition.
D.3.1 Incremental vs Full Refresh
Dynamic Tables support three refresh modes:
- Incremental – analyses the query to determine what changed since the last refresh and merges only those changes into the table.
- Full – re-executes the entire query from scratch and replaces the previously materialized results.
- AUTO – Snowflake evaluates the query at creation time and selects whichever mode it expects to be more cost- and time-effective.
Full refresh does not always imply slower or more expensive. The relative cost depends on several factors:
- Incrementalization overhead. Incremental refresh must track changes, compute deltas, and merge them – this bookkeeping has its own cost. For simple queries over small-to-moderate tables, a straightforward full recomputation can be faster than the incremental machinery.
- Volume of changed data. When a large proportion of rows change between refreshes (e.g., more than 5% of grouping keys), incremental refresh can end up doing more work than a full refresh because it recomputes each affected group individually.
- Data locality. Operators like
GROUP BY,DISTINCT,JOIN, and window functions are sensitive to how well the source data is clustered by the relevant keys. Poor locality degrades incremental performance disproportionately. - Query complexity. Highly complex queries with many joins, subqueries, or nested CTEs may produce incremental plans that are harder for the optimizer to execute efficiently.
When using AUTO mode, Snowflake takes these trade-offs into account at table creation time. If you are explicitly choosing a mode, consider benchmarking both to see which performs better for your specific workload.
Several SQL constructs fall back to full refresh or are unsupported entirely:
| Construct | Incremental | Full | Notes |
|---|---|---|---|
SELECT, WHERE, FROM |
Supported | Supported | Performs consistently well |
GROUP BY |
Supported | Supported | Performance depends on data locality |
INNER JOIN, OUTER JOIN |
Supported | Supported | Best when one side is small or changes infrequently |
UNION ALL |
Supported | Supported | |
DISTINCT |
Supported | Supported | Locality-sensitive; prefer QUALIFY (see below) |
Window functions (ROW_NUMBER, RANK, etc.) |
Supported | Supported | Must include PARTITION BY; cluster source by partition keys |
LATERAL FLATTEN |
Supported | Supported | |
| Cortex AI/LLM functions | Supported (SELECT clause) | Supported | |
| Immutable UDFs (Python/Java/Scala) | Supported | Supported | Must not be marked VOLATILE |
| UDTFs (Python/Java/Scala) | Supported | Supported | SQL UDTFs not supported; must explicitly name output columns (no SELECT *) |
PIVOT / UNPIVOT |
Not supported | Not supported | |
SAMPLE / TABLESAMPLE |
Not supported | Not supported | |
| External functions | Not supported | Not supported | |
Sequences (SEQ1, SEQ2) |
Not supported | Supported | |
Set operators (MINUS, EXCEPT, INTERSECT) |
Not supported | Supported | UNION supported incrementally |
Subqueries outside FROM |
Not supported | Supported | |
WITH RECURSIVE |
Not supported | Supported | |
VOLATILE UDFs |
Not supported | Supported |
The matrix above describes what happens within a single DT. As of April 2026, downstream DTs can refresh incrementally even when reading from an upstream DT that uses full refresh, provided the upstream DT has a system-derived unique key. Snowflake automatically derives unique keys from GROUP BY columns and QUALIFY ROW_NUMBER() = 1 filters. Additionally, base tables refreshed via INSERT OVERWRITE can enable incremental downstream processing by declaring a primary key with RELY (ALTER TABLE ... ADD PRIMARY KEY (...) RELY). The downstream DT must explicitly set REFRESH_MODE = INCREMENTAL (AUTO does not resolve to incremental in this case). Verify with SHOW UNIQUE KEYS IN <upstream_dt>. These are SQL-level capabilities not yet exposed through the Feature Store Python API. See Understanding primary keys in dynamic tables.
Reference: Supported queries for dynamic tables
D.3.2 Python UDFs and UDTFs in Dynamic Tables
UDFs and UDTFs provide a way to embed custom logic – including Python with third-party libraries – inside a Dynamic Table’s SQL definition. They are row-level transformations: they operate on input columns from a single row and produce output for that row. They do not aggregate across rows or join between tables.
UDF (scalar function): Takes one or more input columns from a single row and returns a single output column. The output can be a scalar type (FLOAT, STRING, INT, etc.) or a compound type (VARIANT, OBJECT, ARRAY) when multiple values need to be packed into one column. Typical feature-engineering uses include:
- Applying a pre-trained scoring model (n feature columns in, one score out)
- Encoding or decoding a field using proprietary logic
- Computing a derived value from multiple columns that cannot be expressed with native SQL functions
- Parsing semi-structured data into a typed value
UDTF (table function): Takes one or more input columns from a single row and returns multiple output columns (or multiple output rows) per input row. Useful for exploding a VARIANT into structured columns or reshaping row data.
Vectorized execution. UDFs and UDTFs can be declared as vectorized, which means Snowflake sends batches of rows as Pandas DataFrames to the Python runtime rather than invoking the function row by row. This is an execution-level optimization – the logical contract is still row-in, value(s)-out, but throughput improves substantially because batch processing avoids per-row Python overhead.
If existing Python logic cannot be expressed as Snowpark DataFrame operations (e.g., proprietary libraries, complex numerical algorithms), you can:
- Register a Python UDF (or UDTF) in Snowflake.
- Call it from within the Dynamic Table’s SQL definition.
According to the Snowflake documentation, both UDFs and UDTFs are supported for incremental refresh (the docs list them as separate entries in the support matrix). The DT refresh process applies them only to changed rows, just as it would any other expression in the SELECT clause. Neither inherently forces a full refresh.
UDF caveats:
- The UDF must be declared
IMMUTABLE(notVOLATILE).VOLATILEUDFs are not supported for incremental refresh. - Replacing an
IMMUTABLEUDF viaCREATE OR REPLACE FUNCTIONwhile it is in use by an incremental-refresh DT will cause refresh failures. The DT must be recreated after replacing the UDF. This does not apply to Model Registry –MODEL(name)!PREDICT()handles version rotation gracefully (the DT triggers aREINITIALIZEand resumes incremental mode). See Chapter 10: UDF Immutability for the full comparison. - UDFs that import from an external stage are not supported.
- SQL UDFs that contain subqueries are not supported for incremental refresh.
- Vectorized UDFs (processing Pandas DataFrames in batches) perform significantly better than scalar UDFs for large datasets.
UDTF caveats:
- SQL UDTFs are not supported in Dynamic Tables (only Python, Java, and Scala UDTFs).
- SELECT blocks that read from a UDTF must explicitly specify columns –
SELECT *from a UDTF is not allowed. - As with UDFs, vectorized UDTFs are recommended for throughput.
from snowflake.snowpark.types import PandasSeries, FloatType
import snowflake.snowpark.functions as F
from snowflake.snowpark.functions import udf, call_udf
import numpy as np
@udf(name="compute_custom_feature", is_permanent=True,
packages=["pandas", "numpy"], replace=True)
def compute_custom_feature(amount: PandasSeries[float],
quantity: PandasSeries[int]) -> PandasSeries[float]:
return amount * np.log1p(quantity)
feature_df = session.table("FEATURE_STORE_DEMO.CLICKSTREAM_DATA.ORDERS").select(
F.col("USER_ID"),
call_udf("compute_custom_feature",
F.col("AMOUNT"), F.col("QTY")).alias("CUSTOM_FEATURE"),
)Bottom line: Existing Python logic can run via
IMMUTABLEUDFs within a Dynamic Table and still benefit from incremental refresh. Prefer vectorized UDFs for performance. For model inference, prefer Model Registry (MODEL(name)!PREDICT()) over custom UDFs – MR handles version rotation gracefully (REINITIALIZE + incremental resume) whereas replacing a standard UDF breaks the DT. If you must use custom UDFs, plan for DT recreation when the function changes.
D.3.3 Common Pattern: Incremental Batch Inference in a DT
For model inference in a DT, there are two approaches. Model Registry is strongly preferred for models that will be retrained and versioned over time:
| Approach | Version rotation | DT impact | Best for |
|---|---|---|---|
Model Registry MODEL(name)!PREDICT(...) |
ALTER MODEL SET DEFAULT_VERSION |
REINITIALIZE (one-time full refresh), then incremental resumes | Production models with regular retraining |
| Custom IMMUTABLE UDF | CREATE OR REPLACE FUNCTION |
Breaks the DT — must CREATE OR REPLACE DYNAMIC TABLE |
Static logic that rarely changes, or non-ML transformations |
Upstream DTs prepare and aggregate the features incrementally; a final DT calls the model for inference. Because the function is IMMUTABLE and the DT supports incremental refresh, only the rows whose upstream features changed are scored – giving you incremental batch inference without any external orchestration.
# Upstream Feature View: features are incrementally maintained
user_features_fv = FeatureView(
name="USER_FEATURES",
entities=[user_entity],
feature_df=user_features_df,
refresh_freq="DOWNSTREAM",
)
# Final Feature View: scores new/changed rows via UDF
scored_df = (
session.table("FEATURE_STORE_DEMO.FEATURE_STORE.USER_FEATURES$V01")
.select(
F.col("USER_ID"),
F.col("TOTAL_SPEND"),
F.col("ORDER_CNT"),
call_udf("predict_churn_score",
F.col("TOTAL_SPEND"),
F.col("ORDER_CNT")).alias("CHURN_SCORE"),
)
)
churn_score_fv = FeatureView(
name="USER_CHURN_SCORES",
entities=[user_entity],
feature_df=scored_df,
refresh_freq="1 hour",
desc="Churn propensity scores - incremental batch inference"
)This pattern is well-suited to use cases like propensity scoring, risk rating, or recommendation ranking where predictions need to stay fresh as features update, but a full re-score of the entire population each cycle would be wasteful.
The example above covers single-FV scoring. For models that consume features from multiple Feature Views, or that require point-in-time correct inference, see Chapter 10: Scoring Feature Views which covers the multi-FV pre-join pattern and orchestrated Task/Sproc alternative.
D.3.4 Constructs That Prevent Incremental Refresh Entirely
The following patterns force every refresh to be a full recomputation:
- Non-deterministic functions in SELECT (e.g.,
RANDOM(),CURRENT_TIMESTAMP()outside WHERE/HAVING/QUALIFY) - Self-referential queries (a DT reading its own previous state)
VOLATILEUDFs – even if the logic is deterministic, theVOLATILEmarker tells the optimizer it cannot be trusted- Downstream incremental DTs below upstream full-refresh DTs – historically, an incremental DT could not sit below a full-refresh DT. As of April 2026, this limitation is relaxed when the upstream full-refresh DT has a system-derived unique key (e.g., from
GROUP BYcolumns orQUALIFY ROW_NUMBER() = 1). In that case, the downstream DT can useREFRESH_MODE = INCREMENTALexplicitly and Snowflake will compute only the actual row-level changes between upstream full refreshes.AUTOmode does not resolve to incremental in this scenario – you must opt in. Verify withSHOW UNIQUE KEYS IN <upstream_dt>. See Understanding primary keys in dynamic tables for details. Note: This is a SQL-level capability; the Feature Store Python API does not yet expose it. Use the sameALTER DYNAMIC TABLE/ plain-DT pattern described in View-Based Feature Views when the API lags behind SQL features.
D.4 View-Based Feature Views: The On-Demand Alternative
Not every Feature View needs a Dynamic Table. When you omit refresh_freq, the Feature Store creates a view instead. The transformation SQL is stored as a view definition and executed on-the-fly each time data is retrieved.
product_attributes_fv = FeatureView(
name="PRODUCT_ATTRIBUTES",
entities=[product_entity],
feature_df=product_df,
timestamp_col="UPDATED_TS",
# No refresh_freq → View-based Feature View
desc="Product attributes - computed on query"
)D.4.1 When View-Based Feature Views Make Sense
- Prototyping and development. During early iteration you want to change feature definitions rapidly without waiting for DT materialization or paying continuous refresh costs. A view lets you test transformations immediately.
- Rarely accessed Feature Views. If a Feature View is only consumed during periodic training runs (e.g., weekly or monthly), the cost of computing features on-the-fly at query time may be significantly lower than continuously maintaining a Dynamic Table.
- Slowly changing or static dimensions. Reference data (product catalogs, region mappings) that changes infrequently often does not justify a DT.
D.4.2 Spine Filter Push-Down
A key advantage of view-based Feature Views during dataset generation is spine filter push-down. When the Feature Store performs an ASOF join against a view-based Feature View, the filters from the spine (entity keys and timestamps) are pushed down into the view query. This means the view does not need to materialize its entire result set – only the rows relevant to the spine are computed. This can dramatically reduce compute cost for large source tables when the spine is selective.
D.4.3 Escaping Dynamic Table Constraints with External Pipelines
If a transformation pipeline is too complex or time-consuming to be processed efficiently in Dynamic Tables – for example, pipelines that involve unsupported constructs, extensive Python library dependencies, or orchestration logic that exceeds what a single SQL query can express – the transformations can be expressed in any external tool (dbt, Airflow, Dagster, custom scripts, etc.). The resulting output table is then registered as a view-based Feature View that simply references the externally maintained table:
# dbt (or any external tool) maintains the output table
# Register it as a view-based Feature View for FS integration
dbt_features_df = session.table("FEATURE_STORE_DEMO.FEATURE_STORE.USER_STATS")
dbt_fv = FeatureView(
name="USER_STATS_DBT",
entities=[user_entity],
feature_df=dbt_features_df,
timestamp_col="_DBT_UPDATED_TS",
# No refresh_freq → View wrapping the externally managed table
desc="User statistics - maintained by dbt pipeline"
)
fs.register_feature_view(feature_view=dbt_fv, version="V01")This gives you the full Feature Store benefits – point-in-time-correct dataset generation, entity-based retrieval, Model Registry lineage – while keeping the transformation logic in the tool best suited for it. The external tool owns the refresh schedule; the Feature Store owns the feature contract.
D.5 Optimization Strategies
D.5.1 Break Complex Pipelines into Chained Dynamic Tables
Long, complex Snowpark pipelines translate to long, complex SQL, which can degrade incremental refresh performance. The recommended pattern is to decompose into multiple Feature Views (each backed by its own Dynamic Table) chained together. When you still have a large generated statement to inspect (for example in QUERY_HISTORY or a colleague’s review), pretty-printing and light structural rewrites can make it readable; see Formatting machine-generated SQL.
# Stage 1: Cleaning and parsing
clean_df = session.table("RAW_EVENTS").select(
F.col("USER_ID"),
F.to_timestamp("EVENT_TS").alias("EVENT_TS"),
F.col("AMOUNT").cast("FLOAT").alias("AMOUNT"),
)
clean_fv = FeatureView(
name="EVENTS_CLEAN",
entities=[user_entity],
feature_df=clean_df,
refresh_freq="DOWNSTREAM",
)
fs.register_feature_view(feature_view=clean_fv, version="V01")
# Stage 2: Aggregation (reads from Stage 1's DT)
agg_df = (
session.table("FEATURE_STORE_DEMO.FEATURE_STORE.EVENTS_CLEAN$V01")
.group_by("USER_ID")
.agg(F.sum("AMOUNT").alias("TOTAL_SPEND"))
)
agg_fv = FeatureView(
name="USER_SPEND_AGGREGATES",
entities=[user_entity],
feature_df=agg_df,
refresh_freq="15 minutes",
)
fs.register_feature_view(feature_view=agg_fv, version="V01")Benefits:
- Better incremental refresh – simpler SQL per DT means the optimizer can reason about changes more effectively.
- Independent monitoring – each stage has its own refresh history and health metrics.
- Mixed refresh strategies – fast refresh for one stage, slower for another.
- Full-refresh propagation awareness – historically, any DT using full refresh forced all downstream DTs to also use full refresh. As of April 2026, this constraint is relaxed: if the full-refresh DT has a system-derived unique key (from
GROUP BYorQUALIFY ROW_NUMBER() = 1), downstream DTs can opt intoREFRESH_MODE = INCREMENTALand process only actual row changes (see Constructs That Prevent Incremental Refresh above). When this is not possible, design your chain so that stages using full-refresh constructs sit at the end of the pipeline, not the beginning. If neither option works, consider whether the full-refresh stages should instead be handled by an external pipeline and wrapped as a view-based Feature View (see View-Based Feature Views below).
D.5.2 Optimize for Incremental Refresh
The Snowflake documentation provides specific guidance on operator-level optimization:
Cluster source tables by grouping/join/partition keys. Data locality is the single biggest factor for incremental refresh performance. When changes affect a small portion of grouping keys, the DT only recomputes those groups.
Keep changes to fewer than 5% of grouping keys per refresh. When more than 5% of keys change, incremental refresh may be slower than full refresh.
Use QUALIFY ROW_NUMBER() ... = 1 instead of DISTINCT for deduplication. The QUALIFY pattern has an optimized incremental path that performs consistently faster than DISTINCT, which is equivalent to GROUP BY ALL.
SQL:
-- Prefer this (optimized incremental path)
SELECT customer_id, customer_name, email, event_time
FROM customer_events
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY event_time DESC) = 1
-- Over this (locality-sensitive, potentially slow)
SELECT DISTINCT customer_id, customer_name, email
FROM customer_eventsSnowpark equivalent:
from snowflake.snowpark.functions import row_number
from snowflake.snowpark import Window
# Prefer this (optimized incremental path)
window = Window.partition_by("CUSTOMER_ID").order_by(F.col("EVENT_TIME").desc())
deduped_df = (
session.table("CUSTOMER_EVENTS")
.select("CUSTOMER_ID", "CUSTOMER_NAME", "EMAIL", "EVENT_TIME")
.with_column("RN", row_number().over(window))
.filter(F.col("RN") == 1)
.drop("RN")
)
# Over this (locality-sensitive, potentially slow)
distinct_df = (
session.table("CUSTOMER_EVENTS")
.select("CUSTOMER_ID", "CUSTOMER_NAME", "EMAIL")
.distinct()
)Avoid compound expressions in GROUP BY. Materialise computed columns in an upstream DT first, then group on the simple column reference in a downstream DT.
SQL:
-- Upstream DT: materialize the expression
CREATE DYNAMIC TABLE transactions_with_minute ...
AS SELECT DATE_TRUNC('minute', ts) AS ts_minute, amount
FROM transactions;
-- Downstream DT: group on simple column
CREATE DYNAMIC TABLE minute_sums ...
AS SELECT ts_minute, SUM(amount)
FROM transactions_with_minute
GROUP BY 1;Snowpark equivalent (as two chained Feature Views):
# Upstream Feature View: materialize the expression
stage1_df = session.table("TRANSACTIONS").select(
F.date_trunc("minute", F.col("TS")).alias("TS_MINUTE"),
F.col("AMOUNT"),
)
stage1_fv = FeatureView(
name="TRANSACTIONS_WITH_MINUTE",
entities=[txn_entity],
feature_df=stage1_df,
refresh_freq="DOWNSTREAM",
)
fs.register_feature_view(feature_view=stage1_fv, version="V01")
# Downstream Feature View: group on simple column
stage2_df = (
session.table("FEATURE_STORE_DEMO.FEATURE_STORE.TRANSACTIONS_WITH_MINUTE$V01")
.group_by("TS_MINUTE")
.agg(F.sum("AMOUNT").alias("TOTAL_AMOUNT"))
)
stage2_fv = FeatureView(
name="MINUTE_SUMS",
entities=[txn_entity],
feature_df=stage2_df,
refresh_freq="15 minutes",
)
fs.register_feature_view(feature_view=stage2_fv, version="V01")For joins, cluster the dimension table (the side that changes less often) by the join key. For OUTER JOINs, put the table that changes more frequently on the LEFT side. Clustering is configured at the table level via SQL (ALTER TABLE ... CLUSTER BY ...) and applies regardless of whether you author the Feature View in Snowpark or SQL.
Always include PARTITION BY in window functions. Window functions without PARTITION BY result in full recomputation of the entire result set on every refresh.
# Good: partition_by included
window = Window.partition_by("REGION").order_by(F.col("AMOUNT").desc())
ranked_df = sales_df.with_column("SALES_RANK", F.rank().over(window))
# Bad: no partition_by -- forces full recomputation every refresh
window_all = Window.order_by(F.col("AMOUNT").desc())
ranked_df = sales_df.with_column("GLOBAL_RANK", F.rank().over(window_all))D.5.3 Reduce Refresh Cost with Immutability Constraints
Dynamic Tables with large historical datasets can incur significant compute costs during refresh – especially when dimension table changes force recomputation of the entire result set, or when queries that cannot incrementalize well (e.g., multiple LEFT OUTER JOINs) fall back to full refresh.
The IMMUTABLE WHERE clause lets you declare that rows matching a condition will never change, so Snowflake skips them entirely during refresh. Only the mutable region is reprocessed.
CREATE DYNAMIC TABLE USER_EVENT_FEATURES
TARGET_LAG = '1 hour'
WAREHOUSE = FS_DEV_WH
IMMUTABLE WHERE (LAST_EVENT_TS < CURRENT_TIMESTAMP() - INTERVAL '7 days')
AS
SELECT USER_ID,
COUNT(EVENT_ID) AS EVENT_CNT,
MAX(EVENT_TS) AS LAST_EVENT_TS
FROM FEATURE_STORE_DEMO.CLICKSTREAM_DATA.EVENTS
GROUP BY USER_ID;With this definition, events older than 7 days are locked. If an upstream dimension table changes, only the last 7 days of aggregates are recomputed rather than the full history.
Key use cases for Feature Store pipelines:
| Scenario | How IMMUTABLE WHERE helps |
|---|---|
| Dimension table updates | Changes to a dimension (e.g., product category rename) don’t trigger recomputation of historical fact aggregates |
| Expensive UDF scoring | Model inference or LLM calls on historical rows are preserved; only new rows are scored |
| Full-refresh DTs | Queries that cannot incrementalize (multiple OUTER JOINs, UDTFs) only reprocess the mutable window instead of the entire table |
| Source data retention | Delete old source data to save storage while retaining historical aggregates in the DT |
| Downstream incremental refresh | A downstream DT can use incremental refresh even if its upstream DT uses full refresh, provided both declare IMMUTABLE WHERE |
Backfill from existing data. IMMUTABLE WHERE pairs with BACKFILL FROM to copy historical data from an existing table (or a clone of a prior DT version) into a new DT without recomputing it. This is particularly valuable for schema evolution – adding new feature columns to a large Feature View without paying the cost of a full historical recompute. See Chapter 04 for the detailed workflow.
You can check whether a row is immutable via the METADATA$IS_IMMUTABLE pseudo-column:
Note:
IMMUTABLE WHEREandBACKFILL FROMare Dynamic Table SQL features and are not currently exposed through the Feature Store Python API. Other SQL-only DT features –INITIALIZATION_WAREHOUSE,ROW_TIMESTAMP,DATA_RETENTION_TIME_IN_DAYS, andMAX_DATA_EXTENSION_TIME_IN_DAYS– are also valuable for Feature Store pipelines; see Chapter 11: Dynamic Table SQL Features Beyond the API for details. To use any of these with Feature Store, create the DT via SQL and wrap it as a view-based Feature View, or useALTER DYNAMIC TABLEafter the API creates the DT. Check the Feature Store release notes for future API integration.
Reference: Use immutability constraints | Introducing Immutability for Dynamic Tables (engineering blog)
D.5.4 Choose an Appropriate Target Lag
Dynamic Tables do not blindly re-execute on every refresh cycle. The scheduler checks for upstream changes and only triggers warehouse compute when there is new data to process. This change-detection check is near-zero cost, so a short target lag on a source that updates infrequently does not waste significant compute – the DT simply checks more often and finds nothing to do.
That said, target lag is still worth considering:
- Shorter target lag means the DT is ready to process changes sooner, giving lower latency from source change to feature availability. The compute cost is driven by the volume and frequency of actual data changes, not the target lag itself.
- Longer target lag allows Snowflake to batch multiple small changes into fewer, larger refresh operations, which can be more efficient for high-frequency sources where micro-batching would otherwise create overhead.
Choose your target lag based on how quickly downstream consumers need to see updated features, rather than trying to match it exactly to the source update cadence.
| Use Case | Suggested Target Lag | Rationale |
|---|---|---|
| Real-time fraud / personalization | 1 minute |
Features must reflect latest activity promptly |
| Operational ML features | 5-15 minutes |
Balance of freshness and batching efficiency |
| Standard reporting / training features | 1 hour |
Consumers tolerate moderate lag |
| Slowly changing dimensions | 1 day or greater |
Source changes are infrequent |
D.6 Formatting machine-generated SQL
Snowpark, Spark SQL, R packages such as dbplyr (dbplyr::sql_render()), and other dataframe-style APIs emit SQL that is correct but often hard to read: dense one-liners, deep nesting, or redundant derived tables. That friction shows up when you compare DataFrame.queries (Snowpark) or rendered SQL from other stacks against worksheets, documentation, or EXPLAIN output in Snowsight.
Third-party parsers and formatters address readability. Treat them as developer aids, not as something Snowflake requires at runtime: the engine optimizes from its own parse tree, and reformatting does not improve warehouse performance. Also assume round-trip risk: any tool that parses and re-emits SQL can reject uncommon syntax, normalize identifiers, or reorder clauses. If you paste formatted SQL back into production objects, run it in a non-production worksheet first.
D.6.1 Tool landscape (summary)
| Tool | Primary role | Snowflake fit | Strengths | Limitations |
|---|---|---|---|---|
| SQLGlot | Parser, transpiler, optional rewrite rules | First-class read="snowflake" / dialect support |
pretty=True indentation; dialect-aware transpile; eliminate_subqueries hoists derived tables in FROM to WITH CTEs and deduplicates identical subqueries |
Cutting-edge Snowflake keywords can lag the live product until the grammar catches up; correlated subqueries and LATERAL are not always rewritten |
| SQLFluff | Linter + auto-fix + formatter | Snowflake ruleset | Strong for team style and CI on .sql / dbt projects; fixes many layout and some logic-smell issues |
Heavier dependency and configuration; severely mangled generator output may need manual cleanup before lint passes |
| sqlfmt | Opinionated formatter (dbt-first) | Polyglot / Snowflake-friendly style | Extremely fast; minimal configuration; works well on dbt models with Jinja | Enforces sqlfmt’s style (e.g. lowercase keywords); not a semantic transpiler |
| sqlparse | Lexer-based split / reindent | Dialect-agnostic | Small library; good for quick indentation in Python | No real dialect model; can mis-parenthesize or break on Snowflake-specific constructs |
| pgFormatter, sql-formatter (JS), IDE formatters | Pretty-print | Often PostgreSQL- or ANSI-leaning | Easy editor integration | Snowflake extensions (QUALIFY, staged file functions, etc.) may format oddly or fail to parse |
For one-off inspection of Snowpark output, SQLGlot is usually the best balance of Snowflake awareness and programmatic control. For repository-wide SQL style enforcement, SQLFluff or sqlfmt are common complements to (not replacements for) code review.
D.6.2 Example: pretty-print and optional CTE hoisting (SQLGlot)
eliminate_subqueries lifts inline derived tables into a WITH clause (it does not rewrite every correlated predicate subquery). Combine it with transpile(..., pretty=True) for indentation.
from sqlglot import parse_one, transpile
from sqlglot.optimizer.eliminate_subqueries import eliminate_subqueries
def format_sql(query: str, hoist_derived_tables_to_cte: bool = False) -> str:
"""Pretty-print SQL; optionally hoist FROM subqueries to CTEs."""
expression = parse_one(query, read="snowflake")
if hoist_derived_tables_to_cte:
expression = eliminate_subqueries(expression)
return transpile(expression, read="snowflake", pretty=True)[0]Usage with a Snowpark plan string (conceptually):
raw_sql = feature_df.queries["queries"][0] # illustrative; structure may vary by Snowpark version
print(format_sql(raw_sql, hoist_derived_tables_to_cte=True))In R, a typical pattern is dbplyr::sql_render(lazy_query) to obtain the text, then the same Python helper via reticulate, a saved .sql file processed in CI, or any standalone formatter above.
D.7 Summary
| Question | Answer |
|---|---|
| How does Python become a Dynamic Table? | Snowpark translates DataFrame operations to SQL at registration time. The DT is a pure SQL object – no Python runs during refresh. |
| Can we use existing Python code? | Yes, via IMMUTABLE UDFs called within the DT definition. For model inference, prefer Model Registry (MODEL(name)!PREDICT()) – it handles version rotation gracefully. Standard UDFs work for static logic but replacing one breaks the DT. |
| What are the limitations? | Certain SQL constructs don’t support incremental refresh (PIVOT, external functions, VOLATILE UDFs, set operators other than UNION/UNION ALL). See the supported queries documentation for the full matrix. |
| What about view-based Feature Views? | Omit refresh_freq to create a view instead of a DT. The query runs on-the-fly at retrieval time with spine filter push-down. Ideal for prototyping, rarely accessed features, or wrapping externally managed tables (dbt, Airflow, etc.). |
| How do we optimize? | Break long pipelines into chained DTs, cluster source data by grouping/join keys, use QUALIFY over DISTINCT, choose target lag based on consumer freshness needs, and use IMMUTABLE WHERE to skip recomputation of historical rows. |
| How do I read ugly generated SQL? | Use a dialect-aware formatter (often SQLGlot for Snowpark strings); optionally hoist derived tables to CTEs for review only. See Formatting machine-generated SQL. |
D.8 References
- Snowpark DataFrames – how lazy evaluation and query pushdown work
- Dynamic Tables overview – architecture and refresh mechanics
- Supported queries for Dynamic Tables – full incremental/full support matrix
- Optimize queries for incremental refresh – operator-level tuning guidance
- Use immutability constraints – IMMUTABLE WHERE and BACKFILL FROM
- Chapter 04: Feature Views – DT vs View Feature View types
- Chapter 05: Feature Pipelines – pipeline architecture patterns
- SQLGlot – Python parser / transpiler used for pretty-printing and optional CTE hoisting (
eliminate_subqueriesAPI)