15  RSnowflake: Query

DBI, dbplyr, bulk I/O, and Arrow

Keywords

snowflake, R, RStudio, Posit, VS Code, workspace notebooks, snowflakeR, RSnowflake, mlops

15.1 Overview

Once connected (Connect chapter), RSnowflake gives you three layers of data access:

  1. DBI — standard R database interface (dbGetQuery, dbWriteTable, …)
  2. dbplyr — dplyr verbs translated to SQL and executed in Snowflake
  3. Bulk I/O — optimized paths for large reads/writes (Snowpark, ADBC, or literal SQL)

The design goal is push computation to the data. Your R session sends SQL; the warehouse aggregates, filters, and joins close to storage. You collect() only what you need for modeling or visualization.

15.2 Learning Objectives

  • Run queries, DDL, and table operations with DBI
  • Build lazy dplyr pipelines that execute on the warehouse
  • Use RSnowflake-only dbplyr translations (semi-structured types, OBJECT_AGG, approximate aggregates)
  • Materialize intermediate results with compute() (session temp tables) vs pull summaries with collect()
  • Combine window functions, joins, and multi-step ETL entirely in Snowflake SQL via dbplyr
  • Understand how RSnowflake chooses bulk write/read backends
  • Know SQL API limits that affect stage I/O

15.3 Query execution model

flowchart LR
  R[R session RSnowflake]
  API[SQL REST API or ADBC]
  WH[Warehouse]
  STG[(Tables / stages)]
  R --> API --> WH --> STG

Step Where it runs
dplyr pipeline before collect() SQL built in R; not executed yet
collect(), dbGetQuery() Warehouse executes SQL
Result transfer JSON (SQL API) or Arrow (ADBC) to R

Bulk writes from R (dbWriteTable, dbAppendTable) use separate routing — see Bulk write routing. Prefer not loading huge tables into R only to write them back; materialize with SQL/dbplyr when possible.


15.4 DBI basics

DBI examples below are already the SQL path: you send warehouse SQL as strings (dbGetQuery, dbExecute). Pair them with dbplyr tabs when you want the same logic as lazy dplyr.

15.4.1 Reads and parameterized queries

df <- dbGetQuery(con, "SELECT * FROM MY_DB.MY_SCHEMA.MY_TABLE LIMIT 100")

df <- dbGetQuery(con,
  "SELECT * FROM users WHERE age > ?",
  params = list(21L)
)

Parameterized queries avoid SQL injection and help the optimizer reuse plans.

15.4.2 DDL and DML

dbExecute(con, "CREATE TABLE test_tbl (id INT, name VARCHAR)")
dbExecute(con, "INSERT INTO test_tbl VALUES (1, 'Alice')")

dbExecute() returns rows affected; use for statements without result sets.

15.4.3 Table round-trip

dbWriteTable(con, "iris_copy", iris, overwrite = TRUE)
iris_back <- dbReadTable(con, "iris_copy")
dbAppendTable(con, "iris_copy", iris[1:10, ])
dbRemoveTable(con, "iris_copy")
Note

Use fully qualified names when Workspace session context is unset. RSnowflake uppercases unquoted identifiers by default — see Connect — case.


15.5 dbplyr: lazy SQL from dplyr

dbplyr turns dplyr into a SQL compiler. Tables become lazy; verbs append to a query plan until you collect(). Where the same logic is practical in both forms, examples use SQL and dbplyr tabs (group="query-sql-dplyr").

SELECT
  REGION,
  COUNT(*)                    AS n,
  SUM(AMOUNT)                 AS revenue
FROM MY_DB.MY_SCHEMA.ORDERS
WHERE STATUS = 'OPEN'
  AND ORDER_DATE >= '2024-01-01'
GROUP BY REGION
ORDER BY revenue DESC;
library(dplyr)

tbl(con, in_schema("MY_SCHEMA", "ORDERS")) |>
  filter(STATUS == "OPEN", ORDER_DATE >= "2024-01-01") |>
  group_by(REGION) |>
  summarise(
    n       = n(),
    revenue = sum(AMOUNT, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(revenue)) |>
  collect()   # single SQL round trip

Why lazy matters: filter() and summarise() do not pull data to R. A 10-billion-row table stays in Snowflake until the final aggregation is computed on the warehouse.

Warning

dbplyr keeps Snowflake’s UPPERCASE column names. snowflakeR’s sfr_query() lowercases — do not mix blindly in joins between APIs.

Generic Snowflake translations (floor_date()DATE_TRUNC, paste0()ARRAY_TO_STRING, ifelse()IFF, and many more) come from dbplyr via dbplyr::simulate_snowflake(). RSnowflake adds the Snowflake-only functions below — they are not available when you use ODBC/dbplyr against Snowflake from other drivers.

Full tables: RSnowflake getting-started vignette · Appendix H — getting-started HTML.


15.6 RSnowflake-only dplyr translations

Use these inside mutate(), filter(), and summarise() on tbl(con, ...). They compile to Snowflake SQL and run on the warehouse.

15.6.1 Semi-structured scalars

R function Snowflake SQL Typical use
parse_json(x) PARSE_JSON(x) VARCHAR → VARIANT
try_parse_json(x) TRY_PARSE_JSON(x) Safe parse (NULL on bad JSON)
typeof(x) TYPEOF(x) Inspect VARIANT element type
is_object(x) IS_OBJECT(x) Branch on OBJECT payloads
is_array(x) IS_ARRAY(x) Branch on ARRAY payloads
is_integer(x) IS_INTEGER(x) VARIANT holds integer

15.6.2 Array scalars

R function Snowflake SQL Notes
array_size(x) ARRAY_SIZE(x) Length of ARRAY
array_contains(arr, val) ARRAY_CONTAINS(val, arr) RSnowflake swaps arg order to match Snowflake
array_slice(arr, from, to) ARRAY_SLICE(arr, from, to) Sub-array

15.6.3 Strings and hashing

R function Snowflake SQL Notes
regexp_substr(x, pattern, ...) REGEXP_SUBSTR(...) Snowflake-specific args (pos, occ, params)
hash(...) HASH(...) Stable hash across columns

15.6.4 Aggregates (inside summarise())

R function Snowflake SQL Typical use
object_agg(key, value) OBJECT_AGG(key, value) Bundle many rows into one OBJECT per group
array_agg(x) ARRAY_AGG(x) Positional feature vector as ARRAY
array_unique_agg(x) ARRAY_UNIQUE_AGG(x) Distinct ARRAY elements
approx_count_distinct(x) APPROX_COUNT_DISTINCT(x) Cheap cardinality on large groups
approx_percentile(x, p) APPROX_PERCENTILE(x, p) Approximate quantiles
mode_val(x) MODE(x) Statistical mode (mode_val avoids base::mode())

Window variants of the aggregate functions above are also registered for analytic queries.

15.6.5 Example: events with JSON payloads

SELECT
  EVENT_TYPE,
  ARRAY_UNIQUE_AGG(USER_ID)          AS user_ids,
  APPROX_COUNT_DISTINCT(USER_ID)     AS n_users
FROM (
  SELECT
    EVENT_TYPE,
    USER_ID,
    TRY_PARSE_JSON(RAW_JSON) AS payload
  FROM CLICKSTREAM_DATA.EVENTS
) e
WHERE IS_OBJECT(payload)
GROUP BY EVENT_TYPE;
library(dplyr)

tbl(con, in_schema("CLICKSTREAM_DATA", "EVENTS")) |>
  mutate(payload = parse_json(RAW_JSON)) |>
  filter(is_object(payload)) |>
  group_by(EVENT_TYPE) |>
  summarise(
    user_ids   = array_unique_agg(USER_ID),
    n_users    = approx_count_distinct(USER_ID),
    .groups    = "drop"
  ) |>
  collect()

15.6.6 Example: bundle sparse features into OBJECT (Feature Store pattern)

Snowflake’s Feature Store advanced patterns recommend OBJECT (or ARRAY) columns to avoid thousand-column tables. You can build those bundles in dplyr on the warehouse, then register the result as an external feature view from Feature Store or materialize with compute().

-- Long/narrow source: one row per (entity, feature_name, value)
SELECT
  ENTITY_ID,
  OBJECT_AGG(FEATURE_NAME, FEATURE_VALUE) AS feature_object
FROM FEATURE_STORE.ENTITY_FEATURES_LONG
GROUP BY ENTITY_ID;
# Long/narrow source: one row per (entity, feature_name, value)
bundled <- tbl(con, in_schema("FEATURE_STORE", "ENTITY_FEATURES_LONG")) |>
  group_by(ENTITY_ID) |>
  summarise(
    FEATURE_OBJECT = object_agg(FEATURE_NAME, FEATURE_VALUE),
    .groups = "drop"
  )

bundled |> show_query()

# Persist for reuse (session temp table) or inspect a sample
feature_sample <- bundled |> head(100) |> collect()

For fixed keys (known feature names), OBJECT_CONSTRUCT in SQL is often clearer — use dbGetQuery() or mutate(feature_object = sql("OBJECT_CONSTRUCT('avg_spend', AVG_SPEND, 'n_orders', N_ORDERS)")) after a summarise() step.

Path access (payload:field::FLOAT, bracket notation) is not wrapped as dplyr helpers today. Use dbGetQuery() or mutate(x = sql("FEATURE_OBJECT:avg_spend::FLOAT")) for extraction expressions.



15.7 Push-down analytics: dplyr on the warehouse

The power of dbplyr is not “R that talks SQL” — it is keeping heavy work in Snowflake and using R for orchestration, inspection, and modest result sets (plots, model matrices, samples).

Verb / action Runs in Typical use
filter(), mutate(), group_by(), summarise() Warehouse (lazy) Feature engineering, aggregations
show_query() R only Debug generated SQL before execution
compute() Warehouse (CREATE TEMP TABLE … AS) Persist an intermediate; reuse in branches
collect() Warehouse execute + transfer to R Final table slice, ggplot input, head()
Tip

Feature Store training exports (OBJECT/ARRAY/VECTOR pipelines, ADBC decode, wide ASOF joins) live in Training Data in R — not here. Feature View authoring SQL/dbplyr is in Authoring Feature Views. This chapter focuses on everyday dbplyr analytics.

15.7.1 Multi-step ETL with temp tables

Break a pipeline into named steps on the warehouse instead of one nested query or repeated collect():

-- Step 1 — session temp table (dbplyr compute() generates similar DDL)
CREATE TEMPORARY TABLE daily_orders_tmp AS
SELECT
  CUSTOMER_ID,
  ORDER_DATE,
  COUNT(*)              AS n_orders,
  SUM(AMOUNT)           AS revenue
FROM ANALYTICS.ORDERS
WHERE ORDER_DATE >= '2024-01-01'
GROUP BY CUSTOMER_ID, ORDER_DATE;

-- Step 2 — aggregate from materialized step
SELECT
  CUSTOMER_ID,
  SUM(n_orders)  AS orders_90d,
  SUM(revenue)   AS revenue_90d,
  SUM(revenue) / NULLIF(SUM(n_orders), 0) AS avg_order
FROM daily_orders_tmp
GROUP BY CUSTOMER_ID;

-- Inspect row counts without pulling data to R
SELECT COUNT(*) AS n FROM daily_orders_tmp;
library(dplyr)

orders <- tbl(con, in_schema("ANALYTICS", "ORDERS")) |>
  filter(ORDER_DATE >= "2024-01-01")

# Step 1 — materialize daily rollups (session temp table)
daily <- orders |>
  group_by(CUSTOMER_ID, ORDER_DATE) |>
  summarise(
    n_orders = n(),
    revenue  = sum(AMOUNT, na.rm = TRUE),
    .groups  = "drop"
  ) |>
  compute(name = "daily_orders_tmp")

# Step 2 — continue from temp table; still lazy until collect()
customer_rollup <- daily |>
  group_by(CUSTOMER_ID) |>
  summarise(
    orders_90d   = sum(n_orders),
    revenue_90d  = sum(revenue),
    avg_order    = revenue_90d / orders_90d,
    .groups      = "drop"
  )

customer_rollup |> show_query()

dbGetQuery(con, "SELECT COUNT(*) AS n FROM daily_orders_tmp")

15.7.2 Joins and window functions

dbplyr translates joins and window verbs to Snowflake SQL — including qualify-style patterns via filter() on windowed columns:

CREATE TEMPORARY TABLE top5_orders_tmp AS
SELECT
  CUSTOMER_ID,
  ORDER_ID,
  ORDER_TS,
  AMOUNT,
  running_spend
FROM (
  SELECT
    CUSTOMER_ID,
    ORDER_ID,
    ORDER_TS,
    AMOUNT,
    SUM(AMOUNT) OVER (
      PARTITION BY CUSTOMER_ID ORDER BY ORDER_TS
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_spend,
    ROW_NUMBER() OVER (
      PARTITION BY CUSTOMER_ID ORDER BY ORDER_TS DESC
    ) AS order_rank
  FROM SALES.LINE_ITEMS
)
WHERE order_rank <= 5;
library(dplyr)

line_items <- tbl(con, in_schema("SALES", "LINE_ITEMS"))

ranked <- line_items |>
  group_by(CUSTOMER_ID) |>
  mutate(
    order_rank = row_number(desc(ORDER_TS)),
    running_spend = sum(AMOUNT)
  ) |>
  filter(order_rank <= 5L)   # top 5 orders per customer — computed in Snowflake

top5_per_customer <- ranked |>
  select(CUSTOMER_ID, ORDER_ID, ORDER_TS, AMOUNT, running_spend) |>
  compute(name = "top5_orders_tmp", temporary = TRUE)

For production pipelines, set temporary = FALSE and a fully qualified table name when the artifact should outlive the session.

15.7.3 Collect only what you need

Anti-pattern: collect() on a billion-row table to aggregate in R.

Better: aggregate in SQL, collect() the summary (or a stratified sample):

-- Summary for a dashboard — tiny result set
SELECT REGION, SUM(AMOUNT) AS revenue
FROM SALES.ORDERS
GROUP BY REGION;

-- Stratified sample for exploratory plots (see dbplyr show_query() for exact SAMPLE)
SELECT *
FROM SALES.ORDERS
WHERE ORDER_DATE >= '2025-01-01'
SAMPLE (5000 ROWS);
# Summary for a dashboard — tiny result set
by_region <- tbl(con, in_schema("SALES", "ORDERS")) |>
  group_by(REGION) |>
  summarise(revenue = sum(AMOUNT), .groups = "drop") |>
  collect()

# Optional: ggplot2 on the aggregated frame
# library(ggplot2)
# ggplot(by_region, aes(REGION, revenue)) + geom_col()

sample_for_plot <- tbl(con, in_schema("SALES", "ORDERS")) |>
  filter(ORDER_DATE >= "2025-01-01") |>
  slice_sample(n = 5000) |>   # Snowflake TABLESAMPLE / random — see show_query()
  collect()

15.7.4 Semi-structured columns (brief)

RSnowflake can build OBJECT/ARRAY bundles with object_agg() and array_agg() in dplyr (RSnowflake-only translations). Register bundles as Feature Views in Authoring Feature Views. How those columns round-trip into R (JSON vs Arrow list columns, VECTOR packs, Feature Store save_as) is covered in Training Data in R.


15.7.5 Materialize on the warehouse (compute())

Anti-pattern: collect() after every dplyr verb — data shuttles to R repeatedly.

Better: treat Snowflake as the execution engine. Build lazy pipelines; materialize intermediate results as tables when a step is expensive, reused, or hard to read as one nested query.

Function What it does When to use
collect() Run SQL + pull all rows to R Final modest result (modeling, plot, head())
compute() CREATE TEMPORARY TABLE … AS + return lazy handle Multi-step ETL; break up joins; debug with show_query() per step
copy_to() Upload local data.frame to a Snowflake table Seed pipeline from R object
dbExecute() + sql_render() Run explicit DDL you control Permanent tables, CREATE TABLE AS with custom options
CREATE TEMPORARY TABLE daily_events_tmp AS
SELECT customer_id, event_date, COUNT(*) AS n_events
FROM RAW.EVENTS
WHERE event_date >= '2024-01-01'
GROUP BY customer_id, event_date;

-- Reuse materialized step; final features as permanent table
CREATE TABLE MY_DB.MY_SCHEMA.CUSTOMER_EVENT_FEATURES AS
SELECT customer_id, AVG(n_events) AS avg_daily
FROM daily_events_tmp
GROUP BY customer_id;
library(dplyr)

base <- tbl(con, in_schema("RAW", "EVENTS")) |>
  filter(event_date >= "2024-01-01")

# Materialize once on the warehouse (session-scoped temp table)
daily <- base |>
  group_by(customer_id, event_date) |>
  summarise(n_events = n(), .groups = "drop") |>
  compute(name = "daily_events_tmp")

# Continue from the materialized table — still lazy until collect()
features <- daily |>
  group_by(customer_id) |>
  summarise(avg_daily = mean(n_events), .groups = "drop")

# Permanent table (set session schema or use a fully qualified name)
features |>
  compute(name = "CUSTOMER_EVENT_FEATURES", temporary = FALSE)

# Or inspect SQL without executing
features |> show_query()

RSnowflake’s sql_query_save generates Snowflake CREATE TEMPORARY TABLE (or permanent when temporary = FALSE). Temp tables live for the session — ideal for notebook pipelines; use qualified permanent names for shared artifacts.

When to materialize vs one big query: materialize when (1) an intermediate is reused in multiple branches, (2) the generated SQL is deeply nested or slow to optimize, or (3) you want to inspect row counts with SELECT COUNT(*) on a named table. Otherwise a single pipeline + one collect() is fine.

See Platform Primer — storage-compute for the mental model.


15.8 Bulk write routing

Small tables use batched literal INSERT over the SQL API. Above RSnowflake.bulk_write_threshold (cells = rows × columns; default 200,000 in Workspace, 50,000 elsewhere), "auto" picks a bulk backend.

15.8.1 How "auto" chooses a path (current RSnowflake behaviour)

Environment Order when above threshold
Workspace ADBC write_adbc (internal gateway) → Snowpark write_pandas (fallback) → literal INSERT
Local / Posit ADBC → literal INSERT

This differs from older docs that listed Snowpark as the Workspace default. After internal-gateway ADBC work (2026), ADBC is preferred in Workspace when adbcsnowflake is installed; Snowpark remains an explicit opt-in or fallback. See WORKSPACE_ADBC.md.

flowchart TD
  WT[dbWriteTable or dbAppendTable]
  AUTO{upload_method}
  BIG{cells >= threshold?}
  ADBC{ADBC installed?}
  SP{Snowpark plus reticulate?}
  ADBC_W[ADBC write_adbc from R]
  SP_W[reticulate r_to_py then write_pandas]
  LIT[Literal INSERT]

  WT --> AUTO
  AUTO -->|literal or small| LIT
  AUTO -->|auto and large| BIG
  BIG -->|no| LIT
  BIG -->|yes| ADBC
  ADBC -->|yes| ADBC_W
  ADBC -->|no| SP
  SP -->|yes| SP_W
  SP -->|no| LIT

Method What actually happens When to use
"auto" Routing table above Default
"adbc" R data.frame → Snowflake driver PUT + COPY INTO (no SQL API PUT from R) Force Arrow path; bake into CRE to skip per-session Go install
"snowpark" R → reticulate::r_to_py() → pandas → active Snowpark session write_pandas() Fallback in Workspace; explicit override
"literal" SQL INSERT batches via REST Small tables, debugging, no extra packages
options(RSnowflake.verbose = TRUE)   # log conversion and write timings
options(RSnowflake.upload_method = "auto")
dbWriteTable(con, "big_table", large_df)

options(RSnowflake.bulk_write_threshold = 100000L)  # rows × cols

15.8.2 Is the Snowpark path copying the whole data frame through reticulate?

Yes, when Snowpark is used. RSnowflake converts the R data.frame in memory with reticulate::r_to_py(), then calls the Workspace kernel’s Snowpark write_pandas(). In RSnowflake benchmarks (~50k rows), that R→pandas step is on the order of ~0.2 seconds — small next to network upload, but it still duplicates the dataset in process memory (R + Python) for the duration of the write.

ADBC avoids the Python hop: write_adbc() sends the R table through the Arrow/Snowflake driver (PUT + COPY under the hood), staying in the R process with no pandas intermediate.

15.8.3 For very large data, prefer not to round-trip through R

Approach Why
dbplyr / SQLcompute(), CREATE TABLE AS, load from existing Snowflake tables Data never leaves the warehouse
Python Snowpark cell — write from a DataFrame already in Python, or session.write_pandas from a file you read in Python No R→Python copy for the bulk of the bytes
Stage + COPY INTO (SQL) Classic warehouse pattern; RSnowflake’s SQL API driver cannot run client PUT — so “write CSV in R, then PUT” is not a built-in RSnowflake path; use a Python cell, Snowpark, or ADBC instead
dbWriteTable(large_df) from R Convenient up to modest sizes; for millions of rows, memory and reticulate/ADBC cost dominate

Writing CSV/Parquet locally and loading via Python is a valid manual pattern in a mixed notebook, but RSnowflake does not automatically spool to disk first — it passes the in-memory data.frame to ADBC or r_to_py().

15.8.4 Workspace benchmarks (order of magnitude)

From RSnowflake testing (~50k rows, Mar 2026 — your account may differ):

Path Typical time
Snowpark write_pandas (after r_to_py) ~2–3 s
ADBC via internal SNOWFLAKE_HOST ~7 s
Literal INSERT via SQL API ~45 s

So Snowpark can be faster when both are available, but ADBC is the default in "auto" because it avoids the reticulate dependency and matches bulk read auth. Use options(RSnowflake.upload_method = "snowpark") when you have measured Snowpark winning for your shape of data.

Reads follow a similar "auto" / "adbc" backend via RSnowflake.backend.


15.9 Arrow and ADBC

What is ADBC?

ADBC (Arrow Database Connectivity) is a columnar database access standard (Apache Arrow ecosystem). RSnowflake uses the Snowflake ADBC driver for bulk I/O; routine queries use the SQL API. Glossary · SQL API chapter.

The Snowflake ADBC driver uses the Go Snowflake driver for columnar bulk transfer:

requireNamespace("adbcsnowflake", quietly = TRUE)

options(RSnowflake.backend = "adbc")
df <- dbGetQuery(con, "SELECT * FROM big_table")
options(RSnowflake.backend = "auto")

Workspace considerations:

  • Installing adbcsnowflake may require Go compiler and EAI for module downloads (~2 min cold bootstrap)
  • ADBC can use internal SNOWFLAKE_HOST gateway with OAuth — no public URL required
  • Bulk writes may still prefer Snowpark in "auto" even when ADBC is installed

Bake ADBC into a CRE to avoid repeat installs. Deep dive: WORKSPACE_ADBC.md.

dbGetQueryArrow() provides DBI Arrow compatibility via nanoarrow. SQL API v2 returns JSON today — native server-side Arrow is on the roadmap; for large reads, dbGetQuery() with ADBC backend is the practical path.


15.10 SQL API limitations

RSnowflake uses the SQL REST API. These affect how you I/O, not whether you can query:

Unsupported Implication
GET / PUT No client-side stage transfer over REST
USE DATABASE / SCHEMA Set context in connection or qualify names
ALTER SESSION Set params via supported alternatives

SPCS workers with RSnowflake must use stage volume mounts for file I/O — see Parallel doSnowflake and Troubleshooting.


15.11 Choosing an access pattern

Goal Pattern
Ad hoc SQL dbGetQuery()
ETL-style transforms dbplyr pipeline + compute() for heavy steps + collect() at end
Large local export ADBC backend + aggregate in SQL first
Workspace notebook write "auto" or "snowpark"
ML platform objects snowflakeR sfr_* APIs (may share DBI via sfr_dbi_connection())

15.12 Next steps

RSnowflake in Workspace — bootstrap order, gateway, ADBC in notebooks.