flowchart LR R[R session RSnowflake] API[SQL REST API or ADBC] WH[Warehouse] STG[(Tables / stages)] R --> API --> WH --> STG
15 RSnowflake: Query
DBI, dbplyr, bulk I/O, and Arrow
snowflake, R, RStudio, Posit, VS Code, workspace notebooks, snowflakeR, RSnowflake, mlops
15.1 Overview
Once connected (Connect chapter), RSnowflake gives you three layers of data access:
- DBI — standard R database interface (
dbGetQuery,dbWriteTable, …) - dbplyr — dplyr verbs translated to SQL and executed in Snowflake
- Bulk I/O — optimized paths for large reads/writes (Snowpark, ADBC, or literal SQL)
The design goal is push computation to the data. Your R session sends SQL; the warehouse aggregates, filters, and joins close to storage. You collect() only what you need for modeling or visualization.
15.2 Learning Objectives
- Run queries, DDL, and table operations with DBI
- Build lazy dplyr pipelines that execute on the warehouse
- Use RSnowflake-only dbplyr translations (semi-structured types,
OBJECT_AGG, approximate aggregates) - Materialize intermediate results with
compute()(session temp tables) vs pull summaries withcollect() - Combine window functions, joins, and multi-step ETL entirely in Snowflake SQL via dbplyr
- Understand how RSnowflake chooses bulk write/read backends
- Know SQL API limits that affect stage I/O
15.3 Query execution model
| Step | Where it runs |
|---|---|
dplyr pipeline before collect() |
SQL built in R; not executed yet |
collect(), dbGetQuery() |
Warehouse executes SQL |
| Result transfer | JSON (SQL API) or Arrow (ADBC) to R |
Bulk writes from R (dbWriteTable, dbAppendTable) use separate routing — see Bulk write routing. Prefer not loading huge tables into R only to write them back; materialize with SQL/dbplyr when possible.
15.4 DBI basics
DBI examples below are already the SQL path: you send warehouse SQL as strings (dbGetQuery, dbExecute). Pair them with dbplyr tabs when you want the same logic as lazy dplyr.
15.4.1 Reads and parameterized queries
df <- dbGetQuery(con, "SELECT * FROM MY_DB.MY_SCHEMA.MY_TABLE LIMIT 100")
df <- dbGetQuery(con,
"SELECT * FROM users WHERE age > ?",
params = list(21L)
)Parameterized queries avoid SQL injection and help the optimizer reuse plans.
15.4.2 DDL and DML
dbExecute(con, "CREATE TABLE test_tbl (id INT, name VARCHAR)")
dbExecute(con, "INSERT INTO test_tbl VALUES (1, 'Alice')")dbExecute() returns rows affected; use for statements without result sets.
15.4.3 Table round-trip
dbWriteTable(con, "iris_copy", iris, overwrite = TRUE)
iris_back <- dbReadTable(con, "iris_copy")
dbAppendTable(con, "iris_copy", iris[1:10, ])
dbRemoveTable(con, "iris_copy")Use fully qualified names when Workspace session context is unset. RSnowflake uppercases unquoted identifiers by default — see Connect — case.
15.5 dbplyr: lazy SQL from dplyr
dbplyr turns dplyr into a SQL compiler. Tables become lazy; verbs append to a query plan until you collect(). Where the same logic is practical in both forms, examples use SQL and dbplyr tabs (group="query-sql-dplyr").
Why lazy matters: filter() and summarise() do not pull data to R. A 10-billion-row table stays in Snowflake until the final aggregation is computed on the warehouse.
dbplyr keeps Snowflake’s UPPERCASE column names. snowflakeR’s sfr_query() lowercases — do not mix blindly in joins between APIs.
Generic Snowflake translations (floor_date() → DATE_TRUNC, paste0() → ARRAY_TO_STRING, ifelse() → IFF, and many more) come from dbplyr via dbplyr::simulate_snowflake(). RSnowflake adds the Snowflake-only functions below — they are not available when you use ODBC/dbplyr against Snowflake from other drivers.
Full tables: RSnowflake getting-started vignette · Appendix H — getting-started HTML.
15.6 RSnowflake-only dplyr translations
Use these inside mutate(), filter(), and summarise() on tbl(con, ...). They compile to Snowflake SQL and run on the warehouse.
15.6.1 Semi-structured scalars
| R function | Snowflake SQL | Typical use |
|---|---|---|
parse_json(x) |
PARSE_JSON(x) |
VARCHAR → VARIANT |
try_parse_json(x) |
TRY_PARSE_JSON(x) |
Safe parse (NULL on bad JSON) |
typeof(x) |
TYPEOF(x) |
Inspect VARIANT element type |
is_object(x) |
IS_OBJECT(x) |
Branch on OBJECT payloads |
is_array(x) |
IS_ARRAY(x) |
Branch on ARRAY payloads |
is_integer(x) |
IS_INTEGER(x) |
VARIANT holds integer |
15.6.2 Array scalars
| R function | Snowflake SQL | Notes |
|---|---|---|
array_size(x) |
ARRAY_SIZE(x) |
Length of ARRAY |
array_contains(arr, val) |
ARRAY_CONTAINS(val, arr) |
RSnowflake swaps arg order to match Snowflake |
array_slice(arr, from, to) |
ARRAY_SLICE(arr, from, to) |
Sub-array |
15.6.3 Strings and hashing
| R function | Snowflake SQL | Notes |
|---|---|---|
regexp_substr(x, pattern, ...) |
REGEXP_SUBSTR(...) |
Snowflake-specific args (pos, occ, params) |
hash(...) |
HASH(...) |
Stable hash across columns |
15.6.4 Aggregates (inside summarise())
| R function | Snowflake SQL | Typical use |
|---|---|---|
object_agg(key, value) |
OBJECT_AGG(key, value) |
Bundle many rows into one OBJECT per group |
array_agg(x) |
ARRAY_AGG(x) |
Positional feature vector as ARRAY |
array_unique_agg(x) |
ARRAY_UNIQUE_AGG(x) |
Distinct ARRAY elements |
approx_count_distinct(x) |
APPROX_COUNT_DISTINCT(x) |
Cheap cardinality on large groups |
approx_percentile(x, p) |
APPROX_PERCENTILE(x, p) |
Approximate quantiles |
mode_val(x) |
MODE(x) |
Statistical mode (mode_val avoids base::mode()) |
Window variants of the aggregate functions above are also registered for analytic queries.
15.6.5 Example: events with JSON payloads
15.6.6 Example: bundle sparse features into OBJECT (Feature Store pattern)
Snowflake’s Feature Store advanced patterns recommend OBJECT (or ARRAY) columns to avoid thousand-column tables. You can build those bundles in dplyr on the warehouse, then register the result as an external feature view from Feature Store or materialize with compute().
# Long/narrow source: one row per (entity, feature_name, value)
bundled <- tbl(con, in_schema("FEATURE_STORE", "ENTITY_FEATURES_LONG")) |>
group_by(ENTITY_ID) |>
summarise(
FEATURE_OBJECT = object_agg(FEATURE_NAME, FEATURE_VALUE),
.groups = "drop"
)
bundled |> show_query()
# Persist for reuse (session temp table) or inspect a sample
feature_sample <- bundled |> head(100) |> collect()For fixed keys (known feature names), OBJECT_CONSTRUCT in SQL is often clearer — use dbGetQuery() or mutate(feature_object = sql("OBJECT_CONSTRUCT('avg_spend', AVG_SPEND, 'n_orders', N_ORDERS)")) after a summarise() step.
Path access (payload:field::FLOAT, bracket notation) is not wrapped as dplyr helpers today. Use dbGetQuery() or mutate(x = sql("FEATURE_OBJECT:avg_spend::FLOAT")) for extraction expressions.
15.7 Push-down analytics: dplyr on the warehouse
The power of dbplyr is not “R that talks SQL” — it is keeping heavy work in Snowflake and using R for orchestration, inspection, and modest result sets (plots, model matrices, samples).
| Verb / action | Runs in | Typical use |
|---|---|---|
filter(), mutate(), group_by(), summarise() |
Warehouse (lazy) | Feature engineering, aggregations |
show_query() |
R only | Debug generated SQL before execution |
compute() |
Warehouse (CREATE TEMP TABLE … AS) |
Persist an intermediate; reuse in branches |
collect() |
Warehouse execute + transfer to R | Final table slice, ggplot input, head() |
Feature Store training exports (OBJECT/ARRAY/VECTOR pipelines, ADBC decode, wide ASOF joins) live in Training Data in R — not here. Feature View authoring SQL/dbplyr is in Authoring Feature Views. This chapter focuses on everyday dbplyr analytics.
15.7.1 Multi-step ETL with temp tables
Break a pipeline into named steps on the warehouse instead of one nested query or repeated collect():
-- Step 1 — session temp table (dbplyr compute() generates similar DDL)
CREATE TEMPORARY TABLE daily_orders_tmp AS
SELECT
CUSTOMER_ID,
ORDER_DATE,
COUNT(*) AS n_orders,
SUM(AMOUNT) AS revenue
FROM ANALYTICS.ORDERS
WHERE ORDER_DATE >= '2024-01-01'
GROUP BY CUSTOMER_ID, ORDER_DATE;
-- Step 2 — aggregate from materialized step
SELECT
CUSTOMER_ID,
SUM(n_orders) AS orders_90d,
SUM(revenue) AS revenue_90d,
SUM(revenue) / NULLIF(SUM(n_orders), 0) AS avg_order
FROM daily_orders_tmp
GROUP BY CUSTOMER_ID;
-- Inspect row counts without pulling data to R
SELECT COUNT(*) AS n FROM daily_orders_tmp;library(dplyr)
orders <- tbl(con, in_schema("ANALYTICS", "ORDERS")) |>
filter(ORDER_DATE >= "2024-01-01")
# Step 1 — materialize daily rollups (session temp table)
daily <- orders |>
group_by(CUSTOMER_ID, ORDER_DATE) |>
summarise(
n_orders = n(),
revenue = sum(AMOUNT, na.rm = TRUE),
.groups = "drop"
) |>
compute(name = "daily_orders_tmp")
# Step 2 — continue from temp table; still lazy until collect()
customer_rollup <- daily |>
group_by(CUSTOMER_ID) |>
summarise(
orders_90d = sum(n_orders),
revenue_90d = sum(revenue),
avg_order = revenue_90d / orders_90d,
.groups = "drop"
)
customer_rollup |> show_query()
dbGetQuery(con, "SELECT COUNT(*) AS n FROM daily_orders_tmp")15.7.2 Joins and window functions
dbplyr translates joins and window verbs to Snowflake SQL — including qualify-style patterns via filter() on windowed columns:
CREATE TEMPORARY TABLE top5_orders_tmp AS
SELECT
CUSTOMER_ID,
ORDER_ID,
ORDER_TS,
AMOUNT,
running_spend
FROM (
SELECT
CUSTOMER_ID,
ORDER_ID,
ORDER_TS,
AMOUNT,
SUM(AMOUNT) OVER (
PARTITION BY CUSTOMER_ID ORDER BY ORDER_TS
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_spend,
ROW_NUMBER() OVER (
PARTITION BY CUSTOMER_ID ORDER BY ORDER_TS DESC
) AS order_rank
FROM SALES.LINE_ITEMS
)
WHERE order_rank <= 5;library(dplyr)
line_items <- tbl(con, in_schema("SALES", "LINE_ITEMS"))
ranked <- line_items |>
group_by(CUSTOMER_ID) |>
mutate(
order_rank = row_number(desc(ORDER_TS)),
running_spend = sum(AMOUNT)
) |>
filter(order_rank <= 5L) # top 5 orders per customer — computed in Snowflake
top5_per_customer <- ranked |>
select(CUSTOMER_ID, ORDER_ID, ORDER_TS, AMOUNT, running_spend) |>
compute(name = "top5_orders_tmp", temporary = TRUE)For production pipelines, set temporary = FALSE and a fully qualified table name when the artifact should outlive the session.
15.7.3 Collect only what you need
Anti-pattern: collect() on a billion-row table to aggregate in R.
Better: aggregate in SQL, collect() the summary (or a stratified sample):
# Summary for a dashboard — tiny result set
by_region <- tbl(con, in_schema("SALES", "ORDERS")) |>
group_by(REGION) |>
summarise(revenue = sum(AMOUNT), .groups = "drop") |>
collect()
# Optional: ggplot2 on the aggregated frame
# library(ggplot2)
# ggplot(by_region, aes(REGION, revenue)) + geom_col()
sample_for_plot <- tbl(con, in_schema("SALES", "ORDERS")) |>
filter(ORDER_DATE >= "2025-01-01") |>
slice_sample(n = 5000) |> # Snowflake TABLESAMPLE / random — see show_query()
collect()15.7.4 Semi-structured columns (brief)
RSnowflake can build OBJECT/ARRAY bundles with object_agg() and array_agg() in dplyr (RSnowflake-only translations). Register bundles as Feature Views in Authoring Feature Views. How those columns round-trip into R (JSON vs Arrow list columns, VECTOR packs, Feature Store save_as) is covered in Training Data in R.
15.7.5 Materialize on the warehouse (compute())
Anti-pattern: collect() after every dplyr verb — data shuttles to R repeatedly.
Better: treat Snowflake as the execution engine. Build lazy pipelines; materialize intermediate results as tables when a step is expensive, reused, or hard to read as one nested query.
| Function | What it does | When to use |
|---|---|---|
collect() |
Run SQL + pull all rows to R | Final modest result (modeling, plot, head()) |
compute() |
CREATE TEMPORARY TABLE … AS + return lazy handle |
Multi-step ETL; break up joins; debug with show_query() per step |
copy_to() |
Upload local data.frame to a Snowflake table |
Seed pipeline from R object |
dbExecute() + sql_render() |
Run explicit DDL you control | Permanent tables, CREATE TABLE AS with custom options |
CREATE TEMPORARY TABLE daily_events_tmp AS
SELECT customer_id, event_date, COUNT(*) AS n_events
FROM RAW.EVENTS
WHERE event_date >= '2024-01-01'
GROUP BY customer_id, event_date;
-- Reuse materialized step; final features as permanent table
CREATE TABLE MY_DB.MY_SCHEMA.CUSTOMER_EVENT_FEATURES AS
SELECT customer_id, AVG(n_events) AS avg_daily
FROM daily_events_tmp
GROUP BY customer_id;library(dplyr)
base <- tbl(con, in_schema("RAW", "EVENTS")) |>
filter(event_date >= "2024-01-01")
# Materialize once on the warehouse (session-scoped temp table)
daily <- base |>
group_by(customer_id, event_date) |>
summarise(n_events = n(), .groups = "drop") |>
compute(name = "daily_events_tmp")
# Continue from the materialized table — still lazy until collect()
features <- daily |>
group_by(customer_id) |>
summarise(avg_daily = mean(n_events), .groups = "drop")
# Permanent table (set session schema or use a fully qualified name)
features |>
compute(name = "CUSTOMER_EVENT_FEATURES", temporary = FALSE)
# Or inspect SQL without executing
features |> show_query()RSnowflake’s sql_query_save generates Snowflake CREATE TEMPORARY TABLE (or permanent when temporary = FALSE). Temp tables live for the session — ideal for notebook pipelines; use qualified permanent names for shared artifacts.
When to materialize vs one big query: materialize when (1) an intermediate is reused in multiple branches, (2) the generated SQL is deeply nested or slow to optimize, or (3) you want to inspect row counts with SELECT COUNT(*) on a named table. Otherwise a single pipeline + one collect() is fine.
See Platform Primer — storage-compute for the mental model.
15.8 Bulk write routing
Small tables use batched literal INSERT over the SQL API. Above RSnowflake.bulk_write_threshold (cells = rows × columns; default 200,000 in Workspace, 50,000 elsewhere), "auto" picks a bulk backend.
15.8.1 How "auto" chooses a path (current RSnowflake behaviour)
| Environment | Order when above threshold |
|---|---|
| Workspace | ADBC write_adbc (internal gateway) → Snowpark write_pandas (fallback) → literal INSERT |
| Local / Posit | ADBC → literal INSERT |
This differs from older docs that listed Snowpark as the Workspace default. After internal-gateway ADBC work (2026), ADBC is preferred in Workspace when adbcsnowflake is installed; Snowpark remains an explicit opt-in or fallback. See WORKSPACE_ADBC.md.
flowchart TD
WT[dbWriteTable or dbAppendTable]
AUTO{upload_method}
BIG{cells >= threshold?}
ADBC{ADBC installed?}
SP{Snowpark plus reticulate?}
ADBC_W[ADBC write_adbc from R]
SP_W[reticulate r_to_py then write_pandas]
LIT[Literal INSERT]
WT --> AUTO
AUTO -->|literal or small| LIT
AUTO -->|auto and large| BIG
BIG -->|no| LIT
BIG -->|yes| ADBC
ADBC -->|yes| ADBC_W
ADBC -->|no| SP
SP -->|yes| SP_W
SP -->|no| LIT
| Method | What actually happens | When to use |
|---|---|---|
"auto" |
Routing table above | Default |
"adbc" |
R data.frame → Snowflake driver PUT + COPY INTO (no SQL API PUT from R) |
Force Arrow path; bake into CRE to skip per-session Go install |
"snowpark" |
R → reticulate::r_to_py() → pandas → active Snowpark session write_pandas() |
Fallback in Workspace; explicit override |
"literal" |
SQL INSERT batches via REST |
Small tables, debugging, no extra packages |
15.8.2 Is the Snowpark path copying the whole data frame through reticulate?
Yes, when Snowpark is used. RSnowflake converts the R data.frame in memory with reticulate::r_to_py(), then calls the Workspace kernel’s Snowpark write_pandas(). In RSnowflake benchmarks (~50k rows), that R→pandas step is on the order of ~0.2 seconds — small next to network upload, but it still duplicates the dataset in process memory (R + Python) for the duration of the write.
ADBC avoids the Python hop: write_adbc() sends the R table through the Arrow/Snowflake driver (PUT + COPY under the hood), staying in the R process with no pandas intermediate.
15.8.3 For very large data, prefer not to round-trip through R
| Approach | Why |
|---|---|
dbplyr / SQL — compute(), CREATE TABLE AS, load from existing Snowflake tables |
Data never leaves the warehouse |
Python Snowpark cell — write from a DataFrame already in Python, or session.write_pandas from a file you read in Python |
No R→Python copy for the bulk of the bytes |
Stage + COPY INTO (SQL) |
Classic warehouse pattern; RSnowflake’s SQL API driver cannot run client PUT — so “write CSV in R, then PUT” is not a built-in RSnowflake path; use a Python cell, Snowpark, or ADBC instead |
dbWriteTable(large_df) from R |
Convenient up to modest sizes; for millions of rows, memory and reticulate/ADBC cost dominate |
Writing CSV/Parquet locally and loading via Python is a valid manual pattern in a mixed notebook, but RSnowflake does not automatically spool to disk first — it passes the in-memory data.frame to ADBC or r_to_py().
15.8.4 Workspace benchmarks (order of magnitude)
From RSnowflake testing (~50k rows, Mar 2026 — your account may differ):
| Path | Typical time |
|---|---|
Snowpark write_pandas (after r_to_py) |
~2–3 s |
ADBC via internal SNOWFLAKE_HOST |
~7 s |
| Literal INSERT via SQL API | ~45 s |
So Snowpark can be faster when both are available, but ADBC is the default in "auto" because it avoids the reticulate dependency and matches bulk read auth. Use options(RSnowflake.upload_method = "snowpark") when you have measured Snowpark winning for your shape of data.
Reads follow a similar "auto" / "adbc" backend via RSnowflake.backend.
15.9 Arrow and ADBC
ADBC (Arrow Database Connectivity) is a columnar database access standard (Apache Arrow ecosystem). RSnowflake uses the Snowflake ADBC driver for bulk I/O; routine queries use the SQL API. Glossary · SQL API chapter.
The Snowflake ADBC driver uses the Go Snowflake driver for columnar bulk transfer:
requireNamespace("adbcsnowflake", quietly = TRUE)
options(RSnowflake.backend = "adbc")
df <- dbGetQuery(con, "SELECT * FROM big_table")
options(RSnowflake.backend = "auto")Workspace considerations:
- Installing
adbcsnowflakemay require Go compiler and EAI for module downloads (~2 min cold bootstrap) - ADBC can use internal
SNOWFLAKE_HOSTgateway with OAuth — no public URL required - Bulk writes may still prefer Snowpark in
"auto"even when ADBC is installed
Bake ADBC into a CRE to avoid repeat installs. Deep dive: WORKSPACE_ADBC.md.
dbGetQueryArrow() provides DBI Arrow compatibility via nanoarrow. SQL API v2 returns JSON today — native server-side Arrow is on the roadmap; for large reads, dbGetQuery() with ADBC backend is the practical path.
15.10 SQL API limitations
RSnowflake uses the SQL REST API. These affect how you I/O, not whether you can query:
| Unsupported | Implication |
|---|---|
GET / PUT |
No client-side stage transfer over REST |
USE DATABASE / SCHEMA |
Set context in connection or qualify names |
ALTER SESSION |
Set params via supported alternatives |
SPCS workers with RSnowflake must use stage volume mounts for file I/O — see Parallel doSnowflake and Troubleshooting.
15.11 Choosing an access pattern
| Goal | Pattern |
|---|---|
| Ad hoc SQL | dbGetQuery() |
| ETL-style transforms | dbplyr pipeline + compute() for heavy steps + collect() at end |
| Large local export | ADBC backend + aggregate in SQL first |
| Workspace notebook write | "auto" or "snowpark" |
| ML platform objects | snowflakeR sfr_* APIs (may share DBI via sfr_dbi_connection()) |
15.12 Next steps
RSnowflake in Workspace — bootstrap order, gateway, ADBC in notebooks.