15 RSnowflake: Query

DBI, dbplyr, bulk I/O, and Arrow

Keywords

snowflake, R, RStudio, Posit, VS Code, workspace notebooks, snowflakeR, RSnowflake, mlops

15.1 Overview

Once connected (Connect chapter), RSnowflake gives you three layers of data access:

DBI — standard R database interface (dbGetQuery, dbWriteTable, …)
dbplyr — dplyr verbs translated to SQL and executed in Snowflake
Bulk I/O — optimized paths for large reads/writes (Snowpark, ADBC, or literal SQL)

The design goal is push computation to the data. Your R session sends SQL; the warehouse aggregates, filters, and joins close to storage. You collect() only what you need for modeling or visualization.

15.2 Learning Objectives

Run queries, DDL, and table operations with DBI
Build lazy dplyr pipelines that execute on the warehouse
Use RSnowflake-only dbplyr translations (semi-structured types, OBJECT_AGG, approximate aggregates)
Materialize intermediate results with compute() (session temp tables) vs pull summaries with collect()
Combine window functions, joins, and multi-step ETL entirely in Snowflake SQL via dbplyr
Understand how RSnowflake chooses bulk write/read backends
Know SQL API limits that affect stage I/O

15.3 Query execution model

flowchart LR
  R[R session RSnowflake]
  API[SQL REST API or ADBC]
  WH[Warehouse]
  STG[(Tables / stages)]
  R --> API --> WH --> STG

Step	Where it runs
dplyr pipeline before `collect()`	SQL built in R; not executed yet
`collect()`, `dbGetQuery()`	Warehouse executes SQL
Result transfer	JSON (SQL API) or Arrow (ADBC) to R

Bulk writes from R (dbWriteTable, dbAppendTable) use separate routing — see Bulk write routing. Prefer not loading huge tables into R only to write them back; materialize with SQL/dbplyr when possible.

15.4 DBI basics

DBI examples below are already the SQL path: you send warehouse SQL as strings (dbGetQuery, dbExecute). Pair them with dbplyr tabs when you want the same logic as lazy dplyr.

15.4.1 Reads and parameterized queries

df <- dbGetQuery(con, "SELECT * FROM MY_DB.MY_SCHEMA.MY_TABLE LIMIT 100")

df <- dbGetQuery(con,
  "SELECT * FROM users WHERE age > ?",
  params = list(21L)
)

Parameterized queries avoid SQL injection and help the optimizer reuse plans.

15.4.2 DDL and DML

dbExecute(con, "CREATE TABLE test_tbl (id INT, name VARCHAR)")
dbExecute(con, "INSERT INTO test_tbl VALUES (1, 'Alice')")

dbExecute() returns rows affected; use for statements without result sets.

15.4.3 Table round-trip

dbWriteTable(con, "iris_copy", iris, overwrite = TRUE)
iris_back <- dbReadTable(con, "iris_copy")
dbAppendTable(con, "iris_copy", iris[1:10, ])
dbRemoveTable(con, "iris_copy")

Note

Use fully qualified names when Workspace session context is unset. RSnowflake uppercases unquoted identifiers by default — see Connect — case.

15.5 dbplyr: lazy SQL from dplyr

dbplyr turns dplyr into a SQL compiler. Tables become lazy; verbs append to a query plan until you collect(). Where the same logic is practical in both forms, examples use SQL and dbplyr tabs (group="query-sql-dplyr").

SQL
dbplyr

SELECT
  REGION,
  COUNT(*)                    AS n,
  SUM(AMOUNT)                 AS revenue
FROM MY_DB.MY_SCHEMA.ORDERS
WHERE STATUS = 'OPEN'
  AND ORDER_DATE >= '2024-01-01'
GROUP BY REGION
ORDER BY revenue DESC;

library(dplyr)

tbl(con, in_schema("MY_SCHEMA", "ORDERS")) |>
  filter(STATUS == "OPEN", ORDER_DATE >= "2024-01-01") |>
  group_by(REGION) |>
  summarise(
    n       = n(),
    revenue = sum(AMOUNT, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(revenue)) |>
  collect()   # single SQL round trip

Why lazy matters: filter() and summarise() do not pull data to R. A 10-billion-row table stays in Snowflake until the final aggregation is computed on the warehouse.

Warning

dbplyr keeps Snowflake’s UPPERCASE column names. snowflakeR’s sfr_query() lowercases — do not mix blindly in joins between APIs.

Generic Snowflake translations (floor_date() → DATE_TRUNC, paste0() → ARRAY_TO_STRING, ifelse() → IFF, and many more) come from dbplyr via dbplyr::simulate_snowflake(). RSnowflake adds the Snowflake-only functions below — they are not available when you use ODBC/dbplyr against Snowflake from other drivers.

Full tables: RSnowflake getting-started vignette · Appendix H — getting-started HTML.

15.6 RSnowflake-only dplyr translations

Use these inside mutate(), filter(), and summarise() on tbl(con, ...). They compile to Snowflake SQL and run on the warehouse.

15.6.1 Semi-structured scalars

R function	Snowflake SQL	Typical use
`parse_json(x)`	`PARSE_JSON(x)`	VARCHAR → VARIANT
`try_parse_json(x)`	`TRY_PARSE_JSON(x)`	Safe parse (NULL on bad JSON)
`typeof(x)`	`TYPEOF(x)`	Inspect VARIANT element type
`is_object(x)`	`IS_OBJECT(x)`	Branch on OBJECT payloads
`is_array(x)`	`IS_ARRAY(x)`	Branch on ARRAY payloads
`is_integer(x)`	`IS_INTEGER(x)`	VARIANT holds integer

15.6.2 Array scalars

R function	Snowflake SQL	Notes
`array_size(x)`	`ARRAY_SIZE(x)`	Length of ARRAY
`array_contains(arr, val)`	`ARRAY_CONTAINS(val, arr)`	RSnowflake swaps arg order to match Snowflake
`array_slice(arr, from, to)`	`ARRAY_SLICE(arr, from, to)`	Sub-array

15.6.3 Strings and hashing

R function	Snowflake SQL	Notes
`regexp_substr(x, pattern, ...)`	`REGEXP_SUBSTR(...)`	Snowflake-specific args (`pos`, `occ`, `params`)
`hash(...)`	`HASH(...)`	Stable hash across columns

15.6.4 Aggregates (inside `summarise()`)

R function	Snowflake SQL	Typical use
`object_agg(key, value)`	`OBJECT_AGG(key, value)`	Bundle many rows into one OBJECT per group
`array_agg(x)`	`ARRAY_AGG(x)`	Positional feature vector as ARRAY
`array_unique_agg(x)`	`ARRAY_UNIQUE_AGG(x)`	Distinct ARRAY elements
`approx_count_distinct(x)`	`APPROX_COUNT_DISTINCT(x)`	Cheap cardinality on large groups
`approx_percentile(x, p)`	`APPROX_PERCENTILE(x, p)`	Approximate quantiles
`mode_val(x)`	`MODE(x)`	Statistical mode (`mode_val` avoids `base::mode()`)

Window variants of the aggregate functions above are also registered for analytic queries.

15.6.5 Example: events with JSON payloads

SQL
dbplyr

SELECT
  EVENT_TYPE,
  ARRAY_UNIQUE_AGG(USER_ID)          AS user_ids,
  APPROX_COUNT_DISTINCT(USER_ID)     AS n_users
FROM (
  SELECT
    EVENT_TYPE,
    USER_ID,
    TRY_PARSE_JSON(RAW_JSON) AS payload
  FROM CLICKSTREAM_DATA.EVENTS
) e
WHERE IS_OBJECT(payload)
GROUP BY EVENT_TYPE;

library(dplyr)

tbl(con, in_schema("CLICKSTREAM_DATA", "EVENTS")) |>
  mutate(payload = parse_json(RAW_JSON)) |>
  filter(is_object(payload)) |>
  group_by(EVENT_TYPE) |>
  summarise(
    user_ids   = array_unique_agg(USER_ID),
    n_users    = approx_count_distinct(USER_ID),
    .groups    = "drop"
  ) |>
  collect()

15.6.6 Example: bundle sparse features into OBJECT (Feature Store pattern)

Snowflake’s Feature Store advanced patterns recommend OBJECT (or ARRAY) columns to avoid thousand-column tables. You can build those bundles in dplyr on the warehouse, then register the result as an external feature view from Feature Store or materialize with compute().

SQL
dbplyr

-- Long/narrow source: one row per (entity, feature_name, value)
SELECT
  ENTITY_ID,
  OBJECT_AGG(FEATURE_NAME, FEATURE_VALUE) AS feature_object
FROM FEATURE_STORE.ENTITY_FEATURES_LONG
GROUP BY ENTITY_ID;

# Long/narrow source: one row per (entity, feature_name, value)
bundled <- tbl(con, in_schema("FEATURE_STORE", "ENTITY_FEATURES_LONG")) |>
  group_by(ENTITY_ID) |>
  summarise(
    FEATURE_OBJECT = object_agg(FEATURE_NAME, FEATURE_VALUE),
    .groups = "drop"
  )

bundled |> show_query()

# Persist for reuse (session temp table) or inspect a sample
feature_sample <- bundled |> head(100) |> collect()

For fixed keys (known feature names), OBJECT_CONSTRUCT in SQL is often clearer — use dbGetQuery() or mutate(feature_object = sql("OBJECT_CONSTRUCT('avg_spend', AVG_SPEND, 'n_orders', N_ORDERS)")) after a summarise() step.

Path access (payload:field::FLOAT, bracket notation) is not wrapped as dplyr helpers today. Use dbGetQuery() or mutate(x = sql("FEATURE_OBJECT:avg_spend::FLOAT")) for extraction expressions.

15.7 Push-down analytics: dplyr on the warehouse

The power of dbplyr is not “R that talks SQL” — it is keeping heavy work in Snowflake and using R for orchestration, inspection, and modest result sets (plots, model matrices, samples).

Verb / action	Runs in	Typical use
`filter()`, `mutate()`, `group_by()`, `summarise()`	Warehouse (lazy)	Feature engineering, aggregations
`show_query()`	R only	Debug generated SQL before execution
`compute()`	Warehouse (`CREATE TEMP TABLE … AS`)	Persist an intermediate; reuse in branches
`collect()`	Warehouse execute + transfer to R	Final table slice, ggplot input, `head()`

Tip

Feature Store training exports (OBJECT/ARRAY/VECTOR pipelines, ADBC decode, wide ASOF joins) live in Training Data in R — not here. Feature View authoring SQL/dbplyr is in Authoring Feature Views. This chapter focuses on everyday dbplyr analytics.

15.7.1 Multi-step ETL with temp tables

Break a pipeline into named steps on the warehouse instead of one nested query or repeated collect():

SQL
dbplyr

-- Step 1 — session temp table (dbplyr compute() generates similar DDL)
CREATE TEMPORARY TABLE daily_orders_tmp AS
SELECT
  CUSTOMER_ID,
  ORDER_DATE,
  COUNT(*)              AS n_orders,
  SUM(AMOUNT)           AS revenue
FROM ANALYTICS.ORDERS
WHERE ORDER_DATE >= '2024-01-01'
GROUP BY CUSTOMER_ID, ORDER_DATE;

-- Step 2 — aggregate from materialized step
SELECT
  CUSTOMER_ID,
  SUM(n_orders)  AS orders_90d,
  SUM(revenue)   AS revenue_90d,
  SUM(revenue) / NULLIF(SUM(n_orders), 0) AS avg_order
FROM daily_orders_tmp
GROUP BY CUSTOMER_ID;

-- Inspect row counts without pulling data to R
SELECT COUNT(*) AS n FROM daily_orders_tmp;

library(dplyr)

orders <- tbl(con, in_schema("ANALYTICS", "ORDERS")) |>
  filter(ORDER_DATE >= "2024-01-01")

# Step 1 — materialize daily rollups (session temp table)
daily <- orders |>
  group_by(CUSTOMER_ID, ORDER_DATE) |>
  summarise(
    n_orders = n(),
    revenue  = sum(AMOUNT, na.rm = TRUE),
    .groups  = "drop"
  ) |>
  compute(name = "daily_orders_tmp")

# Step 2 — continue from temp table; still lazy until collect()
customer_rollup <- daily |>
  group_by(CUSTOMER_ID) |>
  summarise(
    orders_90d   = sum(n_orders),
    revenue_90d  = sum(revenue),
    avg_order    = revenue_90d / orders_90d,
    .groups      = "drop"
  )

customer_rollup |> show_query()

dbGetQuery(con, "SELECT COUNT(*) AS n FROM daily_orders_tmp")

15.7.2 Joins and window functions

dbplyr translates joins and window verbs to Snowflake SQL — including qualify-style patterns via filter() on windowed columns:

SQL
dbplyr

CREATE TEMPORARY TABLE top5_orders_tmp AS
SELECT
  CUSTOMER_ID,
  ORDER_ID,
  ORDER_TS,
  AMOUNT,
  running_spend
FROM (
  SELECT
    CUSTOMER_ID,
    ORDER_ID,
    ORDER_TS,
    AMOUNT,
    SUM(AMOUNT) OVER (
      PARTITION BY CUSTOMER_ID ORDER BY ORDER_TS
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_spend,
    ROW_NUMBER() OVER (
      PARTITION BY CUSTOMER_ID ORDER BY ORDER_TS DESC
    ) AS order_rank
  FROM SALES.LINE_ITEMS
)
WHERE order_rank <= 5;

library(dplyr)

line_items <- tbl(con, in_schema("SALES", "LINE_ITEMS"))

ranked <- line_items |>
  group_by(CUSTOMER_ID) |>
  mutate(
    order_rank = row_number(desc(ORDER_TS)),
    running_spend = sum(AMOUNT)
  ) |>
  filter(order_rank <= 5L)   # top 5 orders per customer — computed in Snowflake

top5_per_customer <- ranked |>
  select(CUSTOMER_ID, ORDER_ID, ORDER_TS, AMOUNT, running_spend) |>
  compute(name = "top5_orders_tmp", temporary = TRUE)

For production pipelines, set temporary = FALSE and a fully qualified table name when the artifact should outlive the session.

15.7.3 Collect only what you need

Anti-pattern: collect() on a billion-row table to aggregate in R.

Better: aggregate in SQL, collect() the summary (or a stratified sample):

SQL
dbplyr

-- Summary for a dashboard — tiny result set
SELECT REGION, SUM(AMOUNT) AS revenue
FROM SALES.ORDERS
GROUP BY REGION;

-- Stratified sample for exploratory plots (see dbplyr show_query() for exact SAMPLE)
SELECT *
FROM SALES.ORDERS
WHERE ORDER_DATE >= '2025-01-01'
SAMPLE (5000 ROWS);

# Summary for a dashboard — tiny result set
by_region <- tbl(con, in_schema("SALES", "ORDERS")) |>
  group_by(REGION) |>
  summarise(revenue = sum(AMOUNT), .groups = "drop") |>
  collect()

# Optional: ggplot2 on the aggregated frame
# library(ggplot2)
# ggplot(by_region, aes(REGION, revenue)) + geom_col()

sample_for_plot <- tbl(con, in_schema("SALES", "ORDERS")) |>
  filter(ORDER_DATE >= "2025-01-01") |>
  slice_sample(n = 5000) |>   # Snowflake TABLESAMPLE / random — see show_query()
  collect()

15.7.4 Semi-structured columns (brief)

RSnowflake can build OBJECT/ARRAY bundles with object_agg() and array_agg() in dplyr (RSnowflake-only translations). Register bundles as Feature Views in Authoring Feature Views. How those columns round-trip into R (JSON vs Arrow list columns, VECTOR packs, Feature Store save_as) is covered in Training Data in R.

15.7.5 Materialize on the warehouse (`compute()`)

Anti-pattern: collect() after every dplyr verb — data shuttles to R repeatedly.

Better: treat Snowflake as the execution engine. Build lazy pipelines; materialize intermediate results as tables when a step is expensive, reused, or hard to read as one nested query.

Function	What it does	When to use
`collect()`	Run SQL + pull all rows to R	Final modest result (modeling, plot, `head()`)
`compute()`	`CREATE TEMPORARY TABLE … AS` + return lazy handle	Multi-step ETL; break up joins; debug with `show_query()` per step
`copy_to()`	Upload local `data.frame` to a Snowflake table	Seed pipeline from R object
`dbExecute()` + `sql_render()`	Run explicit DDL you control	Permanent tables, `CREATE TABLE AS` with custom options

SQL
dbplyr

CREATE TEMPORARY TABLE daily_events_tmp AS
SELECT customer_id, event_date, COUNT(*) AS n_events
FROM RAW.EVENTS
WHERE event_date >= '2024-01-01'
GROUP BY customer_id, event_date;

-- Reuse materialized step; final features as permanent table
CREATE TABLE MY_DB.MY_SCHEMA.CUSTOMER_EVENT_FEATURES AS
SELECT customer_id, AVG(n_events) AS avg_daily
FROM daily_events_tmp
GROUP BY customer_id;

library(dplyr)

base <- tbl(con, in_schema("RAW", "EVENTS")) |>
  filter(event_date >= "2024-01-01")

# Materialize once on the warehouse (session-scoped temp table)
daily <- base |>
  group_by(customer_id, event_date) |>
  summarise(n_events = n(), .groups = "drop") |>
  compute(name = "daily_events_tmp")

# Continue from the materialized table — still lazy until collect()
features <- daily |>
  group_by(customer_id) |>
  summarise(avg_daily = mean(n_events), .groups = "drop")

# Permanent table (set session schema or use a fully qualified name)
features |>
  compute(name = "CUSTOMER_EVENT_FEATURES", temporary = FALSE)

# Or inspect SQL without executing
features |> show_query()

RSnowflake’s sql_query_save generates Snowflake CREATE TEMPORARY TABLE (or permanent when temporary = FALSE). Temp tables live for the session — ideal for notebook pipelines; use qualified permanent names for shared artifacts.

When to materialize vs one big query: materialize when (1) an intermediate is reused in multiple branches, (2) the generated SQL is deeply nested or slow to optimize, or (3) you want to inspect row counts with SELECT COUNT(*) on a named table. Otherwise a single pipeline + one collect() is fine.

See Platform Primer — storage-compute for the mental model.

15.8 Bulk write routing

Small tables use batched literal INSERT over the SQL API. Above RSnowflake.bulk_write_threshold (cells = rows × columns; default 200,000 in Workspace, 50,000 elsewhere), "auto" picks a bulk backend.

15.8.1 How `"auto"` chooses a path (current RSnowflake behaviour)

Environment	Order when above threshold
Workspace	ADBC `write_adbc` (internal gateway) → Snowpark `write_pandas` (fallback) → literal INSERT
Local / Posit	ADBC → literal INSERT

This differs from older docs that listed Snowpark as the Workspace default. After internal-gateway ADBC work (2026), ADBC is preferred in Workspace when adbcsnowflake is installed; Snowpark remains an explicit opt-in or fallback. See WORKSPACE_ADBC.md.

flowchart TD
  WT[dbWriteTable or dbAppendTable]
  AUTO{upload_method}
  BIG{cells >= threshold?}
  ADBC{ADBC installed?}
  SP{Snowpark plus reticulate?}
  ADBC_W[ADBC write_adbc from R]
  SP_W[reticulate r_to_py then write_pandas]
  LIT[Literal INSERT]

  WT --> AUTO
  AUTO -->|literal or small| LIT
  AUTO -->|auto and large| BIG
  BIG -->|no| LIT
  BIG -->|yes| ADBC
  ADBC -->|yes| ADBC_W
  ADBC -->|no| SP
  SP -->|yes| SP_W
  SP -->|no| LIT

Method	What actually happens	When to use
`"auto"`	Routing table above	Default
`"adbc"`	R `data.frame` → Snowflake driver PUT + COPY INTO (no SQL API `PUT` from R)	Force Arrow path; bake into CRE to skip per-session Go install
`"snowpark"`	R → `reticulate::r_to_py()` → pandas → active Snowpark session `write_pandas()`	Fallback in Workspace; explicit override
`"literal"`	SQL `INSERT` batches via REST	Small tables, debugging, no extra packages

options(RSnowflake.verbose = TRUE)   # log conversion and write timings
options(RSnowflake.upload_method = "auto")
dbWriteTable(con, "big_table", large_df)

options(RSnowflake.bulk_write_threshold = 100000L)  # rows × cols

15.8.2 Is the Snowpark path copying the whole data frame through reticulate?

Yes, when Snowpark is used. RSnowflake converts the R data.frame in memory with reticulate::r_to_py(), then calls the Workspace kernel’s Snowpark write_pandas(). In RSnowflake benchmarks (~50k rows), that R→pandas step is on the order of ~0.2 seconds — small next to network upload, but it still duplicates the dataset in process memory (R + Python) for the duration of the write.

ADBC avoids the Python hop: write_adbc() sends the R table through the Arrow/Snowflake driver (PUT + COPY under the hood), staying in the R process with no pandas intermediate.

15.8.3 For very large data, prefer not to round-trip through R

Approach	Why
dbplyr / SQL — `compute()`, `CREATE TABLE AS`, load from existing Snowflake tables	Data never leaves the warehouse
Python Snowpark cell — write from a DataFrame already in Python, or `session.write_pandas` from a file you read in Python	No R→Python copy for the bulk of the bytes
Stage + `COPY INTO` (SQL)	Classic warehouse pattern; RSnowflake’s SQL API driver cannot run client `PUT` — so “write CSV in R, then PUT” is not a built-in RSnowflake path; use a Python cell, Snowpark, or ADBC instead
`dbWriteTable(large_df)` from R	Convenient up to modest sizes; for millions of rows, memory and reticulate/ADBC cost dominate

Writing CSV/Parquet locally and loading via Python is a valid manual pattern in a mixed notebook, but RSnowflake does not automatically spool to disk first — it passes the in-memory data.frame to ADBC or r_to_py().

15.8.4 Workspace benchmarks (order of magnitude)

From RSnowflake testing (~50k rows, Mar 2026 — your account may differ):

Path	Typical time
Snowpark `write_pandas` (after `r_to_py`)	~2–3 s
ADBC via internal `SNOWFLAKE_HOST`	~7 s
Literal INSERT via SQL API	~45 s

So Snowpark can be faster when both are available, but ADBC is the default in "auto" because it avoids the reticulate dependency and matches bulk read auth. Use options(RSnowflake.upload_method = "snowpark") when you have measured Snowpark winning for your shape of data.

Reads follow a similar "auto" / "adbc" backend via RSnowflake.backend.

15.9 Arrow and ADBC

What is ADBC?

ADBC (Arrow Database Connectivity) is a columnar database access standard (Apache Arrow ecosystem). RSnowflake uses the Snowflake ADBC driver for bulk I/O; routine queries use the SQL API. Glossary · SQL API chapter.

The Snowflake ADBC driver uses the Go Snowflake driver for columnar bulk transfer:

requireNamespace("adbcsnowflake", quietly = TRUE)

options(RSnowflake.backend = "adbc")
df <- dbGetQuery(con, "SELECT * FROM big_table")
options(RSnowflake.backend = "auto")

Workspace considerations:

Installing adbcsnowflake may require Go compiler and EAI for module downloads (~2 min cold bootstrap)
ADBC can use internal SNOWFLAKE_HOST gateway with OAuth — no public URL required
Bulk writes may still prefer Snowpark in "auto" even when ADBC is installed

Bake ADBC into a CRE to avoid repeat installs. Deep dive: WORKSPACE_ADBC.md.

dbGetQueryArrow() provides DBI Arrow compatibility via nanoarrow. SQL API v2 returns JSON today — native server-side Arrow is on the roadmap; for large reads, dbGetQuery() with ADBC backend is the practical path.

15.10 SQL API limitations

RSnowflake uses the SQL REST API. These affect how you I/O, not whether you can query:

Unsupported	Implication
`GET` / `PUT`	No client-side stage transfer over REST
`USE DATABASE` / `SCHEMA`	Set context in connection or qualify names
`ALTER SESSION`	Set params via supported alternatives

SPCS workers with RSnowflake must use stage volume mounts for file I/O — see Parallel doSnowflake and Troubleshooting.

15.11 Choosing an access pattern

Goal	Pattern
Ad hoc SQL	`dbGetQuery()`
ETL-style transforms	dbplyr pipeline + `compute()` for heavy steps + `collect()` at end
Large local export	ADBC backend + aggregate in SQL first
Workspace notebook write	`"auto"` or `"snowpark"`
ML platform objects	snowflakeR `sfr_*` APIs (may share DBI via `sfr_dbi_connection()`)

15.12 Next steps

RSnowflake in Workspace — bootstrap order, gateway, ADBC in notebooks.

--- title: "RSnowflake: Query" subtitle: "DBI, dbplyr, bulk I/O, and Arrow" --- ## Overview Once connected ([Connect chapter](../12_rsnowflake_connect/index.qmd)), RSnowflake gives you three layers of data access: 1. **DBI** — standard R database interface (`dbGetQuery`, `dbWriteTable`, …) 2. **dbplyr** — dplyr verbs translated to SQL and executed **in Snowflake** 3. **Bulk I/O** — optimized paths for large reads/writes (Snowpark, ADBC, or literal SQL) The design goal is **push computation to the data**. Your R session sends SQL; the **warehouse** aggregates, filters, and joins close to storage. You `collect()` only what you need for modeling or visualization. ## Learning Objectives - Run queries, DDL, and table operations with DBI - Build lazy dplyr pipelines that execute on the warehouse - Use RSnowflake-only dbplyr translations (semi-structured types, `OBJECT_AGG`, approximate aggregates) - Materialize intermediate results with `compute()` (session temp tables) vs pull summaries with `collect()` - Combine window functions, joins, and multi-step ETL entirely in Snowflake SQL via dbplyr - Understand how RSnowflake chooses bulk write/read backends - Know SQL API limits that affect stage I/O --- ## Query execution model {#sec-model} ```{mermaid} flowchart LR R[R session RSnowflake] API[SQL REST API or ADBC] WH[Warehouse] STG[(Tables / stages)] R --> API --> WH --> STG ``` | Step | Where it runs | |------|---------------| | dplyr pipeline before `collect()` | SQL built in R; **not executed** yet | | `collect()`, `dbGetQuery()` | **Warehouse** executes SQL | | Result transfer | JSON (SQL API) or Arrow (ADBC) to R | Bulk **writes** from R (`dbWriteTable`, `dbAppendTable`) use separate routing — see [Bulk write routing](#sec-bulk). Prefer **not** loading huge tables into R only to write them back; materialize with SQL/dbplyr when possible. --- ## DBI basics {#sec-dbi} DBI examples below are already the **SQL path**: you send warehouse SQL as strings (`dbGetQuery`, `dbExecute`). Pair them with [dbplyr](#sec-dbplyr) tabs when you want the same logic as lazy dplyr. ### Reads and parameterized queries ```r df <- dbGetQuery(con, "SELECT * FROM MY_DB.MY_SCHEMA.MY_TABLE LIMIT 100") df <- dbGetQuery(con, "SELECT * FROM users WHERE age > ?", params = list(21L) ) ``` Parameterized queries avoid SQL injection and help the optimizer reuse plans. ### DDL and DML ```r dbExecute(con, "CREATE TABLE test_tbl (id INT, name VARCHAR)") dbExecute(con, "INSERT INTO test_tbl VALUES (1, 'Alice')") ``` `dbExecute()` returns rows affected; use for statements without result sets. ### Table round-trip ```r dbWriteTable(con, "iris_copy", iris, overwrite = TRUE) iris_back <- dbReadTable(con, "iris_copy") dbAppendTable(con, "iris_copy", iris[1:10, ]) dbRemoveTable(con, "iris_copy") ``` ::: {.callout-note} Use **fully qualified** names when Workspace session context is unset. RSnowflake uppercases unquoted identifiers by default — see [Connect — case](../12_rsnowflake_connect/index.qmd#sec-case). ::: --- ## dbplyr: lazy SQL from dplyr {#sec-dbplyr} **dbplyr** turns dplyr into a SQL compiler. Tables become lazy; verbs append to a query plan until you **`collect()`**. Where the same logic is practical in both forms, examples use **SQL** and **dbplyr** tabs (`group="query-sql-dplyr"`). ::: {.panel-tabset group="query-sql-dplyr"} ##### SQL ```sql SELECT REGION, COUNT(*) AS n, SUM(AMOUNT) AS revenue FROM MY_DB.MY_SCHEMA.ORDERS WHERE STATUS = 'OPEN' AND ORDER_DATE >= '2024-01-01' GROUP BY REGION ORDER BY revenue DESC; ``` ##### dbplyr ```r library(dplyr) tbl(con, in_schema("MY_SCHEMA", "ORDERS")) |> filter(STATUS == "OPEN", ORDER_DATE >= "2024-01-01") |> group_by(REGION) |> summarise( n = n(), revenue = sum(AMOUNT, na.rm = TRUE), .groups = "drop" ) |> arrange(desc(revenue)) |> collect() # single SQL round trip ``` ::: **Why lazy matters:** `filter()` and `summarise()` do not pull data to R. A 10-billion-row table stays in Snowflake until the final aggregation is computed on the warehouse. ::: {.callout-warning} dbplyr keeps Snowflake's **UPPERCASE** column names. snowflakeR's `sfr_query()` lowercases — do not mix blindly in joins between APIs. ::: Generic Snowflake translations (`floor_date()` → `DATE_TRUNC`, `paste0()` → `ARRAY_TO_STRING`, `ifelse()` → `IFF`, and many more) come from **dbplyr** via `dbplyr::simulate_snowflake()`. RSnowflake **adds** the Snowflake-only functions below — they are not available when you use ODBC/dbplyr against Snowflake from other drivers. Full tables: [RSnowflake getting-started vignette](https://github.com/Snowflake-Labs/RSnowflake/blob/main/vignettes/getting-started.Rmd#snowflake-specific-sql-translations) · [Appendix H — `getting-started` HTML](../appendices/H_package_vignettes/index.qmd#sec-snowflaker). --- ## RSnowflake-only dplyr translations {#sec-snowflake-dbplyr} Use these inside `mutate()`, `filter()`, and `summarise()` on `tbl(con, ...)`. They compile to Snowflake SQL and run on the warehouse. ### Semi-structured scalars | R function | Snowflake SQL | Typical use | |------------|---------------|-------------| | `parse_json(x)` | `PARSE_JSON(x)` | VARCHAR → VARIANT | | `try_parse_json(x)` | `TRY_PARSE_JSON(x)` | Safe parse (NULL on bad JSON) | | `typeof(x)` | `TYPEOF(x)` | Inspect VARIANT element type | | `is_object(x)` | `IS_OBJECT(x)` | Branch on OBJECT payloads | | `is_array(x)` | `IS_ARRAY(x)` | Branch on ARRAY payloads | | `is_integer(x)` | `IS_INTEGER(x)` | VARIANT holds integer | ### Array scalars | R function | Snowflake SQL | Notes | |------------|---------------|-------| | `array_size(x)` | `ARRAY_SIZE(x)` | Length of ARRAY | | `array_contains(arr, val)` | `ARRAY_CONTAINS(val, arr)` | RSnowflake swaps arg order to match Snowflake | | `array_slice(arr, from, to)` | `ARRAY_SLICE(arr, from, to)` | Sub-array | ### Strings and hashing | R function | Snowflake SQL | Notes | |------------|---------------|-------| | `regexp_substr(x, pattern, ...)` | `REGEXP_SUBSTR(...)` | Snowflake-specific args (`pos`, `occ`, `params`) | | `hash(...)` | `HASH(...)` | Stable hash across columns | ### Aggregates (inside `summarise()`) | R function | Snowflake SQL | Typical use | |------------|---------------|-------------| | `object_agg(key, value)` | `OBJECT_AGG(key, value)` | Bundle many rows into one OBJECT per group | | `array_agg(x)` | `ARRAY_AGG(x)` | Positional feature vector as ARRAY | | `array_unique_agg(x)` | `ARRAY_UNIQUE_AGG(x)` | Distinct ARRAY elements | | `approx_count_distinct(x)` | `APPROX_COUNT_DISTINCT(x)` | Cheap cardinality on large groups | | `approx_percentile(x, p)` | `APPROX_PERCENTILE(x, p)` | Approximate quantiles | | `mode_val(x)` | `MODE(x)` | Statistical mode (`mode_val` avoids `base::mode()`) | Window variants of the aggregate functions above are also registered for analytic queries. ### Example: events with JSON payloads ::: {.panel-tabset group="query-sql-dplyr"} ##### SQL ```sql SELECT EVENT_TYPE, ARRAY_UNIQUE_AGG(USER_ID) AS user_ids, APPROX_COUNT_DISTINCT(USER_ID) AS n_users FROM ( SELECT EVENT_TYPE, USER_ID, TRY_PARSE_JSON(RAW_JSON) AS payload FROM CLICKSTREAM_DATA.EVENTS ) e WHERE IS_OBJECT(payload) GROUP BY EVENT_TYPE; ``` ##### dbplyr ```r library(dplyr) tbl(con, in_schema("CLICKSTREAM_DATA", "EVENTS")) |> mutate(payload = parse_json(RAW_JSON)) |> filter(is_object(payload)) |> group_by(EVENT_TYPE) |> summarise( user_ids = array_unique_agg(USER_ID), n_users = approx_count_distinct(USER_ID), .groups = "drop" ) |> collect() ``` ::: ### Example: bundle sparse features into OBJECT (Feature Store pattern) Snowflake's [Feature Store advanced patterns](https://snowflake-labs.github.io/snowflake-featurestore-imp-guide/12_advanced_patterns/#sec-wide-sparse) recommend **OBJECT** (or **ARRAY**) columns to avoid thousand-column tables. You can **build** those bundles in dplyr on the warehouse, then register the result as an external feature view from [Feature Store](../17_feature_store/index.qmd) or materialize with `compute()`. ::: {.panel-tabset group="query-sql-dplyr"} ##### SQL ```sql -- Long/narrow source: one row per (entity, feature_name, value) SELECT ENTITY_ID, OBJECT_AGG(FEATURE_NAME, FEATURE_VALUE) AS feature_object FROM FEATURE_STORE.ENTITY_FEATURES_LONG GROUP BY ENTITY_ID; ``` ##### dbplyr ```r # Long/narrow source: one row per (entity, feature_name, value) bundled <- tbl(con, in_schema("FEATURE_STORE", "ENTITY_FEATURES_LONG")) |> group_by(ENTITY_ID) |> summarise( FEATURE_OBJECT = object_agg(FEATURE_NAME, FEATURE_VALUE), .groups = "drop" ) bundled |> show_query() # Persist for reuse (session temp table) or inspect a sample feature_sample <- bundled |> head(100) |> collect() ``` ::: For **fixed keys** (known feature names), `OBJECT_CONSTRUCT` in SQL is often clearer — use `dbGetQuery()` or `mutate(feature_object = sql("OBJECT_CONSTRUCT('avg_spend', AVG_SPEND, 'n_orders', N_ORDERS)"))` after a `summarise()` step. **Path access** (`payload:field::FLOAT`, bracket notation) is not wrapped as dplyr helpers today. Use `dbGetQuery()` or `mutate(x = sql("FEATURE_OBJECT:avg_spend::FLOAT"))` for extraction expressions. --- --- ## Push-down analytics: dplyr on the warehouse {#sec-pushdown} The power of dbplyr is not “R that talks SQL” — it is **keeping heavy work in Snowflake** and using R for orchestration, inspection, and modest result sets (plots, model matrices, samples). | Verb / action | Runs in | Typical use | |---------------|---------|-------------| | `filter()`, `mutate()`, `group_by()`, `summarise()` | **Warehouse** (lazy) | Feature engineering, aggregations | | `show_query()` | R only | Debug generated SQL before execution | | `compute()` | **Warehouse** (`CREATE TEMP TABLE … AS`) | Persist an intermediate; reuse in branches | | `collect()` | Warehouse execute + **transfer to R** | Final table slice, ggplot input, `head()` | ::: {.callout-tip} **Feature Store training exports** (OBJECT/ARRAY/VECTOR pipelines, ADBC decode, wide ASOF joins) live in [Training Data in R](../17a_training_data_in_r/index.qmd) — not here. Feature View **authoring** SQL/dbplyr is in [Authoring Feature Views](../17b_authoring_feature_views/index.qmd). This chapter focuses on everyday dbplyr analytics. ::: ### Multi-step ETL with temp tables {#sec-etl-temp} Break a pipeline into named steps on the warehouse instead of one nested query or repeated `collect()`: ::: {.panel-tabset group="query-sql-dplyr"} ##### SQL ```sql -- Step 1 — session temp table (dbplyr compute() generates similar DDL) CREATE TEMPORARY TABLE daily_orders_tmp AS SELECT CUSTOMER_ID, ORDER_DATE, COUNT(*) AS n_orders, SUM(AMOUNT) AS revenue FROM ANALYTICS.ORDERS WHERE ORDER_DATE >= '2024-01-01' GROUP BY CUSTOMER_ID, ORDER_DATE; -- Step 2 — aggregate from materialized step SELECT CUSTOMER_ID, SUM(n_orders) AS orders_90d, SUM(revenue) AS revenue_90d, SUM(revenue) / NULLIF(SUM(n_orders), 0) AS avg_order FROM daily_orders_tmp GROUP BY CUSTOMER_ID; -- Inspect row counts without pulling data to R SELECT COUNT(*) AS n FROM daily_orders_tmp; ``` ##### dbplyr ```r library(dplyr) orders <- tbl(con, in_schema("ANALYTICS", "ORDERS")) |> filter(ORDER_DATE >= "2024-01-01") # Step 1 — materialize daily rollups (session temp table) daily <- orders |> group_by(CUSTOMER_ID, ORDER_DATE) |> summarise( n_orders = n(), revenue = sum(AMOUNT, na.rm = TRUE), .groups = "drop" ) |> compute(name = "daily_orders_tmp") # Step 2 — continue from temp table; still lazy until collect() customer_rollup <- daily |> group_by(CUSTOMER_ID) |> summarise( orders_90d = sum(n_orders), revenue_90d = sum(revenue), avg_order = revenue_90d / orders_90d, .groups = "drop" ) customer_rollup |> show_query() dbGetQuery(con, "SELECT COUNT(*) AS n FROM daily_orders_tmp") ``` ::: ### Joins and window functions {#sec-windows} dbplyr translates joins and window verbs to Snowflake SQL — including **qualify**-style patterns via `filter()` on windowed columns: ::: {.panel-tabset group="query-sql-dplyr"} ##### SQL ```sql CREATE TEMPORARY TABLE top5_orders_tmp AS SELECT CUSTOMER_ID, ORDER_ID, ORDER_TS, AMOUNT, running_spend FROM ( SELECT CUSTOMER_ID, ORDER_ID, ORDER_TS, AMOUNT, SUM(AMOUNT) OVER ( PARTITION BY CUSTOMER_ID ORDER BY ORDER_TS ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_spend, ROW_NUMBER() OVER ( PARTITION BY CUSTOMER_ID ORDER BY ORDER_TS DESC ) AS order_rank FROM SALES.LINE_ITEMS ) WHERE order_rank <= 5; ``` ##### dbplyr ```r library(dplyr) line_items <- tbl(con, in_schema("SALES", "LINE_ITEMS")) ranked <- line_items |> group_by(CUSTOMER_ID) |> mutate( order_rank = row_number(desc(ORDER_TS)), running_spend = sum(AMOUNT) ) |> filter(order_rank <= 5L) # top 5 orders per customer — computed in Snowflake top5_per_customer <- ranked |> select(CUSTOMER_ID, ORDER_ID, ORDER_TS, AMOUNT, running_spend) |> compute(name = "top5_orders_tmp", temporary = TRUE) ``` ::: For production pipelines, set `temporary = FALSE` and a fully qualified table name when the artifact should outlive the session. ### Collect only what you need {#sec-collect-plot} **Anti-pattern:** `collect()` on a billion-row table to aggregate in R. **Better:** aggregate in SQL, `collect()` the summary (or a stratified sample): ::: {.panel-tabset group="query-sql-dplyr"} ##### SQL ```sql -- Summary for a dashboard — tiny result set SELECT REGION, SUM(AMOUNT) AS revenue FROM SALES.ORDERS GROUP BY REGION; -- Stratified sample for exploratory plots (see dbplyr show_query() for exact SAMPLE) SELECT * FROM SALES.ORDERS WHERE ORDER_DATE >= '2025-01-01' SAMPLE (5000 ROWS); ``` ##### dbplyr ```r # Summary for a dashboard — tiny result set by_region <- tbl(con, in_schema("SALES", "ORDERS")) |> group_by(REGION) |> summarise(revenue = sum(AMOUNT), .groups = "drop") |> collect() # Optional: ggplot2 on the aggregated frame # library(ggplot2) # ggplot(by_region, aes(REGION, revenue)) + geom_col() sample_for_plot <- tbl(con, in_schema("SALES", "ORDERS")) |> filter(ORDER_DATE >= "2025-01-01") |> slice_sample(n = 5000) |> # Snowflake TABLESAMPLE / random — see show_query() collect() ``` ::: ### Semi-structured columns (brief) {#sec-semi-structured-brief} RSnowflake can **build** OBJECT/ARRAY bundles with `object_agg()` and `array_agg()` in dplyr ([RSnowflake-only translations](#sec-snowflake-dbplyr)). Register bundles as Feature Views in [Authoring Feature Views](../17b_authoring_feature_views/index.qmd). How those columns **round-trip into R** (JSON vs Arrow list columns, VECTOR packs, Feature Store `save_as`) is covered in [Training Data in R](../17a_training_data_in_r/index.qmd). --- ### Materialize on the warehouse (`compute()`) {#sec-materialize} **Anti-pattern:** `collect()` after every dplyr verb — data shuttles to R repeatedly. **Better:** treat Snowflake as the execution engine. Build lazy pipelines; **materialize intermediate results as tables** when a step is expensive, reused, or hard to read as one nested query. | Function | What it does | When to use | |----------|--------------|-------------| | **`collect()`** | Run SQL + pull all rows to R | Final modest result (modeling, plot, `head()`) | | **`compute()`** | `CREATE TEMPORARY TABLE … AS` + return lazy handle | Multi-step ETL; break up joins; debug with `show_query()` per step | | **`copy_to()`** | Upload local `data.frame` to a Snowflake table | Seed pipeline from R object | | **`dbExecute()` + `sql_render()`** | Run explicit DDL you control | Permanent tables, `CREATE TABLE AS` with custom options | ::: {.panel-tabset group="query-sql-dplyr"} ##### SQL ```sql CREATE TEMPORARY TABLE daily_events_tmp AS SELECT customer_id, event_date, COUNT(*) AS n_events FROM RAW.EVENTS WHERE event_date >= '2024-01-01' GROUP BY customer_id, event_date; -- Reuse materialized step; final features as permanent table CREATE TABLE MY_DB.MY_SCHEMA.CUSTOMER_EVENT_FEATURES AS SELECT customer_id, AVG(n_events) AS avg_daily FROM daily_events_tmp GROUP BY customer_id; ``` ##### dbplyr ```r library(dplyr) base <- tbl(con, in_schema("RAW", "EVENTS")) |> filter(event_date >= "2024-01-01") # Materialize once on the warehouse (session-scoped temp table) daily <- base |> group_by(customer_id, event_date) |> summarise(n_events = n(), .groups = "drop") |> compute(name = "daily_events_tmp") # Continue from the materialized table — still lazy until collect() features <- daily |> group_by(customer_id) |> summarise(avg_daily = mean(n_events), .groups = "drop") # Permanent table (set session schema or use a fully qualified name) features |> compute(name = "CUSTOMER_EVENT_FEATURES", temporary = FALSE) # Or inspect SQL without executing features |> show_query() ``` ::: RSnowflake's `sql_query_save` generates Snowflake **`CREATE TEMPORARY TABLE`** (or permanent when `temporary = FALSE`). Temp tables live for the **session** — ideal for notebook pipelines; use qualified permanent names for shared artifacts. **When to materialize vs one big query:** materialize when (1) an intermediate is reused in multiple branches, (2) the generated SQL is deeply nested or slow to optimize, or (3) you want to inspect row counts with `SELECT COUNT(*)` on a named table. Otherwise a single pipeline + one `collect()` is fine. See [Platform Primer — storage-compute](../01_snowflake_platform/index.qmd#sec-storage-compute) for the mental model. --- ## Bulk write routing {#sec-bulk} Small tables use batched **literal `INSERT`** over the SQL API. Above **`RSnowflake.bulk_write_threshold`** (cells = rows × columns; default **200,000** in Workspace, **50,000** elsewhere), `"auto"` picks a bulk backend. ### How `"auto"` chooses a path (current RSnowflake behaviour) | Environment | Order when above threshold | |-------------|----------------------------| | **Workspace** | **ADBC** `write_adbc` (internal gateway) → **Snowpark** `write_pandas` (fallback) → literal INSERT | | **Local / Posit** | **ADBC** → literal INSERT | This differs from older docs that listed Snowpark as the Workspace default. After internal-gateway ADBC work (2026), **ADBC is preferred in Workspace** when `adbcsnowflake` is installed; Snowpark remains an explicit opt-in or fallback. See [WORKSPACE_ADBC.md](https://github.com/Snowflake-Labs/RSnowflake/blob/main/WORKSPACE_ADBC.md). ```{mermaid} flowchart TD WT[dbWriteTable or dbAppendTable] AUTO{upload_method} BIG{cells >= threshold?} ADBC{ADBC installed?} SP{Snowpark plus reticulate?} ADBC_W[ADBC write_adbc from R] SP_W[reticulate r_to_py then write_pandas] LIT[Literal INSERT] WT --> AUTO AUTO -->|literal or small| LIT AUTO -->|auto and large| BIG BIG -->|no| LIT BIG -->|yes| ADBC ADBC -->|yes| ADBC_W ADBC -->|no| SP SP -->|yes| SP_W SP -->|no| LIT ``` | Method | What actually happens | When to use | |--------|----------------------|-------------| | `"auto"` | Routing table above | Default | | `"adbc"` | R `data.frame` → Snowflake driver **PUT + COPY INTO** (no SQL API `PUT` from R) | Force Arrow path; bake into [CRE](../09_custom_runtime_and_ml_jobs/index.qmd) to skip per-session Go install | | `"snowpark"` | R → **`reticulate::r_to_py()`** → pandas → active Snowpark session **`write_pandas()`** | Fallback in Workspace; explicit override | | `"literal"` | SQL `INSERT` batches via REST | Small tables, debugging, no extra packages | ```r options(RSnowflake.verbose = TRUE) # log conversion and write timings options(RSnowflake.upload_method = "auto") dbWriteTable(con, "big_table", large_df) options(RSnowflake.bulk_write_threshold = 100000L) # rows × cols ``` ### Is the Snowpark path copying the whole data frame through reticulate? **Yes, when Snowpark is used.** RSnowflake converts the R `data.frame` in memory with `reticulate::r_to_py()`, then calls the Workspace kernel’s Snowpark `write_pandas()`. In RSnowflake benchmarks (~50k rows), that R→pandas step is on the order of **~0.2 seconds** — small next to network upload, but it still **duplicates the dataset in process memory** (R + Python) for the duration of the write. **ADBC avoids the Python hop:** `write_adbc()` sends the R table through the Arrow/Snowflake driver (PUT + COPY under the hood), staying in the R process with no pandas intermediate. ### For very large data, prefer not to round-trip through R | Approach | Why | |----------|-----| | **dbplyr / SQL** — `compute()`, `CREATE TABLE AS`, load from existing Snowflake tables | Data never leaves the warehouse | | **Python Snowpark cell** — write from a DataFrame already in Python, or `session.write_pandas` from a file you read in Python | No R→Python copy for the bulk of the bytes | | **Stage + `COPY INTO` (SQL)** | Classic warehouse pattern; RSnowflake’s SQL API driver **cannot** run client `PUT` — so “write CSV in R, then PUT” is **not** a built-in RSnowflake path; use a **Python** cell, Snowpark, or ADBC instead | | **`dbWriteTable(large_df)` from R** | Convenient up to modest sizes; for millions of rows, memory and reticulate/ADBC cost dominate | Writing CSV/Parquet locally and loading via Python is a valid **manual** pattern in a mixed notebook, but RSnowflake does not automatically spool to disk first — it passes the in-memory `data.frame` to ADBC or `r_to_py()`. ### Workspace benchmarks (order of magnitude) From RSnowflake testing (~50k rows, Mar 2026 — your account may differ): | Path | Typical time | |------|----------------| | Snowpark `write_pandas` (after `r_to_py`) | ~2–3 s | | ADBC via internal `SNOWFLAKE_HOST` | ~7 s | | Literal INSERT via SQL API | ~45 s | So Snowpark can be **faster** when both are available, but **ADBC is the default** in `"auto"` because it avoids the reticulate dependency and matches bulk read auth. Use `options(RSnowflake.upload_method = "snowpark")` when you have measured Snowpark winning for your shape of data. **Reads** follow a similar `"auto"` / `"adbc"` backend via `RSnowflake.backend`. --- ## Arrow and ADBC {#sec-arrow} ::: {.callout-note title="What is ADBC?"} **ADBC** (Arrow Database Connectivity) is a **columnar database access standard** (Apache Arrow ecosystem). RSnowflake uses the Snowflake ADBC driver for bulk I/O; routine queries use the SQL API. [Glossary](../appendices/F_glossary/index.qmd) · [SQL API chapter](../11_rsnowflake_sql_api/index.qmd#sec-arrow). ::: The Snowflake **ADBC** driver uses the Go Snowflake driver for columnar bulk transfer: ```r requireNamespace("adbcsnowflake", quietly = TRUE) options(RSnowflake.backend = "adbc") df <- dbGetQuery(con, "SELECT * FROM big_table") options(RSnowflake.backend = "auto") ``` **Workspace considerations:** - Installing `adbcsnowflake` may require **Go compiler** and **EAI** for module downloads (~2 min cold bootstrap) - ADBC can use internal **`SNOWFLAKE_HOST`** gateway with OAuth — no public URL required - Bulk writes may still prefer Snowpark in `"auto"` even when ADBC is installed Bake ADBC into a [CRE](../09_custom_runtime_and_ml_jobs/index.qmd) to avoid repeat installs. Deep dive: [WORKSPACE_ADBC.md](https://github.com/Snowflake-Labs/RSnowflake/blob/main/WORKSPACE_ADBC.md). **`dbGetQueryArrow()`** provides DBI Arrow compatibility via `nanoarrow`. SQL API v2 returns JSON today — native server-side Arrow is on the roadmap; for large reads, `dbGetQuery()` with ADBC backend is the practical path. --- ## SQL API limitations {#sec-rest} RSnowflake uses the [SQL REST API](../11_rsnowflake_sql_api/index.qmd). These affect **how** you I/O, not **whether** you can query: | Unsupported | Implication | |-------------|-------------| | `GET` / `PUT` | No client-side stage transfer over REST | | `USE DATABASE` / `SCHEMA` | Set context in connection or qualify names | | `ALTER SESSION` | Set params via supported alternatives | **SPCS workers** with RSnowflake must use **stage volume mounts** for file I/O — see [Parallel doSnowflake](../22_parallel_dosnowflake/index.qmd#sec-stage) and [Troubleshooting](../appendices/C_troubleshooting/index.qmd). --- ## Choosing an access pattern {#sec-choose} | Goal | Pattern | |------|---------| | Ad hoc SQL | `dbGetQuery()` | | ETL-style transforms | dbplyr pipeline + `compute()` for heavy steps + `collect()` at end | | Large local export | ADBC backend + aggregate in SQL first | | Workspace notebook write | `"auto"` or `"snowpark"` | | ML platform objects | snowflakeR `sfr_*` APIs (may share DBI via `sfr_dbi_connection()`) | --- ## Next steps [RSnowflake in Workspace](../14_rsnowflake_workspace/index.qmd) — bootstrap order, gateway, ADBC in notebooks.

15.1 Overview

15.2 Learning Objectives

15.3 Query execution model

15.4 DBI basics

15.4.1 Reads and parameterized queries

15.4.2 DDL and DML

15.4.3 Table round-trip

15.5 dbplyr: lazy SQL from dplyr

15.6 RSnowflake-only dplyr translations

15.6.1 Semi-structured scalars

15.6.2 Array scalars

15.6.3 Strings and hashing

15.6.4 Aggregates (inside summarise())

15.6.5 Example: events with JSON payloads

15.6.6 Example: bundle sparse features into OBJECT (Feature Store pattern)

15.7 Push-down analytics: dplyr on the warehouse

15.7.1 Multi-step ETL with temp tables

15.7.2 Joins and window functions

15.7.3 Collect only what you need

15.7.4 Semi-structured columns (brief)

15.7.5 Materialize on the warehouse (compute())

15.8 Bulk write routing

15.8.1 How "auto" chooses a path (current RSnowflake behaviour)

15.8.2 Is the Snowpark path copying the whole data frame through reticulate?

15.8.3 For very large data, prefer not to round-trip through R

15.8.4 Workspace benchmarks (order of magnitude)

15.9 Arrow and ADBC

15.10 SQL API limitations

15.11 Choosing an access pattern

15.12 Next steps

15.6.4 Aggregates (inside `summarise()`)

15.7.5 Materialize on the warehouse (`compute()`)

15.8.1 How `"auto"` chooses a path (current RSnowflake behaviour)