This function writes a large R data frame or tibble (df) to a Delta Lake
table (target_table) on Databricks using Databricks Volumes and the
COPY INTO SQL command.
Usage
write_df_to_delta(
df,
target_table,
db_conn,
column_types_schema = NULL,
volume_dir,
copy_options = "'mergeSchema' = 'true'",
overwrite_table = FALSE,
chunk_size_bytes = 5 * 1024^3
)Arguments
- df
A
data.frameortibblecontaining the data to be written to Delta Lake.- target_table
A character string specifying the name of the Delta table. Can be unqualified (
"table"), partially qualified ("schema.table"), or fully qualified ("catalog.schema.table"). If not fully qualified, the function resolves the catalog and/or schema from the active database connection.- db_conn
A valid DBI connection object to Databricks. This connection is used to interact with the Delta table.
- column_types_schema
An optional Arrow Schema object (created via
arrow::schema(...)) used to explicitly enforce precise data types during Parquet conversion. Defaults toNULL. IfNULL, data types are inferred from the R data frame.Type mapping reference (Arrow schema field → Spark SQL type):
Arrow int8/int16 → TINYINT/SMALLINT
Arrow int32/int64 → INT/BIGINT
Arrow float/double → FLOAT/DOUBLE
Arrow string/large_string → STRING
Arrow bool → BOOLEAN
Arrow
date32[day]→ DATEArrow
timestamp['s' | 'ms' | 'us', ...]→ TIMESTAMP_NTZ (no timezone)Arrow
timestamp['s' | 'ms' | 'us', tz = ...]→ TIMESTAMP (with timezone)Arrow decimal(P,S) → DECIMAL(P,S)
Important notes for schema use:
Factors: If the R data frame contains a
factorcolumn, the corresponding Arrow schema field must beutf8()orlarge_utf8(). The categorical labels are automatically converted and mapped to aSTRINGtype in the Delta table.Timestamp precision: Second (
s), millisecond (ms), and microsecond (us) precisions are supported. Nanosecond (ns) precision may be incompatible with the current Databricks runtime environment.
- volume_dir
A character string specifying the path to the target Databricks Volume where the Parquet file will be uploaded.
- copy_options
A character string specifying options for the
COPY INTOcommand, e.g.,'mergeSchema' = 'true'. Defaults to"'mergeSchema' = 'true'".- overwrite_table
Logical; if
TRUE, deletes and recreates the Delta table before import. IfFALSEand the table does not exist, the function will throw an error. Defaults toFALSE.- chunk_size_bytes
An integer specifying the size of each data chunk in bytes. This is used to split the data frame into smaller chunks for uploading. Defaults to 5GB.
Details
The function performs the following steps:
Optionally overwrites the target table.
Uploads the data as Parquet file(s) to a specified Databricks Volume.
Executes a
COPY INTOcommand to load the file(s) into Delta Lake.Deletes the temporary file(s) from the Volume after loading is complete.
Data is chunked into segments during upload to accommodate the Databricks REST API limit of 5 GB per single file upload.
Note
To use this function, users must ensure that they have appropriate Databricks permissions:
Catalog/schema:
USE CATALOGon the target catalog andUSE SCHEMAon the target schema.Table creation:
CREATE TABLEon the target schema (ifoverwrite_table = TRUE) orMODIFYandSELECTon the existing table.Volume access:
READ VOLUMEandWRITE VOLUMEon the Databricks Volume used for staging (specified involume_dir).
Moreover, this function requires valid .Renviron variables for
authentication, specifically DATABRICKS_TOKEN and DATABRICKS_HOST.
See also
Other databricks:
check_databricks_odbc()
Examples
if (FALSE) { # \dontrun{
# Setup connection using environment variables
con <- DBI::dbConnect(odbc::databricks(),
httpPath = Sys.getenv("DATABRICKS_SQL_PATH")
)
write_df_to_delta(
df = my_data,
target_table = "catalog.schema.my_table",
db_conn = con,
volume_dir = "/Volumes/catalog/schema",
overwrite_table = TRUE
)
} # }
