Copy tables to and from a database

James Wondrasek, Kirill Müller

2021-06-20

In this tutorial we introduce {dm} methods and techniques for copying individual tables and entire relational data models into an RDBMS. This is an integral part of the {dm} workflow. Copying tables to an RDBMS is often a step in the process of building a relational data model from locally hosted data. If your data model is complete, copying it to an RDBMS in a single operation allows you to leverage the power of the database and make it accessible to others. For modifying and persisting changes to your data at the row-level see vignette("howto-dm-rows").

Copy models or copy tables?

Using {dm} you can persist an entire relational data model with a single function call. copy_dm_to() will move your entire model into a destination RDBMS. This may be all you need to deploy a new model. You may want to add new tables to an existing model on an RDBMS. These requirements can be handled using the compute() and copy_to() methods.

Calling compute() or copy_to() requires write permission on the RDBMS, otherwise an error is returned. Therefore for the following examples we will instantiate a test dm and move it into a local SQLite database with full permissions. {dm} and {dbplyr} are designed so there is no difference between the code used to manipulate a local SQLite database and a remote RDBMS. The steps for this were already introduced in vignette("howto-dm-db") and will be discussed in more detail in the Copying a relational model section.

library(dm)
library(tidyverse)
#> ── Attaching packages ────────────────────────────────── tidyverse 1.3.1 ──
#>  ggplot2 3.3.3      purrr   0.3.4
#>  tibble  3.1.2      dplyr   1.0.6
#>  tidyr   1.1.3      stringr 1.4.0
#>  readr   1.4.0      forcats 0.5.1
#> ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks dm::filter(), stats::filter()
#> x dplyr::lag()    masks stats::lag()
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

fin_dm <-
  dm_financial() %>%
  dm_select_tbl(-trans) %>%
  collect()
#> Error in (function () : rlang::is_installed("RMariaDB") is not TRUE

local_db <- DBI::dbConnect(RSQLite::SQLite())
#> Error in (function (cond) : error in evaluating the argument 'drv' in selecting a method for function 'dbConnect': there is no package called 'RSQLite'
deployed_dm <- copy_dm_to(local_db, fin_dm, temporary = FALSE)
#> Error in is.src(dest): object 'local_db' not found

Copying and persisting individual tables

As part of your data analysis you may combine tables from multiple sources and create links to existing tables via foreign keys, or create new tables holding data summaries. The example below, already discussed in vignette("howto-dm-db"), computes the total amount of all loans for each account.

my_dm_total <-
  deployed_dm %>%
  dm_zoom_to(loans) %>%
  group_by(account_id) %>%
  summarize(total_amount = sum(amount, na.rm = TRUE)) %>%
  ungroup() %>%
  dm_insert_zoomed("total_loans")
#> Error in is_dm(dm): object 'deployed_dm' not found

The derived table total_loans is a lazy table powered by the {dbplyr} package: the results are not materialized, instead an SQL query is built and executed each time the data is requested.

my_dm_total$total_loans %>%
  sql_render()
#> Error in sql_render(.): object 'my_dm_total' not found

To avoid recomputing the query every time you use total_loans, call compute() right before inserting the derived table with dm_insert_tbl(). compute() forces the computation of a query and stores the full results in a table on the RDBMS.

my_dm_total_computed <-
  deployed_dm %>%
  dm_zoom_to(loans) %>%
  group_by(account_id) %>%
  summarize(total_amount = sum(amount, na.rm = TRUE)) %>%
  ungroup() %>%
  compute() %>%
  dm_insert_zoomed("total_loans")
#> Error in is_dm(dm): object 'deployed_dm' not found

my_dm_total_computed$total_loans %>%
  sql_render()
#> Error in sql_render(.): object 'my_dm_total_computed' not found
#> Error in sql_render(my_dm_total_computed$total_loans): object 'my_dm_total_computed' not found

Note the differences in queries returned by sql_render(). my_dm_total$total_loans is still being lazily evaluated and the full query constructed from the chain of operations that generated it, and is required to run to access it, is still in place. Contrast that with my_dm_total_computed$total_loans where the query has been realized and accessing its rows requires a simple SELECT * statement. The table name, dbplyr_001, was automatically generated as the name argument was not supplied to compute().

The default is to create a temporary table. If you want results to persist across sessions in permanent tables, compute() must be called with the argument temporary = FALSE and a table name for the name argument. See ?compute for more details.

When called on a whole dm object (without zoom), compute() materializes all tables into new (temporary or persistent) tables by executing the associated SQL query and storing the full results. Depending on the size of your data this may take considerable time or be infeasible. It may be useful occasionally to create snapshots of data that is subject to change.

my_dm_total_snapshot <-
  my_dm_total %>%
  compute()
#> Error in compute(.): object 'my_dm_total' not found

Adding local data frames to an RDBMS

If you need to add local data frames to an existing dm object, use the copy_to() method. It takes the same arguments as copy_dm_to(), except the second argument takes a data frame rather than a dm. The result is a derived dm object that contains the new table.

To demonstrate the use of copy_to() the example below will use {dm} to pull consolidated data from several tables out of an RDBMS, estimate a linear model from the data, then insert the residuals back into the RDBMS and link it to the existing tables. This is all done with a local SQLite database, but the process would work unchanged on any supported RDBMS.

loans_df <-
  deployed_dm %>%
  dm_squash_to_tbl(loans) %>%
  select(id, amount, duration, A3) %>%
  collect()
#> Error in is_dm(dm): object 'deployed_dm' not found

Please note the use of dm_squash_to_tbl(). This method gathers all linked information into a single wide table. It follows foreign key relations starting from the table supplied as its argument and gathers all the columns from related tables, disambiguating column names as it goes.

In the above code, the select() statement isolates the columns we need for our model. collect() works similarly to compute() by forcing the execution of the underlying SQL query, but it returns the results as a local tibble.

Below, the local tibble, loans_df, is used to estimate the linear model and the residuals are stored along with the original associated id in a new tibble, loans_residuals. The id column is necessary to link the new tibble to the tables in the dm it was collected from.

model <- lm(amount ~ duration + A3, data = loans_df)
#> Error in is.data.frame(data): object 'loans_df' not found

loans_residuals <- tibble::tibble(
  id = loans_df$id,
  resid = unname(residuals(model))
)
#> Error in eval_tidy(xs[[j]], mask): object 'loans_df' not found

loans_residuals
#> Error in eval(expr, envir, enclos): object 'loans_residuals' not found

Adding loans_residuals to the dm is done using copy_to(). The call to the method includes the argument temporary = FALSE because we want this table to persist beyond our current session. In the same pipeline we create the necessary primary and foreign keys to integrate the table with the rest of our relational model. For more information on key creation see vignette("howto-dm-db") and vignette("howto-dm-theory").

my_dm_sqlite_resid <-
  copy_to(deployed_dm, loans_residuals, temporary = FALSE) %>%
  dm_add_pk(loans_residuals, id) %>%
  dm_add_fk(loans_residuals, id, loans)
#> Error in copy_to(deployed_dm, loans_residuals, temporary = FALSE): object 'deployed_dm' not found

my_dm_sqlite_resid %>%
  dm_set_colors(violet = loans_residuals) %>%
  dm_draw()
#> Error in dm_get_def(dm, quiet): object 'my_dm_sqlite_resid' not found
my_dm_sqlite_resid %>%
  dm_examine_constraints()
#> Error in is_dm(dm): object 'my_dm_sqlite_resid' not found
my_dm_sqlite_resid$loans_residuals
#> Error in eval(expr, envir, enclos): object 'my_dm_sqlite_resid' not found

Persisting a relational model with copy_dm_to()

Persistence, because it is intended to make permanent changes, requires write access to the source RDBMS. The code below is a repeat of the code that opened the Copying and persisting individual tables section at the beginning of the tutorial. It uses the {dm} convenience function dm_financial() to create a dm object corresponding to a data model from a public dataset repository. The dm object is downloaded locally first, before deploying it to a local SQLite database.

dm_select_tbl() is used to exclude the transaction table trans due to its size, then the collect() method retrieves the remaining tables and returns them as a local dm object.

dm_financial() %>%
  dm_nrow()
#> Error in (function () : rlang::is_installed("RMariaDB") is not TRUE
fin_dm <-
  dm_financial() %>%
  dm_select_tbl(-trans) %>%
  collect()
#> Error in (function () : rlang::is_installed("RMariaDB") is not TRUE

fin_dm
#> Error in eval(expr, envir, enclos): object 'fin_dm' not found

It is just as simple to move a local relational model into an RDBMS.

destination_db <- DBI::dbConnect(RSQLite::SQLite())
#> Error in (function (cond) : error in evaluating the argument 'drv' in selecting a method for function 'dbConnect': there is no package called 'RSQLite'

deployed_dm <-
  copy_dm_to(destination_db, fin_dm, temporary = FALSE)
#> Error in is.src(dest): object 'destination_db' not found

deployed_dm
#> Error in eval(expr, envir, enclos): object 'deployed_dm' not found

Note that in the call to copy_dm_to() the argument temporary = FALSE is supplied. Without this argument the model would still be copied into the database, but the argument would default to temporary = TRUE and the data would be deleted once your session ends.

In the output you can observe that the src for deployed_dm is SQLite, while for fin_dm the source is not indicated because it is a local data model.

Copying a relational model into an empty database is the simplest use case for copy_dm_to(). If you want to copy a model into an RDBMS that is already populated, be aware that copy_dm_to() will not overwrite pre-existing tables. In this case you will need to use the table_names argument to give the tables unique names.

table_names can be a named character vector, with the names matching the table names in the dm object and the values containing the desired names in the RDBMS, or a function or one-sided formula. In the example below, paste0() is used to add a prefix to the table names to provide uniqueness.

dup_dm <-
  copy_dm_to(destination_db, fin_dm, temporary = FALSE, table_names = ~ paste0("dup_", .x))
#> Error in is.src(dest): object 'destination_db' not found

dup_dm
#> Error in eval(expr, envir, enclos): object 'dup_dm' not found
remote_name(dup_dm$accounts)
#> Error in remote_name(dup_dm$accounts): object 'dup_dm' not found
remote_name(deployed_dm$accounts)
#> Error in remote_name(deployed_dm$accounts): object 'deployed_dm' not found

Note the different table names for dup_dm$accounts and deployed_dm$accounts. For both, the table name is accounts in the dm, but they link to different tables on the database. In dup_dm the table is backed by the table dup_accounts in the RDBMS. dm_deployed$accounts shows us that this table is still backed by the accounts table from the copy_dm_to() operation we performed in the preceding example.

Managing tables in the RDBMS is outside the scope of dm. If you find you need to remove tables or perform operations directly on the RDBMS, see the {DBI} package.

When done, do not forget to disconnect:

DBI::dbDisconnect(destination_db)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'conn' in selecting a method for function 'dbDisconnect': object 'destination_db' not found
DBI::dbDisconnect(local_db)
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'conn' in selecting a method for function 'dbDisconnect': object 'local_db' not found

Conclusion

dm makes it straightforward to deploy your complete relational model to an RDBMS using the copy_dm_to() function. For tables that are created from a relational model during analysis or development, compute() and copy_to() can be used to persist them between sessions or to copy local tables to a database dm. The collect() method downloads an entire dm object that fits into memory from the database.

Next steps

If you need finer-grained control over modifications to your relational model, see vignette("howto-dm-rows") for an introduction to row level operations, including updates, insertions, deletions and patching.

If you feel you need to know more about relational data models in order to get the most out of dm, check out vignette("howto-dm-theory").

If you’re familiar with relational data models but want to know how to work with them in dm, then any of vignette("tech-dm-join"), vignette("tech-dm-filter"), or vignette("tech-dm-zoom") is a good next step.