There are several utilities in the R ecosystem for reproducible research. The package repana (for Reproducible Analysis) help in having a common directory structure where to save the files that should be consider part of the main stream of production, and files that are products of the main stream as well as modified files no longer part of the main stream but need to be kept such as re-formatted reports or presentations.

The aspiration of this package is that you can set up an analysis with the make_structure() function, have access to the database using get_con() function, have your tables in the database documented with the update_table() function, and reproduce a complete analysis by running the master() function

Directory Structure

The function make_structure() reads the config.yml files and ensure all entries on the dirs section exists. The config.yml created by default will produce the following directories for the data, functions, database, logs, reports and handmade entries of the config.yml:

The directories can be used in your programs by the get_dirs() function.

The information in data, functions, handmade as well as the scripts in the root directory should be preserved as they are the core of your analysis. The files in logs, reports and database are created and recreated as result of your analysis’ scripts. Those are the results of your analysis.

The function clean_structure()clean those directories included in the list clean_before_new_analysis so a new analysis could be re-run without worries that new and old results are mixed. If you use git, those directories are candidates to be excluded from the control version by having them in the .gitignore file. (make_structure() take care of create a .gitignore if it does not exist or include those directories if they have not been yet included).

The function make_structure() writes a config.yml if it does not exists. This file is used by the config::get() function. It contains a the following entries under the default: tree

dirs: to define the directories that make_structure will maintain. It have the entry values for data, functions, database, reports, logs, handmade directories. If you prefer other options than the defaults values you may change it. You may access those directories in your programs using get_dirs() Note that the name for data and function does not have a underscore but by default the values are _data and _function respectively. make_structure will create the paths with the value of the entries but you access them in your program with the name of the entry. This will provide the freedom to direct the real path of the directory to any place you need.

clean_before_new_analysis: to define the list of directories that should be cleaned every time you want to repeat the analysis from zero. This directories are included in the .gitignore

defaultdb: is written with the parameters for a RSQLite SQLite driver

You may add other entries that your analysis may requires. The config.yml itself should also be included in your .gitignore file as it is something that change from system to system (i.e. driver parameters) so you should include in the documentation of your analysis what entries should be defined so you can reproduce the analysis in other machine.

The configuration of database

DBI and Pool connections are used as a way to keep data as well as results in a database system. You must provide, at least, values for the package and dbconnection entries corresponding to the package that host the dbConnection and the name of the dbConnection function. Notice entry names are lowercase. The rest of the entries must correspond for parameters for your driver connection or pool connection.

Example to use RSQLite with a results.db file in the database directory

    package: RSQLite
    dbconnect: SQLite
    dbname: dbname/results.db

Example to use RPostgres

    package: RPostgres
    dbconnection: Postgres
    dbname: testdb
    host: localhost
    port: 5432
    user: username
    password: password

You can define several configuration to use different databases in the same analysis, but the defaultdb will be used by default for the update_table() function.

The function update_table will save a data.frame into the database, and will keep a log in the log_table table with the timestamps the file was updated in the database. The log_table keep a record of when was the table included in the database and a comment that will help to trace the origin. You may include the date data was obtained or the source of the data.

update_table(p_con,"iris", "from system)

The master function

The master(pattern, start, stop, logdir, rscript_path) function execute in a plain vanilla R process each one of the files identified by the pattern. By default use the pattern is "^[0-9][0-9].*\\.R$", which include all files like 01_read_data.R, 02_process_data.R, 03_analysis_data.R, 04_make_report.R but not report1.rmd, exploratory.R etc.. Files are run on the order starting from the first but if for any reason you need to omit the first files you may skip them with the start parameter.

logdir is the directory for the logs, by default get_dirs()$logs

rscript_path is the full path to the Rscript program which is at the end the one that process the R file. The current default is for a OS system. But implementation for Linux and Windows will soon be implemented.

The master function use functions from the processx package.

The config.yml file

Here is an example of the config.yml created by make_structure()

    data: _data
    functions: _functions
    handmade: handmade
    database: database
    reports: reports
    logs: logs
    - database
    - reports
    - logs
    package: RSQLite
    dbconnect: SQLite
    dbname: ":memory:"
    driver:  /usr/local/lib/libsqlite3odbc.dylib 
    database: database/results.db


You may download the package from git-hub. Within R you may use devtools::install_github() as:

devtools::install_github("johnaponte/repana", build_manual = T, build_vignettes = T)