#### 2020-09-29

library(readxlsb)

read_xlsb attempts to import a region from a binary format Excel workbook (xlsb)

The parameters are

read_xlsb(path, sheet, range, col_names, col_types, na, trim_ws, skip, ...)

## sheet

Either a name, or the index of the sheet to read. Index of the first sheet is 1. If the sheet name is embedded in the range argument, or implied if range is a named range, then this argument is ignored

## range

range can be specified as

• A named range. Named ranges are not case sensitive
• In Sheet!A1 notation
• In Sheet!R1C1 notation
• As a cellranger::cell_limits object

## col_names

• TRUE: The first row is used for column names. Empty cells result in a column name of the form ‘column.i’
• FALSE: Column names will be ‘column.i’
• Character vector: vector containing column names.

## col_types

Can be implied from the spreadsheet or specified in advanced. When specifying types, options are

• logical” (or “boolean”), “numeric” (or “double”), “integer”, “date” and “string” (or “character”)
• Use “skip” (or “ignore”) to skip a column

When implying types from the underlying spreadsheet data, the resultant type is the regarded as the ‘least fragile’.

Effectively the order is logicaldatetimeintegerdoublestring

• If 99 rows are of type ‘integer’ and 1 row is of type ‘double’, then all cells are regarded as ‘double’ in that column.
• If 99 rows are of type ‘date’ and 1 row is of type ‘string’, then all cells are promoted to ‘string

Currently ‘date’ is implied from cell formatting. It is either one of the built-in Excel datetime formats or a custom format where the format string contains only the characters Y, M, D, H, S, y, m, d, h, s and - (dash), : (colon), (space), . (dot). That should be good enough to identify any dates.

If ‘date’ is specified as the column type, then any strings are converted to dates. The format I’m afraid isn’t flexible at this stage - it’s assumed to be of the type “%Y-%m-%dT%H:%M:%S”.

I think there’s a 1900 leap year bug in Excel. I haven’t corrected for that.

If any of the cells in a ‘date’ column contain time, then a POSIXct object is returned, otherwise a Date object is returned. The timezone for POSIXct is set as UTC. Seems like the sensible thing to do.

## na

A character string that is interpret as NA. This does not effect the implied data type for a column.

## trim_ws

Should leading and trailing whitespaces be trimmed from character strings?

## …

Additional options. At present just debug = TRUE is supported. This returns a list with fields ‘result’ set to the resulting data.frame and ‘env’ set to an internal environment that may be useful for debugging.

res = read_xlsb(path = system.file("extdata", "TestBook.xlsb", package = "readxlsb"), range = "PORTFOLIO", debug = TRUE)

ls(res$env) #> [1] "content" "named_ranges" "sheets" "stream" res$env$named_ranges #> name range sheet_idx first_column first_row #> 1 INFO_RELEASE FirstSheet!$A$11 0 1 11 #> 2 OUTLOOK 'My SecondTab'!$A$1:$C$13 1 1 1 #> 3 PORTFOLIO FirstSheet!$A$3:$C$9 0 1 3 #> 4 SAVED_DATETIME FirstSheet!$C$13 0 3 13 #> 5 TITLE FirstSheet!$A\$1         0            1         1
#>   last_column last_row
#> 1           1       11
#> 2           3       13
#> 3           3        9
#> 4           3       13
#> 5           1        1