Rmarkdown on JLA Data

Parametrické reporty v erku

Fri, 21 Jun 2019 00:00:00 +0000

Tvorba reportů ve formátech pdf, html a docx – tedy souborů čitelných v Adobe Acrobat Readeru, internetovém prohlížeči a MS Wordu – je vcelku dobře známá silná stránka erka.

Ne tak často využívaná, ale rovněž velmi zajímavá, je možnost parametrického reportingu. Tato o něco pokročilejší technika je postavena na předání určité hodnoty – parametru – R Markdownu při generování reportu. Je tak možné podle jedné zdrojové markdown šablony vytvořit více hotových dokumentů.

Typické přiklady použití parametrizace jsou :

reporty shodné daty a strukturou, ale zpracované k odlišnému datu
sada reportů stejné struktury, ale mírně odlišných dat (například ke stejnému datu za více regionů)

Z popisu je vidět, že parametrizace je dobrá cesta k odbourání nudné a nezáživné (navíc náchylné k chybě) ruční práce.

Oceníme jí zejména v případě, kdy dojde k institucionalizaci původně jednorázového reportu. Což se, zejména při práci v korporátu, může stát…

Tvorba parametrického reportu je téma na více souborů – vyžaduje minimálně dva:

RMarkdown šablonu s definovaným parametrem
erkový skript který šablonu volá s konkrétní hodnotou parametru

V případě ukládání do pdf bývá praktické navíc doplnit LaTeX-ovou šablonu.

Nabízím vám ilustrativní příklad parametrického reportu, který ilustruje práci s parametry v Rmd a jejich volání přes rmarkdown::render(). Protože příklad z povahy věci pracuje s více soubory nebylo praktické ho publikovat na těchto stránkách. Místo toho jsem jej uložil na GitHubu.

Projekt si snado a rychle vyklonujete z adresy https://github.com/jlacko/R4RPTG.git postupem popsaným v mé cestě erka.

Pro ilustraci používám svojí oblíbenou časovou řadu ceny piva v regionech podle ČSÚ.

Z hlediska dalšího rozvoje stojí za úvahu integrace generování reportů s balíčkem cronR pro přehlednější scheduling jobů v Linuxovém prostředí (tj. v kontextu serverové verze RStudia).

Dalším logickým krokem je automatizace distribuce takto vytvořených reportů, ale ta již hodně závisí na konkrétní infrastruktuře.

Unbearable Lightness of SQL Code Chunks

Tue, 17 Apr 2018 00:00:00 +0000

Using code chunks with R code in RMarkdown documents is a well understood (and much appreciated!) topic. In this post I would like to draw attention to a slightly different aspect of RMarkdown, that is the option of writing code chunks in different programming lanugages.

I yet have to find the need to mix and match R and Python code in a single document, but I have found it advantageous to use SQL code chunks.

Like it or not SQL is the de facto language of data, and SQL code is immediately clear to any old BI hand - much more so than a dplyr pipeline. In addition it allows me to use features of SQL language that do not translate easily to R code.

The first task is creating a database connection; this needs to be done in a R (or Python, but let us stick to R) code chunk.

library(odbc)
con <- dbConnect(odbc::odbc(), 
                 driver = "PostgreSQL Unicode", 
                 server = "db.jla-data.net", 
                 port = 5432, 
                 uid = "babisobot", # user babisobot has select rights only ...
                 password = "babisobot", # ... so his password need not be too secret :)
                 database = "dbase")

The next chunk is declared as SQL {sql ... } and and it is necessary to specify both connection = con and output.var = "frmVystup" in the header (i.e. in the curly braces). The quotation marks around output variable are important.

select 
  date_trunc('day', saved) date,
  count(1) volume
from 
  babisobot 
group by 
  date_trunc('day', saved)
order by 
  2 desc
limit 5

Now that I have the result of SQL script safely stored in variable frmVystup I can use it in my futher work in R. For this proof of concept showing the data frame in a simple kable is enough.

library(kableExtra)

kable(frmVystup, # the variable created in previous chunk
      format = 'html',
      booktabs = T,
      align = c('l', 'r')) %>%
  kable_styling(full_width = F) %>%
  column_spec(1, width = "6cm")

date	volume
2018-03-25	3363
2018-04-06	1753
2018-04-10	1741
2018-03-27	1537
2018-04-11	1505

The last, but not least, thing is not forgetting about closing the database connection on exit.

dbDisconnect(con) # because it is good manners to shut the door and turn off the light

Parametrized R Markdown Reports

Wed, 10 Jan 2018 00:00:00 +0000

Every business, no matter how big or small, simple or sophisticated, requires regular reports to run. R Studio, especially in its server flavor with option of cron jobs, is eminently capable of producing these. Parametrized reports are thus able to perform the role of a gateway drug and wean the analytic team off their beloved Excel sheets.

In fact, if I was looking for a single feature to convince a die hard Excel user to see the light and give up his VLOOKUP, I would stress out the ease of regular reporting with parametrized reports. It might not be a fancy ML / AI technique that catches the headlines, but it is one of the small things which take the pain out of everyday chores.

This example will demonstrate creating parametrized reports using the well known and much loved Iris dataset.

It will show:

a R Markdown template, with a single parameter species defined
using knitr::kable function and the kableExtra package to build a simple table with a calculated summary row and some basic formatting
a master R script, calling rmarkdown::render on the template to build the reports, iterating value of the parameter species over unique values of species from the Iris dataset

The R markdown template in its easiest part needs just two parts:

YAML header
a single R chunk

---
title: "Iris *`r params$species`* are rather cute..." # a report looks better with the title set
params:  # this is the parameter declaration
  species: "setosa" # default value, overrriden by the render function, but helpful for debugging
output:
  pdf_document:
    latex_engine: pdflatex
header-includes:
- \usepackage{booktabs}
- \usepackage{longtable}
- \usepackage{array}
- \usepackage{multirow}
- \usepackage[table]{xcolor}
- \usepackage{wrapfig}
- \usepackage{float}
- \usepackage{colortbl}
- \usepackage{pdflscape}
- \usepackage{tabu}
- \usepackage{threeparttable}
- \usepackage[normalem]{ulem}
---

The YAML header needs to include declaration of the parameters (indentation is, as is often the case with YAML, crucial). Including a default value is optional, but helpful in debugging.

The header-includes option loads LaTeX macros necessary for table formatting; this list, helpfuly provided by Hao Zhu (the author of kableExtra package) should keep the dreaded LaTeX error “environment xyz undefined” at bay.

library(tidyverse)
library(knitr)
library(kableExtra)

src <- iris %>% # here you would normally load a file or connect to a database...
  filter(Species == params$species) %>%
  mutate(Species = as.character(Species)) %>% # factor would be a problem for summary row
  select(Species, Sepal.Length) %>% # just two columns for the sake of clarity...
  slice(1:5) # first five rows only, so that page space is not an issue

src <- rbind(src, # add summary row 
             c("Grand total", sum(src$Sepal.Length)))

kable(src,
      format = 'latex',
      booktabs = T,
      align = c('l','r')) %>%
      row_spec(nrow(src), bold = T) # make the last (summary) row bold

The body chunk needs to:

declare your libraries (note that knitr, where kable lives, is not a formal part of tidyverse - it is ‘just’ suggested - and needs to be loaded separately)
load your data (I have cheated a little, and used a pre-loaded Iris dataset) and
peform necessary filtering / aggregating

Note how params$species is applied as filter condition, and how the summary row is created by binding a new row to the filtered dataset.

The master script needs to do two things:

construct a vector of unique Iris species, each of which will be passed as a parameter the render function to generate a report
call the render function from rmarkdown package, with a list of parameters as required by the template. In this simple case just a sigle parameter ‘species’.

library(rmarkdown)

flowers <- unique(iris$Species) # setosa, versicolor, virginica - you know them all, don't you?

for (i in seq_along(flowers)) {
  myIris <- flowers[i]  # my species - to be reused as 1) parameter & 2) file name
  render("report-template.Rmd", # the template
          params = list(species = myIris), # value of myIris passed to the species parameter
          output_file = paste(myIris, '.pdf', sep = ''), # name of the output file - species name and pdf extension
          quiet = T,
          encoding = 'UTF-8')
}

When you put it all together and source the master script you should end up with three pdf files like this:

You can download a working example of both the markdown document and master script directly from my pages.

As a next step I recommend learning more about the cronR package - when teamed with the parametric report functionality you get a report that makes itself; an business analyst dream!