Data Science part V (Reproducible Analysis)

•Reproducible Analysis Bibliographies : Often you want to cite books or papers in a report. You can, of course, handle citations manually, but a better approach is to have a file with the citation information and then refer to it using markup tags. To add a bibliography, you use a tag in the YAML header called bibliography. --- ... bibliography: bibliography.bib ... --- You can use several different formats here; see the R Markdown documentation (http://rmarkdown. rstudio.com/authoring_bibliographies_and_citations.html) for a list. The suffix .bib is used for BibLaTeX. The format for the citation file is the same as BibTeX, and you get citation information in that format from nearly every site that will give you bibliography information.
To cite something from the bibliography, you use [@smith04] where smith04 is the identifier used in the bibliography file. You can cite more than one paper inside square brackets separated by a semicolon, [@smith04; doe99], and you can add text such as chapters or page numbers [@smith04, chapter 4]. To suppress the author name(s) in the citation, say when you mention the name already in the text, you put - before the @, so you write As Smith showed [-@smith04].... For in-text citations, similar to \citet{} in natbib, you just leave out the brackets: @smith04 showed that... and you can combine that with additional citation information as @smith04 [chapter 4] showed that.... To specify the citation style to use, you use the csl tag in the YAML header. --- ... bibliography: bibliography.bib csl: biomed-central.csl ... --- Check out the citation styles list at https://github.com/citation-style-language/styles for a large number of different formats. There should be most, if not all, of your heart desires there. Controlling the Output (Templates/Stylesheets) The pandoc tool has a powerful mechanism for formatting the documents it generates. This is achieved using stylesheets in CSS for HTML and from using templates for how to format the output for all output formats. The template mechanism lets you write an HTML or LaTeX document, say, that determines where various part of the text goes and where variables from the YAML header is used. This mechanism is far beyond what we can cover in this chapter, but I just want to mention it if you want to start writing papers using R Markdown. You can do this, you just need to have a template for formatting the document in the style a journal wants. Often they provide LaTeX templates, and you can modify these to work with Markdown. There isn’t much support for this in RStudio, but for HTML documents, you can use the Output Options command (click on the tooth-wheel) to choose different output formatting. Running R Code in Markdown Documents The formatting so far is all Markdown (and YAML). Where it combines with R and makes it R Markdown is through knitr. When you format a document, the first step evaluates R code to create a Markdown document. This translates an .rmd document into an .md document, but this intermediate document is deleted afterward unless you explicitly tell RStudio not to do so. It does that by running all the R code you want to be executed and putting it into the Markdown document. The simplest R code you can evaluate is part of a text. If you want an R expression evaluated, you use backticks but add r right after the first. So to evaluate 2 + 2 and put the result in your Markdown document, you write `r and then the expression 2 + 2 and get the result 4 inserted into the text. You can write any R expression there to get it evaluated. This is useful for inserting short summary statistics like means and standard deviations directly into the text and ensuring that the summaries are always up to date with the actual data you are analyzing. For longer chunks of code, you use the block-quotes, the three backticks. Instead of just writing: ```r 2 + 2 ``` which will only display the code (highlighted as R code), you put the r in curly brackets. This will insert the code in your document but will also show the result of evaluating it right after the code block. The boilerplate code you get when creating an R Markdown document in RStudio shows you examples of this (see Figure 2-3). Figure 2-3. Code chunk in RStudio You can name code chunks by putting a name right after r. You don’t have to name all chunks, and if you have a lot of chunks, you probably won’t bother naming all of them. But if you give them a name, they are easily located by clicking on the structure button in the bar below the document (see Figure 2-4). You can also use the name to refer to chunks when caching results, which we will cover later. ■ Reproducible Analysis If you modify these options, you will see that the options are included in the top line of the chunk. You can of course also manually control the options here, and there are more options than what you can control with the window in the GUI. You can read the knitr documentation for all the details (http://yihui.name/ knitr/). This dialog box will handle most of your needs, though, except for displaying tables or when you want to cache results of chunks, both of which we return to later. Using Chunks when Analyzing Data (Without Compiling Documents) Before continuing, though, I want to stress that working with data analysis in an R Markdown document is useful for more than just creating documents. I personally do all my analysis in these documents because I can combine documentation and code, regardless of whether I want to generate a report at the end. The combination of explanatory text and analysis code is just convenient to have. The way code chunks are evaluated as separate pieces of analysis is also part of this. You can evaluate chunks individually, or all chunks down to a point, and I find that very convenient when doing an analysis. There are keyboard shortcuts for evaluating all chunks, all previous chunks, or just the current chunk (see Figure 2-7), which makes it very easy to write a bit of code for an exploratory analysis and evaluate just that piece of code. If you are familiar with Jupyter or similar notebooks, you will recognize the workflow

Comments