Operations
The %>% operator is a very powerful mechanism for specifying data analysis pipelines, but there are some
special cases where slightly different behavior is needed.
One case is when you need to refer to the parameters in a data frame you get from the left side of the
pipe expression directly. In many functions, you can get to the parameters of a data frame just by naming
them, as you have seen with lm and plot, but there are cases where that is not so simple.
You can do that by indexing . like this:
d <- data.frame(x = rnorm(10), y = 4 + rnorm(10))
d %>% {data.frame(mean_x = mean(.$x), mean_y = mean(.$y))}
## mean_x mean_y
## 1 0.4167151 3.911174
But if you use the operator %$% instead of %>%, you can get to the variables just by naming them instead.
d %$% data.frame(mean_x = mean(x), mean_y = mean(y))
## mean_x mean_y
## 1 0.4167151 3.911174
Another common case is when you want to output or plot some intermediate result of a pipeline. You
can of course write the first part of a pipeline, run data through it, and store the result in a parameter, output
or plot what you want, and then continue from the stored data. But you can also use the %T>% (tee) operator.
It works like the %>% operator but where %>% passes the result of the right side of the expression on, %T>%
passes on the result of the left side. The right side is computed but not passed on, which is perfect if you only
want a step for its side-effect, like printing some summary.
d <- data.frame(x = rnorm(10), y = rnorm(10))
d %T>% plot(y ~ x, data = .) %>% lm(y ~ x, data = .)
The final operator is %<>%, which does something I warned against earlier—it assigns the result of a
pipeline back to a variable on the left. Sometimes you do want this behavior—for instance if you do some
data cleaning right after loading the data and you never want to use anything between the raw and the
cleaned data, you can use %<>%.
d <- read_my_data("/path/to/data")
d %<>% clean_data
I use it sparingly and prefer to just pass this case through a pipeline, as follows:
d <- read_my_data("/path/to/data") %>% clean_data
Coding and Naming Conventions
People have been developing R code for a long time, and they haven’t been all that consistent in how they do
it. So as you use R packages, you will see many different conventions on how code is written and especially
how variables and functions are named.
How you choose to write your code is entirely up to you as long as you are consistent with it. It helps
somewhat if your code matches the packages you use, just to make everything easier to read, but it really is
up to you.
A few words on naming is worth going through, though. There are three ways people typically name
their variables, data, or functions, and these are:
underscore_notation(x, y)
camelBackNotation(x, y)
dot.notation(x, y)
You are probably familiar with the first two notations, but if you have used Python or Java or C/C++
before, the dot notation looks like method calls in object oriented programming. It is not. The dot in the
name doesn’t mean method call. R just allows you to use dots in variable and function names.
I will mostly use the underscore notation in this book, but you can do whatever you want. I recommend
that you stay away from the dot notation, though. There are good reasons for this. R actually put some
interpretation into what dots mean in function names, so you can get into some trouble. The built-in
functions in R often use dots in function names, but it is a dangerous approach, so you should probably stay
away from it unless you are absolutely sure that you are avoiding its pitfalls.
Exercises
Try the following exercises to become more comfortable with the concepts discussed in this chapter.
Mean of Positive Values
You can simulate values from the normal distribution using the rnorm() function. Its first argument is
the number of samples you want, and if you do not specify other values, it will sample from the N(0,1)
distribution.
Write a pipeline that takes samples from this function as input, removes the negative values, and
computes the mean of the rest. Hint: One way to remove values is to replace them with missing values (NA); if
a vector has missing values, the mean() function can ignore them if you give it the option na.rm = TRUE.
Root Mean Square Error
If you have “true” values, t = (t
1
, …, tn) and “predicted” values y = (y1
, …, yn), then the root mean square error
is defined as RMSE( ) t y, = - ( ) =
å
1
1
2
n t y i
n
i i .
Write a pipeline that computes this from a data frame containing the t and y values. Remember that you
can do this by first computing the square difference in one expression, then computing the mean of that in the
next step, and finally computing the square root of this. The R function for computing the square root is sqrt().
The typical data analysis workflow looks like this: you collect your data and you put it in a file or spreadsheet
or database. Then you run some analyses, written in various scripts, perhaps saving some intermediate
results along the way or maybe always working on the raw data. You create some plots or tables of relevant
summaries of the data, and then you go and write a report about the results in a text editor or word
processor. It is the typical workflow. Most people doing data analysis do this or variations thereof. But it is
also a workflow that has many potential problems.
There is a separation between the analysis scripts and the data, and there is a separation between the
analysis and the documentation of the analysis.
If all analyses are done on the raw data then issue number one is not a major problem. But it is common
to have scripts for different parts of the analysis, with one script storing intermediate results that are then
read by the next script. The scripts describe a workflow of data analysis and, to reproduce an analysis, you
have to run all the scripts in the right order. Often enough, this correct order is only described in a text file or,
even worse, only in the head of the data scientist who wrote the workflow. What is even worse, it won’t stay
there for long and is likely to be lost before it is needed again.
Ideally, you always want to have your analysis scripts written in a way in which you can rerun any part
of your workflow, completely automatically, at any time.
For issue number two, the problem is that even if the workflow is automated and easy to run again,
the documentation quickly drifts away from the actual analysis scripts. If you change the scripts, you
won’t necessarily remember to update the documentation. You probably don’t forget to update figures
and tables and such, but not necessarily the documentation of the exact analysis run. Options to functions
and filtering choices and such. If the documentation drifts far enough from the actual analysis, it becomes
completely useless. You can trust automated scripts to represent the real data analysis at any time—that is
the benefit of having automated analysis workflows in the first place—but the documentation can easily
end up being pure fiction.
What you want is a way to have dynamic documentation. Reports that describe the analysis workflow
in a form that can be understood both by machines and humans. Machines use the report as an automated
workflow that can redo the analysis at any time. We humans use it as documentation that always accurately
describes the analysis workflow that we run.
Comments
Post a Comment