What are Operators in Data Science ?

Operations The %>% operator is a very powerful mechanism for specifying data analysis pipelines, but there are some special cases where slightly different behavior is needed. One case is when you need to refer to the parameters in a data frame you get from the left side of the pipe expression directly. In many functions, you can get to the parameters of a data frame just by naming them, as you have seen with lm and plot, but there are cases where that is not so simple. You can do that by indexing . like this: d <- data.frame(x = rnorm(10), y = 4 + rnorm(10)) d %>% {data.frame(mean_x = mean(.$x), mean_y = mean(.$y))} ## mean_x mean_y ## 1 0.4167151 3.911174 But if you use the operator %$% instead of %>%, you can get to the variables just by naming them instead. d %$% data.frame(mean_x = mean(x), mean_y = mean(y)) ## mean_x mean_y ## 1 0.4167151 3.911174 Another common case is when you want to output or plot some intermediate result of a pipeline.

You can of course write the first part of a pipeline, run data through it, and store the result in a parameter, output or plot what you want, and then continue from the stored data. But you can also use the %T>% (tee) operator. It works like the %>% operator but where %>% passes the result of the right side of the expression on, %T>% passes on the result of the left side. The right side is computed but not passed on, which is perfect if you only want a step for its side-effect, like printing some summary. d <- data.frame(x = rnorm(10), y = rnorm(10)) d %T>% plot(y ~ x, data = .) %>% lm(y ~ x, data = .) The final operator is %<>%, which does something I warned against earlier—it assigns the result of a pipeline back to a variable on the left. Sometimes you do want this behavior—for instance if you do some data cleaning right after loading the data and you never want to use anything between the raw and the cleaned data, you can use %<>%. d <- read_my_data("/path/to/data") d %<>% clean_data I use it sparingly and prefer to just pass this case through a pipeline, as follows: d <- read_my_data("/path/to/data") %>% clean_data Coding and Naming Conventions People have been developing R code for a long time, and they haven’t been all that consistent in how they do it. So as you use R packages, you will see many different conventions on how code is written and especially how variables and functions are named. How you choose to write your code is entirely up to you as long as you are consistent with it. It helps somewhat if your code matches the packages you use, just to make everything easier to read, but it really is up to you. A few words on naming is worth going through, though. There are three ways people typically name their variables, data, or functions, and these are: underscore_notation(x, y) camelBackNotation(x, y) dot.notation(x, y) You are probably familiar with the first two notations, but if you have used Python or Java or C/C++ before, the dot notation looks like method calls in object oriented programming. It is not. The dot in the name doesn’t mean method call. R just allows you to use dots in variable and function names. I will mostly use the underscore notation in this book, but you can do whatever you want. I recommend that you stay away from the dot notation, though. There are good reasons for this. R actually put some interpretation into what dots mean in function names, so you can get into some trouble. The built-in functions in R often use dots in function names, but it is a dangerous approach, so you should probably stay away from it unless you are absolutely sure that you are avoiding its pitfalls. Exercises Try the following exercises to become more comfortable with the concepts discussed in this chapter. Mean of Positive Values You can simulate values from the normal distribution using the rnorm() function. Its first argument is the number of samples you want, and if you do not specify other values, it will sample from the N(0,1) distribution. Write a pipeline that takes samples from this function as input, removes the negative values, and computes the mean of the rest. Hint: One way to remove values is to replace them with missing values (NA); if a vector has missing values, the mean() function can ignore them if you give it the option na.rm = TRUE. Root Mean Square Error If you have “true” values, t = (t 1 , …, tn) and “predicted” values y = (y1 , …, yn), then the root mean square error is defined as RMSE( ) t y, = - ( ) = å 1 1 2 n t y i n i i . Write a pipeline that computes this from a data frame containing the t and y values. Remember that you can do this by first computing the square difference in one expression, then computing the mean of that in the next step, and finally computing the square root of this. The R function for computing the square root is sqrt(). The typical data analysis workflow looks like this: you collect your data and you put it in a file or spreadsheet or database. Then you run some analyses, written in various scripts, perhaps saving some intermediate results along the way or maybe always working on the raw data. You create some plots or tables of relevant summaries of the data, and then you go and write a report about the results in a text editor or word processor. It is the typical workflow. Most people doing data analysis do this or variations thereof. But it is also a workflow that has many potential problems. There is a separation between the analysis scripts and the data, and there is a separation between the analysis and the documentation of the analysis. If all analyses are done on the raw data then issue number one is not a major problem. But it is common to have scripts for different parts of the analysis, with one script storing intermediate results that are then read by the next script. The scripts describe a workflow of data analysis and, to reproduce an analysis, you have to run all the scripts in the right order. Often enough, this correct order is only described in a text file or, even worse, only in the head of the data scientist who wrote the workflow. What is even worse, it won’t stay there for long and is likely to be lost before it is needed again. Ideally, you always want to have your analysis scripts written in a way in which you can rerun any part of your workflow, completely automatically, at any time. For issue number two, the problem is that even if the workflow is automated and easy to run again, the documentation quickly drifts away from the actual analysis scripts. If you change the scripts, you won’t necessarily remember to update the documentation. You probably don’t forget to update figures and tables and such, but not necessarily the documentation of the exact analysis run. Options to functions and filtering choices and such. If the documentation drifts far enough from the actual analysis, it becomes completely useless. You can trust automated scripts to represent the real data analysis at any time—that is the benefit of having automated analysis workflows in the first place—but the documentation can easily end up being pure fiction. What you want is a way to have dynamic documentation. Reports that describe the analysis workflow in a form that can be understood both by machines and humans. Machines use the report as an automated workflow that can redo the analysis at any time. We humans use it as documentation that always accurately describes the analysis workflow that we run.

Jeynork Techclass

Search This Blog

What are Operators in Data Science ?

Comments

Post a Comment