What is Visualizing Data In Data Science ?

Visualizing Data Nothing really tells a story about your data as powerfully as good plots. Graphics capture your data much better than summary statistics and often show you features that you would not be able to glean from summaries alone. R has very powerful tools for visualizing data. Unfortunately, it also has more tools than you’ll really know what to do with. There are several different frameworks for visualizing data, and they are usually not
particularly compatible, so you cannot easily combine the various approaches. In this chapter, we look at graphics in R. We cannot possibly cover all the plotting functionality, so I will focus on a few frameworks. First, the basic graphics framework. It is not something I frequently use or recommend that you use, but it is the default for many packages so you need to know about it. Secondly, we discuss the ggplot2 framework, which is my preferred approach to visualizing data. It defines a small domain-specific language for constructing data and is perfect for exploring data as long as you have it in a data frame (and with a little bit more work for creating publication-ready plots). Basic Graphics The basic plotting system is implemented in the graphics package. You usually do not have to include the package: library(graphics) It is already loaded when you start up R. But you can use this to get a list of the functions implemented in the package: library(help = "graphics") This list isn’t exhaustive, though, since the main plotting function, plot(), is generic and many packages write extensions to it to specialize plots. In any case, you create basic plots using the plot() function. This function is a so-called generic function, which means that what it does depends on the input it gets. So you can give it different first arguments to get plots of various objects. The simplest plot you can make is a scatterplot, which plot points for x and y values, as shown in Figure 4-1. x <- rnorm(50) y <- rnorm(50) plot(x, y) 76 The plot() function takes a data argument you can use to plot data from a data frame, but you cannot write code like this to plot the cars data from the datasets package: data(cars) cars %>% plot(speed, dist, data = .) Despite giving plot() the data frame, it will not recognize the variables for the x and y parameters, and so adding plots to pipelines requires that you use the %$% operator to give plot() access to the variables in a data frame. So, for instance, we can plot the cars data like this: cars %$% plot(speed, dist, main="Cars data", xlab="Speed", ylab="Stopping distance") Figure 4-2 uses main as a title and xlab and ylab specify the axes labels. The data argument of plot() is used when the variables of the plot are specified as a formula. It is combined with a formula that the data parameter of the plot() function is used. If the x and y values are specified in a formula, you can give the function a data frame that holds the variables and plot from that, as follows: cars %>% plot(dist ~ speed, data = .) What is meant by plot() being a generic function (something we cover in much greater detail in Chapter 10) is that it will have different functionality depending on the parameters you give it. Different kinds of objects can have their own plotting functionality, though, and they often do. This is why you probably will use basic graphics from time to time even if you follow my advice and use ggplot2 for your own plotting. Linear regression, for example, created with the lm() function, has its own plotting routine. Try evaluating the following expression: cars %>% lm(dist ~ speed, data = .) %>% plot It will give you several summary plots for visualizing the quality of the linear fit. Many model-fitting algorithms return a fitted object that has specialized plotting functionality like this, so when you have fitted a model, you can always try to call plot() on it and see if you get something useful out of that. Functions like plot() and hist() and a few more creates new plots, but there is also a large number of functions for annotating a plot. Functions such as lines() or points() add lines and points, respectively, to the current plot rather than making a new plot. What is meant by plot() being a generic function (something we cover in much greater detail in Chapter 10) is that it will have different functionality depending on the parameters you give it. Different kinds of objects can have their own plotting functionality, though, and they often do. This is why you probably will use basic graphics from time to time even if you follow my advice and use ggplot2 for your own plotting. Linear regression, for example, created with the lm() function, has its own plotting routine. Try evaluating the following expression: cars %>% lm(dist ~ speed, data = .) %>% plot It will give you several summary plots for visualizing the quality of the linear fit. Many model-fitting algorithms return a fitted object that has specialized plotting functionality like this, so when you have fitted a model, you can always try to call plot() on it and see if you get something useful out of that. Functions like plot() and hist() and a few more creates new plots, but there is also a large number of functions for annotating a plot. Functions such as lines() or points() add lines and points, respectively, to the current plot rather than making a new plot. Like plot(), the other plotting functions are usually generic. This means you can sometimes give them objects such as fitted models. The abline() function is one such case. It plots lines of the form y=a+bx, but there is a variant of it that takes a linear model as input and plot the best fitting line defined by the model. So you can plot the cars data together with the best-fitted line using the combination of the lm() and abline() functions (see Figure 4-7). cars %>% plot(dist ~ speed, data = .) cars %>% lm(dist ~ speed, data = .) %>% abline(col = "red")

Comments