Data Science part II ( R Programming)

We will use R for our data analysis so we need to know the basics of programming in the R language. R is a full programming language with both functional programming and object oriented programming features. Learning the language is far beyond the scope of this chapter and is something we return to later. The good news, though, is that to use R for data analysis, you rarely need to do much programming. At least, if you do the right kind of programming, you won’t need much. For manipulating data—and how to do this is the topic of the next chapter—you mainly just have to string together a couple of operations. Operations such as “group the data by this feature” followed by “calculate the mean value of these features within each group” and then “plot these means”. This used to be much more complicated to do in R, but a couple of new ideas on how to structure such data flow—and some clever implementations of these in a couple of packages such as magrittr and dplyr—has significantly simplified it. We will see some of this at the end of this chapter and more in the next chapter. First, though, you need to get a taste for R. Basic Interaction with R Start by downloading RStudio if you haven’t done so already (https://www.rstudio.com/products/ RStudio). If you open it, you should see a window similar to Figure 1-1. Well, except that you will be in an empty project while the figure shows (on the top right) that this RStudio is opened in a project called “Data Science”. You always want to be working on a project. Projects keep track of the state of your analysis by remembering variables and functions you have written and keep track of which files you have opened and such. Choose File ➤ New Project to create a project. You can create a project from an existing directory, but if this is the first time you are working with R you probably just want to create an empty project in a new directory, so do that. Once you have opened RStudio, you can type R expressions into the console, which is the frame on the left of the RStudio window. When you write an expression there, R will read it, evaluate it, and print the result. When you assign values to variables, and you will see how to do this shortly, they will appear in the Environment frame on the top right. At the bottom right, you have the directory where the project lives, and files you create will go there. To create a new file, choose File ➤ New File. You can select several different file types. We are interested in the R Script and R Markdown types. The former is the file type for pure R code, while the latter is used for creating reports where documentation text is mixed with R code. For data analysis projects, I recommend using Markdown files. Writing documentation for what you are doing is really helpful when you need to go back to a project several months down the line. For most of this chapter, you can just write R code in the console, or you can create an R Script file. If you create an R Script file, it will show up on the top left, as shown in Figure 1-2. You can evaluate single expressions using the Run button on the top-right of this frame, or evaluate the entire file using the Source button. For longer expressions, you might want to write them in an R Script file for now. In the next chapter, we talk about R Markdown, which is the better solution for data science projects. It also works pretty much as you are used to. Except, perhaps, that you might be used to integers behaving as integers in a division. At least in some programming languages, division between integers is integer division, but in R, you can divide integers and if there is a remainder you will get a floating-point number back as the result. 4 / 3 ## [1] 1.333333 When you write numbers like 4 and 3, they are interpreted as floating-point numbers. To explicitly get an integer, you must write 4L and 3L. class(4) ## [1] "numeric" class(4L) ## [1] "integer" You will still get a floating-point if you divide two integers, although there is no need to tell R explicitly that you want floating-point division. If you want integer division, on the other hand, you need a different operator, %/%: 4 %/% 3 ## [1] 1 In many languages % is used to get the remainder of a division, but this doesn’t quite work with R, where % is used to construct infix operators. So in R, the operator for this is %%: 4 %% 3 ## [1] 1 In addition to the basic arithmetic operators—addition, subtraction, multiplication, division, and the modulus operator you just saw—you also have an exponentiation operator for taking powers. For this, you can use ^ or ** as infix operators: 2^2 ## [1] 4 2^3 ## [1] 8 2**2 ## [1] 4 2**3 ## [1] 8 There are some other data types besides numbers, but we won’t go into an exhaustive list here. There are two types you do need to know about early, though, since they are frequently used and since not knowing about how they work can lead to all kinds of grief. Those are strings and “factors”. Strings work as you would expect. You write them in quotes, either double quotes or single quotes, and that is about it. "hello," ## [1] "hello," 'world!' ## [1] "world!" Strings are not particularly tricky, but I mention them because they look a lot like factors, but factors are not like strings, they just look sufficiently like them to cause some confusion. I explain factors a little later in this chapter when you have seen how functions and vectors work.
Assignments To assign a value to a variable, you use the arrow operators. So you assign the value 2 to the variable x, you would write the following: x <- 2 You can test that x now holds the value 2 by evaluating x. x ## [1] 2 And of course, you can now use x in expressions: 2 * x ## [1] 4 You can assign with arrows in both directions, so you could also write the following: 2 -> x An assignment won’t print anything if you write it into the R terminal, but you can get R to print it just by putting the assignment in parentheses. x <- "invisible" (y <- "visible") ## [1] "visible" Actually, All of the Above Are Vectors of Values… If you were wondering why all the values printed above had a [1] in front of them, I am going to explain that right now. It is because we are usually not working with single values anywhere in R. We are working with vectors of values (and you will hear more about vectors in the next section). The vectors we have seen have length one—they consist of a single value—so there is nothing wrong about thinking about them as individual values. But they really are vectors. The [1] does not indicate that we are looking at a vector of length one, though. The [1] tells you that the first value after [1] is the first value in the vector. With longer vectors, you get the index each time R moves to the next line of output. This is just done to make it easier to count your way into a particular index. You will see this if you make a longer vector, for example, you can make one of length 50 using the : operator: 1:50 ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## [16] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ## [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 ## [46] 46 47 48 49 50

Comments