We will use R for our data analysis so we need to know the basics of programming in the R language. R is a
full programming language with both functional programming and object oriented programming features.
Learning the language is far beyond the scope of this chapter and is something we return to later. The good
news, though, is that to use R for data analysis, you rarely need to do much programming. At least, if you do
the right kind of programming, you won’t need much.
For manipulating data—and how to do this is the topic of the next chapter—you mainly just have to
string together a couple of operations. Operations such as “group the data by this feature” followed by
“calculate the mean value of these features within each group” and then “plot these means”. This used to be
much more complicated to do in R, but a couple of new ideas on how to structure such data flow—and some
clever implementations of these in a couple of packages such as magrittr and dplyr—has significantly
simplified it. We will see some of this at the end of this chapter and more in the next chapter. First, though,
you need to get a taste for R.
Basic Interaction with R
Start by downloading RStudio if you haven’t done so already (https://www.rstudio.com/products/
RStudio). If you open it, you should see a window similar to Figure 1-1. Well, except that you will be in an
empty project while the figure shows (on the top right) that this RStudio is opened in a project called “Data
Science”. You always want to be working on a project. Projects keep track of the state of your analysis by
remembering variables and functions you have written and keep track of which files you have opened and
such. Choose File ➤ New Project to create a project. You can create a project from an existing directory, but
if this is the first time you are working with R you probably just want to create an empty project in a new
directory, so do that.
Once you have opened RStudio, you can type R expressions into the console, which is the frame on
the left of the RStudio window. When you write an expression there, R will read it, evaluate it, and print the
result. When you assign values to variables, and you will see how to do this shortly, they will appear in the
Environment frame on the top right. At the bottom right, you have the directory where the project lives, and
files you create will go there.
To create a new file, choose File ➤ New File. You can select several different file types. We are interested
in the R Script and R Markdown types. The former is the file type for pure R code, while the latter is used for
creating reports where documentation text is mixed with R code. For data analysis projects, I recommend
using Markdown files. Writing documentation for what you are doing is really helpful when you need to go
back to a project several months down the line.
For most of this chapter, you can just write R code in the console, or you can create an R Script file. If
you create an R Script file, it will show up on the top left, as shown in Figure 1-2. You can evaluate single
expressions using the Run button on the top-right of this frame, or evaluate the entire file using the Source
button. For longer expressions, you might want to write them in an R Script file for now. In the next chapter,
we talk about R Markdown, which is the better solution for data science projects.
It also works pretty much as you are used to. Except, perhaps, that you might be used to integers
behaving as integers in a division. At least in some programming languages, division between integers is
integer division, but in R, you can divide integers and if there is a remainder you will get a floating-point
number back as the result.
4 / 3
## [1] 1.333333
When you write numbers like 4 and 3, they are interpreted as floating-point numbers. To explicitly get
an integer, you must write 4L and 3L.
class(4)
## [1] "numeric"
class(4L)
## [1] "integer"
You will still get a floating-point if you divide two integers, although there is no need to tell R explicitly
that you want floating-point division. If you want integer division, on the other hand, you need a different
operator, %/%:
4 %/% 3
## [1] 1
In many languages % is used to get the remainder of a division, but this doesn’t quite work with R, where
% is used to construct infix operators. So in R, the operator for this is %%:
4 %% 3
## [1] 1
In addition to the basic arithmetic operators—addition, subtraction, multiplication, division, and the
modulus operator you just saw—you also have an exponentiation operator for taking powers. For this, you
can use ^ or ** as infix operators:
2^2
## [1] 4
2^3
## [1] 8
2**2
## [1] 4
2**3
## [1] 8
There are some other data types besides numbers, but we won’t go into an exhaustive list here. There
are two types you do need to know about early, though, since they are frequently used and since not knowing
about how they work can lead to all kinds of grief. Those are strings and “factors”.
Strings work as you would expect. You write them in quotes, either double quotes or single quotes, and
that is about it.
"hello,"
## [1] "hello,"
'world!'
## [1] "world!"
Strings are not particularly tricky, but I mention them because they look a lot like factors, but factors are
not like strings, they just look sufficiently like them to cause some confusion. I explain factors a little later in
this chapter when you have seen how functions and vectors work.
Assignments
To assign a value to a variable, you use the arrow operators. So you assign the value 2 to the variable x, you
would write the following:
x <- 2
You can test that x now holds the value 2 by evaluating x.
x
## [1] 2
And of course, you can now use x in expressions:
2 * x
## [1] 4
You can assign with arrows in both directions, so you could also write the following:
2 -> x
An assignment won’t print anything if you write it into the R terminal, but you can get R to print it just
by putting the assignment in parentheses.
x <- "invisible"
(y <- "visible")
## [1] "visible"
Actually, All of the Above Are Vectors of Values…
If you were wondering why all the values printed above had a [1] in front of them, I am going to explain
that right now. It is because we are usually not working with single values anywhere in R. We are working
with vectors of values (and you will hear more about vectors in the next section). The vectors we have seen
have length one—they consist of a single value—so there is nothing wrong about thinking about them as
individual values. But they really are vectors.
The [1] does not indicate that we are looking at a vector of length one, though. The [1] tells you that the
first value after [1] is the first value in the vector. With longer vectors, you get the index each time R moves to
the next line of output. This is just done to make it easier to count your way into a particular index.
You will see this if you make a longer vector, for example, you can make one of length 50 using the :
operator:
1:50
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## [16] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## [46] 46 47 48 49 50
Comments
Post a Comment