Data Science part III (Data Manipulation)

Chapter 3 ■ Data Manipulation 50 To download the data, you could go to the URL and save the file. Explicitly downloading data outside of the R code has pros and cons. It is pretty simple, and you can look at the data before you start parsing it, but on the other hand, it gives you a step in the analysis workflow that is not automatically reproducible. Even if the URL is described in the documentation and uses a link that doesn’t change over time, it is a manual step in the workflow. And a step that people could make mistakes in.
Instead, I am going to read the data directly from the URL. Of course, this is also a risky step in a workflow because I am not in control of the server the data is on, and I cannot guarantee that the data will always be there and that it won’t change over time. It is a bit of a risk either way. I will usually add the code to my workflow for downloading the data, but I will also store the data in a file. If I leave the code for downloading the data and saving it to my local disk in a cached Markdown chunk, it will only be run the one time I need it. I can read the data and get it as a vector of lines using the readLines() function. I can always use that to scan the first one or two lines to see what the file looks like. lines <- readLines(data_url) lines[1:5] ## [1] "1000025,5,1,1,1,2,1,3,1,1,2" ## [2] "1002945,5,4,4,5,7,10,3,2,1,2" ## [3] "1015425,3,1,1,1,2,2,3,1,1,2" ## [4] "1016277,6,8,8,1,3,4,3,7,1,2" ## [5] "1017023,4,1,1,3,2,1,3,1,1,2" For this data, it seems to be a comma-separated values file without a header line. So I save the data with the .csv suffix. None of the functions for writing or reading data in R cares about the suffixes, but it is easier for me to remember what the file contains that way. writeLines(lines, con = "data/raw-breast-cancer.csv") For that function to succeed, I first need to make a data/ directory. I suggest you have a data/ directory for all your projects, always, since you want your directories and files structured when you are working on a project. The file I just wrote to disk can then read in using the read.csv() function. raw_breast_cancer <- read.csv("data/raw-breast-cancer.csv") raw_breast_cancer %>% head(3) ## X1000025 X5 X1 X1.1 X1.2 X2 X1.3 X3 X1.4 X1.5 ## 1 1002945 5 4 4 5 7 10 3 2 1 ## 2 1015425 3 1 1 1 2 2 3 1 1 ## 3 1016277 6 8 8 1 3 4 3 7 1 ## X2.1 ## 1 2 ## 2 2 ## 3 2 Of course, I wouldn’t write exactly these steps into a workflow. Once I have discovered that the data at the end of the URL is a .csv file, I would just read it directly from the URL.Chapter 3 ■ Data Manipulation 50 To download the data, you could go to the URL and save the file. Explicitly downloading data outside of the R code has pros and cons. It is pretty simple, and you can look at the data before you start parsing it, but on the other hand, it gives you a step in the analysis workflow that is not automatically reproducible. Even if the URL is described in the documentation and uses a link that doesn’t change over time, it is a manual step in the workflow. And a step that people could make mistakes in. Instead, I am going to read the data directly from the URL. Of course, this is also a risky step in a workflow because I am not in control of the server the data is on, and I cannot guarantee that the data will always be there and that it won’t change over time. It is a bit of a risk either way. I will usually add the code to my workflow for downloading the data, but I will also store the data in a file. If I leave the code for downloading the data and saving it to my local disk in a cached Markdown chunk, it will only be run the one time I need it. I can read the data and get it as a vector of lines using the readLines() function. I can always use that to scan the first one or two lines to see what the file looks like. lines <- readLines(data_url) lines[1:5] ## [1] "1000025,5,1,1,1,2,1,3,1,1,2" ## [2] "1002945,5,4,4,5,7,10,3,2,1,2" ## [3] "1015425,3,1,1,1,2,2,3,1,1,2" ## [4] "1016277,6,8,8,1,3,4,3,7,1,2" ## [5] "1017023,4,1,1,3,2,1,3,1,1,2" For this data, it seems to be a comma-separated values file without a header line. So I save the data with the .csv suffix. None of the functions for writing or reading data in R cares about the suffixes, but it is easier for me to remember what the file contains that way. writeLines(lines, con = "data/raw-breast-cancer.csv") For that function to succeed, I first need to make a data/ directory. I suggest you have a data/ directory for all your projects, always, since you want your directories and files structured when you are working on a project. The file I just wrote to disk can then read in using the read.csv() function. raw_breast_cancer <- read.csv("data/raw-breast-cancer.csv") raw_breast_cancer %>% head(3) ## X1000025 X5 X1 X1.1 X1.2 X2 X1.3 X3 X1.4 X1.5 ## 1 1002945 5 4 4 5 7 10 3 2 1 ## 2 1015425 3 1 1 1 2 2 3 1 1 ## 3 1016277 6 8 8 1 3 4 3 7 1 ## X2.1 ## 1 2 ## 2 2 ## 3 2 Of course, I wouldn’t write exactly these steps into a workflow. Once I have discovered that the data at the end of the URL is a .csv file, I would just read it directly from the URL. Chapter 3 ■ Data Manipulation 51 raw_breast_cancer <- read.csv(data_url) raw_breast_cancer %>% head(3) ## X1000025 X5 X1 X1.1 X1.2 X2 X1.3 X3 X1.4 X1.5 ## 1 1002945 5 4 4 5 7 10 3 2 1 ## 2 1015425 3 1 1 1 2 2 3 1 1 ## 3 1016277 6 8 8 1 3 4 3 7 1 ## X2.1 ## 1 2 ## 2 2 ## 3 2 The good news is that this data looks similar to the BreastCancer data. The bad news is that it appears that the first line in BreastCancer seems to have been turned into column names in raw_breast_cancer. The read.csv() function interpreted the first line as a header. This we can fix using the header parameter. raw_breast_cancer <- read.csv(data_url, header = FALSE) raw_breast_cancer %>% head(3) ## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 ## 1 1000025 5 1 1 1 2 1 3 1 1 2 ## 2 1002945 5 4 4 5 7 10 3 2 1 2 ## 3 1015425 3 1 1 1 2 2 3 1 1 2 Now the first line is no longer interpreted as header names. That is good, but the names you actually get are not that informative about what the columns contain. If you read the description of the data from the web site, you can see what each column is and choose names that are appropriate. I am going to cheat here and just take the names from the BreastCancer dataset. I can set the names explicitly like this: names(raw_breast_cancer) <- names(BreastCancer) raw_breast_cancer %>% head(3) ## Id Cl.thickness Cell.size Cell.shape ## 1 1000025 5 1 1 ## 2 1002945 5 4 4 ## 3 1015425 3 1 1 ## Marg.adhesion Epith.c.size Bare.nuclei ## 1 1 2 1 ## 2 5 7 10 ## 3 1 2 2 ## Bl.cromatin Normal.nucleoli Mitoses Class ## 1 3 1 1 2 ## 2 3 2 1 2 ## 3 3 1 1 2

Comments