What Is Data Science?
Oh boy! That is a difficult question. I don’t know if it is easy to find someone who is entirely sure what data
science is, but I am pretty sure that it would be difficult to find two people with fewer than three opinions
about it. It is certainly a popular buzzword, and everyone wants to have data scientists these days, so data
science skills are useful to have on the CV. But what is it?
Since I can’t really give you an agreed upon definition, I will just give you my own: Data science is the
science of learning from data.
This is a very broad definition—almost too broad to be useful. I realize this. But then, I think data
science is an incredibly general field. I don’t have a problem with that. Of course, you could argue that any
science is all about getting information out of data, and you might be right. Although I would say that there
is more to science than just transforming raw data into useful information. The sciences are focusing on
answering specific questions about the world while data science is focusing on how to manipulate data
efficiently and effectively. The primary focus is not which questions to ask of the data but how we can
answer them, whatever they may be. It is more like computer science and mathematics than it is like natural
sciences, in this way. It isn’t so much about studying the natural world as it is about how to compute data
efficiently.
Included in data science is the design of experiments. With the right data, we can address the questions
we are interested in. With a poor design of experiments or a poor choice of which data we gather, this can be
difficult. Study design might be the most important aspect of data science, but is not the topic of this book. In
this book I focus on the analysis of data, once gathered.
Computer science is also mainly the study of computations—as is hinted at in the name—but is a bit
broader in this focus. Although datalogy, an earlier name for data science, was also suggested for computer
science, and for example in Denmark it is the name for computer science, using the name “computer
science” puts the focus on computation while using the name “data science” puts the focus on data. But of
course, the fields overlap. If you are writing a sorting algorithm, are you then focusing on the computation or
the data? Is that even a meaningful question to ask?
There is a huge overlap between computer science and data science and naturally the skillsets you need
overlap as well. To efficiently manipulate data you need the tools for doing that, so computer programming
skills are a must and some knowledge about algorithms and data structures usually is as well. For data
science, though, the focus is always on the data. In a data analysis project, the focus is on how the data flows
from its raw form through various manipulations until it is summarized in some useful form. Although the
difference can be subtle, the focus is not about what operations a program does during the analysis, but
about how the data flows and is transformed. It is also focused on why we do certain transformations of thePrerequisites for Reading this Book
In the first seven chapters in this book, the focus is on data analysis and not programming. For those
seven chapters, I do not assume a detailed familiarity with topics such as software design, algorithms, data
structures, and such. I do not expect you to have any experience with the R programming language either.
I do, however, expect that you have had some experience with programming, mathematical modeling, and
statistics.
Programming R can be quite tricky at times if you are familiar with a scripting language or object-
oriented languages. R is a functional language that does not allow you to modify data, and while it does
have systems for object-oriented programming, it handles this programming paradigm very differently from
languages you are likely to have seen such as Java or Python.
For the data analysis part of this book, the first seven chapters, we will only use R for very
straightforward programming tasks, so none of this should pose a problem. We will have to write simple
scripts for manipulating and summarizing data so you should be familiar with how to write basic
expressions like function calls, if statements, loops, and so on. These things you will have to be comfortable
with. I will introduce every such construction in the book when we need them so you will see how they are
expressed in R, but I will not spend much time explaining them. I mostly will just expect you to be able to
pick it up from examples.
Similarly, I do not expect you to know already how to fit data and compare models in R. I do expect that
you have had enough introduction to statistics to be comfortable with basic terms like parameter estimation,
model fitting, explanatory and response variables, and model comparison. If not, I expect you to be at least
able to pick up what we are talking about when you need to.
I won’t expect you to know a lot about statistics and programming, but this isn’t Data Science for
Dummies, so I do expect you to be able to figure out examples without me explaining everything in detail.
After the first seven chapters is a short description of a data analysis project, one of my students did
in an earlier class. It shows how such a project could look, but I suggest that you do not wait until you have
finished the first seven chapters to start doing such analysis yourself. To get the most benefit out of reading
this book, you should be applying what you learn continuously. Already when you begin reading, I suggest
that you find a dataset that you would be interested in finding out more about and then apply what you learn
in each chapter to that data.
Comments
Post a Comment