What is Data Science ? Full explain

What Is Data Science? Oh boy! That is a difficult question. I don’t know if it is easy to find someone who is entirely sure what data science is, but I am pretty sure that it would be difficult to find two people with fewer than three opinions about it. It is certainly a popular buzzword, and everyone wants to have data scientists these days, so data science skills are useful to have on the CV. But what is it? Since I can’t really give you an agreed upon definition, I will just give you my own: Data science is the science of learning from data.
This is a very broad definition—almost too broad to be useful. I realize this. But then, I think data science is an incredibly general field. I don’t have a problem with that. Of course, you could argue that any science is all about getting information out of data, and you might be right. Although I would say that there is more to science than just transforming raw data into useful information. The sciences are focusing on answering specific questions about the world while data science is focusing on how to manipulate data efficiently and effectively. The primary focus is not which questions to ask of the data but how we can answer them, whatever they may be. It is more like computer science and mathematics than it is like natural sciences, in this way. It isn’t so much about studying the natural world as it is about how to compute data efficiently. Included in data science is the design of experiments. With the right data, we can address the questions we are interested in. With a poor design of experiments or a poor choice of which data we gather, this can be difficult. Study design might be the most important aspect of data science, but is not the topic of this book. In this book I focus on the analysis of data, once gathered. Computer science is also mainly the study of computations—as is hinted at in the name—but is a bit broader in this focus. Although datalogy, an earlier name for data science, was also suggested for computer science, and for example in Denmark it is the name for computer science, using the name “computer science” puts the focus on computation while using the name “data science” puts the focus on data. But of course, the fields overlap. If you are writing a sorting algorithm, are you then focusing on the computation or the data? Is that even a meaningful question to ask? There is a huge overlap between computer science and data science and naturally the skillsets you need overlap as well. To efficiently manipulate data you need the tools for doing that, so computer programming skills are a must and some knowledge about algorithms and data structures usually is as well. For data science, though, the focus is always on the data. In a data analysis project, the focus is on how the data flows from its raw form through various manipulations until it is summarized in some useful form. Although the difference can be subtle, the focus is not about what operations a program does during the analysis, but about how the data flows and is transformed. It is also focused on why we do certain transformations of thePrerequisites for Reading this Book In the first seven chapters in this book, the focus is on data analysis and not programming. For those seven chapters, I do not assume a detailed familiarity with topics such as software design, algorithms, data structures, and such. I do not expect you to have any experience with the R programming language either. I do, however, expect that you have had some experience with programming, mathematical modeling, and statistics. Programming R can be quite tricky at times if you are familiar with a scripting language or object- oriented languages. R is a functional language that does not allow you to modify data, and while it does have systems for object-oriented programming, it handles this programming paradigm very differently from languages you are likely to have seen such as Java or Python. For the data analysis part of this book, the first seven chapters, we will only use R for very straightforward programming tasks, so none of this should pose a problem. We will have to write simple scripts for manipulating and summarizing data so you should be familiar with how to write basic expressions like function calls, if statements, loops, and so on. These things you will have to be comfortable with. I will introduce every such construction in the book when we need them so you will see how they are expressed in R, but I will not spend much time explaining them. I mostly will just expect you to be able to pick it up from examples. Similarly, I do not expect you to know already how to fit data and compare models in R. I do expect that you have had enough introduction to statistics to be comfortable with basic terms like parameter estimation, model fitting, explanatory and response variables, and model comparison. If not, I expect you to be at least able to pick up what we are talking about when you need to. I won’t expect you to know a lot about statistics and programming, but this isn’t Data Science for Dummies, so I do expect you to be able to figure out examples without me explaining everything in detail. After the first seven chapters is a short description of a data analysis project, one of my students did in an earlier class. It shows how such a project could look, but I suggest that you do not wait until you have finished the first seven chapters to start doing such analysis yourself. To get the most benefit out of reading this book, you should be applying what you learn continuously. Already when you begin reading, I suggest that you find a dataset that you would be interested in finding out more about and then apply what you learn in each chapter to that data.

Comments