What is Supervied Learning ?

Supervised Learning This chapter and the next concern the mathematical modeling of data that is the essential core of data science. We can call this statistics, or we can call it machine learning. At its core, it is the same thing. It is all about extracting information out of data. Machine Learning Machine learning is the discipline of developing and applying models and algorithms for learning from data.
Traditional algorithms implement fixed rules for solving particular problems. Like sorting numbers or finding the shortest route between two cities. To develop algorithms like that, you need a deep understand of the problem you are trying to solve. A thorough understanding that you can rarely obtain unless the problem is particularly simple or you have abstracted away all the interesting cases. Far more often, you can collect examples of good or bad solutions to the problem you want to solve without being able to explain exactly why a given solution is good or bad. Or you can obtain data that provides examples of relationships between data you are interested in without necessarily understanding the underlying reasons for these relationships. This is where machine learning can help. Machine learning concerns learning from data; you do not explicitly develop an algorithm for solving a particular problem. Instead, you use a generic learning algorithm that you feed examples of solutions to, and let it learn how to solve the problem from those examples. This might sound very abstract, but most statistical modeling is indeed examples of this. Take for example a linear model y = αx + β + ε where ε is the stochastic noise (usually assumed to be normal distributed). When you want to model a linear relationship between x and y, you don’t figure out α and β from first principle. You can write an algorithm for sorting numbers without having studied the numbers beforehand, but you cannot usually figure out what the linear relationship is between y and x without looking at data. When you fit the linear model, you are doing machine learning. (Well, I suppose if you do it by hand it isn’t machine learning, but you are not likely to fit linear models by hand that often.) People typically do not call simple models like linear regression machine learning, but that is mostly because the term “machine learning” is much younger than these models. Linear regression is as much machine learning as neural networks are. Supervised Learning Supervised learning is used when you have variables you want to predict using other variables. Situations like linear regression where you have some input variables, for example, x, and you want a model that predicts output (or response) variables, y = f(x) Unsupervised learning, the topic of Chapter 7, is instead concerned with discovering patterns in data when you don’t necessarily know what kind of questions you are interested in learning. When you don’t have x and y values and want to know how they are related, but instead have a collection of data, and you want to discover what patterns there are in the data. For the simplest case of supervised learning, we have one response variable, y, and one input variable, x, and we want to figure out a function, f, mapping input to output, i.e., so that y = f(x). What we have to work with is example data of matching x and y. We can write that as vectors x = (x1 , … , xn) and y = (y1 , … , yn ) where we want to figure out a function f such that yi = f(xi). We will typically accept that there might be some noise in our observations, so f doesn’t map perfectly from x to y. So we can chance the setup slightly and assume that the data we have is x = (x1 , … , xn) and t = (t 1 , … , tn), where t is target values and where t i = yi + εi , yi = f(xi ), and εi is the error in the observation t i . How we model the error εi and the function f are choices that are up to us. It is only modelling, after all, and we can do whatever we want. Not all models are equally good, of course, so we need to be a little careful with what we choose and how we evaluate if the choice is good or bad, but in principle, we can do anything. The way most machine learning works is that an algorithm, implicitly or explicitly, defines a class of parameterized functions f(−; θ), each mapping input to output f(−; θ): x ↦ f(x; θ) =y(θ) (now the value we get for the output depends on the parameters of the function, θ), and the learning consists of choosing parameters θ such that we minimize the errors, i.e., so that f(xi ; θ) is as close to t i as we can get. We want to get close for all our data points, or at least get close on average, so if we let y(θ) denote the vector (y(θ)1 , … , y(θ)n) = (f(x1 ; θ), … , f(xn ; θ))we want to minimize the distance from y(θ) to t, ∥ y(θ) − t ∥, for some distance measure ∥·∥. Regression versus Classification There are two types of supervised learning: regression and classification. Regression is used when the output variable we try to target is a number. Classification is used when we try to target categorical variables. Take linear regression, y = αx + β (or t = αx + β +ε). It is regression because the variable we are trying to target is a number. The parametrized class of functions, f θ , are all lines. If we let θ = θ1 , θ0 and α = θ1 , β = θ0 then y(θ) = f(x; θ) = θ1 x + θ0 Fitting a linear model consists of finding the best θ, where best is defined as the θ that gets y(θ) closest to t. The distance measure used in linear regression is the squared Euclidean distance y t θ θ ( ) = − =∑( ) ( )− 2 1 2 i n i i y t . The reason it is the squared distance instead of just the distance is mostly mathematical convenience—it is easier to maximize θ that way—but also related to us interpreting the error term ε as normal distributed. Whenever you are fitting data in linear regression, you are minimizing this distance; you are finding the parameters θ that bests fit the data in the sense of: θ θ θ  = ( ) θ θ+ − = ∑ arg min , 1 0 1 0 1 2 i n i i x t For an example of classification, assume that the targets t i are binary, encoded as 0 and 1, but that the input variables xi are still real numbers. A common way of defining the mapping function f(−; θ) is to let it map x to the unit interval [0, 1] and interpret the resulting y(θ) as the probability that t is 1. In a classification setting, you would then predict 0 if f(x; θ) < 0.5 and predict 1 if f(x; θ) > 0.5 (and have some strategy for dealing with f(x; θ) < 0.5. In linear classification, the function f θ could look like this: f x( ) ;θ σ= + ( )

Comments