Supervised Learning
This chapter and the next concern the mathematical modeling of data that is the essential core of data
science. We can call this statistics, or we can call it machine learning. At its core, it is the same thing. It is all
about extracting information out of data.
Machine Learning
Machine learning is the discipline of developing and applying models and algorithms for learning from
data. Traditional algorithms implement fixed rules for solving particular problems. Like sorting numbers or
finding the shortest route between two cities. To develop algorithms like that, you need a deep understand of
the problem you are trying to solve. A thorough understanding that you can rarely obtain unless the problem
is particularly simple or you have abstracted away all the interesting cases. Far more often, you can collect
examples of good or bad solutions to the problem you want to solve without being able to explain exactly
why a given solution is good or bad. Or you can obtain data that provides examples of relationships between
data you are interested in without necessarily understanding the underlying reasons for these relationships.
This is where machine learning can help. Machine learning concerns learning from data; you do
not explicitly develop an algorithm for solving a particular problem. Instead, you use a generic learning
algorithm that you feed examples of solutions to, and let it learn how to solve the problem from those
examples.
This might sound very abstract, but most statistical modeling is indeed examples of this. Take for
example a linear model y = αx + β + ε where ε is the stochastic noise (usually assumed to be normal
distributed). When you want to model a linear relationship between x and y, you don’t figure out α and β
from first principle. You can write an algorithm for sorting numbers without having studied the numbers
beforehand, but you cannot usually figure out what the linear relationship is between y and x without
looking at data. When you fit the linear model, you are doing machine learning. (Well, I suppose if you do
it by hand it isn’t machine learning, but you are not likely to fit linear models by hand that often.) People
typically do not call simple models like linear regression machine learning, but that is mostly because
the term “machine learning” is much younger than these models. Linear regression is as much machine
learning as neural networks are.
Supervised Learning
Supervised learning is used when you have variables you want to predict using other variables. Situations
like linear regression where you have some input variables, for example, x, and you want a model that
predicts output (or response) variables, y = f(x)
Unsupervised learning, the topic of Chapter 7, is instead concerned with discovering patterns in data
when you don’t necessarily know what kind of questions you are interested in learning. When you don’t
have x and y values and want to know how they are related, but instead have a collection of data, and you
want to discover what patterns there are in the data.
For the simplest case of supervised learning, we have one response variable, y, and one input variable,
x, and we want to figure out a function, f, mapping input to output, i.e., so that y = f(x). What we have to work
with is example data of matching x and y. We can write that as vectors x = (x1
, … , xn) and y = (y1
, … , yn ) where
we want to figure out a function f such that yi = f(xi).
We will typically accept that there might be some noise in our observations, so f doesn’t map perfectly
from x to y. So we can chance the setup slightly and assume that the data we have is x = (x1
, … , xn) and
t = (t
1
, … , tn), where t is target values and where t
i = yi + εi
, yi = f(xi
), and εi
is the error in the observation t
i
.
How we model the error εi
and the function f are choices that are up to us. It is only modelling, after all,
and we can do whatever we want. Not all models are equally good, of course, so we need to be a little careful
with what we choose and how we evaluate if the choice is good or bad, but in principle, we can do anything.
The way most machine learning works is that an algorithm, implicitly or explicitly, defines a class of
parameterized functions f(−; θ), each mapping input to output f(−; θ): x ↦ f(x; θ) =y(θ) (now the value we
get for the output depends on the parameters of the function, θ), and the learning consists of choosing
parameters θ such that we minimize the errors, i.e., so that f(xi
; θ) is as close to t
i
as we can get. We want to
get close for all our data points, or at least get close on average, so if we let y(θ) denote the vector (y(θ)1
, … ,
y(θ)n) = (f(x1 ; θ), … , f(xn ; θ))we want to minimize the distance from y(θ) to t, ∥ y(θ) − t ∥, for some distance
measure ∥·∥.
Regression versus Classification
There are two types of supervised learning: regression and classification. Regression is used when the output
variable we try to target is a number. Classification is used when we try to target categorical variables.
Take linear regression, y = αx + β (or t = αx + β +ε). It is regression because the variable we are trying to
target is a number. The parametrized class of functions, f
θ
, are all lines. If we let θ = θ1
, θ0
and α = θ1
, β = θ0
then y(θ) = f(x; θ) = θ1
x + θ0
Fitting a linear model consists of finding the best θ, where best is defined as the θ
that gets y(θ) closest to t. The distance measure used in linear regression is the squared Euclidean distance
y t θ θ ( )
=
− =∑( ) ( )− 2
1
2
i
n
i i y t .
The reason it is the squared distance instead of just the distance is mostly mathematical convenience—it
is easier to maximize θ that way—but also related to us interpreting the error term ε as normal distributed.
Whenever you are fitting data in linear regression, you are minimizing this distance; you are finding the
parameters θ that bests fit the data in the sense of:
θ
θ θ
= ( ) θ θ+ − =
∑ arg min
, 1 0
1 0
1
2
i
n
i i x t
For an example of classification, assume that the targets t
i
are binary, encoded as 0 and 1, but that the
input variables xi
are still real numbers. A common way of defining the mapping function f(−; θ) is to let it
map x to the unit interval [0, 1] and interpret the resulting y(θ) as the probability that t is 1. In a classification
setting, you would then predict 0 if f(x; θ) < 0.5 and predict 1 if f(x; θ) > 0.5 (and have some strategy for
dealing with f(x; θ) < 0.5. In linear classification, the function f
θ
could look like this:
f x( ) ;θ σ= + ( )
Comments
Post a Comment