Logistic Regression (Classification, Really)
Using other statistical models works the same way. We specify the class of functions, f
θ
, using a formula and
use a function to fit its parameters. Consider binary classification and logistic regression.
Here we can use the breast cancer data from the mlbench library discussed in Chapter 3 and ask if the
clump thickness has an effect on the risk of a tumor being malignant. That is, we want to see if we can predict
the Class variable from the Cl.thickness variable.
library(mlbench)
data("BreastCancer")
BreastCancer %>% head
## Id Cl.thickness Cell.size Cell.shape
## 1 1000025 5 1 1
## 2 1002945 5 4 4
## 3 1015425 3 1 1
## 4 1016277 6 8 8
## 5 1017023 4 1 1
## 6 1017122 8 10 10
## Marg.adhesion Epith.c.size Bare.nuclei
## 1 1 2 1
## 2 5 7 10
## 3 1 2 2
## 4 1 3 4
## 5 3 2 1
## 6 8 7 10
## Bl.cromatin Normal.nucleoli Mitoses Class
## 1 3 1 1 benign
## 2 3 2 1 benign
## 3 3 1 1 benign
## 4 3 7 1 benign
## 5 3 1 1 benign
## 6 9 7 1 malignant
We can plot the data against the fit, as shown in Figure 6-5. Since the malignant status is either 0 or 1,
the points would overlap but if we add a little jitter to the plot we can still see them, and if we make them
slightly transparent, we can see the density of the points.
BreastCancer %>%
ggplot(aes(x = Cl.thickness, y = Class)) +
geom_jitter(height = 0.05, width = 0.3, alpha=0.4)
For classification we still specify the prediction function y = f(x) using the formula y ~ x. The outcome
parameter for y ~ x is just binary now. To fit a logistic regression we need to use the glm() function
(generalized linear model) with the family set to "binomial". This specifies that we use the logistic function
to map from the linear space of x and θ to the unit interval. Aside from that, fitting and getting results is very
similar.
We cannot directly fit the breast cancer data with logistic regression, though. There are two problems.
The first is that the breast cancer dataset considers the clump thickness ordered factors, but for logistic
regression we need the input variable to be numeric. While generally, it is not advisable to directly translate
categorical data into numeric data, judging from the plot it seems okay in this case. Using the function as.
numeric() will do this, but remember that this is a risky approach when working with factors! It actually
would work for this dataset, but we will use the safer approach of first translating the factor into strings
and then into numbers. The second problem is that the glm() function expects the response variable to
be numerical, coding the classes like 0 or 1, while the BreastCancer data encodes the classes as a factor.
Generally, it varies a little from algorithm to algorithm whether a factor or a numerical encoding is expected
for classification, so you always need to check the documentation for that, but in any case, it is simple
enough to translate between the two representations.
We can translate the input variable to numerical values and the response variable to 0 and 1 and plot
the data together with a fitted model, as shown in Figure 6-6. For the geom_smooth() function, we specify
that the method is glm and that the family is binomial. To specify the family, we need to pass this argument
Model Matrices and Formula
Most statistical models and machine learning algorithms actually creates a map not from a single value,
f(−; θ): x ↦ y, but from a vector, f(−; θ): x ↦ y. When we fit a line for single x and y values we are actually also
working with fitting a vector because we have both the x values and the intercept to fit. That is why the model
has two parameters, θ0
and θ0
. For each x value, we are actually using the vector (1, x) where the 1 is used to
fit the intercept.
We shouldn’t confuse this with the vector we have as input to the model fitting, though. If we have
data (x, t) to fit, then we already have a vector for our input data. But what the linear model actually sees is
a matrix for x, so we’ll call that X. This matrix, know as the model matrix, has a row per value in x and it has
two columns, one for the intercept and one for the x values.
X
x
x
x
xn
=
1
1
1
1
1
2
3
We can see what model matrix R generates for a given dataset and formula using the model.matrix()
function. For the cars data, if we want to fit dist versus speed we get this:
cars %>%
model.matrix(dist ~ speed, data = .) %>%
head(5)
## (Intercept) speed
## 1 1 4
## 2 1 4
## 3 1 7
## 4 1 7
## 5 1 8
If we remove the intercept, we simply get this:
cars %>%
model.matrix(dist ~ speed - 1, data = .) %>%
head(5)
## speed
## 1 4
In this plot, I used the method "lm" for the smoothed statistics to see the fit. By default the geom_
smooth() function would have given us a loess curve, but since we are interested in linear fits, we tell it to
use the lm method. By default geom_smooth() will also plot the uncertainty of the fit. This is the gray area in
the plot. This is the area where the line is likely to be (assuming that the data is generated by a linear model).
Do not confuse this with where data points are likely to be, though. If target values are given by t = θ1
x + θ0
+ ε
where ε has a very large variance, then even if we knew θ1
and θ0
with high certainty we still wouldn’t be able
to predict with high accuracy where any individual point would fall. There is a difference between prediction
accuracy and inference accuracy. We might know model parameters with very high accuracy without being
able to predict very well. We might also be able to predict very well without knowing all model parameters
well. If a given model parameter has little influence on where target variables fall, then the training data
gives us little information about that parameter. This usually doesn’t happen unless the model is more
complicated than it needs to be, though, since we often want to remove parameters that do not affect the
data.
To actually fit the data and get information about the fit, we use the lm() function with the model
specification, dist ~ speed, and we can use the summary() function to see information about the fit:
Comments
Post a Comment