What is Logistic Regression ?

Logistic Regression (Classification, Really) Using other statistical models works the same way. We specify the class of functions, f θ , using a formula and use a function to fit its parameters. Consider binary classification and logistic regression.
Here we can use the breast cancer data from the mlbench library discussed in Chapter 3 and ask if the clump thickness has an effect on the risk of a tumor being malignant. That is, we want to see if we can predict the Class variable from the Cl.thickness variable. library(mlbench) data("BreastCancer") BreastCancer %>% head ## Id Cl.thickness Cell.size Cell.shape ## 1 1000025 5 1 1 ## 2 1002945 5 4 4 ## 3 1015425 3 1 1 ## 4 1016277 6 8 8 ## 5 1017023 4 1 1 ## 6 1017122 8 10 10 ## Marg.adhesion Epith.c.size Bare.nuclei ## 1 1 2 1 ## 2 5 7 10 ## 3 1 2 2 ## 4 1 3 4 ## 5 3 2 1 ## 6 8 7 10 ## Bl.cromatin Normal.nucleoli Mitoses Class ## 1 3 1 1 benign ## 2 3 2 1 benign ## 3 3 1 1 benign ## 4 3 7 1 benign ## 5 3 1 1 benign ## 6 9 7 1 malignant We can plot the data against the fit, as shown in Figure 6-5. Since the malignant status is either 0 or 1, the points would overlap but if we add a little jitter to the plot we can still see them, and if we make them slightly transparent, we can see the density of the points. BreastCancer %>% ggplot(aes(x = Cl.thickness, y = Class)) + geom_jitter(height = 0.05, width = 0.3, alpha=0.4) For classification we still specify the prediction function y = f(x) using the formula y ~ x. The outcome parameter for y ~ x is just binary now. To fit a logistic regression we need to use the glm() function (generalized linear model) with the family set to "binomial". This specifies that we use the logistic function to map from the linear space of x and θ to the unit interval. Aside from that, fitting and getting results is very similar. We cannot directly fit the breast cancer data with logistic regression, though. There are two problems. The first is that the breast cancer dataset considers the clump thickness ordered factors, but for logistic regression we need the input variable to be numeric. While generally, it is not advisable to directly translate categorical data into numeric data, judging from the plot it seems okay in this case. Using the function as. numeric() will do this, but remember that this is a risky approach when working with factors! It actually would work for this dataset, but we will use the safer approach of first translating the factor into strings and then into numbers. The second problem is that the glm() function expects the response variable to be numerical, coding the classes like 0 or 1, while the BreastCancer data encodes the classes as a factor. Generally, it varies a little from algorithm to algorithm whether a factor or a numerical encoding is expected for classification, so you always need to check the documentation for that, but in any case, it is simple enough to translate between the two representations. We can translate the input variable to numerical values and the response variable to 0 and 1 and plot the data together with a fitted model, as shown in Figure 6-6. For the geom_smooth() function, we specify that the method is glm and that the family is binomial. To specify the family, we need to pass this argument Model Matrices and Formula Most statistical models and machine learning algorithms actually creates a map not from a single value, f(−; θ): x ↦ y, but from a vector, f(−; θ): x ↦ y. When we fit a line for single x and y values we are actually also working with fitting a vector because we have both the x values and the intercept to fit. That is why the model has two parameters, θ0 and θ0 . For each x value, we are actually using the vector (1, x) where the 1 is used to fit the intercept. We shouldn’t confuse this with the vector we have as input to the model fitting, though. If we have data (x, t) to fit, then we already have a vector for our input data. But what the linear model actually sees is a matrix for x, so we’ll call that X. This matrix, know as the model matrix, has a row per value in x and it has two columns, one for the intercept and one for the x values. X x x x xn =                 1 1 1 1 1 2 3   We can see what model matrix R generates for a given dataset and formula using the model.matrix() function. For the cars data, if we want to fit dist versus speed we get this: cars %>% model.matrix(dist ~ speed, data = .) %>% head(5) ## (Intercept) speed ## 1 1 4 ## 2 1 4 ## 3 1 7 ## 4 1 7 ## 5 1 8 If we remove the intercept, we simply get this: cars %>% model.matrix(dist ~ speed - 1, data = .) %>% head(5) ## speed ## 1 4 In this plot, I used the method "lm" for the smoothed statistics to see the fit. By default the geom_ smooth() function would have given us a loess curve, but since we are interested in linear fits, we tell it to use the lm method. By default geom_smooth() will also plot the uncertainty of the fit. This is the gray area in the plot. This is the area where the line is likely to be (assuming that the data is generated by a linear model). Do not confuse this with where data points are likely to be, though. If target values are given by t = θ1 x + θ0 + ε where ε has a very large variance, then even if we knew θ1 and θ0 with high certainty we still wouldn’t be able to predict with high accuracy where any individual point would fall. There is a difference between prediction accuracy and inference accuracy. We might know model parameters with very high accuracy without being able to predict very well. We might also be able to predict very well without knowing all model parameters well. If a given model parameter has little influence on where target variables fall, then the training data gives us little information about that parameter. This usually doesn’t happen unless the model is more complicated than it needs to be, though, since we often want to remove parameters that do not affect the data. To actually fit the data and get information about the fit, we use the lm() function with the model specification, dist ~ speed, and we can use the summary() function to see information about the fit:

Comments