After working with many machine learning models and training algorithms,
which seem like unfathomable black boxes. we were able to optimize a
regression system, have also worked with image classifiers. But we developed
these systems without understanding what's s inside and how they work, so now
we need to go deeper so that we can grasp how they work and understand the
details of implementation.
Gaining a deep understanding of these details will help you with the right model
and with choosing the best training algorithm. Also, it will help you with
debugging and error analysis.
In this chapter, we'll work with polynomial regression, which is a complex
model that works for nonlinear data sets. In addition, we'll working with several
regularization techniques that reduce training that encourages overfitting.
Linear Regression
As an example, we'll take l_S = θ0 + θ1 × GDP_per_cap. This is a simple model
for a linear function of the input feature ,“GPD_per_cap”. (θ0 and θ1) are the
parameters of the model,
In general, you'll use a linear model to make a prediction by calculating a
weighted sum of the input features, and also a constant “bias,” as you can see in
the following equation.
. Y is the value of the predictor.
. N represents the features
. X1 is the value of the feature.
. Θj is the model parameter of j theta.
Also, we can write the equation in vectorized form, as in the following example:
. Θ is the value that minimizes the cost.
. Y contains the values y (1) to y (m).
Let’s write some code to practice.
Import numpy as np
V1_x = 2 * np.random.rand (100, 1)
V2_y = 4 + 3 * V1_x + np.random.randn (100, 1)
After that, we'll calculate Θ value using our equation. It's time to use the inv()
function from our linear algebra module of numpy (np.linalg)to calculate the
inverse of any matrix, and also, the dot() function for multiply our matrix
Value1 = np.c_[np.ones((100, 1)), V1_x]
myTheta = np.linalg.inv(Value1.T.dot(Value1)).dot(Value1.T).dot(V2_y)
>>>myTheta
Array([[num], [num]])
This function uses the following equation — y = 4 + 3x + noise “Gaussian” —
to generate our data.
Now let’s make our predictions.
>>>V1_new = np.array([[0],[2]])
>>>V1_new_2 = np.c_[np.ones((2,1)), V1_new]
>>>V2_predicit = V1_new_2.dot(myTheta)
>>>V2_predict
Array([[ 4.219424], [9.74422282]])
Now, it’s time to plot the model.
Computational Complexity
With the normal formula, we can compute the inverse of M^T. M — that is, a
n*n matrix (n = the number of features). The complexity of this inversion is
something like O(n^2.5) to O(n^3.2), which is based on the implementation.
Actually, if you make the number of features like twice, you'll make the time
of the computation attain between 2^2.5 and 2^3.2.
The great news here is that the equation is a linear equation. This means It can
easily handle huge training sets and fit the memory in.
After training your model, the predictions will be not slow, and the complexity
will be simple, thanks to the linear model. It’s time to go deeper into the methods
of training a linear regression model, which is always used when there is a large
number of features and instances in the memory.
Gradient Descent
This algorithm is a general algorithm that is used for optimization and for
providing the optimal solution for various problems. The idea of this algorithm
is to work with the parameters in an iterative way, to make the cost function as
simple as possible.
The gradient descent algorithm calculates the gradient of the error using the
parameter theta, and it works with the method of descending gradient. If the
gradient is equal to zero, you'll reach the minimum.
Also, you should keep in the mind that the size of the steps is very important for
this algorithm, because if it's very small – “meaning the rate of learning” is slow
– it will take a long time to cover everything that it needs to.
Batch Gradient Descent
If you'd like to implement this algorithm, you should first calculate the gradient
of your cost function using the theta parameter. If the value of the parameter
theta has changed, you’ll need to know the changing rate of your cost function.
We can call this change by a partial derivative
We can calculate the partial derivative using the following equation:
But we`ll also use the following equation to calculate the partial derivatives and
the gradient vector together.
Let’s implement the algorithm.
Lr = 1 # Lr for learning rate
Num_it = 1000 # number of iterations
L = 100
myTheta = np.random.randn (2,1)
Stochastic Gradient Descent
You'll find a problem when you’re using the batch gradient descent: it needs to
use the whole training set in order to calculate the value at each step, and that
will affect performance “speed”.
But when using the stochastic gradient descent, the algorithm will randomly
choose an instance from your training set at each step, and then it will calculate
the value. In this way, the algorithm will be faster than the batch gradient
descent, since it doesn’t need to use the whole set to calculate the value. On the
other hand, because of the randomness of this method, it will be irregular when
compared to the batch algorithm.
Let’s implement the algorithm.
Nums = 50
L1, L2 = 5, 50
Def lr_sc(s):
return L1 / (s + L2)
myTheta = np.random.randn(2,1)
for Num in range (Nums):
for l in range (f)
myIndex = np.random.randint(f)
V1_Xi = Value1[myIndex:myIndex+1]
V2_yi = V2_y[myIndex:myIndex+1]
gr = 2 * V1_xi.T.dot(V1_xi.dot(myTheta) – V2_yi)
Lr = lr_sc(Num * f + i)
myTheta = myTheta – Lr * gr
>>> myTheta
Array ([[num], [num]])
Mini-Batch Gradient Descent
Because you already know the batch and the stochastic algorithms, this kind of
algorithms is very easy to understand and work with . As you know, both
algorithms calculate the value of the gradients, based on the whole training set or
Mini-Batch Gradient Descent
Because you already know the batch and the stochastic algorithms, this kind of
algorithms is very easy to understand and work with . As you know, both
algorithms calculate the value of the gradients, based on the whole training set or
Comments
Post a Comment