Want to make it even easier to get premium apps? Apps for gamers Show all. News about the App Store. Best entertainment apps Show all. Security for every app. Explore in-app events like movie premieres, gaming competitions, and livestreams.

California housing prices 15 If you are reading this in grayscale, grab a red pen and scribble over most of the coastline from the Bay Area down to San Diego as you might expect. You can add a patch of yellow around Sacramento as well. It will probably be useful to use a clustering algorithm to detect the main clusters, and add new features that measure the proximity to the cluster centers.

When it is close to 1, it means that there is a strong positive correlation; for example, the median house value tends to go up when the median income goes up. When the coefficient is close to �1, it means that there is a strong negative correlation; you can see a small negative correlation between the latitude and the median house value i.

Finally, coefficients close to zero mean that there is no linear correlation. It may completely miss out on nonlinear relationships e.

Scatter matrix The main diagonal top left to bottom right would be full of straight lines if Pandas plotted each variable against itself, which would not be very useful. First, the correlation is indeed very strong; you can clearly see the upward trend and the points are not too dispersed. You may want to try removing the corresponding districts to prevent your algorithms from learning to reproduce these data quirks.

Median income versus median house value Experimenting with Attribute Combinations Hopefully the previous sections gave you an idea of a few ways you can explore the data and gain insights. You identified a few data quirks that you may want to clean up before feeding the data to a Machine Learning algorithm, and you found interesting correlations between attributes, in particular with the target attribute.

Of course, your mileage will vary considerably with each project, but the general ideas are similar. One last thing you may want to do before actually preparing the data for Machine Learning algorithms is to try out various attribute combinations. What you really want is the number of rooms per household.

Similarly, the total number of bedrooms by itself is not very useful: you probably want to compare it to the number of rooms. And the population per household also seems like an interesting attribute combination to look at. The number of rooms per household is also more informative than the total number of rooms in a district�obviously the larger the houses, the more expensive they are.

But this is an iterative process: once you get a prototype up and running, you can analyze its output to gain more insights and come back to this exploration step.

You will need it later to replace missing values in the test set when you want to evaluate your system, and also once the system goes live to replace missing values in new data. Scikit-Learn provides a handy class to take care of missing values: SimpleImputer. Here is how to use it.

All objects share a consistent and simple interface: � Estimators. Any object that can estimate some parameters based on a dataset is called an estimator e. The estimation itself is performed by the fit method, and it takes only a dataset as a parameter or two for supervised learning algorithms; the second dataset contains the labels. Some estimators such as an imputer can also transform a dataset; these are called transformers.

Once again, the API is quite simple: the transformation is performed by the transform method with the dataset to transform as a parameter. It returns the transformed dataset.

Finally, some estimators are capable of making predictions given a dataset; they are called predictors. A predictor has a predict method that takes a dataset of new instances and returns a dataset of corresponding predictions. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade classes. Hyperparameters are just regular Python strings or numbers. Existing building blocks are reused as much as possible.

For example, it is easy to create a Pipeline estimator from an arbitrary sequence of transformers followed by a final estimator, as we will see. Scikit-Learn provides reasonable default values for most parameters, making it easy to create a baseline working system quickly.

This may be fine in some cases e. This is called one-hot encoding, because only one attribute will be equal to 1 hot , while the others will be 0 cold. The new attributes are sometimes called dummy attributes. This is very useful when you have categorical attributes with thousands of categories. After one- hot encoding we get a matrix with thousands of columns, and the matrix is full of zeros except for a single 1 per row. This may slow down training and degrade performance. Alternatively, you could replace each category with a learnable low dimensional vector called an embedding.

Custom Transformers Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom cleanup operations or combining specific attributes. You can get the last one for free by simply adding TransformerMixin as a base class.

For example, here is a small transformer class that adds the combined attributes we discussed earlier: from sklearn. Feature Scaling One of the most important transformations you need to apply to your data is feature scaling.

Note that scaling the target values is generally not required. There are two common ways to get all attributes to have the same scale: min-max scaling and standardization. Min-max scaling many people call this normalization is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. Standardization is quite different: first it subtracts the mean value so standardized values always have a zero mean , and then it divides by the standard deviation so that the resulting distribution has unit variance.

For example, suppose a district had a median income equal to by mistake. Min-max scaling would then crush all the other values from 0�15 down to 0�0.

As with all the transformations, it is important to fit the scalers to the training data only, not to the full dataset including the test set. Only then can you use them to transform the training set and the test set and new data. Transformation Pipelines As you can see, there are many data transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn provides the Pipeline class to help with such sequences of transformations.

Here is a small pipeline for the numerical attributes: from sklearn. All but the last estimator must be transformers i. In version 0. The constructor requires a list of tuples, where each tuple contains a name21, a transformer and a list of names or indices of columns that the transformer should be applied to. Finally, we apply this ColumnTransformer to the housing data: it applies each transformer to the appropriate columns and concatenates the outputs along the second axis the transformers must return the same number of rows.

When there is such a mix of sparse and dense matrices, the Colum nTransformer estimates the density of the final matrix i. In this example, it returns a dense matrix. We have a preprocessing pipeline that takes the full housing data and applies the appropriate transformations to each column.

Or you can specify "pass through" if you want the columns to be left untouched. By default, the remaining columns i. If you are using Scikit-Learn 0.

Alternatively, you can use the FeatureUnion class which can also apply different transformers and concatenate their outputs, but you cannot specify different columns for each transformer, they all apply to the whole data.

Select and Train a Model At last! You framed the problem, you got the data and explored it, you sampled a training set and a test set, and you wrote transformation pipelines to clean up and prepare your data for Machine Learning algorithms automatically. You are now ready to select and train a Machine Learning model.

Training and Evaluating on the Training Set The good news is that thanks to all these previous steps, things are now going to be much simpler than you might think. You now have a working Linear Regression model. This is an example of a model underfitting the training data.

When this happens it can mean that the features do not provide enough information to make good predictions, or that the model is not powerful enough. As we saw in the previous chapter, the main ways to fix underfitting are to select a more powerful model, to feed the training algorithm with better features, or to reduce the constraints on the model.

This model is not regularized, so this rules out the last option. You could try to add more features e. This is a powerful model, capable of finding complex nonlinear relationships in the data Decision Trees are presented in more detail in Chapter 6. The code should look familiar by now: from sklearn. No error at all? Could this model really be absolutely perfect?

Of course, it is much more likely that the model has badly overfit the data. How can you be sure? Notice that cross-validation allows you to get not only an estimate of the performance of your model, but also a measure of how precise this estimate is i.

You would not have this information if you just used one validation set. But cross-validation comes at the cost of training the model several times, so it is not always possible. However, note that the score on the training set is still much lower than on the validation sets, meaning that the model is still overfitting the training set. Possible solutions for overfitting are to simplify the model, constrain it i. The goal is to shortlist a few two to five promising models.

Make sure you save both the hyperparameters and the trained parameters, as well as the cross-validation scores and perhaps the actual predictions as well. This will allow you to easily compare scores across model types, and compare the types of errors they make. You now need to fine-tune them. Grid Search One way to do that would be to fiddle with the hyperparameters manually, until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations.

All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation. The RMSE score for this combination is 49,, which is slightly better than the score you got earlier using the default hyperparameter values which was 50, For example, the grid search will automatically find out whether or not to add a feature you were not sure about e.

Randomized Search The grid search approach is fine when you are exploring relatively few combinations, like in the previous example, but when the hyperparameter search space is large, it is often preferable to use RandomizedSearchCV instead.

Ensemble Methods Another way to fine-tune your system is to try to combine the models that perform best. We will cover this topic in more detail in Chapter 7. Analyze the Best Models and Their Errors You will often gain good insights on the problem by inspecting the best models. Evaluate Your System on the Test Set After tweaking your models for a while, you eventually have a system that performs sufficiently well. Now is the time to evaluate the final model on the test set.

You might want to have an idea of how precise this estimate is. It is not the case in this example, but when this happens you must resist the temptation to tweak the hyperparameters to make the numbers look good on the test set; the improvements would be unlikely to generalize to new data. This is important to catch not only sudden breakage, but also performance degradation.

This will generally require a human analysis. These analysts may be field experts, or workers on a crowdsourcing platform such as Amazon Mechanical Turk or CrowdFlower. Sometimes performance will degrade slightly because of a poor quality signal e. Monitoring the inputs is particularly important for online learning systems. Finally, you will generally want to train your models on a regular basis using fresh data.

You should automate this process as much as possible. If your system is an online learning system, you should make sure you save snapshots of its state at regular intervals so you can easily roll back to a previously working state. Hopefully this chapter gave you a good idea of what a Machine Learning project looks like, and showed you some of the tools you can use to train a great system. As you can see, much of the work is in the data preparation step, building monitoring tools, setting up human evaluation pipelines, and automating regular model training.

So, if you have not already done so, now is a good time to pick up a laptop, select a dataset that you are interested in, and try to go through the whole process from A to Z. Try a Support Vector Machine regressor sklearn. How does the best SVR predictor perform? Try adding a transformer in the preparation pipeline to select only the most important attributes.

Try creating a single pipeline that does the full data preparation plus the final prediction. Automatically explore some preparation options using GridSearchCV. Try It Out! In Chapter 2 we explored a regression task, predicting housing values, using various algorithms such as Linear Regression, Decision Trees, and Random Forests which will be explained in further detail in later chapters.

Now we will turn our attention to classification systems. Each image is labeled with the digit it represents. Scikit-Learn provides many helper functions to download popular datasets. MNIST is one of them. You should always create a test set and set it aside before inspecting the data closely. Moreover, some learning algorithms are sensitive to the order of the training instances, and they perform poorly if they get many similar instances in a row. We will explore this in the next chapters.

This is in part because SGD deals with training instances independently, one at a time which also makes SGD well suited for online learning , as we will see later.

Looks like it guessed right in this particular case! Performance Measures Evaluating a classifier is often significantly trickier than evaluating a regressor, so we will spend a large part of this chapter on this topic. There are many performance measures available, so grab another coffee and get ready to learn many new concepts and acronyms!

Implementing Cross-Validation Occasionally you will need more control over the cross-validation process than what Scikit-Learn provides off-the-shelf. In these cases, you can implement cross- validation yourself; it is actually fairly straightforward. At each iteration the code creates a clone of the classifier, trains that clone on the training folds, and makes predictions on the test fold. Then it counts the number of correct predictions and outputs the ratio of correct predictions. Beats Nostradamus.

This demonstrates why accuracy is generally not the preferred performance measure for classifiers, especially when you are dealing with skewed datasets i. The general idea is to count the number of times instances of class A are classified as class B. For example, to know the number of times the classifier confused images of 5s with 3s, you would look in the 5th row and 3rd column of the confusion matrix.

To compute the confusion matrix, you first need to have a set of predictions, so they can be compared to the actual targets.

This would not be very useful since the classifier would ignore all but one positive instance. So precision is typically used along with another metric named recall, also called sensitivity or true positive rate TPR : this is the ratio of positive instances that are correctly detected by the classifier Equation If you are confused about the confusion matrix, Figure may help. When it claims an image represents a 5, it is correct only It is often convenient to combine precision and recall into a single metric called the F1 score, in particular if you need a simple way to compare two classifiers.

The F1 score is the harmonic mean of precision and recall Equation Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values.

As a result, the classifier will only get a high F1 score if both recall and precision are high. For each instance, it computes a score based on a decision function, and if that score is greater than a threshold, it assigns the instance to the positive class, or else it assigns it to the negative class.

Figure shows a few digits positioned from the lowest score on the left to the highest score on the right. Conversely, lowering the threshold increases recall and reduces precision. Now how do you decide which threshold to use? Precision and recall versus the decision threshold You may wonder why the precision curve is bumpier than the recall curve in Figure The reason is that precision may sometimes go down when you raise the threshold although in general it will go up.

But of course the choice depends on your project. You look up the first plot and find that you need to use a threshold of about 8, Hmm, not so fast. A high-precision classifier is not very useful if its recall is too low! The FPR is the ratio of negative instances that are incorrectly classified as positive.

It is equal to one minus the true negative rate, which is the ratio of negative instances that are correctly classified as negative. The TNR is also called specificity. Hence the ROC curve plots sensitivity recall versus 1 � specificity. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible toward the top-left corner.

One way to compare classifiers is to measure the area under the curve AUC. As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives, and the ROC curve otherwise. But this is mostly because there are few positives 5s compared to the negatives non-5s. In contrast, the PR curve makes it clear that the classifier has room for improvement the curve could be closer to the top- right corner.

First, you need to get scores for each instance in the training set. Scikit-Learn classifiers generally have one or the other. It is useful to plot the first ROC curve as well to see how they compare Figure : plt. Not too bad! Multiclass Classification Whereas binary classifiers distinguish between two classes, multiclass classifiers also called multinomial classifiers can distinguish between more than two classes.

Some algorithms such as Random Forest classifiers or naive Bayes classifiers are capable of handling multiple classes directly. Others such as Support Vector Machine classifiers or Linear classifiers are strictly binary classifiers.

For example, one way to create a system that can classify the digit images into 10 classes from 0 to 9 is to train 10 binary classifiers, one for each digit a 0-detector, a 1-detector, a 2-detector, and so on.

Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs the highest score. This is called the one-versus-all OvA strategy also called one-versus-the-rest. This is called the one-versus-one OvO strategy.

When you want to classify an image, you have to run the image through all 45 classifiers and see which class wins the most duels. Some algorithms such as Support Vector Machine classifiers scale poorly with the size of the training set, so for these algorithms OvO is preferred since it is faster to train many classifiers on small training sets than training few classifiers on large training sets.

For most binary classification algorithms, however, OvA is preferred. Then it makes a prediction a correct one in this case. Under the hood, Scikit-Learn actually trained 10 binary classifiers, got their decision scores for the image, and selected the class with the highest score.

Simply create an instance and pass a binary classifier to its constructor. Now of course you want to evaluate these classifiers. As usual, you want to use cross- validation. Here, we will assume that you have found a promising model and you want to find ways to improve it. One way to do this is to analyze the types of errors it makes.

The 5s look slightly darker than the other digits, which could mean that there are fewer images of 5s in the dataset or that the classifier does not perform as well on 5s as on other digits. In fact, you can verify that both are the case.

Remember that rows represent actual classes, while columns represent predicted classes. The column for class 8 is quite bright, which tells you that many images get misclassified as 8s. As you can see, the confusion matrix is not necessarily symmetrical.

You can also see that 3s and 5s often get confused in both directions. Analyzing the confusion matrix can often give you insights on ways to improve your classifier. Looking at this plot, it seems that your efforts should be spent on reducing the false 8s. For example, you could try to gather more training data for digits that look like 8s but are not so the classifier can learn to distinguish them from real 8s. Or you could preprocess the images e.

Analyzing individual errors can also be a good way to gain insights on what your classifier is doing and why it is failing, but it is more difficult and time-consuming. Some of the digits that the classifier gets wrong i.

All it does is assign a weight per class to each pixel, and when it sees a new image it just sums up the weighted pixel intensities to get a score for each class.

So since 3s and 5s differ only by a few pixels, this model will easily confuse them. If you draw a 3 with the junction slightly shifted to the left, the classifier might classify it as a 5, and vice versa. In other words, this classifier is quite sensitive to image shifting and rotation. This will probably help reduce other errors as well.

Multilabel Classification Until now each instance has always been assigned to just one class. In some cases you may want your classifier to output multiple classes for each instance. For example, consider a face-recognition classifier: what should it do if it recognizes several people on the same picture? Of course it should attach one tag per person it recognizes. Such a classification system that outputs multiple binary tags is called a multilabel classification system.

The next lines create a KNeighborsClassifier instance which supports multilabel classification, but not all classifiers do and we train it using the multiple targets array.

The digit 5 is indeed not large False and odd True. There are many ways to evaluate a multilabel classifier, and selecting the right metric really depends on your project.

For example, one approach is to measure the F1 score for each individual label or any other binary classifier metric discussed earlier , then simply compute the average score. One simple option is to give each label a weight equal to its support i. It is thus an example of a multioutput classification system. The line between classification and regression is sometimes blurry, such as in this example.

Arguably, predicting pixel intensity is more akin to regression than to classification. Moreover, multioutput systems are not limited to classification tasks; you could even have a system that outputs multiple labels per instance, including both class labels and value labels.

This concludes our tour of classification. Exercises 1. Write a function that can shift an MNIST image in any direction left, right, up, or down by one pixel. Finally, train your best model on this expanded training set and measure its accuracy on the test set.

You should observe that your model performs even better now! This technique of artificially growing the training set is called data augmentation or training set expansion. Tackle the Titanic dataset. A great place to start is on Kaggle. Your preparation pipeline should transform an email into a sparse vector indicating the presence or absence of each possible word.

However, having a good understanding of how things work can help you quickly home in on the appropriate model, the right training algorithm to use, and a good set of hyperparameters for your task. In this chapter, we will start by looking at the Linear Regression model, one of the simplest models there is.

Finally, we will look at two more models that are commonly used for classification tasks: Logistic Regression and Softmax Regression. There will be quite a few math equations in this chapter, using basic notions of linear algebra and calculus.

For those who are truly allergic to mathematics, you should still go through this chapter and simply skip the equations; hopefully, the text will be sufficient to help you understand most of the concepts. More generally, a linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term also called the intercept term , as shown in Equation In this book we will use this notation to avoid switching between dot products and matrix multiplications.

Well, recall that training a model means setting its parameters so that the model best fits the training set. For this purpose, we first need a measure of how well or poorly the model fits the training data. In practice, it is simpler to minimize the Mean Square Error MSE than the RMSE, and it leads to the same result because the value that minimizes a function also minimizes its square root.

This is generally because that function is easier to compute, because it has useful differentiation properties that the performance measure lacks, or because we want to constrain the model during training, as we will see when we discuss regularization. This is called the Normal Equation Equation You can use np.

This approach is more efficient than computing the Normal Equation, plus it handles edge cases nicely: indeed, the Normal Equation may not work if the matrix XTX is not invertible i. The computational complexity of inverting such a matrix is typically about O n2. In other words, if you double the number of features, you multiply the computation time by roughly If you double the number of features, you multiply the computation time by roughly 4.

Both the Normal Equation and the SVD approach get very slow when the number of features grows large e. In other words, making predictions on twice as many instances or twice as many features will just take roughly twice as much time.

Now we will look at very different ways to train a Linear Regression model, better suited for cases where there are a large number of features, or too many training instances to fit in memory. Gradient Descent Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

A good strategy to get to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. Gradient Descent An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter.

If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time see Figure Learning rate too small On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before.

This might make the algorithm diverge, with larger and larger values, failing to find a good solution see Figure Learning rate too large Finally, not all cost functions look like nice regular bowls. There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult. If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum. Gradient Descent pitfalls Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve.

This implies that there are no local minima, just one global minimum. It is also a continuous function with a slope that never changes abruptly.

In fact, the cost function has the shape of a bowl, but it can be an elongated bowl if the features have very different scales. Gradient Descent with and without feature scaling 4 Technically speaking, its derivative is Lipschitz continuous. It will eventually reach the minimum, but it will take a long time.

When using Gradient Descent, you should ensure that all features have a similar scale e. This diagram also illustrates the fact that training a model means searching for a combination of model parameters that minimizes a cost function over the training set.

This is called a partial derivative. This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step. But what if you had used a different learning rate eta? Figure shows the first 10 steps of Gradient Descent using three different learning rates the dashed line represents the starting point.

In the middle, the learning rate looks pretty good: in just a few iterations, it has already converged to the solution. To find a good learning rate, you can use grid search see Chapter 2. However, you may want to limit the number of iterations so that grid search can eliminate models that take too long to converge. You may wonder how to set the number of iterations. If it is too low, you will still be far away from the optimal solution when the algorithm stops, but if it is too high, you will waste time while the model parameters do not change anymore.

If you divide the tolerance by 10 to have a more precise solution, then the algorithm may have to run about 10 times longer. Stochastic Gradient Descent The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance.

Obviously this makes the algorithm much faster since it has very little data to manipulate at every iteration. It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration SGD can be implemented as an out-of-core algorithm.

Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down see Figure Stochastic Gradient Descent 7 Out-of-core algorithms are discussed in Chapter 1.

Therefore randomness is good to escape from local optima, but bad because it means that the algorithm can never settle at the minimum. One solution to this dilemma is to gradually reduce the learning rate. The steps start out large which helps make quick progress and escape local minima , then get smaller and smaller, allowing the algorithm to settle at the global minimum.

The function that determines the learning rate at each iteration is called the learning schedule. If the learning rate is reduced too quickly, you may get stuck in a local minimum, or even end up frozen halfway to the minimum. If the learning rate is reduced too slowly, you may jump around the minimum for a long time and end up with a suboptimal solution if you halt training too early. Stochastic Gradient Descent first 20 steps Note that since instances are picked randomly, some instances may be picked several times per epoch while others may not be picked at all.

If you want to be sure that the algorithm goes through every instance at each epoch, another approach is to shuffle the training set, then go through it instance by instance, then shuffle it again, and so on.

However, this generally converges more slowly. The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs. But, on the other hand, it may be harder for it to escape from local minima in the case of problems that suffer from local minima, unlike Linear Regression as we saw earlier.

Figure shows the paths taken by the three Gradient Descent algorithms in parameter space during training. Table Polynomial Regression What if your data is actually more complex than a simple straight line?

Surprisingly, you can actually use a linear model to fit nonlinear data. A simple way to do this is to add powers of each feature as new features, then train a linear model on this extended set of features. This technique is called Polynomial Regression.

This is made possible by the fact that PolynomialFeatures also adds all combinations of features up to the given degree. Learning Curves If you perform high-degree Polynomial Regression, you will likely fit the training data much better than with plain Linear Regression. For example, Figure applies a degree polynomial model to the preceding training data, and compares the result with a pure linear model and a quadratic model 2nd-degree polynomial.

High-degree Polynomial Regression Of course, this high-degree Polynomial Regression model is severely overfitting the training data, while the linear model is underfitting it. The model that will generalize best in this case is the quadratic model. How can you tell that your model is overfitting or underfitting the data? If a model performs well on the training data but generalizes poorly according to the cross-validation metrics, then your model is overfitting. This is one way to tell when a model is too simple or too complex.

To generate the plots, simply train the model several times on different sized subsets of the training set. The following code defines a function that plots the learning curves of a model given some training data: from sklearn.

When the model is trained on very few training instances, it is incapable of generalizing properly, which is why the validation error is initially quite big. Then as the model is shown more training examples, it learns and thus the validation error slowly goes down. However, once again a straight line cannot do a good job modeling the data, so the error ends up at a plateau, very close to the other curve.

These learning curves are typical of an underfitting model. Both curves have reached a plateau; they are close and fairly high. You need to use a more complex model or come up with better features.

However, if you used a much larger training set, the two curves would continue to get closer. Learning curves for the polynomial model One way to improve an overfitting model is to feed it more training data until the validation error reaches the training error.

A high-bias model is most likely to underfit the training data. The only way to reduce this part of the error is to clean up the data e. This is why it is called a tradeoff. Regularized Linear Models As we saw in Chapters 1 and 2, a good way to reduce overfitting is to regularize the model i.

For example, a simple way to regularize a polynomial model is to reduce the number of polynomial degrees. For a linear model, regularization is typically achieved by constraining the weights of the model.

We will now look at Ridge Regression, Lasso Regression, and Elastic Net, which implement three different ways to constrain the weights. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. Note that the regularization term should only be added to the cost function during training. It is quite common for the cost function used during training to be different from the performance measure used for testing.

It is important to scale the data e. This is true of most regularized models. On the left, plain Ridge models are used, leading to linear predictions. As with Linear Regression, we can perform Ridge Regression either by computing a closed-form equation or by performing Gradient Descent. The pros and cons are the same. Ridge Regression Equation In other words, Lasso Regression automatically performs feature selection and outputs a sparse model i.

Lasso versus Ridge regularization On the Lasso cost function, the BGD path tends to bounce across the gutter toward the end. You need to gradually reduce the learning rate in order to actually converge to the global minimum. It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression.

In general, Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated. This is called early stopping. Figure shows a complex model in this case a high-degree Polynomial Regression model being trained using Batch Gradient Descent.

As the epochs go by, the algorithm learns and its prediction error RMSE on the training set naturally goes down, and so does its prediction error on the validation set. However, after a while the validation error stops decreasing and actually starts to go back up. This indicates that the model has started to overfit the training data. Early stopping regularization With Stochastic and Mini-batch Gradient Descent, the curves are not so smooth, and it may be hard to know whether you have reached the minimum or not.

One solution is to stop only after the validation error has been above the minimum for some time when you are confident that the model will not do any better , then roll back the model parameters to the point where the validation error was at a minimum.

This makes it a binary classifier. Estimating Probabilities So how does it work? It is defined as shown in Equation and Figure Indeed, if you compute the logit of the estimated probability p, you will find that the result is t. The logit is also called the log-odds, since it is the log of the ratio between the estimated probability for the positive class and the estimated probability for the negative class. Training and Cost Function Good, now you know how a Logistic Regression model estimates probabilities and makes predictions.

But how is it trained? This idea is captured by the cost function shown in Equation for a single training instance x. On the other hand, � log t is close to 0 when t is close to 1, so the cost will be close to 0 if the estimated probability is close to 0 for a negative instance or close to 1 for a positive instance, which is precisely what we want.

It can be written in a single expression as you can verify easily , called the log loss, shown in Equation Once you have the gradient vector containing all the partial derivatives you can use it in the Batch Gradient Descent algorithm. For Stochastic GD you would of course just take one instance at a time, and for Mini-batch GD you would use a mini-batch at a time. This is a famous dataset that contains the sepal and petal length and width of iris flowers of three different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica see Figure Gordon E.

Estimated probabilities and decision boundary The petal width of Iris-Virginica flowers represented by triangles ranges from 1.

In between these extremes, the classifier is unsure. Therefore, there is a decision boundary at around 1. Note that it is a linear boundary. The hyperparameter controlling the regularization strength of a Scikit-Learn LogisticRegression model is not alpha as in other linear models , but its inverse: C.

The higher the value of C, the less the model is regularized. Softmax Regression The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers as discussed in Chapter 3. The idea is quite simple: when given an instance x, the Softmax Regression model first computes a score sk x for each class k, then estimates the probability of each class by applying the softmax function also called the normalized exponential to the scores.

The scores are generally called logits or log-odds although they are actually unnormalized log- odds. Just like the Logistic Regression classifier, the Softmax Regression classifier predicts the class with the highest estimated probability which is simply the class with the highest score , as shown in Equation The Softmax Regression classifier predicts only one class at a time i.

You cannot use it to recognize multiple people in one picture. The objective is to have a model that estimates a high probability for the target class and consequently a low probability for the other classes. Minimizing the cost function shown in Equation , called the cross entropy, should lead to this objective because it penalizes the model when it estimates a low probability for a target class. In general, it is either equal to 1 or 0, depending on whether the instance belongs to the class or not.

Cross Entropy Cross entropy originated from information theory. Suppose you want to efficiently transmit information about the weather every day. If there are eight options sunny, rainy, etc. Cross entropy measures the average number of bits you actually send per option. If your assumption about the weather is perfect, cross entropy will just be equal to the entropy of the weather itself i.

For more details, check out this video. Notice that the decision boundaries between any two classes are linear. The figure also shows the probabilities for the Iris-Versicolor class, represented by the curved lines e. What Linear Regression training algorithm can you use if you have a training set with millions of features?

Suppose the features in your training set have very different scales. What can you do about it? Can Gradient Descent get stuck in a local minimum when training a Logistic Regression model? Do all Gradient Descent algorithms lead to the same model provided you let them run long enough?

Suppose you use Batch Gradient Descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on? How can you fix this? Which Gradient Descent algorithm among those we discussed will reach the vicinity of the optimal solution the fastest?

Which will actually converge? How can you make the others converge as well? Suppose you are using Polynomial Regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are three ways to solve this?

Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance?

This chapter will explain the core concepts of SVMs, how to use them, and how they work. Figure shows part of the iris dataset that was introduced at the end of Chapter 4. The two classes can clearly be separated easily with a straight line they are linearly separable. The left plot shows the decision boundaries of three possible linear classifiers. The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly.

The other two models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances. You can think of an SVM classifier as fitting the widest possible street represented by the parallel dashed lines between the classes. This is called large margin classification. These instances are called the support vectors they are circled in Figure SVMs are sensitive to the feature scales, as you can see in Figure on the left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal.

After feature scaling e. Sensitivity to feature scales Soft Margin Classification If we strictly impose that all instances be off the street and on the right side, this is called hard margin classification.

Figure shows the iris dataset with just one additional outlier: on the left, it is impossible to find a hard margin, and on the right the decision boundary ends up very different from the one we saw in Figure without the outlier, and it will probably not generalize as well. Hard margin sensitivity to outliers To avoid these issues it is preferable to use a more flexible model.

The objective is to find a good balance between keeping the street as large as possible and limiting the margin violations i. This is called soft margin classification. Figure shows the decision boundaries and margins of two soft margin SVM classifiers on a nonlinearly separable dataset. On the right, using a low C value the margin is quite large, but many instances end up on the street.

On the left, using a high C value the classifier makes fewer margin violations but ends up with a smaller margin. Large margin left versus fewer margin violations right If your SVM model is overfitting, you can try regularizing it by reducing C.

The resulting model is represented on the left of Figure The LinearSVC class regularizes the bias term, so you should center the training set first by subtracting its mean. This is automatic if you scale the data using the StandardScaler. Moreover, make sure you set the loss hyperparameter to "hinge", as it is not the default value.

Finally, for better performance you should set the dual hyperparameter to False, unless there are more features than training instances we will discuss duality later in the chapter. One approach to handling nonlinear datasets is to add more features, such as polynomial features as you did in Chapter 4 ; in some cases this can result in a linearly separable dataset.

Consider the left plot in Figure it represents a simple dataset with just one feature x1. This dataset is not linearly separable, as you can see. Linear SVM classifier using polynomial features Polynomial Kernel Adding polynomial features is simple to implement and can work great with all sorts of Machine Learning algorithms not just SVMs , but at a low polynomial degree it cannot deal with very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow.

Fortunately, when using SVMs you can apply an almost miraculous mathematical technique called the kernel trick it is explained in a moment. It makes it possible to get the same result as if you added many polynomial features, even with very high- degree polynomials, without actually having to add them.

This trick is implemented by the SVC class. On the right is another SVM classifier using a 10th- degree polynomial kernel. Conversely, if it is underfitting, you can try increasing it.

The hyperparameter coef0 controls how much the model is influenced by high- degree polynomials versus low-degree polynomials. SVM classifiers with a polynomial kernel A common approach to find the right hyperparameter values is to use grid search see Chapter 2. It is often faster to first do a very coarse grid search, then a finer grid search around the best values found. Adding Similarity Features Another technique to tackle nonlinear problems is to add features computed using a similarity function that measures how much each instance resembles a particular landmark.

Now we are ready to compute the new features. As you can see, it is now linearly separable. The simplest approach is to create a landmark at the location of each and every instance in the dataset. This creates many dimensions and thus increases the chances that the transformed training set will be linearly separable. The downside is that a training set with m instances and n features gets transformed into a training set with m instances and m features assuming you drop the original features.

If your training set is very large, you end up with an equally large number of features. Gaussian RBF Kernel Just like the polynomial features method, the similarity features method can be useful with any Machine Learning algorithm, but it may be computationally expensive to compute all the additional features, especially on large training sets. However, once again the kernel trick does its SVM magic: it makes it possible to obtain a similar result as if you had added many similarity features, without actually having to add them.

For example, some kernels are specialized for specific data structures. With so many kernels to choose from, how can you decide which one to use? If the training set is not too large, you should try the Gaussian RBF kernel as well; it works well in most cases. The algorithm takes longer if you require a very high precision.

In most classification tasks, the default tolerance is fine. This algorithm is perfect for complex but small or medium training sets. However, it scales well with the number of features, especially with sparse features i. In this case, the algorithm scales roughly with the average number of nonzero features per instance.

Platt There is little regularization on the left plot i. In this chapter, we will use a different convention, which is more convenient and more common when you are dealing with SVMs: the bias term will be called b and the feature weights vector will be called w. No bias feature will be added to the input feature vectors. Decision function for the iris dataset The dashed lines represent the points where the decision function is equal to 1 or �1: they are parallel and at equal distance to the decision boundary, forming a margin around it.

Training a linear SVM classifier means finding the value of w and b that make this margin as wide as possible while avoiding margin violations hard margin or limiting them soft margin. In other words, dividing the slope by 2 will multiply the margin by 2.

The smaller the weight vector w, the larger the margin. However, if we also want to avoid any margin violation hard margin , then we need the decision function to be greater than 1 for all positive training instances, and lower than �1 for negative training instances. Optimization algorithms work much better on differentiable functions. We now have two conflicting objectives: making the slack variables as small as possible to 1 reduce the margin violations, and making 2 wT w as small as possible to increase the margin.

This gives us the constrained optimization problem in Equation Many off-the-shelf solvers are available to solve QP problems using a variety of techniques that are outside the scope of this book.

So one way to train a hard margin linear SVM classifier is just to use an off-the-shelf QP solver by passing it the preceding parameters. Similarly, you can use a QP solver to solve the soft margin problem see the exercises at the end of the chapter.

Luckily, the SVM problem happens to meet these conditions,6 so you can choose to solve the primal problem or the dual problem; both will have the same solution.

Equation shows the dual form of the linear SVM objective if you are interested in knowing how to derive the dual problem from the primal problem, see Appendix C.

So what is this kernel trick anyway? Kernelized SVM Suppose you want to apply a 2nd-degree polynomial transformation to a two- dimensional training set such as the moons training set , then train a linear SVM classifier on the transformed training set.

However, in Machine Learning, vectors are frequently represented as column vectors i. To remain consistent with the rest of the book, we will use this notation here, ignoring the fact that this technically results in a single-cell matrix rather than a scalar value.

The result will be strictly the same as if you went through the trouble of actually transforming the training set then fitting a linear SVM algorithm, but this trick makes the whole process much more computationally efficient.

This is the essence of the kernel trick. Equation lists some of the most commonly used kernels. There is still one loose end we must tie. But how can you make predictions without knowing w?

This makes it possible to use the kernel trick, once again Equation Of course, you also need to compute the bias term b , using the same trick Equation Unfortunately it converges much more slowly than the methods based on QP.

There's also live online events, interactive content, certification prep materials, and more. In this chapter, you will go through an example project end to end, pretending to be a recently hired data scientist in a real estate company. When you are learning about Machine Learning it is best to actually experiment with real-world data, not just artificial datasets.

Fortunately, there are thousands of open datasets to choose from, ranging across all sorts of domains. Here are a few places you can look to get data:. This dataset was based on data from the California census. It is not exactly recent you could still afford a nice house in the Bay Area at the time , but it has many qualities Skip to main content. Start your free trial. Chapter 2.

The second section covers machine learning basics and Scikit-learn library. It also explains supervised learning, unsupervised learning, implementation, and classification of regression algorithms, and ensemble learning methods in an easy manner with theoretical and practical lessons.

The third section explains complex neural network architectures with details on internal working and implementation of convolutional neural networks. The final chapter contains a detailed end-to-end solution with neural networks in Pytorch. After completing Hands-on Machine Learning with Python , you will be able to implement machine learning and neural network solutions and extend them to your advantage.

Huge Discounts Available! Your email address will not be published. Save my name, email, and website in this browser for the next time I comment. This branch is 30 commits behind ageron:main. Latest commit. Fix statefull RNN's validation set range. Git stats 1, commits. Failed to load latest commit information. Replace handson-ml2 with handson-ml3, and fix figure chapter numbers. November 23, Change environment name from tf2 to homl3. December 21, Big update to chapter 17 for the 3rd edition, add diffusion models.

April 1, February 19, Replace housing. September 5, Fix typos and deprecated shift call in ch 3. May 11, Fix figure numbering and correct typos. May 24, Correct preposition.

May 31, Fix Big O notation and log base in ex. June 1, May 18, Improve comments and fix typo. June 14, Require Python 3. September 12, Remove sklearn version check when sklearn is not needed. April 17, Upgrade to TensorFlow 2.

March 1, Hashing using binary autoencoder fix. First notebook added: matplotlib. February 16, Thanks to Victor Khaustov. June 2, Add apt.

October 27, March 10, September 28, Add keras-tuner, google cloud aiplatform, and google cloud storage. April 16, Add notebook outputs. Fix sin function. May 21, Fix diff conflict. Fix typos in ML project checklist and requirements. May 12, Update pandas tutorial. May 17, Remove redundant parentheses.

In , Geoffrey Hinton et al. Training a deep neural net was widely considered impossible at the time,2 and most researchers had abandoned the idea in the late s. This paper revived the interest of the scientific community, and before long many new papers demonstrated that Deep Learning was not only possible but capable of mind-blowing achievements that no other Machine Learning ML technique could hope to match with the help of tremendous computing power and great amounts of data.

This enthusiasm soon extended to many other areas of Machine Learning. Before you know it, it will be driving your car. Perhaps you would like to give your homemade robot a brain of its own? Make it recognize faces? Or learn to walk around? Or maybe your company has tons of data user logs, financial data, production data, machine sensor data, hotline stats, HR reports, etc. With Machine Learning, you could accomplish the following and more:. Whatever the reason, you have decided to learn Machine Learning and implement it in your projects.

Great idea! This book assumes that you know close to nothing about Machine Learning. Its goal is to give you the concepts, tools, and intuition you need to implement programs capable of learning from data. We will cover a large number of techniques, from the simplest and most commonly used such as Linear Regression to some of the Deep Learning techniques that regularly win competitions.

Rather than implementing our own toy versions of each algorithm, we will be using production-ready Python frameworks:. Scikit-Learn is very easy to use, yet it implements many Machine Learning algorithms efficiently, so it makes for a great entry point to learning Machine Learning.

TensorFlow is a more complex library for distributed numerical computation. It makes it possible to train and run very large neural networks efficiently by distributing the computations across potentially hundreds of multi-GPU graphics processing unit servers.

It was open-sourced in November TensorFlow comes with its own implementation of this API, called tf. Keras, which provides support for some advanced TensorFlow features e. The book favors a hands-on approach, growing an intuitive understanding of Machine Learning through concrete working examples and just a little bit of theory. The official tutorial on Python.

If you have never used Jupyter, Chapter 2 will guide you through installation and the basics: it is a powerful tool to have in your toolbox. There is also a quick math tutorial for linear algebra. Machine Learning Project Checklist. SVM Dual Problem. This paper revived the interest of the scientific community and before long many new papers demonstrated that Deep Learning was not only possible, but capable of mind-blowing achievements that no other Machine Learning ML technique could hope to match with the help of tremendous computing power and great amounts of data.

This enthusiasm soon extended to many other areas of Machine Learning. Before you know it, it will be driving your car. Machine Learning in Your Projects So naturally you are excited about Machine Learning and you would love to join the party! Perhaps you would like to give your homemade robot a brain of its own? Or learn to walk around? Related books. Series in Machine Perception and Artificial Perception and Artifical Intelligence.

Intelligent lighting : a machine learning perspective. Introduction to Machine Learning with Python. Machine Learning and Cognition in Enterprises. Business Intelligence transformed. The tool book: a tool-lover's guide to over hand tools. Popular categories Manga Comics.

Comic Books. Personal Development. For Dummies. Marvel Comics. Attack On Titan. Attack On Titan 1 - 4.