Let's imagine starting a campaignadvertising for a product: first of allwe choose data on customers who have purchased that product in the past. At this pointwe build a profile of people potentiallyinterested in purchasing that product and oneonce the target has been clarified, it will be easier for usplan a campaign based only ondata available. Wanting to make a theorytherefore, with the term machine learning, we refer to the automatic identification of models forthe extraction of information from a large amountof data.

As known, we can divide learning techniques into supervised and unsupervised. In the following experiments we will always refer to supervised techniques, that is to say, that during the training phase we will compare themodels based on the actual label of the instance.In order to achieve greater accuracy canIt is a good idea to take advantage of ensemble learning methods, such as boosting. These meta-algorithms are a good method to obtain a very performing final classifier, interactively extracting simple classifiers (also called weak learners) and associating them with the right weight in the vote. What we expect to achieve by applying this technique withsuccess is greater accuracy in the classification of instances.

The aim is to evaluate the performance of different learning algorithms in classifying instances. All classifiers were analyzed individually and then boosted with the Adaptive Boosting meta-algorithm by comparing the results obtained. This document is organized as follows: Section 2 describes the predictors used and the operating principles of AdaBoost. Section 3 reports the characteristics of the dataset. Experimentation, method, and related analysis are described in Section 4. Finally, the appendix shows the structure of the files and folders within which the code that was developed for the tests is contained.

Parametric classifiers such as the perceptron were used for the analysis. Chosen because it is simple and cheap, it builds the classifier incrementally using greedy techniques to obtain the classifier that minimizes the training error. It provides good performance especially when instances are linearly separable. As a non-parametric classifier, the decision tree seemed like a good choice. It incrementally constructs an acyclic graph, characterized by root and leaf nodes labeled in our case with binary labels. To map the instances from one node to the next, a test function t: xi → {1, .., k} is used which maps the value that the instance takes on the i-th attribute in one of the possible child nodes. Intuitively, the decision tree tells what the algorithm has learned on the training set.

Both, decision trees and perceptrons together with Decision Stumps, can be used as weak learners within ensemble learning techniques such as Bagging or Boosting. If desired, we can see decision stumps as decision trees with only one level of depth. In summary, the following basic classifiers have been implemented:

- Decision tree
- Decision Stumps
- Preceptron

In order to create a sophisticated classifier these basic classifiers,have been boosted with the Adaptive Boosting algorithm.

Boosting is a way of building algorithms rather than an algorithm. We can think of boosting as a technique to incrementally extract members from a committee, get them to vote, and give them the right weight. The final graderit will be given by the combination of the votes of the individual members. The basic idea is to create a model that fails a little by relying solely on a sequence of simple and cheap predictors that overfit a little.

As mentioned, the idea is to combine the votes of several weak learners in order to create a classifier with better accuracy. The output will be given by the sum of the weighted predictions of the individual models, in accordance with the following formula:

During model training, the instances are reweighed and the algorithm will give greater weight to the misclassified instances, in the hope that the subsequent model will be more expert on the latter. In figure 1 we report the pseudocode of the algorithm. As you can see, at each step of the algorithm the base classifier is invoked, the estimated error, and a new weight wt calculated. At each iteration cycle, the parameter setting (εi 6 = 12) and the new probability distribution guarantee an appropriate choice of the next basic classifier. The iteration cycle stops when the T value is reached, or if the algorithm cannot find an εi far from 0.5. Compliance with this second condition guarantees a rapid descent of the training error er (h).

The perceptron is a linear classifier that builds models incrementally that can be seen as hyperplanes of the type: w ^ (T) x - c = 0. Training with the perceptron is simple and updating the model depends on the misclassified examples. These classifiers may have a low variance error, but they bet that the optimal Bayesian is rigid enough otherwise they have to be content with making a very high bias error. In any case, the best formula we can obtain on the convergence of the perceptron (valid for linearly separable and nonlinear training sets) is this:

It means that we cannot go beyond these performances. Here MT indicates the number of updates the perceptron must make before converging, h t (u) is the Hinge Loss function. U is the radius of the sphere that contains all of our models and X is the radius of the bubble that contains our instances. As mentioned, the formula provides a risk premium which is equal to the number of updates we need to make before the model converges.

We want to find a correlation for women to the incidence of benign or malignant breast tumors, based on the results of previous medical tests.

The available dataset is Wisconsin Breast Cancer. This is data provided by the University of Wisconsin (US) and freely accessible from the UCI machine learning repository. The data refer to results acquired by hospitals on medical examinationsand we want to find a correlation to the incidence of malignant tumors on the subjects examined. The dataset to date has 680 instances as opposed to the 367 that constituted the original dataset. Over the years, other examples have contributed to its extension until it reaches its present size. We can see each of these instances as a vector V ∈Rd, i.e. in a d-dimensional space with a number of features d = 9. Each attribute varies in a range from 1-10 and it must be immediately clear that it represents the result of a medical examination. The last column contains the labels. As mentioned in supervised learning, each example has its own label which for convenience we indicate with Y, as opposed to the label predicted by our algorithms ŷ. Each instance belongs to 2 possible classes:

- -1 for benign tumor
- +1 for malignant tumor

In summary, we have 680 instances with 9 attributes, each representing a clinical exam and binary labels that take value -1 (if benign tumor) or +1 (if malignant).

Features list:

- Thickness of the clot
- Cell size
- Cell shape
- Marginal adhesion
- Size of the epithelial tissue
- Nuclei
- Chromatin
- Normal nucleoli
- Mitosis

Each of these attributes takes on a numerical value in a range [1-10]. Class distribution:

- Benign tumors: 458 (65.5%)
- Malignant tumors: 241 (34.5%)

As mentioned, benign tumors have a -1 label; malignant tumors are found in the database labeled +1.

The dataset varies over a range of integer values with binary labels. This makes us guess to use the algorithms explained in the second section to solve binary classification problems. The unbalanced (albeit slightly) distribution of classes in favor of one label over another (65.5% vs 34.5%) suggests implementing evaluation metrics such as precision and recall.

All the code is written in the R language. R is a programming language and a specific development environment for statistical data analysis. Distributed in its standard version with a command-line interface, it supports various graphical environments (such as Rstudio) that allow you to integrate R with new packages. Chosen for its flexibility and ease of use in handling large datasets, it offers good performance even when working with vectors and matrices.

For the tests, the Breast Cancer dataset was divided into two parts, training and test set. As a test, a portion equal to 15% of the entire dataset (102 examples) was chosen and the remaining part was divided between training and validation set. Cross-Validation was used to tune the parameters, which consists of dividing the training set into k parts of equal number to be used in turn as training and validation. For the experimentation, the parameter k = 10 was chosen. Thus, at each step, the 10-th part of the dataset forms the validation, and the algorithm is trained on the remainder. This technique ensures that the entire training set (alternating phases) is used to generate the model and to validate it.

Let us remember what we want to demonstrate, that is, the risk of a boosted classifier is lower than that of individual weak learners. A parametric classification algorithm such as the perceptron can have a very high bias error, while tree classifiers suffer the phenomenon of overfitting when the number of nodes grows out of control.

First, let's see empirically how overfitting is done in the family of tree predictors. The parameters of the dataset are the same as specified in section 4. We have related the classification error to the number of nodes (or depth) of the tree. The chart below shows a training error (blue line) going down the more we train our model.

Intuitively, this is what we expect since if the algorithm trains a lot on the data, the error it will make in classifying the same instances will be small. As the depth of the shaft (x-axis) increases, the error decreases. On the ordinate axis, we used the Cross-Entropy as an index of the error. Calculated as:

where N is the size of the training set and q the probability measure. The test error er, which is ultimately what we are interested in, decreases parallel to the training error in the first part and then increases towards the end. Its minimum value is therefore not in correspondence with arg Minh and r (h)but of the minimum of the cv er cross-validation error (green line). Beyond this value, we are overfitting. It means that by training the model too much and making the tree grow out of all proportion, the result we get will be worse. This is also certified by the risk majorant formula

which, as we know, penalizes a lot when the number of nodes N grows out of control. So if we appropriately choose the attribute on which to split, after only 4 levels of depth the model guarantees us to minimize the risk. The technique was chosen in this test for identifying the attribute from which to start to grow the tree is Information Gain. In fact, it represents the expected reduction of entropy resulting from the partitioning of the examples of the training set. To calculate it, simply implement the following formula in R code:

Where t is the test used for the split, E (S) represents the entropy calculated as −p log2 p - n log2 n; p and n represent in this case the number of positive and negative examples. The criterion based on Information Gain chooses the t-test that maximizes the gain G (S, t). In summary, if the choice of the test is not the optimal one, the generated tree will not be as short as possible and we will pay something in the formula seen before the risk increase. This means that the curve of Figure 2 will less well approximate the function e − x. Below are the evaluation indices for decision trees with 4 depth levels.

We ask ourselves, how does the error vary when the chosen attribute is not the optimal one? We have implemented the Decision Stump whose behavior can be associated with that of a decision tree at a single level of depth. We use a random function for the choice of the test and let us ignore for a moment the gain G (S, t). From the graph below we can see that the split on the sixth attribute is the only one able to guarantee the rapid descent of the risk er (h stump).

In fact, the test on this attribute is to be preferred over the others and guarantees us to classify the instances in the best possible way. In fact, it is no coincidence that this coincides with the highest gain index (the red bar in the graph). These very simple classifiers do not have overfitting and therefore do not need further tuning.

Below are the performance evaluation indices with decision stumps.

We now want to demonstrate that by combining different decision stumps with ensemble learning techniques, the error decreases and we can achieve better performances. The tests conducted by comparing training and test errors show that as the number of members of the T committee increases, the training error decreases exponentially as e − 2T γ2. Intuitively, it means that a large number of basic classifiers are not required to minimize the error. Figure 4 shows the result from which it is clear that after only 6 basic classifiers the empirical risk reaches its minimum.

This result pleases us and is in line with the theoretical formula which indicates that it is sufficient to take T weak learners, with

to minimize the training error. Here m represents the size of the training set S. With our training set (consisting of 578 examples), the ln (m) is equal to 6.4, and coincides with the result obtained during the tests.

From the graphs, it is evident that the training error er (h) is always lower than the test error er (h). This confirms the fact that it is possible to consider empirical risk as a lower bound to estimate the theoretical performance of a learning algorithm.

We also note that in this case overfitting does not penalize the classifiers once they have been boosted.

Let's take decision trees with 4 levels of depth and boost them. In fact, as mentioned in section 4.1.1, trees with this depth are sufficient to minimize the training error without risking falling into the overfitting trap. Can we get better performances than single weak learners? First, let's look at the Confusion Matrix related to testing examples; that is, those that the algorithm has never seen to train.

Out of a total of 102 examples, the algorithm fails to classify only 3. Below are the precision & recall indices that confirm an improvement in classifying the test instances.

We have more evidence of this fact, looking at the graph in figure 5 which calculates the recall/precision pair for each position of the training set. Simply put, it represents the precision estimate at each standard recall level. The curve closest to the point (1,1) indicates the best performance.

The graph confirms what has been specified up to now on a theoretical level. Decision stumps predictors are the worst performers. In fact, the function (monotone decreasing) of all is the one that differs most from the points (1,1). On the contrary, the decision trees with AdaBoost have a trend that better approximates the function y = 1. In the middle the results obtained with decision trees when they are not used together with boosting algorithms.

Let's now shift our attention to another classification algorithm, the Percetron. We know it is parametric and generates linear classifiers. Let's use it to classify our instances and analyze how it behaves on generate data from the same distribution. If the optimal Bayesian had a rigid structure (for example if it were a hyperplane), we would be sure that in a finite number of steps, our model would converge to the best model f. It means that a simple hyperplane would be enough to classify all the examples well.

Divided as usual the dataset into training and testing, at each epoch we have reshuffled the instances. This is not a mandatory procedure, but it can be good practice to do so. At each iteration cycle, each instance has the same probability of being fished out and contributing to the generation of the model. During the training phase our hyperplane never converges, a sign of the fact that the training set is not linearly separable and therefore regardless of the number of epochs, we never reach convergence.

So let's try to understand something more about the distribution of errors and look at how these are related to the instances. We have plotted on the graph in Figure 6, the average error that the hyperplane makes in classifying every single instance during the n epochs. The training instances (85% of the total) are represented in blue and the test ones in red. The curve takes the form of a Gaussian centered at the zero points. Indicates that the distance between the real label Y and the predicted label ŷ for all examples and on average approaches zero. By increasing the number of epochs, the curve will always better approximate the Dirac Delta function as the error made in an epoch will become less and less relevant.

In this specific example, the test errors are 17 (red bar). We are far from the performance of 3 errors obtained by boosting the decision trees. We actually noticed that as the number of epochs increases, the number of errors does not appear to decrease significantly. Table 1 reports the number of errors made with a test set of 102 examples based on a different number of epochs. The model generated by training the algorithm with only 10 epochs is in fact the worst one. The performances with 10,000 epochs, however, are no better than those with 50 epochs. Intuitively, this could be due to the fact that our instances are not well centered around the origin of the axes.

As a final analysis, we boosted the perceptron to see if we could improve the previous results. Using to classify hyperplanes with a non-zero threshold value, we were able to see some improvement. The improvements are purely related to this aspect and not to the number of base classifiers used. In fact, from the boosting results we are unable to show a correlation between the increase in the number of classifiers and a constant decrease in the test error. The suspicion is that the training set is not centered around the origin and combining classifiers cannot improve the situation.

The table above shows an example of the number of test errors with a different number of base classifiers. As you can see with respect to the first table, the number of errors decreases a lot, but the thing does not seem to be related in any way with the number of weak learners extracted and used by the algorithm.

From the tests it is clear what is stated by the "no free lunch" theorem. There is no single algorithm that solves any problem better than others, but it is necessary to evaluate the best algorithm from time to time. In fact, the perceptron on linearly separable training sets is able to give excellent performances. It is clear in these tests that boosting with decision trees seems to be the best choice to better classify instances.

In particular, we have shown on this dataset, that by boosting simple and cheap predictors such as decision stumps, we can achieve good performance. The comparison with the perceptron has instead highlighted some limitations of linear classifiers when the dataset is nonlinearly separable. An alternative approach leads us to use techniques for transforming parametric into non-parametric predictors. For example, one can think of mapping features in spaces with many more dimensions and learning hyperplanes directly in these spaces (eg Gaussian Kernels), but this is beyond the scope of this paper. Precisely this last aspect could reserve a scenario of future developments for the analysis work done so far.

[1] Study material of the course of *Statistical methods for machine learning*.
Prof.re Nicolò
Cesa-Bianchi. Web page: cesa-bianchi.di.unimi.it/MSA/

[2] UCI Machine Learning Repository. Web page: archive.ics.uci.edu/ml/index.php

[3] Software, RStudio.