Integration of Data Mining Classification Techniques and Ensemble Learning for Predicting the Type of Breast Cancer Recurrence

. Conservative surgery plus radiotherapy is an alternative to radical mastectomy in the early stages of breast cancer, presenting equivalent survival rates. Data mining facilitates to manage the data and provide the useful medical progression and treatment of cancerous conditions as these methods can help to reduce the number of false positive and false negative decisions. Various machine learning techniques can be used to support the doctors in effective and accurate decision making. In this paper, various classifiers have been tested for the prediction of type of breast cancer recurrence and the results show that neural networks outperform others.


Introduction
Worldwide, breast cancer is the most common neoplasm among women; During 2016, more than two million new cases were registered and 810 712 deaths due to this disease. In the United States of America, during the same year, 409,995 new cases were identified, of which almost 83,000 women died. In Mexico, the incidence of breast cancer is lower, however, reported 21 064 cases and 8310 deaths [1].
Due to its detection in earlier stages as well as advances in adjuvant chemotherapy, it has been possible to reduce recurrence and mortality [2]. Micrometastatic disease is the cause of recurrence and suggests the use of adjuvant therapy. The calculation of the risk of recurrence in early breast cancer is established through the analysis of various characteristics of the patient and the tumor; the age at diagnosis, tumor size, state of the axillary ganglia, degree of differentiation and the presence or absence of vascular or lymphatic invasion, have been some widely validated prognostic factors [3].
The status of hormone receptors (estrogen and progesterone receptors) and overexpression of the protein or amplification of the HER2 oncogene have been shown to be useful in establishing the prognosis and predicting the response to specific treatment modalities [4]. Distant Metastasis is diagnosed after minimum of three months from primary tumor and this accounts for 60% to 70% of the patients [5]. However, using Machine Learning (ML) tools it is possible to extract key factors that help to predict the recurrence of the disease. Machine learning has been practiced for some years, and with good results, in the social sciences, marketing, finance and applied sciences. In medicine it has barely been used, partly for cultural and philosophical reasons that assume that a computer will never be as capable as a human doctor; and by the refusal of some doctors to feel questioned, supervised or advised by a machine or by an engineer [6], [7].
Thus, even in the biological sciences and genomic medicine, advanced computational methods are already used; while clinicians have to deal with increasingly large and complex databases using traditional statistical methods [8], [9], [10].
Due to its characteristics of complexity and uncertainty, medicine is one of the fields of knowledge that can benefit most from an interaction with disciplines such as computing and machine learning to strengthen processes such as clinical diagnosis and perform predictive analyzes about patients and their prognosis, resulting in a more efficient health system and better use of resources [11].
The objective of this paper is the Integration of Data Mining Classification Techniques and Ensemble Learning for Predicting the Type of Breast Cancer Recurrence. Various data mining algorithms such as Support Vector Machine (SVM), Decision Tree (DT), Naïve Bayes (NB) and Neural Networks which includes the Generalized Regression Neural Network (GRNN) can be used for the prediction of type of breast cancer recurrence.

Bibliographic Review
Traditional statistics are not enough to handle large amounts of variables, as they are found in many current databases. Machine learning is knowledge gained by computationally processing training data contained in those databases [12]. The recognition of statistical patterns is an approach to explore a set of data and discover previously unsuspected relationships, without the need for a hypothesis. The problems that arise and the strategies to solve them can be divided into: clustering, reduction of dimensions (dimensionality reduction) and classification [13].
In recent years lot of study has been done for breast cancer prognosis using machine learning techniques. Also these algorithms have been applied for predicting the key factors in breast cancer recurrence. Table 1 provides the details of literature survey done for the same.

Determination of the data set to intervene
The UCI (University of California, Irvine) Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged [23].
The breast cancer recurrence Dataset has been taken from UCI Machine Learning Repository available online [23]. It is provided by the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. It consists of 286 instances and 10 attributes (explained in Table 2) which includes a Class attribute that decides the outcome of the breast cancer being recurrence and non-recurrence. The italicized terms in the Table 2 are the standard terms used in UCI Machine Learning Repository.

Data cleansing and data pre-processing
The large quantities of information contained in the database require an efficient presentation, not only to reduce the dimensionality, but also to preserve the information relevant for an efficient classification, for which the fields were checked to eliminate the ones which did not contain relevant information for the forecasting process [24].

Reduction of variables
In the process of reduction of the variables it is important to identify the type of information they transmit. Such information can be of three types: (i) Redundant: repetitive or predictable information; (ii) Irrelevant: Information that is not relevant for the information discovery process; and (iii) Basic: the relevant that constitutes an important part in a process of prediction or discovery of information (Caamaño et al.,  2015) [25]. The importance of the reduction of data lies in the improvement of the input data for the algorithms to efficiently classified the relationship between variables.

Attribute Filters
WEKA (Mark et al., 2009) [16][17][18][19][20][21][22][23][24][25][26] allows manipulations on the data by applying filters. They can be applied in two levels, attributes and instances. It was decided to apply a refinement to the model in order to get a slightly higher probability of success. Filtering operations have the option to apply "cascading", so that each filter takes the data set resulting from a previous filter as an input. In the model, the method of exhaustive search was used, which can be expressed as a tuple (Anon 2016, p.3) [27]: Satisfying some restrictions.
Optimizing a certain objective function. In each moment, the algorithm will be found in a certain k level, with a partial solution.
( 1,…. ) Each set of possible values of the tuple represents a node of the tree of solutions. The process continues until the partial solution is a complete solution of the problem, or until there are no more possibilities to try.

Evaluation of data mining techniques
For the development of this research, the classification and prediction techniques were used in the construction of models from the data to determine the recurrence of breast cancer. The Bayesian classifier (Naive Bayes) was used as the initial classifier and, in the second instance, decision trees C4.5 (J48), later Support Vector Machine (SVM) and Generalized Regression Neural Network (GRNN). The data mining tool called WEKA was used to classify data, and the predicted class was compared with the current class of the instances to measure the effectiveness of the classification algorithm. There are several ways to carry out the assessment. In this case, "use training set" was applied to use the same sample to train and test (Hepner 1990). [28] Among the algorithms provided by WEKA, the following were analyzed:

(4)
Where: In the case of n predictor variables X1,…, Xn are continuous, the Naive Bayes paradigm is converted to find the value of the variable C, which is denoted by c, which maximizes the a posteriori probability of the variable C, given the evidence expressed as an instantiation of the variables X1,…, Xn, this is, X = (X1,…, Xn)." Therefore, in the Naïve Bayes paradigm, the search for the most probable diagnosis, c*, once known the symptoms (X1,…, Xn) of a particular patient, is reduced to: can be used for classification, and for this reason, C4.5 is almost always referred to as a statistical classifier (Quinlan, J. R. 1993) [31].

Support Vector Machine (SVM)
SVM is a useful technique for data classification. Even though it's considered that Neural Networks are easier to use than this, however, sometimes unsatisfactory results are obtained. A classification task usually involves with training and testing data which consist of some data instances [32]. Each instance in the training set contains one target values and several attributes. The goal of SVM is to produce a model which predicts target value of data instances in the testing set which are given only the attributes [33]. Classification in SVM is an example of Supervised Learning. Known labels help indicate whether the system is performing in a right way or not. This information points to a desired response, validating the accuracy of the system, or be used to help the system learn to act correctly. A step in SVM classification involves identification as which are intimately connected to the known classes. This is called feature selection or feature extraction. Feature selection and SVM classification together have a use even when prediction of unknown samples is not necessary. They can be used to identify key sets which are involved in whatever processes distinguish the classes [33].

Generalized Regression Neural Network (GRNN)
Artificial Intelligence (AI) has a significant impact on the current research trends due its numerous applications in different aspects of the life. Artificial Neural Networks(ANNs) are one of the major parts of AI. ANNs have different applications including regression and approximation, forecasting and prediction, classification, pattern recognition and more. ANNs are useful since they can learn from the data and they have global approximation abilities. A feed-forward neural network with at least single hidden layer and sufficient number of hidden neurons can approximate any arbitrary continuous function under certain conditions [34]. ANNs have two main types: the Feed Forward ANNs (FFANNs) in which the input will only flow to the output layer in the forward direction and the Recurrent ANNs (RANNs) in which data flow can be in any direction. Generalized Regression Neural Networks (GRNN) [35] are single-pass associative memory feed-forward type Artificial Neural Networks (ANNs) and uses normalized Gaussian kernels in the hidden layer as activation functions.
GRNN advantages include its quick training approach and its accuracy. On the other hand, one of the disadvantage of GRNN is the growth of the hidden layer size. However, this issue can be solved by implementing a special algorithm which reduces the growth of the hidden layer by storing only the most relevant patterns [36].

Definition of the data mining technique
The evaluation of data obtained from the application of the methods is carried out by using the following variables as a comparison: Correct instances, absolute error, confusion matrix and ease for the interpretation of the data [24].

Data pre-processing
The number of instances in the data set presented in Table 3. It also includes few duplicate rows which are eliminated and the count of the recurrence and non-recurrence instances is presented in the table. The dataset so obtained after cleaning is nominal in nature [37]. So it is converted into numeric form to be used for further processing (see table 4).

Evaluation of data mining techniques
Comparison between the results of the Naive Bayes algorithm, C4.5 (J48), Support Vector Machine (SVM) and GRNN, are shown in table 5. With respect to the training data with 10% of the data, all the algorithms present 89% correct classification, the difference lies in the fact that the greater number of training dates, the better the J48 and GRNNN algorithm classifies, with a 91 % efficiency compared to 89% of the Naive Bayes and SVM algorithm.

 Comparison of arrays of confusion
One of the benefits of the confusion matrices is that they allow to see if the system is confusing two classes (Corso et al., 2009) [38]. Below the matrices of confusion generated by each of the algorithms applied to the same data set are shown. Table 6 shows that the values of the diagonal are the right findings and the rest are the errors. According to the Naive Bayes Algorithm, out the 260 users with profile b, 240 were well classified and 20 presented errors; in the profile c, 149 were well classified and 131 presented error; in the profile d, 120 were well classified and 40 presented error; and in profile e, 280 were well classified.  For the Algorithm J48 in Table 7, out of the 260 users with profile a, 240 were well classified and 20 presented errors, and for the 160 users with profile d, 120 were well classified and 40 presented errors.    The numeric form of data is used here and this GRNN is used for classification in MATLAB with training and testing data ratio as 70% and 30 % respectively. The data is randomly selected for training and testing and value of =1 [23].  The accuracy measures such as Sensitivity, Specificity, Precision and Recall [39] for all the classifiers are presented in Table 10.

Conclusions
In this paper, several types of classification algorithms have been used here and can have that neural network classifiers have performed better than other learning classifiers. In the future, the accuracy can be increased by adding more features or by increasing the instances of the dataset. Also, the combination of existing classification techniques can be used to enhance the efficiency. Besides this a discussion with medical professional can be done to verify the features for the type of recurrence.