Machine Learning Approach Applied to the Prevalence Analysis of ADHD Symptoms in Young Adults of Barranquilla, Colombia

. Disorder Attention De ﬁ cit/Hyperactivity Disorder, or ADHD, is recognized as one of the pathologies of high prevalence in children and adolescents from the global environment population; this disorder generates visible symptoms usually diminish with the passage of time in adulthood, however they remain concealed by demonstrations damni ﬁ can personal stability and human development apt. This article shows the results of the research aimed at determining the prevalence of symptoms of attention de ﬁ cit hyperactivity disorder in Young Adults University of Barranquilla and its Metropolitan Area. The sample of 1600 young adults between 18 and 25 years, which has been estimated at 95% con ﬁ dence level and a margin of error of 2.44%. The information was acquired through the application of exploratory instruments symptoms of attention de ﬁ cit hyperactivity disorder. With the application of the algorithm different machine learning algorithms such as: Bagging, MultiBoostAB, Deci-sionStump, LogitBoost, FT, J48Graft, a high performance in the Bagging algorithm could be identi ﬁ ed with the following results in quality metrics:


Introduction
Attention Deficit Hyperactivity Disorder, known by its Spanish acronym as ADHD, is a topic of interest for professionals in psychology, psychiatry and neurosciences in general, due to the negative impact it has on social functioning, personal, work, academic and family of those who suffer from it. Recently, the American Psychiatric Association defines Attention Deficit Hyperactivity Disorder in DSM V as a persistent pattern of inattention and/or hyperactivity-impulsivity that interferes with functioning or development, which is characterized by the presence of symptoms of inattention and/or hyperactivity and impulsivity that are present before the age of 12, and affect the different contexts in which an individual develops.
Attention Deficit Hyperactivity Disorder has been considered one of the most prevalent diseases in children and adolescents worldwide [1]; and although their externally visible symptoms tend to decrease over the years in adulthood, they tend to remain hidden, behind manifestations that affect the personal stability and proper development of the human being [2]. Studies show; that of the patients diagnosed with ADHD in childhood, from 30% to 70% continue to present symptoms that generate difficulties during adolescence and adulthood, in addition; at the age of 19, 38% still fully meet the diagnostic criteria of the pathology (without remission); and 72% have at least one third of the symptoms required for diagnosis (with partial remission) [3,4].
On the other hand, according to the Diagnostic and Statistical Manual of DSMV IV Mental Disorders, subjects diagnosed with ADHD can reach academic levels lower than those presented by their peers, and their functionality may also be compromised in the work, family, social environment, low self-esteem, disorder, poor planning capacity, lack of concentration, inadequate time management, among others. However, this research aims to determine the prevalence of symptoms of Attention Deficit Hyperactivity Disorder in Young Adults of the Universities of Barranquilla and its Metropolitan Area. The results of the study will also allow obtaining useful information at the level of higher education, since the symptoms may be having an impact on academics, leading to poor performance or even dropout associated with academic factors [5].

Methods
The present investigation is framed in the Empirical-Analytical paradigm because this study handles a theoretical approach of analytical cut. Regarding its approach, this research is considered quantitative; of descriptive scope. Finally, it is revealed that the temporality of the research responds to a Transactional study. The population under study corresponds to young university adults, men and women, residing in the city of Barranquilla and its metropolitan area, aged between 18 and 25 who voluntarily express their interest in participating in this research.
For the sample calculation, the reports of the ANDA National Data Archive, arranged by the National Administrative Department of Statistics DANE5, were taken into account, according to which the projection of the population to 2014 by sex and age group between 18 and 25 years in Barranquilla, Atlántico is the following ( Table 1): The final sample was found from a calculation of sample error for finite populations (taking into account a total population of 164,676). The result of the size is 1600 young adults between the ages of 18 and 25, which has been estimated with 95% confidence and a margin of error of 2.44%. The sampling implemented was non-probabilistic, intentional and the selection technique is of the Expert type [6] based on the following inclusion and exclusion criteria: Inclusion Criteria. Schooling: University, age: 18-25 years, place of residence: Barranquilla and Metropolitan area, absence of significant clinical history.

Exclusion Criteria
• Individuals who are not in school or have a lower level of education than the University. • Individuals with ages that do not meet the age range between 18 and 25 years of age. • Individuals whose fixed place of residence is not Barranquilla and/or municipalities of the metropolitan area. • Individuals with significant clinical/neuropsychiatric history, for which the presence of the symptoms to be evaluated could be explained.
It is necessary to rescue, that in the present study the instruments were applied to 1,674 subjects, of which 74 were excluded from the analyzes for not meeting the inclusion criteria, or for not having answered the questionnaire in its entirety.

Procedure
To carry out the present investigation, a procedure described in 6 phases was established: -Phase 1: Theoretical review and state of the art in ADHD Adult.

Dataset Description
For this study, the dataset selected was generate for the experimentation during the applications of neuropsychological tests to 1674 patients and 184 features, its description in the following.  Table 2: Manages a scale of individual item types: 1 to 5, where 1 corresponds to (not at all), 2 (a little), 3 (moderately), 4 (quite a lot) and 5 (a lot). Its heading refers to the symptoms presented in childhood through sentences as: "As a child I was (or had) (or was) [7]. This test can be used to assess status mood and emotional changes that occur. • ADHD: It is a self-applied checklist designed for ongoing research, the time of which application ranges from 10 to 15 min. Saying instrument is made up of 18 items, its objective is to measure the prevalence of ADHD symptoms in young adults through the quantification of related symptoms with this, taking into account the different factors that affect the configuration of these symptoms and that compromise the feasibility of the evaluation diagnostic and neuropsychological of this disorder [8]. The Exploratory Inventory of ADHD Symptoms (IES-ADHD), is made up of two subscales, which evaluate two criteria such as inattention and hyperactivity/impulsivity respectively. The first of them covers items 1 to 9 and is aimed at evaluating the symptoms of inattentive type in which questions arise on the presence of errors in activities labor, deficiency in the maintenance of attention in certain activities, the difficulty in following instructions etc. The second subscale includes items from 10 to 18, and evaluates the characteristic symptoms of the hyperactive type/impulsive, denoting questions like tinkling of the hands, feeling restless or constant anxiety, impatience, etc. The list of individual item types comprises 0 (never), 1 (rarely), 2 (sometimes), 3 (with frequency) and 4 (very frequently). The objective of this checklist is to establish with both subscales, the prevalence in the symptomatology of the ADHD in young adults whose childhood they presented the disorder and how these affect the functionality of the individual in the different aspects of your life. • CIE-10 Test: This checklist is self-applicable and is developed to be answered in a period of 10 to 15 min. This instrument is consisting of 18 items, which are divided Affective lability 5 Emotional hyperreactivity 6 Disorganization 7 Impulsivity according to the evaluation criteria ranging from inattention to hyperactivity to impulsiveness [9]. The first measures the difficulty itself of inattentive predominance that goes from items 1 to 9, the second of hyperactive predominance that goes from items 10 to 14 and the third of impulsive predominance which ranges from 15 to 18. It comprises a scale of individual element types from 0 to 4, where 0 corresponds to (never), 1 (rarely), 2 (sometimes), 3 (with frequency) and 4 (very frequently). The objective of this instrument is to assess membership and prevalence of ADHD symptoms in adults based on the evaluation system criteria OMS ICD-10 in 1992 • Class: In case 0 is a control patients, 1 is the patients has the ADHD disease.

Bagging
Bootstrap aggregation, also known as Bagging, is really a meta-algorithm designed to get model combinations from an initial family, causing a decrease in variance and avoiding overfitting. Although the most common is to apply it with the methods based on decision trees, it can be used with any family. Bagging has been shown to tend to produce improvements in cases of unstable individual models (such as neural networks or decision trees), but it can produce mediocre results or even worsen results with other methods, such as the K closest neighbors [10].

Algorithms Based on Boosting
Unlike bagging, boosting does not create versions of the training set, but always works with the complete input set, and manipulates the weights of the data to generate different models. The idea is that in each iteration the weight of the objects misclassified by the predictor is increased in that iteration, so in the construction of the next predictor these objects will be more important and will be more likely to classify them well [11].

DecisionStump
DecisionStump is an algorithm implemented in weka, where the tree decision will have three branches: one of them will be in case the attribute is unknown, and the other two will be in the case that the value of the attribute of the test example is equal to a specific value of the attribute or different from said value, in case of symbolic attributes, or that the value of the test example is greater or less than a certain value in the case of numerical attributes [12].

J48
Algorithm C4.5 builds decision trees of a training data system of the same form that the ID3 algorithm, which uses the concept of information entropy. Training data they are a system S ¼ s 1 ; s 2 ; . . . of samples already classified. Each example s i ¼ x 1 ; x 2 ; . . . f gis a vector where x 1 ; x 2 ; . . . represent the attributes or characteristics of the example. Training data they are augmented with a vector C ¼ c 1 ; c 2 ; . . . f gwhere c_1, c_2,… represent the class to which it belongs each sample. C4.5 is an extension of the ID3 algorithm previously developed by Quinlan. The trees decision generator by C4.5 can be used for classification, and for this reason, C4.5 is almost always referred to as a statistical classifier [13].

Experimentation
For this experimentation the confusion matrix was used. This is a fundamental tool when evaluating the performance of a classification algorithm, since it will give a better idea of how the algorithm is being classified, based on a count of the successes and errors of each of the classes in the classification. This way you can check if the algorithm is classifying the classes poorly and to what extent.
The meaning of the result of confusion matrix are explained next: • a, is the number of correct predictions that a case is negative, called True Negative.
• b, is the number of incorrect predictions that a case is positive, that is, the prediction is positive when the value should really be negative, called False Positive. • c, is the number of incorrect predictions that a case is negative, that is, the prediction is negative when the value should really be positive, called False Negative. • d, is the number of correct predictions that a case is positive, called True Positive.
From the results obtained in the confusion matrix, other quality metrics can be analyzed, which are explained below: 4.1. Accuracy (AC): refers to the dispersion of the set of values obtained from repeated measurements of a magnitude [14]. The smaller the dispersion, the greater the accuracy. It is represented by the ratio between the number of correct predictions (both positive and negative) and the total predictions, and is calculated using the equation.
4.2. Precision (P): It refers to how close the result of a true value measurement is. In statistical terms, accuracy is related to the bias of an estimate [15]. It is also known as True Positive (or "True positive rate"). It is represented by the proportion between the real positives predicted by the algorithm and all positive cases, and is calculated using the equation.

True Values
4.3. Recall (TP): Sensitivity, is also known as the True Positive Rate (TP). It is the proportion of positive cases that were correctly identified by the algorithm [16]. It is calculated according to the equation: 4.4. Specificity (TN): Also known as the True Negative Rate (TN). These are the negative cases that the algorithm has correctly classified [17]. It is calculated according to the equation: 4.5. False Positive Rate (FP): Is the proportion of negative cases that were mistakenly classified as positive by the algorithm [18]. It is calculated according to the equation: 4.7. F-measure: With the concepts of precision and recall it is possible to define another type of metric called "F-Measure" [19]. This occurs depending on the two metric already seen and can be interpreted as the harmonic mean of both. In particular measure F handles a parameter "a" of as follows:

Results
After carrying out the process of cleaning and preprocessing the data, different algorithms of automatic learning support were used to identify patients with ADHD problems. These results are shown in Table 3 and Figs. 1, 2, 3 and 4.

Conclusions and Discussion
As a result of this experimentation, it can be concluded that machine learning techniques are an effective means or form for the analysis process of neurosychological variables related to attention deficit disorders that currently affect a large population conglomerate not only in Colombia, but around the world. From the results shown in Fig. 5, it can be defined that for this particular case the Bagging technique.
In the specific case of the Bagging classifier, it obtained the following results in the quality metrics. Accuracy 91,67%, Precision 94,12%, Recall 88,89% and F-measure 91,43%