# Help with my final project –

The final project for the course is a technical blog post related to a data analysis project you will work on piecemeal over the course of the semester.
The project is very open ended. The objective is to demonstrate your skill in asking meaningful questions of your data and answering them with results of the data analysis using R / Rmarkdown, and that your proficiency in interpreting and presenting the results. The goal is not to conduct an exhaustive data analysis. The data analysis part should meet the following criteria:
1. Perform exploratory data analysis summarizing your data using descriptive statistics / summary statistics and visualizations relevant to your questions or ones that highlight some interesting insight.
2. Demonstrate at least two of the following techniques we have learned in class and that helps answer your question: PCA, hypothesis testing / confidence interval, regression analysis (linear /logistic)
Proposal
The first task is toidentify the dataset, understand the data and write questions you are planning to answerusing that dataset. You maypick a data set from one of the resources mentioned on this webpage(Links to an external site.). The proposal should meet the following criteria:
1. Perform checks to determine quality of the data (missing values, outliers, etc.)
2. Proposal on what questions you are interested in answering from the data
3. Initial visualizations and if required transform to get the data ready
A good reference for ideas on questions and EDA in general: https://r4ds.had.co.nz/exploratory-data-analysis.html#questions
It should be about 2+ pages in length, not exceeding 10 with appendix. It should include roughly the following sections:
1. Background or the context of data selected – sources, description of how it was collected, time period it represents, context in it was collected if available, perhaps why you selected it
2. Description of the data – how big is it (number of observations, variables), how many numeric variables, how many categorical variables, description of the variables
3. Goal – What questions you plan to understand from the data.
3. Analysis – Descriptive statistics and visualization of key variables
4. Summary of findings from the analysis and further questions for future analysis
5. References – link to data or analysis sources you have referenced for the report
6. Appendix – all the visualization that does not support your questions directly can go here
Final Write-up
The project should include
1. Introduction: What is your research question? Why do you care? Why should others care? If you know of any other related work done by others, please include a brief description.
2. Data: Include context about the data covering:
a. Data source: Include the citation for your data, and provide link to the source.
b. Data collection: Context on how the data was collected?
c. Cases: What are the cases (units of observation or experiment)? What do the rows represent in your dataset?
d. Variables: What are the variables you will be studying?
e. Type of study: was it an observational study or an experiment?
f. Data clean-up: (Optional) If you had to do any data clean up (missing values, outliers, transformation), include a very brief description of your steps.
3. Exploratory Data Analysis: summarize your data using descriptive statistics / summary statistics and visualizations relevant to your questions or ones that highlight some interesting insight. Additional plots not relevant to your research question can be included in the appendix.
4. Data Analysis: Pick and perform two of the following techniques we have learned in class and that helps answer your question about the dataset: PCA, hypothesis testing / confidence interval, regression analysis (linear /logistic)
5. Conclusion: Summarize your findings and include a discussion of what you have learned about your data through this project. You may also want to include limitations of your approach and include ideas for possible future work.
6. References: Include links that you have referenced for this project.
Rstudio
ATTACHED FILE(S)
Patient Diabetes Data (1)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
6 148 72 35 0 33.6 0.627 50 1
1 85 66 29 0 26.6 0.351 31 0
8 183 64 0 0 23.3 0.672 32 1
1 89 66 23 94 28.1 0.167 21 0
0 137 40 35 168 43.1 2.288 33 1
5 116 74 0 0 25.6 0.201 30 0
3 78 50 32 88 31 0.248 26 1
10 115 0 0 0 35.3 0.134 29 0
2 197 70 45 543 30.5 0.158 53 1
8 125 96 0 0 0 0.232 54 1
4 110 92 0 0 37.6 0.191 30 0
10 168 74 0 0 38 0.537 34 1
10 139 80 0 0 27.1 1.441 57 0
1 189 60 23 846 30.1 0.398 59 1
5 166 72 19 175 25.8 0.587 51 1
7 100 0 0 0 30 0.484 32 1
0 118 84 47 230 45.8 0.551 31 1
7 107 74 0 0 29.6 0.254 31 1
1 103 30 38 83 43.3 0.183 33 0
1 115 70 30 96 34.6 0.529 32 1
3 126 88 41 235 39.3 0.704 27 0
8 99 84 0 0 35.4 0.388 50 0
7 196 90 0 0 39.8 0.451 41 1
9 119 80 35 0 29 0.263 29 1
11 143 94 33 146 36.6 0.254 51 1
10 125 70 26 115 31.1 0.205 41 1
7 147 76 0 0 39.4 0.257 43 1
1 97 66 15 140 23.2 0.487 22 0
13 145 82 19 110 22.2 0.245 57 0
5 117 92 0 0 34.1 0.337 38 0
5 109 75 26 0 36 0.546 60 0
3 158 76 36 245 31.6 0.851 28 1
3 88 58 11 54 24.8 0.267 22 0
6 92 92 0 0 19.9 0.188 28 0
10 122 78 31 0 27.6 0.512 45 0
4 103 60 33 192 24 0.966 33 0
11 138 76 0 0 33.2 0.42 35 0
9 102 76 37 0 32.9 0.665 46 1
2 90 68 42 0 38.2 0.503 27 1
4 111 72 47 207 37.1 1.39 56 1
3 180 64 25 70 34 0.271 26 0
7 133 84 0 0 40.2 0.696 37 0
7 106 92 18 0 22.7 0.235 48 0
9 171 110 24 240 45.4 0.721 54 1
7 159 64 0 0 27.4 0.294 40 0
0 180 66 39 0 42 1.893 25 1
1 146 56 0 0 29.7 0.564 29 0
2 71 70 27 0 28 0.586 22 0
7 103 66 32 0 39.1 0.344 31 1
7 105 0 0 0 0 0.305 24 0
1 103 80 11 82 19.4 0.491 22 0
1 101 50 15 36 24.2 0.526 26 0
5 88 66 21 23 24.4 0.342 30 0
8 176 90 34 300 33.7 0.467 58 1
7 150 66 42 342 34.7 0.718 42 0
1 73 50 10 0 23 0.248 21 0
7 187 68 39 304 37.7 0.254 41 1
0 100 88 60 110 46.8 0.962 31 0
0 146 82 0 0 40.5 1.781 44 0
0 105 64 41 142 41.5 0.173 22 0
2 84 0 0 0 0 0.304 21 0
8 133 72 0 0 32.9 0.27 39 1
5 44 62 0 0 25 0.587 36 0
2 141 58 34 128 25.4 0.699 24 0
7 114 66 0 0 32.8 0.258 42 1
5 99 74 27 0 29 0.203 32 0
0 109 88 30 0 32.5 0.855 38 1
2 109 92 0 0 42.7 0.845 54 0
1 95 66 13 38 19.6 0.334 25 0
4 146 85 27 100 28.9 0.189 27 0
2 100 66 20 90 32.9 0.867 28 1
5 139 64 35 140 28.6 0.411 26 0
13 126 90 0 0 43.4 0.583 42 1
4 129 86 20 270 35.1 0.231 23 0
1 79 75 30 0 32 0.396 22 0
1 0 48 20 0 24.7 0.14 22 0
7 62 78 0 0 32.6 0.391 41 0
5 95 72 33 0 37.7 0.37 27 0
0 131 0 0 0 43.2 0.27 26 1
2 112 66 22 0 25 0.307 24 0
3 113 44 13 0 22.4 0.14 22 0
2 74 0 0 0 0 0.102 22 0
7 83 78 26 71 29.3 0.767 36 0
0 101 65 28 0 24.6 0.237 22 0
5 137 108 0 0 48.8 0.227 37 1
2 110 74 29 125 32.4 0.698 27 0
13 106 72 54 0 36.6 0.178 45 0
2 100 68 25 71 38.5 0.324 26 0
15 136 70 32 110 37.1 0.153 43 1
1 107 68 19 0 26.5 0.165 24 0
1 80 55 0 0 19.1 0.258 21 0
4 123 80 15 176 32 0.443 34 0
7 81 78 40 48 46.7 0.261 42 0
4 134 72 0 0 23.8 0.277 60 1
2 142 82 18 64 24.7 0.761 21 0
6 144 72 27 228 33.9 0.255 40 0
2 92 62 28 0 31.6 0.13 24 0
1 71 48 18 76 20.4 0.323 22 0
6 93 50 30 64 28.7 0.356 23 0
1 122 90 51 220 49.7 0.325 31 1
1 163 72 0 0 39 1.222 33 1
1 151 60 0 0 26.1 0.179 22 0
0 125 96 0 0 22.5 0.262 21 0
1 81 72 18 40 26.6 0.283 24 0
2 85 65 0 0 39.6 0.93 27 0
1 126 56 29 152 28.7 0.801 21 0
1 96 122 0 0 22.4 0.207 27 0
4 144 58 28 140 29.5 0.287 37 0
3 83 58 31 18 34.3 0.336 25 0
0 95 85 25 36 37.4 0.247 24 1
3 171 72 33 135 33.3 0.199 24 1
8 155 62 26 495 34 0.543 46 1
1 89 76 34 37 31.2 0.192 23 0
4 76 62 0 0 34 0.391 25 0
7 160 54 32 175 30.5 0.588 39 1
4 146 92 0 0 31.2 0.539 61 1
5 124 74 0 0 34 0.22 38 1
5 78 48 0 0 33.7 0.654 25 0
4 97 60 23 0 28.2 0.443 22 0
4 99 76 15 51 23.2 0.223 21 0
0 162 76 56 100 53.2 0.759 25 1
6 111 64 39 0 34.2 0.26 24 0
2 107 74 30 100 33.6 0.404 23 0
5 132 80 0 0 26.8 0.186 69 0
0 113 76 0 0 33.3 0.278 23 1
1 88 30 42 99 55 0.496 26 1
3 120 70 30 135 42.9 0.452 30 0
1 118 58 36 94 33.3 0.261 23 0
1 117 88 24 145 34.5 0.403 40 1
0 105 84 0 0 27.9 0.741 62 1
4 173 70 14 168 29.7 0.361 33 1
9 122 56 0 0 33.3 1.114 33 1
3 170 64 37 225 34.5 0.356 30 1
8 84 74 31 0 38.3 0.457 39 0
2 96 68 13 49 21.1 0.647 26 0
2 125 60 20 140 33.8 0.088 31 0
0 100 70 26 50 30.8 0.597 21 0
0 93 60 25 92 28.7 0.532 22 0
0 129 80 0 0 31.2 0.703 29 0
5 105 72 29 325 36.9 0.159 28 0
3 128 78 0 0 21.1 0.268 55 0
5 106 82 30 0 39.5 0.286 38 0
2 108 52 26 63 32.5 0.318 22 0
10 108 66 0 0 32.4 0.272 42 1
4 154 62 31 284 32.8 0.237 23 0
0 102 75 23 0 0 0.572 21 0
9 57 80 37 0 32.8 0.096 41 0
2 106 64 35 119 30.5 1.4 34 0
5 147 78 0 0 33.7 0.218 65 0
2 90 70 17 0 27.3 0.085 22 0
1 136 74 50 204 37.4 0.399 24 0
4 114 65 0 0 21.9 0.432 37 0
9 156 86 28 155 34.3 1.189 42 1
1 153 82 42 485 40.6 0.687 23 0
8 188 78 0 0 47.9 0.137 43 1
7 152 88 44 0 50 0.337 36 1
2 99 52 15 94 24.6 0.637 21 0
1 109 56 21 135 25.2 0.833 23 0
2 88 74 19 53 29 0.229 22 0
17 163 72 41 114 40.9 0.817 47 1
4 151 90 38 0 29.7 0.294 36 0
7 102 74 40 105 37.2 0.204 45 0
0 114 80 34 285 44.2 0.167 27 0
2 100 64 23 0 29.7 0.368 21 0
0 131 88 0 0 31.6 0.743 32 1
6 104 74 18 156 29.9 0.722 41 1
3 148 66 25 0 32.5 0.256 22 0
4 120 68 0 0 29.6 0.709 34 0
4 110 66 0 0 31.9 0.471 29 0
3 111 90 12 78 28.4 0.495 29 0
6 102 82 0 0 30.8 0.18 36 1
6 134 70 23 130 35.4 0.542 29 1
2 87 0 23 0 28.9 0.773 25 0
1 79 60 42 48 43.5 0.678 23 0
2 75 64 24 55 29.7 0.37 33 0
8 179 72 42 130 32.7 0.719 36 1
6 85 78 0 0 31.2 0.382 42 0
0 129 110 46 130 67.1 0.319 26 1
5 143 78 0 0 45 0.19 47 0
5 130 82 0 0 39.1 0.956 37 1
6 87 80 0 0 23.2 0.084 32 0
0 119 64 18 92 34.9 0.725 23 0
1 0 74 20 23 27.7 0.299 21 0
5 73 60 0 0 26.8 0.268 27 0
4 141 74 0 0 27.6 0.244 40 0
7 194 68 28 0 35.9 0.745 41 1
8 181 68 36 495 30.1 0.615 60 1
1 128 98 41 58 32 1.321 33 1
8 109 76 39 114 27.9 0.64 31 1
5 139 80 35 160 31.6 0.361 25 1
3 111 62 0 0 22.6 0.142 21 0
9 123 70 44 94 33.1 0.374 40 0
7 159 66 0 0 30.4 0.383 36 1
11 135 0 0 0 52.3 0.578 40 1
8 85 55 20 0 24.4 0.136 42 0
5 158 84 41 210 39.4 0.395 29 1
1 105 58 0 0 24.3 0.187 21 0
3 107 62 13 48 22.9 0.678 23 1
4 109 64 44 99 34.8 0.905 26 1
4 148 60 27 318 30.9 0.15 29 1
0 113 80 16 0 31 0.874 21 0
1 138 82 0 0 40.1 0.236 28 0
0 108 68 20 0 27.3 0.787 32 0
2 99 70 16 44 20.4 0.235 27 0
6 103 72 32 190 37.7 0.324 55 0
5 111 72 28 0 23.9 0.407 27 0
8 196 76 29 280 37.5 0.605 57 1
5 162 104 0 0 37.7 0.151 52 1
1 96 64 27 87 33.2 0.289 21 0
7 184 84 33 0 35.5 0.355 41 1
2 81 60 22 0 27.7 0.29 25 0
0 147 85 54 0 42.8 0.375 24 0
7 179 95 31 0 34.2 0.164 60 0
0 140 65 26 130 42.6 0.431 24 1
9 112 82 32 175 34.2 0.26 36 1
12 151 70 40 271 41.8 0.742 38 1
5 109 62 41 129 35.8 0.514 25 1
6 125 68 30 120 30 0.464 32 0
5 85 74 22 0 29 1.224 32 1
5 112 66 0 0 37.8 0.261 41 1
0 177 60 29 478 34.6 1.072 21 1
2 158 90 0 0 31.6 0.805 66 1
7 119 0 0 0 25.2 0.209 37 0
7 142 60 33 190 28.8 0.687 61 0
1 100 66 15 56 23.6 0.666 26 0
1 87 78 27 32 34.6 0.101 22 0
0 101 76 0 0 35.7 0.198 26 0
3 162 52 38 0 37.2 0.652 24 1
4 197 70 39 744 36.7 2.329 31 0
0 117 80 31 53 45.2 0.089 24 0
4 142 86 0 0 44 0.645 22 1
6 134 80 37 370 46.2 0.238 46 1
1 79 80 25 37 25.4 0.583 22 0
4 122 68 0 0 35 0.394 29 0
3 74 68 28 45 29.7 0.293 23 0
4 171 72 0 0 43.6 0.479 26 1
7 181 84 21 192 35.9 0.586 51 1
0 179 90 27 0 44.1 0.686 23 1
9 164 84 21 0 30.8 0.831 32 1
0 104 76 0 0 18.4 0.582 27 0
1 91 64 24 0 29.2 0.192 21 0
4 91 70 32 88 33.1 0.446 22 0
3 139 54 0 0 25.6 0.402 22 1
6 119 50 22 176 27.1 1.318 33 1
2 146 76 35 194 38.2 0.329 29 0
9 184 85 15 0 30 1.213 49 1
10 122 68 0 0 31.2 0.258 41 0
0 165 90 33 680 52.3 0.427 23 0
9 124 70 33 402 35.4 0.282 34 0
1 111 86 19 0 30.1 0.143 23 0
9 106 52 0 0 31.2 0.38 42 0
2 129 84 0 0 28 0.284 27 0
2 90 80 14 55 24.4 0.249 24 0
0 86 68 32 0 35.8 0.238 25 0
12 92 62 7 258 27.6 0.926 44 1
1 113 64 35 0 33.6 0.543 21 1
3 111 56 39 0 30.1 0.557 30 0
2 114 68 22 0 28.7 0.092 25 0
1 193 50 16 375 25.9 0.655 24 0
11 155 76 28 150 33.3 1.353 51 1
3 191 68 15 130 30.9 0.299 34 0
3 141 0 0 0 30 0.761 27 1
4 95 70 32 0 32.1 0.612 24 0
3 142 80 15 0 32.4 0.2 63 0
4 123 62 0 0 32 0.226 35 1
5 96 74 18 67 33.6 0.997 43 0
0 138 0 0 0 36.3 0.933 25 1
2 128 64 42 0 40 1.101 24 0
0 102 52 0 0 25.1 0.078 21 0
2 146 0 0 0 27.5 0.24 28 1
10 101 86 37 0 45.6 1.136 38 1
2 108 62 32 56 25.2 0.128 21 0
3 122 78 0 0 23 0.254 40 0
1 71 78 50 45 33.2 0.422 21 0
13 106 70 0 0 34.2 0.251 52 0
2 100 70 52 57 40.5 0.677 25 0
7 106 60 24 0 26.5 0.296 29 1
0 104 64 23 116 27.8 0.454 23 0
5 114 74 0 0 24.9 0.744 57 0
2 108 62 10 278 25.3 0.881 22 0
0 146 70 0 0 37.9 0.334 28 1
10 129 76 28 122 35.9 0.28 39 0
7 133 88 15 155 32.4 0.262 37 0
7 161 86 0 0 30.4 0.165 47 1
2 108 80 0 0 27 0.259 52 1
7 136 74 26 135 26 0.647 51 0
5 155 84 44 545 38.7 0.619 34 0
1 119 86 39 220 45.6 0.808 29 1
4 96 56 17 49 20.8 0.34 26 0
5 108 72 43 75 36.1 0.263 33 0
0 78 88 29 40 36.9 0.434 21 0
0 107 62 30 74 36.6 0.757 25 1
2 128 78 37 182 43.3 1.224 31 1
1 128 48 45 194 40.5 0.613 24 1
0 161 50 0 0 21.9 0.254 65 0
6 151 62 31 120 35.5 0.692 28 0
2 146 70 38 360 28 0.337 29 1
0 126 84 29 215 30.7 0.52 24 0
14 100 78 25 184 36.6 0.412 46 1
8 112 72 0 0 23.6 0.84 58 0
0 167 0 0 0 32.3 0.839 30 1
2 144 58 33 135 31.6 0.422 25 1
5 77 82 41 42 35.8 0.156 35 0
5 115 98 0 0 52.9 0.209 28 1
3 150 76 0 0 21 0.207 37 0
2 120 76 37 105 39.7 0.215 29 0
10 161 68 23 132 25.5 0.326 47 1
0 137 68 14 148 24.8 0.143 21 0
0 128 68 19 180 30.5 1.391 25 1
2 124 68 28 205 32.9 0.875 30 1
6 80 66 30 0 26.2 0.313 41 0
0 106 70 37 148 39.4 0.605 22 0
2 155 74 17 96 26.6 0.433 27 1
3 113 50 10 85 29.5 0.626 25 0
7 109 80 31 0 35.9 1.127 43 1
2 112 68 22 94 34.1 0.315 26 0
3 99 80 11 64 19.3 0.284 30 0
3 182 74 0 0 30.5 0.345 29 1
3 115 66 39 140 38.1 0.15 28 0
6 194 78 0 0 23.5 0.129 59 1
4 129 60 12 231 27.5 0.527 31 0
3 112 74 30 0 31.6 0.197 25 1
0 124 70 20 0 27.4 0.254 36 1
13 152 90 33 29 26.8 0.731 43 1
2 112 75 32 0 35.7 0.148 21 0
1 157 72 21 168 25.6 0.123 24 0
1 122 64 32 156 35.1 0.692 30 1
10 179 70 0 0 35.1 0.2 37 0
2 102 86 36 120 45.5 0.127 23 1
6 105 70 32 68 30.8 0.122 37 0
8 118 72 19 0 23.1 1.476 46 0
2 87 58 16 52 32.7 0.166 25 0
1 180 0 0 0 43.3 0.282 41 1
12 106 80 0 0 23.6 0.137 44 0
1 95 60 18 58 23.9 0.26 22 0
0 165 76 43 255 47.9 0.259 26 0
0 117 0 0 0 33.8 0.932 44 0
5 115 76 0 0 31.2 0.343 44 1
9 152 78 34 171 34.2 0.893 33 1
7 178 84 0 0 39.9 0.331 41 1
1 130 70 13 105 25.9 0.472 22 0
1 95 74 21 73 25.9 0.673 36 0
1 0 68 35 0 32 0.389 22 0
5 122 86 0 0 34.7 0.29 33 0
8 95 72 0 0 36.8 0.485 57 0
8 126 88 36 108 38.5 0.349 49 0
1 139 46 19 83 28.7 0.654 22 0
3 116 0 0 0 23.5 0.187 23 0
3 99 62 19 74 21.8 0.279 26 0
5 0 80 32 0 41 0.346 37 1
4 92 80 0 0 42.2 0.237 29 0
4 137 84 0 0 31.2 0.252 30 0
3 61 82 28 0 34.4 0.243 46 0
1 90 62 12 43 27.2 0.58 24 0
3 90 78 0 0 42.7 0.559 21 0
9 165 88 0 0 30.4 0.302 49 1
1 125 50 40 167 33.3 0.962 28 1
13 129 0 30 0 39.9 0.569 44 1
12 88 74 40 54 35.3 0.378 48 0
1 196 76 36 249 36.5 0.875 29 1
5 189 64 33 325 31.2 0.583 29 1
5 158 70 0 0 29.8 0.207 63 0
5 103 108 37 0 39.2 0.305 65 0
4 146 78 0 0 38.5 0.52 67 1
4 147 74 25 293 34.9 0.385 30 0
5 99 54 28 83 34 0.499 30 0
6 124 72 0 0 27.6 0.368 29 1
0 101 64 17 0 21 0.252 21 0
3 81 86 16 66 27.5 0.306 22 0
1 133 102 28 140 32.8 0.234 45 1
3 173 82 48 465 38.4 2.137 25 1
0 118 64 23 89 0 1.731 21 0
0 84 64 22 66 35.8 0.545 21 0
2 105 58 40 94 34.9 0.225 25 0
2 122 52 43 158 36.2 0.816 28 0
12 140 82 43 325 39.2 0.528 58 1
0 98 82 15 84 25.2 0.299 22 0
1 87 60 37 75 37.2 0.509 22 0
4 156 75 0 0 48.3 0.238 32 1
0 93 100 39 72 43.4 1.021 35 0
1 107 72 30 82 30.8 0.821 24 0
0 105 68 22 0 20 0.236 22 0
1 109 60 8 182 25.4 0.947 21 0
1 90 62 18 59 25.1 1.268 25 0
1 125 70 24 110 24.3 0.221 25 0
1 119 54 13 50 22.3 0.205 24 0
5 116 74 29 0 32.3 0.66 35 1
8 105 100 36 0 43.3 0.239 45 1
5 144 82 26 285 32 0.452 58 1
3 100 68 23 81 31.6 0.949 28 0
1 100 66 29 196 32 0.444 42 0
5 166 76 0 0 45.7 0.34 27 1
1 131 64 14 415 23.7 0.389 21 0
4 116 72 12 87 22.1 0.463 37 0
4 158 78 0 0 32.9 0.803 31 1
2 127 58 24 275 27.7 1.6 25 0
3 96 56 34 115 24.7 0.944 39 0
0 131 66 40 0 34.3 0.196 22 1
3 82 70 0 0 21.1 0.389 25 0
3 193 70 31 0 34.9 0.241 25 1
4 95 64 0 0 32 0.161 31 1
6 137 61 0 0 24.2 0.151 55 0
5 136 84 41 88 35 0.286 35 1
9 72 78 25 0 31.6 0.28 38 0
5 168 64 0 0 32.9 0.135 41 1
2 123 48 32 165 42.1 0.52 26 0
4 115 72 0 0 28.9 0.376 46 1
0 101 62 0 0 21.9 0.336 25 0
8 197 74 0 0 25.9 1.191 39 1
1 172 68 49 579 42.4 0.702 28 1
6 102 90 39 0 35.7 0.674 28 0
1 112 72 30 176 34.4 0.528 25 0
1 143 84 23 310 42.4 1.076 22 0
1 143 74 22 61 26.2 0.256 21 0
0 138 60 35 167 34.6 0.534 21 1
3 173 84 33 474 35.7 0.258 22 1
1 97 68 21 0 27.2 1.095 22 0
4 144 82 32 0 38.5 0.554 37 1
1 83 68 0 0 18.2 0.624 27 0
3 129 64 29 115 26.4 0.219 28 1
1 119 88 41 170 45.3 0.507 26 0
2 94 68 18 76 26 0.561 21 0
0 102 64 46 78 40.6 0.496 21 0
2 115 64 22 0 30.8 0.421 21 0
8 151 78 32 210 42.9 0.516 36 1
4 184 78 39 277 37 0.264 31 1
0 94 0 0 0 0 0.256 25 0
1 181 64 30 180 34.1 0.328 38 1
0 135 94 46 145 40.6 0.284 26 0
1 95 82 25 180 35 0.233 43 1
2 99 0 0 0 22.2 0.108 23 0
3 89 74 16 85 30.4 0.551 38 0
1 80 74 11 60 30 0.527 22 0
2 139 75 0 0 25.6 0.167 29 0
1 90 68 8 0 24.5 1.138 36 0
0 141 0 0 0 42.4 0.205 29 1
12 140 85 33 0 37.4 0.244 41 0
5 147 75 0 0 29.9 0.434 28 0
1 97 70 15 0 18.2 0.147 21 0
6 107 88 0 0 36.8 0.727 31 0
0 189 104 25 0 34.3 0.435 41 1
2 83 66 23 50 32.2 0.497 22 0
4 117 64 27 120 33.2 0.23 24 0
8 108 70 0 0 30.5 0.955 33 1
4 117 62 12 0 29.7 0.38 30 1
0 180 78 63 14 59.4 2.42 25 1
1 100 72 12 70 25.3 0.658 28 0
0 95 80 45 92 36.5 0.33 26 0
0 104 64 37 64 33.6 0.51 22 1
0 120 74 18 63 30.5 0.285 26 0
1 82 64 13 95 21.2 0.415 23 0
2 134 70 0 0 28.9 0.542 23 1
0 91 68 32 210 39.9 0.381 25 0
2 119 0 0 0 19.6 0.832 72 0
2 100 54 28 105 37.8 0.498 24 0
14 175 62 30 0 33.6 0.212 38 1
1 135 54 0 0 26.7 0.687 62 0
5 86 68 28 71 30.2 0.364 24 0
10 148 84 48 237 37.6 1.001 51 1
9 134 74 33 60 25.9 0.46 81 0
9 120 72 22 56 20.8 0.733 48 0
1 71 62 0 0 21.8 0.416 26 0
8 74 70 40 49 35.3 0.705 39 0
5 88 78 30 0 27.6 0.258 37 0
10 115 98 0 0 24 1.022 34 0
0 124 56 13 105 21.8 0.452 21 0
0 74 52 10 36 27.8 0.269 22 0
0 97 64 36 100 36.8 0.6 25 0
8 120 0 0 0 30 0.183 38 1
6 154 78 41 140 46.1 0.571 27 0
1 144 82 40 0 41.3 0.607 28 0
0 137 70 38 0 33.2 0.17 22 0
0 119 66 27 0 38.8 0.259 22 0
7 136 90 0 0 29.9 0.21 50 0
4 114 64 0 0 28.9 0.126 24 0
0 137 84 27 0 27.3 0.231 59 0
2 105 80 45 191 33.7 0.711 29 1
7 114 76 17 110 23.8 0.466 31 0
8 126 74 38 75 25.9 0.162 39 0
4 132 86 31 0 28 0.419 63 0
3 158 70 30 328 35.5 0.344 35 1
0 123 88 37 0 35.2 0.197 29 0
4 85 58 22 49 27.8 0.306 28 0
0 84 82 31 125 38.2 0.233 23 0
0 145 0 0 0 44.2 0.63 31 1
0 135 68 42 250 42.3 0.365 24 1
1 139 62 41 480 40.7 0.536 21 0
0 173 78 32 265 46.5 1.159 58 0
4 99 72 17 0 25.6 0.294 28 0
8 194 80 0 0 26.1 0.551 67 0
2 83 65 28 66 36.8 0.629 24 0
2 89 90 30 0 33.5 0.292 42 0
4 99 68 38 0 32.8 0.145 33 0
4 125 70 18 122 28.9 1.144 45 1
3 80 0 0 0 0 0.174 22 0
6 166 74 0 0 26.6 0.304 66 0
5 110 68 0 0 26 0.292 30 0
2 81 72 15 76 30.1 0.547 25 0
7 195 70 33 145 25.1 0.163 55 1
6 154 74 32 193 29.3 0.839 39 0
2 117 90 19 71 25.2 0.313 21 0
3 84 72 32 0 37.2 0.267 28 0
6 0 68 41 0 39 0.727 41 1
7 94 64 25 79 33.3 0.738 41 0
3 96 78 39 0 37.3 0.238 40 0
10 75 82 0 0 33.3 0.263 38 0
0 180 90 26 90 36.5 0.314 35 1
1 130 60 23 170 28.6 0.692 21 0
2 84 50 23 76 30.4 0.968 21 0
8 120 78 0 0 25 0.409 64 0
12 84 72 31 0 29.7 0.297 46 1
0 139 62 17 210 22.1 0.207 21 0
9 91 68 0 0 24.2 0.2 58 0
2 91 62 0 0 27.3 0.525 22 0
3 99 54 19 86 25.6 0.154 24 0
3 163 70 18 105 31.6 0.268 28 1
9 145 88 34 165 30.3 0.771 53 1
7 125 86 0 0 37.6 0.304 51 0
13 76 60 0 0 32.8 0.18 41 0
6 129 90 7 326 19.6 0.582 60 0
2 68 70 32 66 25 0.187 25 0
3 124 80 33 130 33.2 0.305 26 0
6 114 0 0 0 0 0.189 26 0
9 130 70 0 0 34.2 0.652 45 1
3 125 58 0 0 31.6 0.151 24 0
3 87 60 18 0 21.8 0.444 21 0
1 97 64 19 82 18.2 0.299 21 0
3 116 74 15 105 26.3 0.107 24 0
0 117 66 31 188 30.8 0.493 22 0
0 111 65 0 0 24.6 0.66 31 0
2 122 60 18 106 29.8 0.717 22 0
0 107 76 0 0 45.3 0.686 24 0
1 86 66 52 65 41.3 0.917 29 0
6 91 0 0 0 29.8 0.501 31 0
1 77 56 30 56 33.3 1.251 24 0
4 132 0 0 0 32.9 0.302 23 1
0 105 90 0 0 29.6 0.197 46 0
0 57 60 0 0 21.7 0.735 67 0
0 127 80 37 210 36.3 0.804 23 0
3 129 92 49 155 36.4 0.968 32 1
8 100 74 40 215 39.4 0.661 43 1
3 128 72 25 190 32.4 0.549 27 1
10 90 85 32 0 34.9 0.825 56 1
4 84 90 23 56 39.5 0.159 25 0
1 88 78 29 76 32 0.365 29 0
8 186 90 35 225 34.5 0.423 37 1
5 187 76 27 207 43.6 1.034 53 1
4 131 68 21 166 33.1 0.16 28 0
1 164 82 43 67 32.8 0.341 50 0
4 189 110 31 0 28.5 0.68 37 0
1 116 70 28 0 27.4 0.204 21 0
3 84 68 30 106 31.9 0.591 25 0
6 114 88 0 0 27.8 0.247 66 0
1 88 62 24 44 29.9 0.422 23 0
1 84 64 23 115 36.9 0.471 28 0
7 124 70 33 215 25.5 0.161 37 0
1 97 70 40 0 38.1 0.218 30 0
8 110 76 0 0 27.8 0.237 58 0
11 103 68 40 0 46.2 0.126 42 0
11 85 74 0 0 30.1 0.3 35 0
6 125 76 0 0 33.8 0.121 54 1
0 198 66 32 274 41.3 0.502 28 1
1 87 68 34 77 37.6 0.401 24 0
6 99 60 19 54 26.9 0.497 32 0
0 91 80 0 0 32.4 0.601 27 0
2 95 54 14 88 26.1 0.748 22 0
1 99 72 30 18 38.6 0.412 21 0
6 92 62 32 126 32 0.085 46 0
4 154 72 29 126 31.3 0.338 37 0
0 121 66 30 165 34.3 0.203 33 1
3 78 70 0 0 32.5 0.27 39 0
2 130 96 0 0 22.6 0.268 21 0
3 111 58 31 44 29.5 0.43 22 0
2 98 60 17 120 34.7 0.198 22 0
1 143 86 30 330 30.1 0.892 23 0
1 119 44 47 63 35.5 0.28 25 0
6 108 44 20 130 24 0.813 35 0
2 118 80 0 0 42.9 0.693 21 1
10 133 68 0 0 27 0.245 36 0
2 197 70 99 0 34.7 0.575 62 1
0 151 90 46 0 42.1 0.371 21 1
6 109 60 27 0 25 0.206 27 0
12 121 78 17 0 26.5 0.259 62 0
8 100 76 0 0 38.7 0.19 42 0
8 124 76 24 600 28.7 0.687 52 1
1 93 56 11 0 22.5 0.417 22 0
8 143 66 0 0 34.9 0.129 41 1
6 103 66 0 0 24.3 0.249 29 0
3 176 86 27 156 33.3 1.154 52 1
0 73 0 0 0 21.1 0.342 25 0
11 111 84 40 0 46.8 0.925 45 1
2 112 78 50 140 39.4 0.175 24 0
3 132 80 0 0 34.4 0.402 44 1
2 82 52 22 115 28.5 1.699 25 0
6 123 72 45 230 33.6 0.733 34 0
0 188 82 14 185 32 0.682 22 1
0 67 76 0 0 45.3 0.194 46 0
1 89 24 19 25 27.8 0.559 21 0
1 173 74 0 0 36.8 0.088 38 1
1 109 38 18 120 23.1 0.407 26 0
1 108 88 19 0 27.1 0.4 24 0
6 96 0 0 0 23.7 0.19 28 0
1 124 74 36 0 27.8 0.1 30 0
7 150 78 29 126 35.2 0.692 54 1
4 183 0 0 0 28.4 0.212 36 1
1 124 60 32 0 35.8 0.514 21 0
1 181 78 42 293 40 1.258 22 1
1 92 62 25 41 19.5 0.482 25 0
0 152 82 39 272 41.5 0.27 27 0
1 111 62 13 182 24 0.138 23 0
3 106 54 21 158 30.9 0.292 24 0
3 174 58 22 194 32.9 0.593 36 1
7 168 88 42 321 38.2 0.787 40 1
6 105 80 28 0 32.5 0.878 26 0
11 138 74 26 144 36.1 0.557 50 1
3 106 72 0 0 25.8 0.207 27 0
6 117 96 0 0 28.7 0.157 30 0
2 68 62 13 15 20.1 0.257 23 0
9 112 82 24 0 28.2 1.282 50 1
0 119 0 0 0 32.4 0.141 24 1
2 112 86 42 160 38.4 0.246 28 0
2 92 76 20 0 24.2 1.698 28 0
6 183 94 0 0 40.8 1.461 45 0
0 94 70 27 115 43.5 0.347 21 0
2 108 64 0 0 30.8 0.158 21 0
4 90 88 47 54 37.7 0.362 29 0
0 125 68 0 0 24.7 0.206 21 0
0 132 78 0 0 32.4 0.393 21 0
5 128 80 0 0 34.6 0.144 45 0
4 94 65 22 0 24.7 0.148 21 0
7 114 64 0 0 27.4 0.732 34 1
0 102 78 40 90 34.5 0.238 24 0
2 111 60 0 0 26.2 0.343 23 0
1 128 82 17 183 27.5 0.115 22 0
10 92 62 0 0 25.9 0.167 31 0
13 104 72 0 0 31.2 0.465 38 1
5 104 74 0 0 28.8 0.153 48 0
2 94 76 18 66 31.6 0.649 23 0
7 97 76 32 91 40.9 0.871 32 1
1 100 74 12 46 19.5 0.149 28 0
0 102 86 17 105 29.3 0.695 27 0
4 128 70 0 0 34.3 0.303 24 0
6 147 80 0 0 29.5 0.178 50 1
4 90 0 0 0 28 0.61 31 0
3 103 72 30 152 27.6 0.73 27 0
2 157 74 35 440 39.4 0.134 30 0
1 167 74 17 144 23.4 0.447 33 1
0 179 50 36 159 37.8 0.455 22 1
11 136 84 35 130 28.3 0.26 42 1
0 107 60 25 0 26.4 0.133 23 0
1 91 54 25 100 25.2 0.234 23 0
1 117 60 23 106 33.8 0.466 27 0
5 123 74 40 77 34.1 0.269 28 0
2 120 54 0 0 26.8 0.455 27 0
1 106 70 28 135 34.2 0.142 22 0
2 155 52 27 540 38.7 0.24 25 1
2 101 58 35 90 21.8 0.155 22 0
1 120 80 48 200 38.9 1.162 41 0
11 127 106 0 0 39 0.19 51 0
3 80 82 31 70 34.2 1.292 27 1
10 162 84 0 0 27.7 0.182 54 0
1 199 76 43 0 42.9 1.394 22 1
8 167 106 46 231 37.6 0.165 43 1
9 145 80 46 130 37.9 0.637 40 1
6 115 60 39 0 33.7 0.245 40 1
1 112 80 45 132 34.8 0.217 24 0
4 145 82 18 0 32.5 0.235 70 1
10 111 70 27 0 27.5 0.141 40 1
6 98 58 33 190 34 0.43 43 0
9 154 78 30 100 30.9 0.164 45 0
6 165 68 26 168 33.6 0.631 49 0
1 99 58 10 0 25.4 0.551 21 0
10 68 106 23 49 35.5 0.285 47 0
3 123 100 35 240 57.3 0.88 22 0
8 91 82 0 0 35.6 0.587 68 0
6 195 70 0 0 30.9 0.328 31 1
9 156 86 0 0 24.8 0.23 53 1
0 93 60 0 0 35.3 0.263 25 0
3 121 52 0 0 36 0.127 25 1
2 101 58 17 265 24.2 0.614 23 0
2 56 56 28 45 24.2 0.332 22 0
0 162 76 36 0 49.6 0.364 26 1
0 95 64 39 105 44.6 0.366 22 0
4 125 80 0 0 32.3 0.536 27 1
5 136 82 0 0 0 0.64 69 0
2 129 74 26 205 33.2 0.591 25 0
3 130 64 0 0 23.1 0.314 22 0
1 107 50 19 0 28.3 0.181 29 0
1 140 74 26 180 24.1 0.828 23 0
1 144 82 46 180 46.1 0.335 46 1
8 107 80 0 0 24.6 0.856 34 0
13 158 114 0 0 42.3 0.257 44 1
2 121 70 32 95 39.1 0.886 23 0
7 129 68 49 125 38.5 0.439 43 1
2 90 60 0 0 23.5 0.191 25 0
7 142 90 24 480 30.4 0.128 43 1
3 169 74 19 125 29.9 0.268 31 1
0 99 0 0 0 25 0.253 22 0
4 127 88 11 155 34.5 0.598 28 0
4 118 70 0 0 44.5 0.904 26 0
2 122 76 27 200 35.9 0.483 26 0
6 125 78 31 0 27.6 0.565 49 1
1 168 88 29 0 35 0.905 52 1
2 129 0 0 0 38.5 0.304 41 0
4 110 76 20 100 28.4 0.118 27 0
6 80 80 36 0 39.8 0.177 28 0
10 115 0 0 0 0 0.261 30 1
2 127 46 21 335 34.4 0.176 22 0
9 164 78 0 0 32.8 0.148 45 1
2 93 64 32 160 38 0.674 23 1
3 158 64 13 387 31.2 0.295 24 0
5 126 78 27 22 29.6 0.439 40 0
10 129 62 36 0 41.2 0.441 38 1
0 134 58 20 291 26.4 0.352 21 0
3 102 74 0 0 29.5 0.121 32 0
7 187 50 33 392 33.9 0.826 34 1
3 173 78 39 185 33.8 0.97 31 1
10 94 72 18 0 23.1 0.595 56 0
1 108 60 46 178 35.5 0.415 24 0
5 97 76 27 0 35.6 0.378 52 1
4 83 86 19 0 29.3 0.317 34 0
1 114 66 36 200 38.1 0.289 21 0
1 149 68 29 127 29.3 0.349 42 1
5 117 86 30 105 39.1 0.251 42 0
1 111 94 0 0 32.8 0.265 45 0
4 112 78 40 0 39.4 0.236 38 0
1 116 78 29 180 36.1 0.496 25 0
0 141 84 26 0 32.4 0.433 22 0
2 175 88 0 0 22.9 0.326 22 0
2 92 52 0 0 30.1 0.141 22 0
3 130 78 23 79 28.4 0.323 34 1
8 120 86 0 0 28.4 0.259 22 1
2 174 88 37 120 44.5 0.646 24 1
2 106 56 27 165 29 0.426 22 0
2 105 75 0 0 23.3 0.56 53 0
4 95 60 32 0 35.4 0.284 28 0
0 126 86 27 120 27.4 0.515 21 0
8 65 72 23 0 32 0.6 42 0
2 99 60 17 160 36.6 0.453 21 0
1 102 74 0 0 39.5 0.293 42 1
11 120 80 37 150 42.3 0.785 48 1
3 102 44 20 94 30.8 0.4 26 0
1 109 58 18 116 28.5 0.219 22 0
9 140 94 0 0 32.7 0.734 45 1
13 153 88 37 140 40.6 1.174 39 0
12 100 84 33 105 30 0.488 46 0
1 147 94 41 0 49.3 0.358 27 1
1 81 74 41 57 46.3 1.096 32 0
3 187 70 22 200 36.4 0.408 36 1
6 162 62 0 0 24.3 0.178 50 1
4 136 70 0 0 31.2 1.182 22 1
1 121 78 39 74 39 0.261 28 0
3 108 62 24 0 26 0.223 25 0
0 181 88 44 510 43.3 0.222 26 1
8 154 78 32 0 32.4 0.443 45 1
1 128 88 39 110 36.5 1.057 37 1
7 137 90 41 0 32 0.391 39 0
0 123 72 0 0 36.3 0.258 52 1
1 106 76 0 0 37.5 0.197 26 0
6 190 92 0 0 35.5 0.278 66 1
2 88 58 26 16 28.4 0.766 22 0
9 170 74 31 0 44 0.403 43 1
9 89 62 0 0 22.5 0.142 33 0
10 101 76 48 180 32.9 0.171 63 0
2 122 70 27 0 36.8 0.34 27 0
5 121 72 23 112 26.2 0.245 30 0
1 126 60 0 0 30.1 0.349 47 1
1 93 70 31 0 30.4 0.315 23 0
Patient Diabetes Prediction Analysis¶
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Objective of the Analysis¶
The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
Content¶
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
# Dataset Description:
Pregnancies: Number of times pregnant
Glucose : Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
BloodPressure : Diastolic blood pressure (mm Hg).
SkinThickness:Triceps skin fold thickness (mm).
Insulin :2-Hour serum insulin (mu U/ml).
BMI : Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction:Diabetes pedigree function
Age : Age (years).
Outcome : Class variable (0 or 1) 268 of 768 are 1, the others are 0.## Process flow of the Analysis
1. Understanding the Requirement
2. Create a objective of the Analysis
3. Data Collection and preparing the Data
4. Understanding the Data and Requirement
5. Performing EDA Analysis (Descriptive Statistics, Correlations, Data Visulizations,Distributions).
6. Selecting the Statistical Techniques.
7. Preparing the Data for applying statistical techniques.
8. Applying the well Suitable model.
9. Checking the Model Accuracy and Performance and Validation.
10.Checking the Auc and Roc Curve.
In[1]:
## Importing the Required Modules for doing Data manipulation in Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In[131]:
import xgboost as xgb
from xgboost import XGBClassifier
# !pip install xgboost
# import graphviz
# !pip install graphviz
# !pip install more-itertools
# from sklearn.impute import SimpleImputer
# imputer = SimpleImputer(missing_values=np.nan, strategy=’median’)
In[5]:
from sklearn import model_selection
from sklearn.model_selection import train_test_split, cross_val_score, KFold, learning_curve, StratifiedKFold, train_test_split
from sklearn.metrics import confusion_matrix, make_scorer, accuracy_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier as MLPC
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
import warnings
warnings.filterwarnings(“ignore”)
%matplotlib inline
In[6]:
## Importing the Required Datasets in Python
In[7]:
## Get the first five records

Out[7]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

In[8]:
## Get the Last five records
Data.tail()

Out[8]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 0 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0

In[9]:
## Get the Columns names
Data.columns

Out[9]:
Index([‘Pregnancies’, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’,
‘BMI’, ‘DiabetesPedigreeFunction’, ‘Age’, ‘Outcome’],
dtype=’object’)

In[10]:
## Getting the Shape of the Data
Data.shape

Out[10]:
(768, 9)

Interpretation :
In the dataset we have 768 Rows(records) and 9 columns
In[11]:
## Getting the Data types for all the columns
Data.dtypes

Out[11]:
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunctionfloat64
Age int64
Outcome int64
dtype: object

In[12]:
## get the Information of the Data
Data.info()

RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# ColumnNon-Null CountDtype
—————————-
0 Pregnancies 768 non-nullint64
1 Glucose 768 non-nullint64
2 BloodPressure 768 non-nullint64
3 SkinThickness 768 non-nullint64
4 Insulin 768 non-nullint64
5 BMI 768 non-nullfloat64
6 DiabetesPedigreeFunction768 non-nullfloat64
7 Age 768 non-nullint64
8 Outcome 768 non-nullint64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

getting the Mising Values for all columns¶
Data.isnull().sum()
Interpretation : In the Data set all are numeric and there is no any missing values in all the variables.
In[14]:
## Get the Descriptive Statistics for all the Variables
Data.describe()

Out[14]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

Interpretation: From the above Descriptive Statistics table we can seen that variables Skin Thickness and Insulin are
having higher maximum values,which means that the some of the observations away from the Dataset.
and also min value is zero and 25% percentie values were also zero,which is suggesting that those feature not having enough
data. there are some data issues, indicating we are missing some data points from the patients.
In[15]:
## Getting the No.of.Zeros for all the columns
data1 = Data.iloc[:, :-1]
print(“# of Rows, # of Columns: “,Data.shape)
print(“\nColumn Name # of Null Values\n”)
print((Data[:] == 0).sum())

# of Rows, # of Columns:(768, 9)
Column Name # of Null Values
Pregnancies 111
Glucose 5
BloodPressure35
SkinThickness 227
Insulin 374
BMI11
DiabetesPedigreeFunction0
Age 0
Outcome 500
dtype: int64

In[16]:
## Getting the % of No.of. Zeros for all the columns.
print(“# of Rows, # of Columns: “,Data.shape)
print(“\nColumn Name% Null Values\n”)
print(((Data[:] == 0).sum())/768*100)

# of Rows, # of Columns:(768, 9)
Column Name% Null Values
Pregnancies 14.453125
Glucose0.651042
BloodPressure4.557292
SkinThickness 29.557292
Insulin 48.697917
BMI1.432292
DiabetesPedigreeFunction 0.000000
Age0.000000
Outcome 65.104167
dtype: float64

Interpretation: Variables Such as SkinThickness and Insulin are having more percentages of zeros , which is indicating that
we do not have more information about them. there are some data points are missing.doctors only measured insulin levels in unhealthy looking patients — or maybe they only measured insulin levels after having first made a preliminary diagnosis. If that were true then this would be a form of data issues.
In[17]:
## Scatter Plots for all the varibles
sns.pairplot(Data)

Out[17]:

Interpretation: From the above output we can see that
Pregnancies variable : rightly skewed and having some relation with Age.
Glucose Variable : Almost normal distribution and do not having much relation with any other feature.
BloodPressure Variable: Almst Normal Distribution anddo not having much relation with any other feature.
SkinThickness Variable : Seem’s like normal Distribution and Slite rightly skewed and having some linear
relationship with BMI, Age, Insulin.
BMI Variable : Almost Normal Distribution and having some reltionship with Skinthickness.
DiabetesPedigreeFunction : Seem’s like rightly skewed , having some linear relation with Age.
Age Variable : Rightly Skewed having some relationship with Age.
In[18]:
sns.pairplot(Data, hue=”Outcome”)
sns.pairplot(Data, hue=”Outcome”, diag_kind=”hist”)

Out[18]:

Interpretation : From the above plot it can be seen that the how target variable distributed across all the variables.
In[19]:
corr = Data.corr()

Out[19]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
Pregnancies 1.00 0.13 0.14 -0.08 -0.07 0.02 -0.03 0.54 0.22
Glucose 0.13 1.00 0.15 0.06 0.33 0.22 0.14 0.26 0.47
BloodPressure 0.14 0.15 1.00 0.21 0.09 0.28 0.04 0.24 0.07
SkinThickness -0.08 0.06 0.21 1.00 0.44 0.39 0.18 -0.11 0.07
Insulin -0.07 0.33 0.09 0.44 1.00 0.20 0.19 -0.04 0.13
BMI 0.02 0.22 0.28 0.39 0.20 1.00 0.14 0.04 0.29
DiabetesPedigreeFunction -0.03 0.14 0.04 0.18 0.19 0.14 1.00 0.03 0.17
Age 0.54 0.26 0.24 -0.11 -0.04 0.04 0.03 1.00 0.24
Outcome 0.22 0.47 0.07 0.07 0.13 0.29 0.17 0.24 1.00

Interpretation: 1.Variables such as Pregnancies and Age are having high +vely Correlated together,
if age is increased then pregnancies also be increased.
2. SkinThickness having some -vely correlated with Age and Pregnancies that means ,
if Skinthickness is decreased then age and Pregnancies will also be decreased.
3. Insulin and Skinthickness is having some +vely correlated that means of Insulin levels are increased then skinthickness will also be Increased.
4. similarlly BMI and Skinthickness.
Plotting Histograms¶
In[22]:
def plotHistogram(values,label,feature,title):
sns.set_style(“whitegrid”)
plotOne = sns.FacetGrid(values, hue=label,aspect=2)
plotOne.map(sns.distplot,feature,kde=False)
plotOne.set(xlim=(0, values[feature].max()))
plotOne.set_axis_labels(feature, ‘Proportion’)
plotOne.fig.suptitle(title)
plt.show()
plotHistogram(Data,”Outcome”,’Pregnancies’,’Pregnancies vs Diagnosis (Blue = Healthy; Orange = Diabetes)’)

Interpretation: If the Patient has high prehnancies values they likely to get more diabetes.
In[23]:
plotHistogram(Data,”Outcome”,’Glucose’,’Glucose vs Diagnosis (Blue = Healthy; Orange = Diabetes)’)

Interpretation: If the Patient has high glucose level then they more likely to get Diabetes.
In[24]:
plotHistogram(Data,”Outcome”,’BloodPressure’,’BloodPressure vs Diagnosis (Blue = Healthy; Orange = Diabetes)’)

Interpretation: If the Patint has normal bloodpressure then they are less likely to get Diabetes.
In[25]:
plotHistogram(Data,”Outcome”,’SkinThickness’,’SkinThickness vs Diagnosis (Blue = Healthy; Orange = Diabetes)’)

Interpretation: If the patient has low skinthickness then they more likely to get Diabetes.
In[26]:
plotHistogram(Data,”Outcome”,’Insulin’,’Insulin vs Diagnosis (Blue = Healthy; Orange = Diabetes)’)

Interpretation: If the Patient has less insulin levels then they more likely to get Diabetes.
In[27]:
plotHistogram(Data,”Outcome”,’BMI’,’BMI vs Diagnosis (Blue = Healthy; Orange = Diabetes)’)

Interpretation: If the Patient has normal BMI levels they are less likely to get diabetes.
In[29]:
plotHistogram(Data,”Outcome”,’DiabetesPedigreeFunction’,’DiabetesPedigreeFunction vs Diagnosis (Blue = Healthy; Orange = Diabetes)’)

Interpretation: If the Patient has low Diabetes Pedigree function then they are more likely to get Diabetes.
In[30]:
plotHistogram(Data,”Outcome”,’Age’,’Age vs Diagnosis (Blue = Healthy; Orange = Diabetes)’)

Interpretation: If the patient age is lies between 30 to 50 they are more likely to get Diabetes.
Plotting Box Plots¶
In[31]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot( x=Data[“Outcome”], y=Data[“Pregnancies”] );
plt.show()

Interpretation:Diabetes is more common in patients with a large number of pregnancies.
In[32]:
sns.boxplot( x=Data[“Outcome”], y=Data[“Glucose”] );
plt.show()

Interpretation: Diabetes is more common in patients with a higher Glucose levels. and patients who are not having Diabetes they having some outliers,some of the patients are having more Glucose levels.
In[33]:
sns.boxplot( x=Data[“Outcome”], y=Data[“BloodPressure”] );
plt.show()

Interpretation: If the patients are having normal bloodpressure then Diabetes is normal in patients.
Both Patients are having some outliers, which means that some ptients are having some what higher bloodpresure than the other patients.
In[34]:
sns.boxplot( x=Data[“Outcome”], y=Data[“SkinThickness”] );
plt.show()

Interpretation: Diabetes is less common in patients with a normal or less skinthickness.
In[35]:
sns.boxplot( x=Data[“Outcome”], y=Data[“Insulin”] );
plt.show()

Interpretation: Diabetes is more common in patients with a higher insulin levels.
And also both patients are having some outliers data points, which says that some of the patients are having higher Insulin levels than the others.
In[36]:
sns.boxplot( x=Data[“Outcome”], y=Data[“BMI”] );
plt.show()

Interpretation: Diabetes is more common in patients with a higher insulin levels.
and some of the patients are having higher BMI levels.
In[37]:
sns.boxplot( x=Data[“Outcome”], y=Data[“DiabetesPedigreeFunction”] );
plt.show()

Interpretation: Diabetes is less common in patients with a lower Diabetes Pediagree function levels.
some of the patients are having some what higher DPfunction levels.
In[38]:
sns.boxplot( x=Data[“Outcome”], y=Data[“Age”] );
plt.show()

Interpretation : Diabetes is more common in patients with a bigger age.
and also patients who are not having diabetes they are having more number of outliers than the others.
Bar Plots¶
In[40]:
# Pregnancies v/s Outcome barplot
sns.barplot(x = ‘Outcome’,
y = ‘Pregnancies’,
data = Data)
# Show the plot
plt.show()

Interpretation : Diabetes is more common in patients with a more number of pregnencies.
In[41]:
# Glucose v/s Outcome barplot
sns.barplot(x = ‘Outcome’,
y = ‘Glucose’,
data = Data)
# Show the plot
plt.show()

Interpretation: Diabetes is more common in patients with a more glucose levels.
In[42]:
# BloodPressure v/s Outcome barplot
sns.barplot(x = ‘Outcome’,
y = ‘BloodPressure’,
data = Data)
# Show the plot
plt.show()

Interpretation : Diabetes is less common in patients with a normal bloodpressure.
In[43]:
# SkinThickness v/s Outcome barplot
sns.barplot(x = ‘Outcome’,
y = ‘SkinThickness’,
data = Data)
# Show the plot
plt.show()

Interpretation : Diabetes is less common in patients with a lower skinthickness.
In[44]:
# Insulin v/s Outcome barplot
sns.barplot(x = ‘Outcome’,
y = ‘Insulin’,
data = Data)
# Show the plot
plt.show()

Interpretation: Diabetes is more common in patients with a Higher Insulin levels.
In[45]:
# BMI v/s Outcome barplot
sns.barplot(x = ‘Outcome’,
y = ‘BMI’,
data = Data)
# Show the plot
plt.show()

Interpretation :Diabetes is less common in patients with a lower BMI levels.
In[46]:
# DiabetesPedigreeFunction v/s Outcome barplot
sns.barplot(x = ‘Outcome’,
y = ‘DiabetesPedigreeFunction’,
data = Data)
# Show the plot
plt.show()

Interpretation :Diabetes is less common in patients with a higher DPfunctin levels.
In[47]:
# Age v/s Outcome barplot
sns.barplot(x = ‘Outcome’,
y = ‘Age’,
data = Data)
# Show the plot
plt.show()

In[48]:
## Group By Summaries for all the Variablewith the target Outcome Variables
Data[‘Outcome’] = Data[‘Outcome’].astype(‘O’)
In[49]:
## Pregnencies
Data.groupby([‘Outcome’],as_index = False).aggregate({‘Pregnancies’: ‘count’}).rename(columns= {‘Pregnancies’:’Count_Pregnancies’})

Out[49]:
Outcome Count_Pregnancies
0 0 500
1 1 268

Interpretation : Almost 35 % of the patients are having Diabetes.
In[50]:
## Pregnencies
Data.groupby([‘Outcome’],as_index = False).aggregate({‘Pregnancies’: ‘mean’}).rename(columns= {‘Pregnancies’:’Mean_Pregnancies’})

Out[50]:
Outcome Mean_Pregnancies
0 0 3.298000
1 1 4.865672

Interpretation : An average there are almost 5 times pregnencies patients get Diabetes out of268 diabetes patients.
In[51]:
## Glucose
Data.groupby([‘Outcome’],as_index = False).aggregate({‘Glucose’: ‘mean’}).rename(columns= {‘Glucose’:’Mean_Glucose’})

Out[51]:
Outcome Mean_Glucose
0 0 109.980000
1 1 141.257463

Interpretation : if the patients are havings an average 141 glucose levels then patients get Diabetes out of 268 diabetes patients.
In[52]:
## BloodPressure
Data.groupby([‘Outcome’],as_index = False).aggregate({‘BloodPressure’: ‘mean’}).rename(columns= {‘BloodPressure’:’Mean_BloodPressure’})

Out[52]:
Outcome Mean_BloodPressure
0 0 68.184000
1 1 70.824627

Interpretation:if the patients are havings an average 70 BloodPressure levels then patients get Diabetes out of 268 diabetes patients.
In[53]:
## SkinThickness
Data.groupby([‘Outcome’],as_index = False).aggregate({‘SkinThickness’: ‘mean’}).rename(columns= {‘SkinThickness’:’Mean_SkinThickness’})

Out[53]:
Outcome Mean_SkinThickness
0 0 19.664000
1 1 22.164179

Interpretation:if the patients are havings an average 22 skinthickness levels then patients get Diabetes out of 268 diabetes patients.
In[54]:
## Insulin
Data.groupby([‘Outcome’],as_index = False).aggregate({‘Insulin’: ‘mean’}).rename(columns= {‘Insulin’:’Mean_Insulin’})

Out[54]:
Outcome Mean_Insulin
0 0 68.792000
1 1 100.335821

Interpretation:if the patients are havings an average 100 Insulin levels then patients get Diabetes out of 268 diabetes patients.
In[55]:
## BMI
Data.groupby([‘Outcome’],as_index = False).aggregate({‘BMI’: ‘mean’}).rename(columns= {‘BMI’:’Mean_BMI’})

Out[55]:
Outcome Mean_BMI
0 0 30.304200
1 1 35.142537

Interpretation:if the patients are havings an average 35 BMI levels then patients get Diabetes out of 268 diabetes patients.
In[56]:
## DiabetesPedigreeFunction
Data.groupby([‘Outcome’],as_index = False).aggregate({‘DiabetesPedigreeFunction’: ‘mean’}).rename(columns= {‘DiabetesPedigreeFunction’:’Mean_DiabetesPedigreeFunction’})

Out[56]:
Outcome Mean_DiabetesPedigreeFunction
0 0 0.429734
1 1 0.550500

Interpretation:if the patients are havings an average 0.55 DPfunction levels then patients get Diabetes out of 268 diabetes patients.
In[57]:
## Age
Data.groupby([‘Outcome’],as_index = False).aggregate({‘Age’: ‘mean’}).rename(columns= {‘Age’:’Mean_Age’})

Out[57]:
Outcome Mean_Age
0 0 31.190000
1 1 37.067164

Interpretation:if the patients are havings an average 37 age then patients get Diabetes out of 268 diabetes patients.
Predctive Modeling or Machine learning Model in Python¶
Predict the patients who are getting diabetes or not¶
In[107]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel(“Training examples”)
plt.ylabel(“Score”)
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean – train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color=”r”)
plt.fill_between(train_sizes, test_scores_mean – test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color=”g”)
plt.plot(train_sizes, train_scores_mean, ‘o-‘, color=”r”,
label=”Training score”)
plt.plot(train_sizes, test_scores_mean, ‘o-‘, color=”g”,
label=”Cross-validation score”)
plt.legend(loc=”best”)
return plt
def plot_confusion_matrix(cm, classes,
normalize=False,
title=’Confusion matrix’,
cmap=plt.cm.Blues):
plt.imshow(cm, interpolation=’nearest’, cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = ‘.2f’ if normalize else ‘d’
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment=”center”,
color=”white” if cm[i, j] > thresh else “black”)
plt.tight_layout()
plt.ylabel(‘True label’)
plt.xlabel(‘Predicted label’)
def compareABunchOfDifferentModelsAccuracy(a, b, c, d):
print(‘\nCompare Multiple Classifiers: \n’)
print(‘K-Fold Cross-Validation Accuracy: \n’)
names = []
models = []
resultsAccuracy = []
models.append((‘LR’, LogisticRegression()))
models.append((‘RF’, RandomForestClassifier()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘SVM’, SVC()))
models.append((‘LSVM’, LinearSVC()))
models.append((‘GNB’, GaussianNB()))
models.append((‘DTC’, DecisionTreeClassifier()))
for name, model in models:
model.fit(a, b)
kfold = model_selection.KFold(n_splits=10, random_state=7)
accuracy_results = model_selection.cross_val_score(model, a,b, cv=kfold, scoring=’accuracy’)
resultsAccuracy.append(accuracy_results)
names.append(name)
accuracyMessage = “%s: %f (%f)” % (name, accuracy_results.mean(), accuracy_results.std())
print(accuracyMessage)
# Boxplot
fig = plt.figure()
fig.suptitle(‘Algorithm Comparison: Accuracy’)
plt.boxplot(resultsAccuracy)
ax.set_xticklabels(names)
ax.set_ylabel(‘Cross-Validation: Accuracy Score’)
plt.show()
def defineModels():
print(‘\nLR = LogisticRegression’)
print(‘RF = RandomForestClassifier’)
print(‘KNN = KNeighborsClassifier’)
print(‘SVM = Support Vector Machine SVC’)
print(‘LSVM = LinearSVC’)
print(‘GNB = GaussianNB’)
print(‘DTC = DecisionTreeClassifier’)
names = [“Nearest Neighbors”, “Linear SVM”, “RBF SVM”, “Gaussian Process”,
“Decision Tree”, “Random Forest”, “MLPClassifier”, “AdaBoost”,
“Naive Bayes”, “QDA”]
classifiers = [
KNeighborsClassifier(),
SVC(kernel=”linear”),
SVC(kernel=”rbf”),
GaussianProcessClassifier(),
DecisionTreeClassifier(),
RandomForestClassifier(),
MLPClassifier(),
GaussianNB(),
]
dict_characters = {0: ‘Healthy’, 1: ‘Diabetes’}
In[108]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy=’mean’)
In[109]:
Data[‘Outcome’] = Data[‘Outcome’].astype(‘int64′)
In[110]:
## Seperating the Target outcome variable and Predictors or indipendent variables in Python
## Seperating the All predictors or Independent Variables
X = Data.iloc[:, :-1]
## Seperating the Target or Dependent Variable
y = Data.iloc[:, -1]
In[111]:
## Splitting the Data into Training and Testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
## Imputing the zeros with median values in all the predictor variables
imputer = SimpleImputer(missing_values=0,strategy=’median’)
In[112]:
## Replacing the zeros with median values in all the predictor variables.
X_train2 = imputer.fit_transform(X_train)
X_test2 = imputer.transform(X_test)
X_train3 = pd.DataFrame(X_train2)
In[113]:
##Ploting the Histograms after replacing zeros with median values.
plotHistogram(X_train3,None,4,’Insulin vs Diagnosis (Blue = Healthy; Orange = Diabetes)’)
plotHistogram(X_train3,None,3,’SkinThickness vs Diagnosis (Blue = Healthy; Orange = Diabetes)’)

In[114]:
## adding the lable names for all the predictors variables
labels = {0:’Pregnancies’,1:’Glucose’,2:’BloodPressure’,3:’SkinThickness’,4:’Insulin’,5:’BMI’,6:’DiabetesPedigreeFunction’,7:’Age’}
print(labels)
print(“\nColumn #, # of Zero Values\n”)
print((X_train3[:] == 0).sum())

{0: ‘Pregnancies’, 1: ‘Glucose’, 2: ‘BloodPressure’, 3: ‘SkinThickness’, 4: ‘Insulin’, 5: ‘BMI’, 6: ‘DiabetesPedigreeFunction’, 7: ‘Age’}
Column #, # of Zero Values
00
10
20
30
40
50
60
70
dtype: int64

In[115]:
## Evaluate Classification Models
compareABunchOfDifferentModelsAccuracy(X_train2, y_train, X_test2, y_test)
defineModels()
results = {}
for name, clf in zip (names, classifiers):
scores = cross_val_score (clf, X_train2, y_train, cv=5)
results[name] = scores
for name, scores in results.items():
print(“%20s | Accuracy: %0.2f%% (+/- %0.2f%%)” % (name, 100*scores.mean(), 100*scores.std() * 2))

Compare Multiple Classifiers:
K-Fold Cross-Validation Accuracy:
LR: 0.757792 (0.065803)
RF: 0.750419 (0.044322)
KNN: 0.705381 (0.083708)
SVM: 0.746646 (0.061050)
LSVM: 0.583298 (0.139781)
GNB: 0.741230 (0.060967)
DTC: 0.692767 (0.067304)
GBC: 0.757722 (0.055943)

LR = LogisticRegression
RF = RandomForestClassifier
KNN = KNeighborsClassifier
SVM = Support Vector Machine SVC
LSVM = LinearSVC
GNB = GaussianNB
DTC = DecisionTreeClassifier
Nearest Neighbors | Accuracy: 69.65% (+/- 11.25%)
Linear SVM | Accuracy: 75.98% (+/- 9.05%)
RBF SVM | Accuracy: 74.49% (+/- 8.55%)
Gaussian Process | Accuracy: 66.29% (+/- 9.55%)
Decision Tree | Accuracy: 68.90% (+/- 3.39%)
Random Forest | Accuracy: 76.16% (+/- 4.53%)
MLPClassifier | Accuracy: 65.93% (+/- 6.59%)
AdaBoost | Accuracy: 73.01% (+/- 9.44%)
Naive Bayes | Accuracy: 73.38% (+/- 8.01%)
QDA | Accuracy: 72.45% (+/- 8.47%)

Interpretation: From the above output we can see that Decision tree model has 68.34% accuracy , which is low
if we camparing the another models but this model is having less error variance and error rate. so we can say that we can go
ahead with Decision tree models for better predictions and we are able to see entire tree output as well.
If we can see that Random Forest model is aslo having good acuracy with variance and error rate.
In[117]:
# !pip install more-itertools
from itertools import product
import itertools
In[118]:
## explore the decision tree model in more detail
def runDecisionTree(a, b, c, d):
model = DecisionTreeClassifier()
accuracy_scorer = make_scorer(accuracy_score)
model.fit(a, b)
kfold = model_selection.KFold(n_splits=10, random_state=7)
accuracy = model_selection.cross_val_score(model, a, b, cv=kfold, scoring=’accuracy’)
mean = accuracy.mean()
stdev = accuracy.std()
prediction = model.predict(c)
cnf_matrix = confusion_matrix(d, prediction)
#plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,title=’Normalized confusion matrix’)
plot_learning_curve(model, ‘Learning Curve For DecisionTreeClassifier’, a, b, (0.60,1.1), 10)
#learning_curve(model, ‘Learning Curve For DecisionTreeClassifier’, a, b, (0.60,1.1), 10)
plt.show()
plot_confusion_matrix(cnf_matrix, classes=dict_characters,title=’Confusion matrix’)
plt.show()
print(‘DecisionTreeClassifier – Training set accuracy: %s (%s)’ % (mean, stdev))
return
runDecisionTree(X_train2, y_train, X_test2, y_test)
feature_names1 = X.columns.values
# def plot_decision_tree1(a,b):
# dot_data = tree.export_graphviz(a, out_file=None,
#feature_names=b,
#class_names=[‘Healthy’,’Diabetes’],
#filled=False, rounded=True,
#special_characters=False)
# graph = graphviz.Source(dot_data)
# return graph
# clf1 = tree.DecisionTreeClassifier(max_depth=3,min_samples_leaf=12)
# clf1.fit(X_train2, y_train)
# plot_decision_tree1(clf1,feature_names1)

DecisionTreeClassifier – Training set accuracy: 0.7075122292103424 (0.05459628509197373)

In[119]:
from matplotlib import pyplot as plt
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
In[120]:
# Fit the classifier with default hyper-parameters
clf = DecisionTreeClassifier(max_depth=3,min_samples_leaf=12,random_state=1234)
model = clf.fit(X, y)
In[121]:
text_representation = tree.export_text(clf)
print(text_representation)

|— feature_1 <= 127.50 | |--- feature_7 <= 28.50 | | |--- feature_5 <= 30.95 | | | |--- class: 0 | | |--- feature_5 >30.95
| | | |— class: 0
| |— feature_7 >28.50
| | |— feature_5 <= 26.35 | | | |--- class: 0 | | |--- feature_5 >26.35
| | | |— class: 0
|— feature_1 >127.50
| |— feature_5 <= 29.95 | | |--- feature_1 <= 145.50 | | | |--- class: 0 | | |--- feature_1 >145.50
| | | |— class: 1
| |— feature_5 >29.95
| | |— feature_1 <= 157.50 | | | |--- class: 1 | | |--- feature_1 >157.50
| | | |— class: 1

In[122]:
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
In[123]:
# Fit the regressor, set max_depth = 3
regr = DecisionTreeRegressor(max_depth=3, random_state=1234)
model = regr.fit(X, y)
In[124]:
text_representation = tree.export_text(regr)
print(text_representation)

|— feature_1 <= 127.50 | |--- feature_7 <= 28.50 | | |--- feature_5 <= 45.40 | | | |--- value: [0.07] | | |--- feature_5 >45.40
| | | |— value: [0.75]
| |— feature_7 >28.50
| | |— feature_5 <= 26.35 | | | |--- value: [0.05] | | |--- feature_5 >26.35
| | | |— value: [0.40]
|— feature_1 >127.50
| |— feature_5 <= 29.95 | | |--- feature_1 <= 145.50 | | | |--- value: [0.15] | | |--- feature_1 >145.50
| | | |— value: [0.51]
| |— feature_5 >29.95
| | |— feature_1 <= 157.50 | | | |--- value: [0.61] | | |--- feature_1 >157.50
| | | |— value: [0.87]

In[125]:
# Prepare the data data
Col = np.array(X.columns)
Col
fig = plt.figure(figsize=(12,7))
_ = tree.plot_tree(regr, feature_names=Col, filled=True)

Interpretation: From the above decision tree output we can observe that most significant variable is Glocose.
It Has been at the value is <=127.5 Then after it has been divided into two sub nodes based on Age and BMI Age has significant and splitted at the values is <=28.5 BMI it has significant and splitted at the value is <=30 BMI again it has been splitted by Glucose. then after Glucose has been divided into 4 sub nodes and BMI it has been splitted into 4Sub nodes. In which each node we will be calculated how many patients has been fallen under in each sub nodes. In[]: fig = plt.figure(figsize=(15,15)) tree.plot_tree(clf) In[128]: pip install xgboost Note: you may need to restart the kernel to use updated packages. 'C:\Users\NANUMALA' is not recognized as an internal or external command, operable program or batch file. In[132]: # Evaluate Feature Importances feature_names = X.columns.values clf1 = tree.DecisionTreeClassifier(max_depth=3,min_samples_leaf=12) clf1.fit(X_train2, y_train) print('Accuracy of DecisionTreeClassifier: {:.2f}'.format(clf1.score(X_test2, y_test))) columns = X.columns coefficients = clf1.feature_importances_.reshape(X.columns.shape[0], 1) absCoefficients = abs(coefficients) fullList = pd.concat((pd.DataFrame(columns, columns = ['Variable']), pd.DataFrame(absCoefficients, columns = ['absCoefficient'])), axis = 1).sort_values(by='absCoefficient', ascending = False) print('DecisionTreeClassifier - Feature Importance:') print('\n',fullList,'\n') feature_names = X.columns.values clf2 = RandomForestClassifier(max_depth=3,min_samples_leaf=12) clf2.fit(X_train2, y_train) print('Accuracy of RandomForestClassifier: {:.2f}'.format(clf2.score(X_test2, y_test))) columns = X.columns coefficients = clf2.feature_importances_.reshape(X.columns.shape[0], 1) absCoefficients = abs(coefficients) fullList = pd.concat((pd.DataFrame(columns, columns = ['Variable']), pd.DataFrame(absCoefficients, columns = ['absCoefficient'])), axis = 1).sort_values(by='absCoefficient', ascending = False) print('RandomForestClassifier - Feature Importance:') print('\n',fullList,'\n') clf3 = XGBClassifier() clf3.fit(X_train2, y_train) print('Accuracy of XGBClassifier: {:.2f}'.format(clf3.score(X_test2, y_test))) columns = X.columns coefficients = clf3.feature_importances_.reshape(X.columns.shape[0], 1) absCoefficients = abs(coefficients) fullList = pd.concat((pd.DataFrame(columns, columns = ['Variable']), pd.DataFrame(absCoefficients, columns = ['absCoefficient'])), axis = 1).sort_values(by='absCoefficient', ascending = False) print('XGBClassifier - Feature Importance:') print('\n',fullList,'\n') Accuracy of DecisionTreeClassifier: 0.75 DecisionTreeClassifier - Feature Importance: VariableabsCoefficient 1 Glucose0.613166 5 BMI0.270807 7 Age0.114143 4 Insulin0.001885 0 Pregnancies0.000000 2 BloodPressure0.000000 3 SkinThickness0.000000 6DiabetesPedigreeFunction0.000000 Accuracy of RandomForestClassifier: 0.77 RandomForestClassifier - Feature Importance: VariableabsCoefficient 1 Glucose0.322880 5 BMI0.216052 7 Age0.146261 4 Insulin0.116453 0 Pregnancies0.065474 6DiabetesPedigreeFunction0.056155 2 BloodPressure0.038978 3 SkinThickness0.037747 [20:00:10] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. Accuracy of XGBClassifier: 0.74 XGBClassifier - Feature Importance: VariableabsCoefficient 1 Glucose0.227671 5 BMI0.146292 7 Age0.129848 4 Insulin0.109483 0 Pregnancies0.099757 6DiabetesPedigreeFunction0.098393 2 BloodPressure0.094846 3 SkinThickness0.093708 Interpretation: From the above models we can see which features are more important and signifiant in among all the models: If we looked at Decision tree classifier model: there are Glucose, BMI and Age features are most significant in the models, which says that patient get diabetes more on these features, these features are more contributed to patient get diabetes. RandomForestClassifier: Glucose, BMI , Age and Insulin are most Significant features in the Random Forest Model. XGBClassifier Model : In This model Glucose, BMI and Age are showing most significant Features, but remaining all other features were showing some significant, which says that these first 3 to 4 features are contributed more to get the petient Diabetes. In[]:

## Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
\$26
The price is based on these factors:
Number of pages
Urgency
Basic features
• Free title page and bibliography
• Unlimited revisions
• Plagiarism-free guarantee
• Money-back guarantee
On-demand options
• Writer’s samples
• Part-by-part delivery
• Overnight delivery
• Copies of used sources
Paper format
• 275 words per page
• 12 pt Arial/Times New Roman
• Double line spacing
• Any citation style (APA, MLA, Chicago/Turabian, Harvard)

## Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

### Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

### Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

### Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.