Neural Network Classification Analysis of Heart Data Disease with R Part 1

Falah Novayanda Adlin
7 min readJul 24, 2021

--

source : https://www.freepik.com/

Assalamua’laikum,

Hello statisticians, in this article we will try to practice data analysis using classification with the Neural Network Analysis (ANN) method of Heart Disease data. Where for part 2 and so on, other classification methods will be used for Heart Disease data as well, so we will compare which method is the best to use.

Artificial Neural Network (ANN) is an information processing system that has characteristics similar to biological neural networks. ANN is formed as a generalization of the mathematical model of biological neural networks. ANN is determined by 3 things:

  1. The pattern of connections between neurons (called network architecture)..
  2. Method for determining the weight of the link (training/learning method/algorithm).
  3. Activation function.

Heart Disease data set is from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. This data contains 76 attributes, including predicted attributes, but all published experiments refer to the use of a subset of these 14 attributes. . The “target” field refers to the patient’s presence or diagnosis of heart disease. This is an integer with a value of 0 = negative heart disease and 1 = positive heart disease. Where the number of data objects there are 303 data. The data link is as follows https://www.kaggle.com/ronitf/heart-disease-uci.

The first step, we install and activate the packages that are used where there are three packages in R Studio with the following syntax:

library(dplyr)
library(neuralnet)
library(caret)

Then, the author enters the data with the read.scv command with the syntax object_data = read.csv(file.choose(), header = TRUE, sep = “,”) then selects the diabetes data file as shown in the image below:

# Input Data 
heart_deseases= read.csv(file.choose(), header = TRUE, sep = “,”

Next, the author displays the data with a syntax view (object name) as shown in the image below:

View(heart_deseases)

So, from all these variables the author wants to see descriptive statistics in order to find out whether custom missing value data is detected or not with a syntax summary (object name) as shown in the image below:

summary(heart_deseases)

Then, it can be seen that the descriptive statistics of all variables are known to have no variables that have missing value data, so they can be analyzed further. After that, the author standardizes the data so that the scale of the data units for each variable is the same, then by using the command loop function standardization of the min max method, the syntax is as follows:

# distandarisasi
for (i in names(heart_deseases[,-1])) {
heart_deseases[i] <- (heart_deseases[i] — min(heart_deseases[i]))/(max(heart_deseases[i]) — min(heart_deseases[i]))
}
heart_deseases

Next, the author divides the portion for training data by 65% and testing data by 35% by using the command until where previously set.seed with a value of 123 then the syntax is as follows:

# Data Partition = data dipisah 65% data training dan 35% data testing
set.seed(123)
ind <- sample(2, nrow(heart_deseases), replace = TRUE, prob = c(0.65, 0.35))
training <- heart_deseases[ind==1,]
testing <- heart_deseases[ind==2,]

Then, the author performs a neural networks analysis modeling with training data where there are 5 hidden layers with the neuralnet command, the syntax n <- neuralnet(target~.,data = training, hidden = 7,err.fct = “ce”,linear.output = FALSE)) then a plot will be displayed with the plot command as follows:

# Model Neural Networks dengan data training
library(neuralnet)
set.seed(321)
n <- neuralnet(target~.,
data = training,
hidden = 5,
err.fct = “ce”,
linear.output = FALSE)
plot(n)

After that, the author sees predictions with testing data with a syntax like this:

# Prediction
output <- compute(n, testing[,-1])
head(output$net.result)
head(training[1,])
# data independen asli
results <- data.frame(DataAsli=testing$target, Prediksi=output$net.result)
results
# prediksinya
library(caret)
roundedresults <- sapply(results, round, digits=0)
roundedresults

Next, the author evaluates the model formed by knowing the accuracy and level of goodness of the results of the analysis, using the confusionMatrix (prediction) syntax as follows:

# Evaluasi model
# akurasi dengan data test
actual <- round(testing$target, digits = 0)
prediction <- round(output$net.result, digits = 0)
mtab <- table(actual,prediction)
mtab
confusionMatrix(mtab)

In this section, the author will discuss the output obtained from the results of classification data analysis using the ANN method against Heart Disease data based on the dependent variable, namely the “target” variable where there are 2 classes, namely 1 = positive heart disease and 0 = negative heart disease. The data display for heart disease is as follows:

Data set Penyakit Jantung

Then, standardization of data is carried out so that the scale on each dependent variable is equal, so that it is obtained as follows:

Data set standarasisasi

In conducting the classification analysis with the ANN method, the data is divided for analysis. Where the first is the distribution for training data where the researcher determines the portion of 65% of all data, of which another 35% is for testing data.

Neural network analysis modeling with training data where 5 hidden layers will get the following model:

Model ANN

The output above is a model that results from the analysis of the Neural Network Analysis classification method. Where to provide information that in this method the important information is the hidden layer. Information from the model is that on the input side there are 14 attributes, namely target, age, sex, cp, trestbps, chol, fpbs, restecg, thalach, exang, oldpeak, slope, ca and thal where these attributes will form the input layer (x) which given the weight values ​​x1, x2, x3, … x14. While the hidden layer 1 will also form a hidden layer (z), so that the distance between the input layer (x) and the hidden layer (z) has a weight value called the first weight matrix with the initials v11, v12, …. v15. Next, the output layer will appear called the output layer (y), where the distance between (y) and (z) can be calculated using the second weight value matrix, namely w1, w2, w3, … w5. Then the output value will be calculated which is for a positive diagnosis of heart disease or a negative heart disease.

After that, in the stage of making predictions, the ANN formed for testing data is combined with the original data from the training data. The results of the analysis can be obtained in the following outputs:

Prediksi

From the output, it is known that the results of the ANN classification show that the original data category 1 is predicted to be 1 and there are also 0.9984651. Then there is the original data category 0 which is predicted to be 0.9230630 and there is also 1.

To see more clearly the results by knowing the amount of data from each class the results of the ANN analysis can be displayed in the form of a table as follows.

Prediksi

From the output it is known that the results of the ANN classification obtained are results where the original data is known to be category 1 (positive heart disease) then it is predicted that there is 1 and there is also 0. It is also known that the original data is category 0 (negative heart disease) then predicted there is 1 and there is also 0. Then . Furthermore, to find out how the level of accuracy and goodness of the results of the ANN classification, it can be known from the confusion matrix process. The output confusion Matrix is as follows.

Confusion Matrix

From the output, it is known that the results of the ANN classification obtained prediction results that showed 43 correct predictions for class 1 (positive heart disease) and 13 incorrect predictions for class 0 (negative heart disease). heart) as many as 37 and wrong predictions for class 1 (positive for heart disease) as many as 12.

Then, an evaluation of the formed model is carried out, it is known that the level of accuracy obtained is 0.7619 or 76.19% of the predicted accuracy measured. Where the accuracy results are quite good.

Reference:

  1. Machmudin, A. (2012). Peramalan Temperatur Udara di Kota Surabaya dengan Menggunakan ARIMA dan Artificial dengan Menggunakan ARIMA dan Artificial. JURNAL SAINS DAN SENI ITS, 6.
  2. https://www.kaggle.com/ronitf/heart-disease-uci
  3. https://www.freepik.com/free-photo/top-view-world-heart-day-concept-with-stethoscope_9472263.htm#page=1&query=Heart%20Disease&position=9
  4. http://ejournal.nusamandiri.ac.id/index.php/pilar/article/view/601/534

--

--

Falah Novayanda Adlin
Falah Novayanda Adlin

Written by Falah Novayanda Adlin

Statistics — Universitas Islam Indonesia

No responses yet