Insurance Claim Prediction using Logistic Regression

Question

Insurance Claim Prediction using Logistic Regression

The increased cost of health insurance is alarming throughout the world. These costs are done for consumers and employers sponsored health insurance premium which has increased by 131 percent over the last decade. A major cause of this increase is payment errors made by the insurance companies while processing claims. Furthermore, because of the payment errors results in re-processing of the claims which is known to be called as re-work and accounts for significant portion of administrative cost and services issues of health plan which have a direct impact in the term of monetary of the insurance company paying more or less than what it should have. The most successful kind of machine learning algorithms is those that automate a decision making processes by generalizing from known examples.

Loading…

Insurance Companies apply numerous models for analyzing and predicting health insurance cost. Some of the work investigated the predictive modeling of healthcare cost using several statistical techniques. Machine Learning approach is also used for predicting high-cost expenditures in health care.

In this project, we will discuss the use of Logistic Regression to predict the insurance claim. We take a sample of 1338 data which consists of the following features:-

age : age of the policyholder
sex: gender of policy holder (female=0, male=1)
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 25
children: number of children/dependents of the policyholder
smoker: smoking state of policyholder (non-smoke=0;smoker=1)
region: the residential area of policyholder in the US (northeast=0, northwest=1, southeast=2, southwest=3)
charges: individual medical costs billed by health insurance
insuranceclaim – The labeled output from the above features, 1 for valid insurance claim / 0 for invalid.

Here is the sample dataset:-

Now we will import pandas to read our data from a CSV file and manipulate it for further use. We will also use numpy to convert out data into a format suitable to feed our classification model. We’ll use seaborn and matplotlib for visualizations. We will then import Logistic Regression algorithm from sklearn. This algorithm will help us build our classification model. Lastly, we will use joblib available in sklearn to save our model for future use.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

We have our data saved in a CSV file called insurance.csv. We first read our dataset in a pandas dataframe called insuranceDF, and then use the head() function to show the first five records from our dataset.

   age  sex     bmi  children  smoker  region      charges  insuranceclaim
0   19    0  27.900         0       1       3  16884.92400               1
1   18    1  33.770         1       0       2   1725.55230               1
2   28    1  33.000         3       0       2   4449.46200               0
3   33    1  22.705         0       0       1  21984.47061               0
4   32    1  28.880         0       0       1   3866.85520               1

Let’s also make sure that our data is clean (has no null values, etc).

insuranceDF.info()


RangeIndex: 1338 entries, 0 to 1337
Data columns (total 8 columns):
age               1338 non-null int64
sex               1338 non-null int64
bmi               1338 non-null float64
children          1338 non-null int64
smoker            1338 non-null int64
region            1338 non-null int64
charges           1338 non-null float64
insuranceclaim    1338 non-null int64
dtypes: float64(2), int64(6)
memory usage: 83.7 KB

Let’s start by finding correlation of every pair of features (and the outcome variable), and visualize the correlations using a heatmap.

corr = insuranceDF.corr()
print(corr)

                     age       sex       bmi  children    smoker    region  \
age             1.000000 -0.020856  0.109272  0.042469 -0.025019  0.002127   
sex            -0.020856  1.000000  0.046371  0.017163  0.076185  0.004588   
bmi             0.109272  0.046371  1.000000  0.012759  0.003750  0.157566   
children        0.042469  0.017163  0.012759  1.000000  0.007673  0.016569   
smoker         -0.025019  0.076185  0.003750  0.007673  1.000000 -0.002181   
region          0.002127  0.004588  0.157566  0.016569 -0.002181  1.000000   
charges         0.299008  0.057292  0.198341  0.067998  0.787251 -0.006208   
insuranceclaim  0.113723  0.031565  0.384198 -0.409526  0.333261  0.020891   

                 charges  insuranceclaim  
age             0.299008        0.113723  
sex             0.057292        0.031565  
bmi             0.198341        0.384198  
children        0.067998       -0.409526  
smoker          0.787251        0.333261  
region         -0.006208        0.020891  
charges         1.000000        0.309418  
insuranceclaim  0.309418        1.000000

When using machine learning algorithms we should always split our data into a training set and test set. (If the number of experiments we are running is large, then we can should be dividing our data into 3 parts, namely – training set, development set and test set). In our case, we will also separate out some data for manual cross checking.

The data set consists of record of 1338 patients in total. To train our model we will be using 1000 records. We will be using 300 records for testing, and the last 38 records to cross check our model.

dfTrain = insuranceDF[:1000]
dfTest = insuranceDF[1000:1300]
dfCheck = insuranceDF[1300:]

Next, we separate the label and features (for both training and test dataset). In addition to that, we will also convert them into NumPy arrays as our machine learning algorithm process data in NumPy array format.

trainLabel = np.asarray(dfTrain['insuranceclaim'])
trainData = np.asarray(dfTrain.drop('insuranceclaim',1))
testLabel = np.asarray(dfTest['insuranceclaim'])
testData = np.asarray(dfTest.drop('insuranceclaim',1))

As the final step before using machine learning, we will normalize our inputs. Machine Learning models often benefit substantially from input normalization. It also makes it easier for us to understand the importance of each feature later, when we’ll be looking at the model weights. We’ll normalize the data such that each variable has 0 mean and standard deviation of 1.

means = np.mean(trainData, axis=0)
stds = np.std(trainData, axis=0)
 
trainData = (trainData - means)/stds
testData = (testData - means)/stds

We can now train our classification model. We’ll be using a machine simple learning model called logistic regression. Since the model is readily available in sklearn, the training process is quite easy and we can do it in few lines of code. First, we create an instance called insuranceCheck and then use the fit function to train the model.

insuranceCheck = LogisticRegression()
insuranceCheck.fit(trainData, trainLabel)

Now use our test data to find out accuracy of the model.

accuracy = insuranceCheck.score(testData, testLabel)
print("accuracy = ", accuracy * 100, "%")

OUTPUT:- accuracy = 85.66666666666667 %

Here is the final code of the project for “Insurance Claim Prediction using Logistic Regression”

# Load essential libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
 
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

# Read CSV
insuranceDF = pd.read_csv('insurance.csv')
print(insuranceDF.head())

insuranceDF.info()

corr = insuranceDF.corr()

# Train Test Split
dfTrain = insuranceDF[:1000]
dfTest = insuranceDF[1000:1300]
dfCheck = insuranceDF[1300:]

# Convert to numpy array
trainLabel = np.asarray(dfTrain['insuranceclaim'])
trainData = np.asarray(dfTrain.drop('insuranceclaim',1))
testLabel = np.asarray(dfTest['insuranceclaim'])
testData = np.asarray(dfTest.drop('insuranceclaim',1))

# Normalize Data
means = np.mean(trainData, axis=0)
stds = np.std(trainData, axis=0)
trainData = (trainData - means)/stds
testData = (testData - means)/stds

insuranceCheck = LogisticRegression()
insuranceCheck.fit(trainData, trainLabel)

# Now use our test data to find out accuracy of the model.

accuracy = insuranceCheck.score(testData, testLabel)
print("accuracy = ", accuracy * 100, "%")

Download the dataset here:- here

Loading…