Insurance Claim Prediction using Logistic Regression

The increased cost of health insurance is alarming throughout the world. These costs are done for consumers and employers sponsored health insurance premium which has increased by 131 percent over the last decade. A major cause of this increase is payment errors made by the insurance companies while processing claims. Furthermore, because of the payment errors results in re-processing of the claims which is known to be called as re-work and accounts for significant portion of administrative cost and services issues of health plan which have a direct impact in the term of monetary of the insurance company paying more or less than what it should have. The most successful kind of machine learning algorithms is those that automate a decision making processes by generalizing from known examples.


Insurance Companies apply numerous models for analyzing and predicting health insurance cost. Some of the work investigated the predictive modeling of healthcare cost using several statistical techniques. Machine Learning approach is also used for predicting high-cost expenditures in health care.

In this project, we will discuss the use of Logistic Regression to predict the insurance claim. We take a sample of 1338 data which consists of the following features:-

  1. age : age of the policyholder
  2. sex: gender of policy holder (female=0, male=1)
  3. bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 25
  4. children: number of children/dependents of the policyholder
  5. smoker: smoking state of policyholder (non-smoke=0;smoker=1)
  6. region: the residential area of policyholder in the US (northeast=0, northwest=1, southeast=2, southwest=3)
  7. charges: individual medical costs billed by health insurance
  8. insuranceclaim – The labeled output from the above features, 1 for valid insurance claim / 0 for invalid.

Here is the sample dataset:-

Now we will import pandas to read our data from a CSV file and manipulate it for further use. We will also use numpy to convert out data into a format suitable to feed our classification model. We’ll use seaborn and matplotlib for visualizations. We will then import Logistic Regression algorithm from sklearn. This algorithm will help us build our classification model. Lastly, we will use joblib available in sklearn to save our model for future use.

We have our data saved in a CSV file called insurance.csv. We first read our dataset in a pandas dataframe called insuranceDF, and then use the head() function to show the first five records from our dataset.

Let’s also make sure that our data is clean (has no null values, etc).

Let’s start by finding correlation of every pair of features (and the outcome variable), and visualize the correlations using a heatmap.

When using machine learning algorithms we should always split our data into a training set and test set. (If the number of experiments we are running is large, then we can should be dividing our data into 3 parts, namely – training set, development set and test set). In our case, we will also separate out some data for manual cross checking.

The data set consists of record of 1338 patients in total. To train our model we will be using 1000 records. We will be using 300 records for testing, and the last 38 records to cross check our model.

Next, we separate the label and features (for both training and test dataset). In addition to that, we will also convert them into NumPy arrays as our machine learning algorithm process data in NumPy array format.

As the final step before using machine learning, we will normalize our inputs. Machine Learning models often benefit substantially from input normalization. It also makes it easier for us to understand the importance of each feature later, when we’ll be looking at the model weights. We’ll normalize the data such that each variable has 0 mean and standard deviation of 1.

We can now train our classification model. We’ll be using a machine simple learning model called logistic regression. Since the model is readily available in sklearn, the training process is quite easy and we can do it in few lines of code. First, we create an instance called insuranceCheck and then use the fit function to train the model.

Now use our test data to find out accuracy of the model.

OUTPUT:- accuracy = 85.66666666666667 %

Here is the final code of the project for¬†“Insurance Claim Prediction using Logistic Regression”

Download the dataset here:- here

SHARE Insurance Claim Prediction using Logistic Regression

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *