Health Care Provider Fraud Detection Analysis

To design a binary classification model which predicts if claims applied by a provider is fraud or legitimate based on the beneficiary details, inpatient and outpatient data.

Image Source: Link
  1. What are Frauds?
  2. Health Care Frauds and its types
  3. Who are Healthcare Providers?
  4. Types of Provider Health Care Frauds
  5. Impact of Provider Health Care Frauds on Business
  6. Understanding the Business Problem
  7. ML Formulation
  8. Business Constraints
  9. Data Gathering
  10. Performance metric
  11. Exploratory Data Analysis
  12. Existing Approaches and Improvements in my model
  13. Data Preprocessing
  14. Feature Engineering
  15. Machine Learning Models
  16. Final Data pipeline
  17. Challenges and Technical Obstacles
  18. Future work
  19. LinkedIn, Web App and GitHub Repository
  20. References
  1. Fraud is defined as any deliberate and dishonest act committed with the knowledge that it could result in an unauthorized benefit to the person committing the act or someone else who is similarly not entitled to the benefit.
  2. This is done by using deception of truth or other unethical means, which are believed and relied upon by others.
Different Types of Frauds(Image Source: Link)
  1. Health care fraud is an organized crime which involves peers of providers(hospitals, cashiers, medical labs, nurses, lab assistants, and others), physicians, beneficiaries acting together knowingly misrepresenting or misstating something about the type, the scope, or the nature of the medical treatment or service provided, in a manner that could result in unauthorized payments(fraud claims) being made.
  2. Insurance companies are the most vulnerable institutions impacted due to these bad practices.
  3. They adopt ways in which an ambiguous diagnosis code is used to bill the costliest procedures and drugs. For example: Billing of services or medical procedures which are not performed due to an ambiguous diagnosis code.
  4. Healthcare fraud and abuse take many forms. Some of the most common types of health care frauds are:
    Service providers: Doctors, hospitals, ambulance companies, laboratories.
    Insurance subscribers: Patients, Patients’ employers
    Insurance carriers: Those who receive premiums from subscribers and pay healthcare on behalf of their subscribers, including governmental health departments and private insurance companies.
Percentages of papers on detecting health care frauds(Image Source: Link)

Among these 3 types of fraud, the maximum number of frauds are committed by service providers as shown in figure. Hence, detection of service providers’ fraud is the most important problem for cost reduction and safety of healthcare systems.

  1. A healthcare provider is an authorized person or entity that provides medical care or treatment.
  2. Providers include doctors, nurse practitioners, radiologists, labs, hospitals, urgent care clinics, medical supply companies, and other professionals, facilities and businesses that provide such services.
  1. Phantom Billing: Submitting claims for services not provided.
  2. Duplicate Billing: Submitting similar claims more than once.
  3. Bill Padding: Submitting claims for unneeded ancillary services to Medicaid.
  4. Upcoding: Billing for a service with a higher reimbursement rate than the service provided.
  5. Unbundling: Submitting several claims for various services that should only be billed as one service.
  6. Excessive or Unnecessary Services: Provides medically excessive or unnecessary services to a patient .
  7. Kickbacks: A kickback is a form of negotiated bribery in which a commission is paid to the bribe-taker (provider or patient) as a quid pro quo for services rendered.

Statistics show that 15% of the total Medicare expense is caused due to fraud claims.

  • Fraud and abuse have led to significant additional expenses in the health care systems. Due to this reason, insurance companies increased their insurance premiums and as result healthcare is becoming a costly matter day by day.
Image Source: Link
  1. To predict the potentially fraudulent providers based on the claims filed by them also find the reasons behind such frauds in order to prevent financial loss.
  2. To identify important variables/features which help in detecting the behavior of potentially fraud providers. For example, if the claim amount is very high for a patient whose risk score is low, then it may be a fraud.
  3. Insurance company can accept or deny the claim or set up an investigation on that provider depending in case a provider is fraudulent or non-fraudulent.

The financial loss is not the only concern but also protection of the healthcare systems so that they can provide quality and safe care to legitimate patients.

Medicare processes more than 4.5 million claims per day, which is nearly impossible for humans to process and detect fraud without the help of AI.

Design a binary classification model which predicts if a claim applied by providers is fraud or legitimate based on the beneficiary details, inpatient and outpatient data.

Confusion Matrix(Image Source: Link
  1. Misclassifications: The Cost of misclassification is high as seen from the confusion matrix. If FN are high means we are predicting the fraudulent claims as legitimate and if FP are high means we are predicting the legitimate claims as fraudulent.
  2. Interpretability: Highly important to cite reasons why the model predicted a claim as fraud. This may be required for further investigation with the providers.
  3. Cost of Misclassifications: Higher FN and FP will result in cost involved to further drill down if actually a claim is fraud or not by further investigation. If FN is high Cost will be reimbursed for fraud claims and if FP is high then additional cost of investigation is added to it.
  4. Latency Requirement: Since the claim reimbursement process takes time means there is no such latency requirement. But still we will consider reducing the training time of model.

The dataset is available on Kaggle’s website. Please refer to the below mentioned link:

The dataset has the following csv files:

  1. Beneficiaries Information
  2. Inpatient Details
  3. Outpatient Details
  4. Provider IDs(Potentially Fraudulent or Non-Fraudulent)

→ Train-1542865627584.csv: Columns of this dataset are explained below:

Image Source: Feature List

→ Test-1542969243754.csv: Columns of this dataset are explained below:

Image Source: Feature List

These csv files contain beneficiaries’ KYC details like DOB, DOD, Gender, Race, Health Conditions (Chronic Diseases if any), State, Country of beneficiary etc. Columns of this dataset are explained below:

Image Source: Feature List

This csv file consists of the claim details for the patients who were admitted into the hospital. Hence, it consists of 3 extra columns Admission date, Discharge date, and Diagnosis Group code apart from those explained in Outpatient Data(Train and Test).

  1. AdmissionDt: It consists of the date on which the patient was admitted to the hospital in yyyy-mm-dd format.
  2. DischargeDt: It consists of the date on which the patient was discharged from the hospital in yyyy-mm-dd format.
  3. DiagnosisGroupCode: It consists of a group code for the diagnosis done on the patient.

Columns of this dataset are explained below:

Image Source: Feature List

In healthcare, fraud is highly imbalanced(very few fraud cases). Hence, ‘accuracy’ cannot be used as a performance metric. The performance metric that will be used in this case are as follows:

i) Confusion Matrix: Refer to the confusion matrix diagram. From this table we can visualize the performance of the model.

Confusion Matrix

ii) F1 Score: As we need to keep a check on FP and FN. Hence a high precision and high recall is desired which can be checked together using F1 Score.

Formula of F1 Score Calculation

iii) AUC Score: Area Under ROC(Receiver Operating Characteristics) Curve is used to plot TPR on y axis and FPR on x axis for different thresholds. If we have two data points and we don’t know the class labels, but we know the AUC = 0.75 means there is a 75% probability that these two data points will be classified correctly.

iv) FPR and FNR: As the cost of misclassification is very high, a check on FPR and FNR is highly desired. We need these two values to be as low as possible.

Observations:

  1. There are 506 Fraudulent providers and 4904 Non-Fraudulent providers respectively.

Point of Interest: There is a significant imbalance between the two class labels. From the perspective of the analyzed data from just the Train-1542865627584.csv, we can conclude that this dataset is Highly Imbalanced

  1. Datapoints in Train Class Label Data = 5410 and that in Test Class Label Data = 1353 respectively. Train Class Label Data contains 5410 rows and 2 columns and Test Class Label Data contains 1353 rows and 1 column respectively.
  2. There are no missing values present in train data. No further analysis is done in test data about missing values and duplicate values as any updates to the test data will result in Data Leakage Issues.
  3. It is a Highly Imbalanced dataset with 506 Fraudulent providers(9.35%) and 4904 Non-Fraudulent providers(90.65%) respectively.
  4. It is a Binary Classification Problem with Class Labels as Yes or No. There are no duplicate Provider Id’s in the Train Class Label Data.

Observations:

  1. Most of the beneficiaries are born in the year 1916–1945. Least number of beneficiaries are born in the year 1963–1983.
  2. Why Birth Year?
    – To check if any false year of birth is written to avail claims.
    – To get more idea about false birth years, we need to check its impact after merging data with all the other csv files.

Point of Interest: Birth year should be considered as a feature to check its impact after merging the data with class labels.

Observations:

  1. Most of the beneficiaries suffer from 2–5 chronic diseases. Least number of beneficiaries suffer from 9–12 chronic diseases.
  2. Why Risk of Chronic Diseases?
    – To get an insight about maximum number of chronic diseases a beneficiary is suffering from.
    – To get more insights about maximum number of beneficiaries are suffering from how many chronic diseases.
    – Risk of chronic diseases help us get an idea about false claim if applied by any provider. For this, we need to check its impact after merging data with all the other csv files.

Point of Interest: Risk should be considered as a feature to check its impact after merging the data with class labels. It helps us in identifying the false claims, if they were applied for a chronic disease majority of times and very few beneficiaries suffered from it.

Observations:

  1. Most of the beneficiaries belong to Gender2. There is no specific information to conclude about majority as males or females. From the above plot, Gender1= 42.91% and Gender2 = 57.09% as seen in plot 1.
  2. Most of the beneficiaries belong to Old Category(65–83 yrs) out of which only 28.17% belong to Gender1 and 35.69% belong to Gender2 respectively. Least number of beneficiaries belong to Young Category(26–45 yrs) out of which only 1.91% belong to Gender1 and 1.84% belong to Gender2 respectively as seen in plot2..
  3. It tells us that most of the beneficiaries are from Old Category but roughly they belong to both the Genders uniformly. Hence, there is no imbalance in distribution of the dataset wrt to genders.

Point of Interest: This relationship was important to know if there is some forging done wrt gender and age groups or not? There can be a scenario that a false claim can be filed under a specific category either too young or too old for a specific gender. Hence, we conclude from the above plot that there is no such forging wrt to age groups and gender.

Observations:

  1. 8.7% of the beneficiaries belong to State 5. Maximum beneficiaries belong from state 5, state 10, state 45, state 33 and state 39 respectively. From the all states plot, it can be seen that only 0.14% of the beneficiaries belong to state 2 respectively.

Point of Interest: A further analysis is required to know beneficiaries from which state are involved in maximum fraudulent cases.

  1. 2.85% of the beneficiaries belong to country 200. Maximum beneficiaries belong from country 200, country 10, country 20, country 60 and country 0 respectively.

Point of Interest: A further analysis is required to know beneficiaries from which country are involved in maximum fraudulent cases.

Observations:

  1. 9.22% of the beneficiaries do not suffer from Chronic Ischemic Heart Disease but suffer from Heart Failure.

Why it did not work? From the first level analysis, this comparison is not important as a person can suffer from Heart Failure without Chronic Ischemic Heart Disease. A further analysis shows no fraudulent claims have been filed for those beneficiaries who actually suffer from Heart Failure but do not suffer from Chronic Ischemic Heart Disease.

Observations:

  1. 3.97% of the beneficiaries belong to adult age group and 1.26% of the beneficiaries belong to young age group which suffer from Chronic Alzheimer.

Why it did not work? From the first level analysis, this comparison is not important as there are very few young/adult beneficiaries suffering from Chronic Alzheimer. This feature would have been important if more people from young/adult age groups were suffering from this disease.

Observations:

  1. 39.81% of the beneficiaries do not suffer from Chronic Diabetes. 60.19% of the beneficiaries suffer from Chronic Diabetes which implies most of the beneficiaries suffer from diabetes.

Point of Interest: From the above graph, it can be concluded that the average inpatient/outpatient annual reimbursement/deductible amount is greater for those actually suffering from Chronic Diabetes.

  1. Further analysis needs to be done on the following new features after merging this dataset with inpatient and outpatient data:
    – Birth Year and Age
    – Age Category
    – Risk of Chronic Disease
  2. Deeper insights need to be drawn from the following existing features after merging this dataset with inpatient and outpatient data:
    – Human Race
    – Renal Disease Indicator and Chronic Kidney Disease Relationship
    – State
    – Country
  3. No conclusions were drawn from the following features. Hence, no need of merging these features with inpatient and outpatient data:
    – Alzheimer and Age Relationship
    – Heart Failure and Chronic Ischemic Heart Disease Relationship
  4. DOD feature contains missing values. Hence, it can be solved by either of the two following ways:
    – Remove the feature completely
    – Use missing value imputation technique and fill all values as 0.

Conclusion of the Analysis for Beneficiary Data: A clear picture can be obtained only after merging the complete dataset. All the questions which remain unanswered needs to be checked after merging the entire data.

Observations:

  1. 4019 is the most used Diagnostic Code while performing diagnosis for a beneficiary(InPatient Data) which accounts for 4.32% of the total diagnosis codes as shown in plot1. 4019 is the most used Diagnostic Code in diagnosis of a beneficiary(OutPatient Data) which accounts for 4.648% respectively.
  2. At least 1 Diagnostic Code has been used by most of the beneficiaries for both InPatient and OutPatient Data.
Fig: Comparison between InPatient and OutPatient Data

Observations:

  1. 4019 is the most used Procedure Code while treating a beneficiary(InPatient Data) which accounts for 6.57% of the total procedure codes as shown in plot1. 9904 is the most used Procedure Code in diagnosis of a beneficiary(InPatient Data) which accounts for 7.35% of the total procedure codes respectively as shown in plot2.
  2. At least 1 Procedure Code has been used by most of the beneficiaries for both InPatient and OutPatient Data.
  3. 4019 is the most used Diagnostic and Procedure Codes in Inpatient data. However, 4019 is the most used Diagnostic but 9904 is the most used Procedure Codes in Outpatient data.

Rank Comparison between Diagnostic Codes and Procedure Codes for InPatient/OutPatient Data:

Fig1: Inpatient Data and Fig2: OutPatient Data
Fig: Comparison between InPatient and OutPatient Data

Observations:

  1. 516 claims have been filed by PRV52019 alone which accounts for 1.275% of the total claims(InPatient Data) as shown in plot1. 8240 claims have been filed by PRV51459 alone which accounts for 1.592% of the total claims(OutPatient Data) respectively as shown in plot2.
  2. At least 1 claim has been filed by every provider for both InPatient and OutPatient Data.
  3. It tells us that all the providers have filed claims in the range 1–516(InPatient Data) and in the range 1–8240(OutPatient Data) respectively.

Point of Interest: This relationship was important to know as there can be a scenario that all false claims can be filed under a specific Providers only. Here, I am taking the assumption that a provider will file only one type of claims(Either All Fraudulent or All Non-Fradulent).

Fig: Comparison between InPatient and OutPatient Data

Observations:

  1. 5.58 Million have been granted to PRV52019 alone which accounts for 1.367% of the total reimbursement amount(InPatient Data) as shown in plot1. 2.32 Million have been granted to PRV51459 alone which accounts for 1.566% of the total reimbursement amount(OutPatient Data) respectively as shown in plot2.
  2. At least 57K has been granted to every provider(InPatient Data) and At least 2.1K has been granted to every provider(OutPatient Data).
  3. It tells us that all the providers have been granted reimbursement in the range 57K-5.58 Million(InPatient Data) and in the range 2.1K-2.32 Million(OutPatient Data) respectively.

Point of Interest: This relationship was important to know as there can be a scenario that all false claims can be filed under a specific Providers only. Here, I am taking the assumption that a provider will be granted reimbursement amount for filing only one type of claims(Either All Fraudulent or All Non-Fradulent).

Fig: Comparison between InPatient and OutPatient Data
  1. Deeper insights need to be drawn from the following existing features after merging this dataset with class labels and outpatient data:
    – Diagnostic Codes and Procedure Codes
    – Providers
    – Reimbursement Amount and Deductible Amount Paid
  2. No conclusions were drawn from the following features. Hence, no need of merging these features with class labels and outpatient data:
    – Claim Start Month and Claim Start Year
    – Claim End Month and Claim End Year

We have 4 different csv files, which are interconnected by foreign keys.

Task: Merge them using the foreign keys to get a overall dataset. Below is a brief overview of the dataset is as shown in the plot below:

Final Dataset After Merging
  1. Before merging the data, add IsHospitalized Column which has value = 1 incase of Inpatient data and value = 0 incase of Outpatient data.
# Code Reference: https://www.geeksforgeeks.org/numpy-ones-python/

all_ones = np.ones(len(train_inp), dtype = int) # Add all ones for all the hospitalised/inpatient beneficiaries
train_inp[‘IsHospitalized’] = list(all_ones) # Add new feature whether a beneficiary is inpatient or outpatient type

all_zeros = np.zeros(len(train_out), dtype = int) # Add all zeros for all the outpatient beneficiaries
train_out[‘IsHospitalized’] = list(all_zeros) # Add new feature whether a beneficiary is inpatient or outpatient type

2. Merge Inpatient and Outpatient data based on common columns:

# Code Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html

common_lst = []

for col in train_out.columns: # Find common columns in Inpatient and Outpatient Data
if col in train_inp.columns:
common_lst.append(col) # Add the common columns in new list

# Merge the Inpatient and Outpatient Data on the common columns list
in_out_patient = pd.merge(train_inp, train_out, left_on = common_lst, right_on = common_lst, how = ‘outer’)
in_out_patient.head()

3. Merge beneficiary details with inpatient and outpatient data on BeneID.

# Merge the clubbed data(Inpatient/Outpatient) with Beneficiary Data on the Beneficiary ID

ben_inout_patient = pd.merge(in_out_patient, train_ben, left_on = ‘BeneID’, right_on = ‘BeneID’, how = ‘inner’)
ben_inout_patient.head()

4. Merge provider details with previously merged data on ProviderID.

# Code Reference: https://www.geeksforgeeks.org/python-pandas-merging-joining-and-concatenating/
# Merge the clubbed data(Inpatient/Outpatient and Beneficiary) with Train Class Labels on the Provider ID

train_final = pd.merge(ben_inout_patient, train_labels , how = ‘inner’, on = ‘Provider’ )
train_final.head()

Observations:

  1. There are 212796 Fraudulent claims and 345415 Non-Fraudulent claims respectively. Percentage of Fraudulent claims = 38.12% and that of Non-Fraudulent Providers = 61.88% respectively as shown in the above plot.

Point of Interest: There is a slight imbalance between the two class labels as per the claims are concerned. This imbalance is less than what we saw for providers and class labels. From the perspective of the analyzed data, we can conclude that the dataset is Highly Imbalanced for providers and slightly imbalanced for claims.

Observations:

  1. If we look at both the graph to draw a clear picture of all the cases where Claim Days> Hospitalization Days, it can be seen that for claim duration of 20 days we need to be thorough with our analysis.
  2. Most of the claims have been applied for 0–5 days either fraudulent or non-fraudulent. If a claim is filed for more than 35 days then it is potentially fraud surely.
  3. Most of the hospitalizations took place for 0–15 days either fraudulent or non-fraudulent.
  4. Why Claim Days?
    – To check maximum claims have been filed for how many days?
  5. Why Hospitalization Days?
    – To check maximum hospitalizations took place for how many days?

Point of Interest: From the above plot, it can be seen that there is no difference in graphs for potentially fraudulent/non-fraudulent claims for either Claim Days or Hospital Days feature. This implies that Claim Days or Hospital Days alone cannot be useful in determining a potential fraud.

Observations:

  1. Total 17 Claims had Claim Days > Hospitalization Days in the analysis and all are fraudulent.
  2. 5 cases with Difference(Claim Days-Hospital Days)=1 account for 0.0009% , 4 cases with Difference=2 account for 0.0007% and 8 cases with Difference=3 account for 0.0014% of the total filed claims that are fraudulent.
  3. Why Difference and that too specifically Claims Days>Admit Days?
    – If Claim Days are greater than Admit Days, it means Claim Amounts Reimbursed in such cases will be more. It is a sign of FRAUD.
    – We cannot use (Admit Days-Claim Days) difference feature because this can happen when enrolled in a Low Cost Plan. In such cases, it will be difficult to find fraudulent claims.

Point of Interest: This implies that Difference(Claim Days-Admit Days) can be of some help but this feature alone cannot be useful in determining if a filed claim is potentially fraud or not

Observations:

  1. Most beneficiaries belong to old age category with age between 65–83 who filed fraudulent claims. Only few beneficiaries belong to young age category with age between 26–45 who filed fraudulent claims.
  2. Why Age Category?
    – Assumption was that most of the fraud claims are applied under a specific age category either too young or too old.

Point of Interest: From the above plot, it can be seen that the beneficiaries who filed major chunk of the total fraudulent claims belong to age group between 65–83. This answers the question that major beneficiaries who filed fraudulent claims belong to the old category. They are not too old nor too young. But same observation holds true for those beneficiaries who filed for non-fraudulent claims. This implies that Age Category feature alone cannot be useful in determining if a beneficiary who filed a potentially fraudulent claim is either too young or too old.

Observations:

  1. From the attending physicians plot, if the Claims Filed>500 by attending physician, from the operating physicians plot, if the Claims Filed>100 by operating physician and if the Claims Filed>500 by other physician then they are all fraudulent.

Point of Interest: From the above plot, it can be seen that the attending/operating/other physicians who filed more than 500/100/500 claims are all fraudulent. But for less than 500/100/500 claims filed in each case we cannot predict anything if that claim is fraud or not. This implies that Attending/Operating/Other Physicians feature alone cannot be useful in determining the potentially filed fraudulent claims.

Observations:

  1. From the above plot, we can conclude that if Claim Days = 36 then the total reimbursement amount filed for 36 days are all fraudulent.
  2. There is no clarity in the reimbursement amount granted for 0–35 days for a fraudulent/non-fraudulent claim.

Observations:

  1. There is no Race4 in the dataset. From the above plot, it can be seen that there is an imbalance among races of the beneficiaries for both fraudulent/non-fraudulent claims.

Point of Interest: From the above plot, it can be seen that the beneficiaries who filed major chunk of the total fraudulent claims belong to Race1. This answers the question that major beneficiaries who filed fraudulent claims belong to Race1. But same observation holds true for those beneficiaries who filed for non-fraudulent claims. This implies that Race feature alone cannot be useful in determining if a beneficiary who filed a potentially fraudulent claim belongs to a particular race or not.

Observations:

  1. It tells us that most of the beneficiaries are from Race1 but roughly they belong to both the Genders uniformly for both fraudulent and non-fraudulent claims. Hence, there is no imbalance in distribution of the dataset wrt to genders and potential fraud.

Point of Interest: From the above plot, it can be seen that the beneficiaries who filed major chunk of the total fraudulent claims belong to Race1 roughly equally distributed amongst Gender0 and Gender1 respectively. This answers the question that major beneficiaries who filed fraudulent claims belong to Race1 roughly equally distributed amongst Gender0 and Gender1. But same observation holds true for those beneficiaries who filed for non-fraudulent claims.

Observations:

  1. 1.11% of the beneficiaries do not suffer from Chronic Kidney Disease but suffer from Renal Disease/Kidney Failure are fraudulent.

Point of Interest: This implies that Renal kidney Disease and Chronic Kidney Disease can be of some help but this feature alone cannot be useful in determining if a filed claim is potentially fraud or not.

  1. The following features are useful but not alone in order to get some insights about the fraudulent claims:
    – Claim Days and Hospitalization Days
    – Difference(Claim_Days>Hospital_Days only)
    – Birth Year and Age
    – Age Category
  2. After merging the data to get deeper insights from the following existing features, it was concluded that these features are useful but not alone for identifying a fraudulent claim:
    – Attending, Operating and Other Physicians
    – Reimbursement Amount and Deductible Amount Paid
    – Human Race
    – Renal Disease Indicator and Chronic Kidney Disease Relationship
  3. IsHospitalized feature is added before merging the dataset such that for train inpatient data its value = 1 indicating a beneficiary was hospitalized and for train outpatient data its value = 0 indicating a beneficiary was not hospitalised.

PCA and t-SNE is used for dimensionality reduction and visualization.

In the existing approaches mainly inbuilt models from scikit learn were used and no proper strategy was taken for the data imbalance. The dimensionality of the dataset was also large which can result in the curse of dimensionality issues while using the machine learning models

  1. I have tried two approaches:
    – Class Weight = ‘Balanced’ scheme along with stratified sampling of the data.
    – Random oversampling using SMOTE to handle the data imbalance.
  2. I have used PCA for dimensionality reduction and t-SNE for data visualization to get better model performance.
# Code Reference: https://www.geeksforgeeks.org/ml-t-distributed-stochastic-neighbor-embedding-t-sne-algorithm/

def plot_TSNE(x_tr, y_tr):
”’Function to plot a dataset with 285 features after reducing it to 2 dimensions using TSNE”’
data_1000 = x_tr[0:1000, :]
labels_1000 = y_tr[0:1000]
model = TSNE(n_components = 2, random_state = 0)

tsne_data = model.fit_transform(data_1000)
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data, columns =(“Dimension_0”, “Dimension_1”, “label”)) # store 2-D TSNE data into dataframe

with plt.style.context(‘seaborn-white’): # Set the white background for graph plotting
plt.figure(figsize=(8,5))
sns.FacetGrid(tsne_df, hue =”label”, size = 6).map(plt.scatter, ‘Dimension_0’, ‘Dimension_1’)
plt.title(‘Dimensionality Reduction to 2 Features using TSNE’) # Add title of the plot
plt.grid(linestyle = ‘-‘) # Add grid to the plot
plt.legend(loc=”upper left”, title=”Potential Fraud”) # Add legend to the plot

plt.show()

# Code Reference: https://machinelearningmastery.com/principal-components-analysis-for-dimensionality-reduction-in-python/

def feat_reduct_PCA(x_tr, x_cv, y_cv):
”’Function that uses PCA for feature reduction and plots features versus the best auroc score for selected features after applying LR model”’
comp = [i for i in range(30,290,10)]
score_list = []

for i in comp:
pca = PCA(n_components = i, svd_solver =’randomized’)
U, S, VT = randomized_svd(x_tr, n_components = i)
pca.fit(x_tr) # Fit train data to PCA
data_cv_pca = pca.transform(x_cv) # Transform CV Data
model = LogisticRegression(random_state=49) # Apply Logistic Regression for getting best AUROC score
scores = cross_val_score(model, data_cv_pca, y_cv, scoring=’roc_auc’, cv=5)
score_list.append(np.mean(scores)) # Store AUROC scores in a list
print(‘AUROC scores for different values of features:n’, score_list)

plt.figure(figsize=(8,5))
plt.plot(comp, score_list, ‘bx-‘)
plt.xlabel(‘Number of Features’) # Add label to x-axis
plt.ylabel(‘AUROC on CV Data’) # Add label to y-axis
plt.title(‘Feature Reduction using PCA’) # Add title of the plot
plt.grid(linestyle = ‘-‘) # Add grid to the plot
plt.show()

# Code Reference: https://stackoverflow.com/questions/50796024/feature-variable-importance-after-a-pca-analysis

def feat_after_reduction(n, x_tr, x, x_tr_csv, x_cv_csv):
”’Function which takes n features with best auroc score and returns the names of it using dataframe”’
np.random.seed(0)
pca = PCA(n_components = n, svd_solver =’randomized’).fit(x_tr)
print(“Explained Variance using PCA with randomized svd_solver:n”, pca.explained_variance_ratio_)
data_pca = pca.transform(x_tr)
n_pcs= pca.components_.shape[0]

most_important = [290]
for i in tqdm(range(n_pcs)):
idx = np.argpartition(pca.components_[i], -10)[-10:] # find index with max variance ratio in ith row(top10)
idx = list(idx[np.argsort(pca.components_[i][idx])]) # sort indices in ascending order of variance ratio
idx.reverse() # reverse the list
for i in idx:
if i not in most_important: # if index is already present in list take the next one
most_important.append(i)
break

most_important = most_important[1:]
initial_feature_names = list(x.columns)
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)] # find the column names with respect to indices
dic = {‘PC{}’.format(i+1): most_important_names[i] for i in range(n_pcs)}
df = pd.DataFrame(sorted(dic.items()))
imp_feat= list(set(df[1]))
print(‘nUnique Feature Names with best Explained Variance Ratio:n’, imp_feat)
x_tr = np.array(x_tr_csv[imp_feat])
x_cv = np.array(x_cv_csv[imp_feat])
print(‘Final X_train Shape = ‘, x_tr.shape, ‘nFinal X_cv Shape = ‘, x_cv.shape)

return imp_feat, x_tr, x_cv # return column names for train and cv

Fig: Number of Features = 50 to enhance model performance
  1. Create features after merging the entire dataset. Some of the features are as follows along with the code snippets:
    – Calculate patient’s age and alive/dead status based on DOD, if DOD is not available calculate age based on the maximum date available in the data.
out = pd.DatetimeIndex(train_final.DOD).max()                                   # Calculate the max date from DOD column

train_final[‘Age’] = ((out – pd.DatetimeIndex(train_final[‘DOB’])).days)/365 # Calculate age from max DOD and DOB column
train_final[‘Age’] = train_final[‘Age’].round().astype(int) # convert age into int

train_final[‘IsAlive’] = pd.factorize(train_final[‘DOD’])[0] # Add new feature if a beneficiary is Alive or not(DOD feature)?
train_final[‘IsAlive’] = np.where(train_final[‘IsAlive’] == -1, 1, 0)
train_final.head()

  • Calculate claim start year, hospital admit year, Days of Hospitalization, Days of Claim filed and Difference between Hospitalization Days and Claim Filed Days based on Claim and Hospital Admit Dates.
# Code Reference: https://www.geeksforgeeks.org/get-month-and-year-from-date-in-pandas-python/
train_final['claim_strt_year'] = pd.DatetimeIndex(train_final['ClaimStartDt']).year # Add new feature claim start year of the admitted beneficiary
train_final['adm_year'] = pd.DatetimeIndex(train_final['AdmissionDt']).year # Add new feature admit year of the beneficiary

# Add new feature number of days of applied claim of the admitted beneficiary
train_final[‘Claim_Days’] = (pd.DatetimeIndex(train_final[‘ClaimEndDt’]) – pd.DatetimeIndex(train_final[‘ClaimStartDt’])).days
# Add new feature number of days spent in hospital by the beneficiary
train_final[‘Hospital_Days’] = (pd.DatetimeIndex(train_final[‘DischargeDt’]) – pd.DatetimeIndex(train_final[‘AdmissionDt’])).days
# Add new feature to find difference between Admit Days and Claim Days only when Claim Days > Admit Days
train_final[‘Difference’] = np.where(train_final[‘Claim_Days’]>train_final[‘Hospital_Days’], train_final[‘Claim_Days’] – train_final[‘Hospital_Days’], 0)
train_final.head()

  • Calculate Risk Score based on the number of diseases a person is currently suffering from.
# Create a new dataframe with all the chronic diseases

new_df = train_final[[‘RenalDiseaseIndicator’, ‘ChronicCond_KidneyDisease’, ‘ChronicCond_Heartfailure’, ‘ChronicCond_IschemicHeart’, ‘ChronicCond_Alzheimer’, ‘ChronicCond_Cancer’, ‘ChronicCond_ObstrPulmonary’,’ChronicCond_Depression’,’ChronicCond_Diabetes’,’ChronicCond_Osteoporasis’,’ChronicCond_rheumatoidarthritis’,’ChronicCond_stroke’]].copy()
train_final[‘Risk’] = new_df.sum(axis = 1) # Find how many beneficiaries are suffering from how many chronic disease?
train_final.head() # Store it as risk in original dataframe

In my work, I have created aggregated features and then evaluated the models for higher performance. Feature Aggregation is done using ‘group-by’ and taking mean or aggregate.

  1. Mean Feature Per Provider or Beneficiary:
    Why mean_feature_per_provider?
    a) Provider fills and submits claims. Hence, they can be involved in frauds.
    b) Incase a provider files a claim with a high mean feature value(eg: reimbursement amount, claim days) then all such claims are suspicious.
    Why mean_feature_per_beneficiary?
    a) Beneficiary can apply for same treatments under different providers. Hence, they can be involved in frauds.
    b) Incase a beneficiary files a claim with a high mean feature value(eg: reimbursement amount, claim days) then all such claims are suspicious.
  2. Mean Feature Per Physician or Diagnostic Group Code:
    Why mean_feature_per_physician(attending/operating/other)?
    a) Physician(attending/operating/other) prescribes the treatment for a beneficiary. Hence, they can be involved in frauds.
    b) Sometimes prescribed treatment may not be performed on the beneficiary but amount can be reimbursed for the prescribed treatments. Incase a physician files a claim with a high mean feature value(eg: reimbursement amount, hospital days) then all such claims are suspicious.
    Why mean_feature_per_diagnostic group code/claim admit diagnosis code?
    a) Diagnostic group code/Claim admit diagnosis code can be involved in frauds.
    b) Sometimes the code(Diagnostic group code/Claim admit diagnosis code) may may be diagnosed for a beneficiary but amount can be reimbursed for other codes than the diagnosed ones. Insuch cases a claim with a high mean feature value(eg: reimbursement amount, hospital days) can be suspicious.
  3. Mean Feature Per Procedure Code or Diagnostic Code:
    – Why mean_feature_per_procedure code/mean_feature_per_diagnostic code?
    a) Procedure code(1–6) can be same for all the beneficiaries who were not even diagnosed with the disease. This can result in frauds.
    b) Sometimes a costly procedure/diagnosis may be performed/detected on a beneficiary for the sake of increasing the cost of treatment. Insuch cases a claim with a high mean feature value(eg: reimbursement amount, hospital days) can be suspicious.
    c) Diagnostic code(1–10) can be same for all the beneficiaries for whom the treatment followed may not be same for the disease. This can result in frauds.
def feature_interaction(train_data, group_col, opr_col, task):
'''Function takes train data groups it using group_col feature and performs task(eg: mean, count) on opr_col'''
for val in opr_col:
new_feature = task+'_'+val+'_'+'per'+''.join(group_col) # column name of new feature
train_data[new_feature] = train_data.groupby(group_col)[val].transform(task) # group by columns to get the new feature
return train_data

columns = [‘InscClaimAmtReimbursed’, ‘DeductibleAmtPaid’, ‘IPAnnualReimbursementAmt’, ‘IPAnnualDeductibleAmt’, ‘OPAnnualReimbursementAmt’, ‘OPAnnualDeductibleAmt’, ‘Age’, ‘Hospital_Days’, ‘Claim_Days’, ‘Risk’]
train_final = feature_interaction(train_final, [‘Provider’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘BeneID’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘AttendingPhysician’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘OperatingPhysician’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘OtherPhysician’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘DiagnosisGroupCode’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmAdmitDiagnosisCode’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmProcedureCode_1’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmProcedureCode_2’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmProcedureCode_3’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmProcedureCode_4’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmProcedureCode_5’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmProcedureCode_6’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmDiagnosisCode_1’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmDiagnosisCode_2’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmDiagnosisCode_3’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmDiagnosisCode_4’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmDiagnosisCode_5’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmDiagnosisCode_6’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmDiagnosisCode_7’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmDiagnosisCode_8’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmDiagnosisCode_9’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘ClmDiagnosisCode_10’], columns, ‘mean’) # function call
train_final = feature_interaction(train_final, [‘Provider’], [‘ClaimID’], ‘count’) # function call

group_col = [‘BeneID’, ‘AttendingPhysician’, ‘OtherPhysician’, ‘OperatingPhysician’, ‘ClmAdmitDiagnosisCode’, ‘ClmProcedureCode_1’,
‘ClmProcedureCode_2’, ‘ClmProcedureCode_3’, ‘ClmProcedureCode_4’, ‘ClmProcedureCode_5’, ‘ClmProcedureCode_5’,
‘ClmDiagnosisCode_1’, ‘ClmDiagnosisCode_2’, ‘ClmDiagnosisCode_3’, ‘ClmDiagnosisCode_4’, ‘ClmDiagnosisCode_5’, ‘ClmDiagnosisCode_6’,
‘ClmDiagnosisCode_7’, ‘ClmDiagnosisCode_8’, ‘ClmDiagnosisCode_9’, ‘ClmDiagnosisCode_10’, ‘DiagnosisGroupCode’]
for col in group_col:
lst = [‘Provider’, col]
train_final = feature_interaction(train_final, lst, [‘ClaimID’], ‘count’)

train_final.head()

After all the Data preprocessing, Feature Engineering and Feature Reduction, we get the Final Dataset. We need to experiment with different models on train set and then validate the performance of every model on validation set. Based on the model performance on the validation set, we need to pick the best one for deployment.

Functions to validate every model are shown in the code snippet:

def pred_proba(clf, data):
'''Function takes classifier and train/cv data and returns the predicted probability'''
y_pred = clf.predict_proba(data)[:,1] # predict probability of Class 1

return y_pred

def find_best_threshold(threshould, fpr, tpr):
”’Calculates threshold with max value of exp: tpr(1-fpr)”’
t = threshould[np.argmax(tpr*(1-fpr))] # (tpr*(1-fpr)) = maximum if fpr = very low and tpr = very high

return t

def predict_with_best_t(proba, threshould):
”’Calculates predicted probability for best threshold”’
predictions = []
for i in proba:
if i>=threshould: # predict 1 if above threshold
predictions.append(1) # store value in list
else: # predict 0 if below threshold
predictions.append(0) # store value in list
return predictions

def plot_confusion_matrix(best_t, x_tr, x_cv, y_tr, y_cv, y_tr_pred, y_cv_pred):
”’Function takes best threshold, train/cv data and predicted train/cv data and returns train/cv predictions along with the confusion matrix”’
fig, ax = plt.subplots(1,2, figsize=(10,4))
cm = confusion_matrix(y_tr, predict_with_best_t(y_tr_pred, best_t)) # Calculate count of the confusion matrix for Train Data
with plt.style.context(‘seaborn’): # Set the background for graph plotting
sns.heatmap(cm, annot=True, fmt=’d’, ax=ax[0], cmap=’YlGnBu’) # Plot Heatmap of Train Confusion Matrix
ax[0].set_title(‘Train Confusion Matrix’) # Add title of the plot
ax[0].set_xlabel(‘Predicted Values’) # Add label to x-axis
ax[0].set_ylabel(‘Actual Values’) # Add label to y-axis

cm = confusion_matrix(y_cv, predict_with_best_t(y_cv_pred, best_t)) # Calculate count of the confusion matrix for CV Data
with plt.style.context(‘seaborn’): # Set the background for graph plotting
sns.heatmap(cm, annot=True, fmt=’d’, ax=ax[1], cmap=’YlGnBu’) # Plot Heatmap of CV Confusion Matrix
ax[1].set_title(‘CV Confusion Matrix’) # Add title of the plot
ax[1].set_xlabel(‘Predicted Values’) # Add label to x-axis
ax[1].set_ylabel(‘Actual Values’) # Add label to x-axis

plt.show()

return predict_with_best_t(y_tr_pred, best_t), predict_with_best_t(y_cv_pred, best_t)

def model_validation(clf, x_tr, x_cv, y_tr, y_cv):
”’Function takes classifier, train and cv data and returns the best threshhold, model auc and train/cv f1 score”’
y_tr_pred = pred_proba(clf, x_tr) # function call
y_cv_pred = pred_proba(clf, x_cv) # function call
train_fpr, train_tpr, tr_thresholds = roc_curve(y_tr, y_tr_pred) # Calculate FPR and TPR for Train Data
cv_fpr, cv_tpr, cv_thresholds = roc_curve(y_cv, y_cv_pred) # Calculate FPR and TPR for CV Data

print(‘Train AUC = {}’.format(auc(train_fpr, train_tpr)))
print(‘CV AUC = {}’.format(auc(cv_fpr, cv_tpr)))

plt.plot(train_fpr, train_tpr, label=”Train AUC =”+str(auc(train_fpr, train_tpr))) # Plot Train FPR and Train TPR
plt.plot(cv_fpr, cv_tpr, label=”CV AUC =”+str(auc(cv_fpr, cv_tpr))) # Plot CV FPR and CV TPR
plt.legend() # Add legend to the plot
plt.plot([0,1],[0,1], ‘g-‘) # Plot a diagnol for random model
plt.xlabel(“False positive Rate(FPR)”) # Add label to x-axis
plt.ylabel(“True positive Rate(TPR)”) # Add label to y-axis
plt.title(“ROC Curve”) # Add title of the plot
plt.grid() # Add grid to the plot
plt.show()

best_t = find_best_threshold(tr_thresholds, train_fpr, train_tpr) # function call
tr_pred, cv_pred = plot_confusion_matrix(best_t, x_tr, x_cv, y_tr, y_cv, y_tr_pred, y_cv_pred)
train_f1_score = f1_score(y_tr, tr_pred) # calculate Train F1 score
cv_f1_score = f1_score(y_cv, cv_pred) # calculate cv F1 score

return best_t, auc(cv_fpr, cv_tpr), train_f1_score, cv_f1_score

def imp_features(imp_feat, feat_weights, val, head):
”’Function to predict the important features after applying a classification model on it”’
feat_imp = pd.DataFrame(list(zip(imp_feat, feat_weights)), columns =[‘Features’, ‘Weights’])
feat_imp = feat_imp[feat_imp[‘Weights’] != 0] # Filter non zero weights of features with names
feat_imp.reset_index(drop=True, inplace=True)

best_worst_15_feat = feat_imp.sort_values(by= ‘Weights’,ascending=val)[‘Features’].iloc[0:15]
best_worst_15_feat_weights = feat_imp.sort_values(by=’Weights’,ascending=val)[‘Weights’].iloc[0:15]
feat_list = pd.DataFrame(list(zip(list(best_worst_15_feat), list(best_worst_15_feat_weights))), columns =[‘Features’, ‘Weights’])
print(feat_list.head(15))

with plt.style.context(‘seaborn-white’): # Set the white background for graph plotting
plt.figure(figsize=(12, 8))
sns.barplot(y=best_worst_15_feat, x=best_worst_15_feat_weights) # Bargraph with features and its weights
plt.grid(linestyle=”-“) # Add grid to the plot
plt.xlabel(‘Features Importance Weights’) # Add label to x-axis
plt.ylabel(‘Features’) # Add label to y-axis
plt.title(head + ‘Features’) # Add title of the plot

plt.show()

Approach 1:
a. Split the data into Train and Validation (70:30)
b. Use Class Weight = ‘Balanced’ scheme.
c. Use Logistic Regression, Decision Tree and XGBoost Classifier. Pick the best model based on the performance score.
d. Repeat the same process for 80:20 split of the final data into Train and Validation Set

Approach 2:
a. Split the data into Train and Validation (70:30)
b. Oversample the data using SMOTE (majority: minority) to make 70:30.
c. Use Logistic Regression, Decision Tree and XGBoost Classifier. Pick the best model based on the performance score.
d. Repeat the same process for 80:20 split of the final data into Train and Validation Set

Evaluation Results:

Best Model: Logistic Regression (Class Weight = Balanced scheme)

Observations:

  1. Logistic Regression with Class Weight = Balanced and 70:30 Train CV Split is working best from the modelling carried out.

Observations:

  1. The following aggregated features are useful in achieving good model performance:
    – Attending/Operating/Other Physician Aggregated Feature
    – Hospitalization Days Aggregated Feature
    – Claim Days Aggregated Feature
    – Provider Aggregated Feature
    – Beneficiary ID Aggregated Feature
    – Diagnostic Codes(2,4,5,8,9) Aggregated Feature
    – Insurance Claim Reimbursement Amount Aggregated Feature
    – Renal Disease Indicator
    – Race_3
  2. The above mentioned Aggregated features(that capture the interactions b/w the different parties involved in the Claim process have certainly helped in achieving the good performance scores.
Approach1: Class Weight = Balanced Scheme

The best-performing model is highlighted in light yellow in the above table

Approach2: Synthetic Minority Oversampling

Observations:

  1. Synthetic Minority Oversampling doesn’t provide gain in the model’s performance. Hence, this is not of much use.

As Logistic Regression with Class Weight = Balanced and 70:30 Train CV Split worked the best for this healthcare provider fraud detection problem, we will use this in our final model pipeline. Please find the code snippet for complete pipeline.

def data_preprocessing(inp, out, ben, test_labels):
'''Function takes test/unknown data and returns the preprocessed data'''
all_ones = np.ones(len(inp), dtype = int) # Add all ones for all the hospitalised/inpatient beneficiaries
inp['IsHospitalized'] = list(all_ones) # Add new feature whether a beneficiary is inpatient or outpatient type
all_zeros = np.zeros(len(out), dtype = int) # Add all zeros for all the outpatient beneficiaries
out['IsHospitalized'] = list(all_zeros) # Add new feature whether a beneficiary is inpatient or outpatient type

common_lst = []

for col in out.columns: # Find common columns in Inpatient and Outpatient Data
if col in inp.columns:
common_lst.append(col) # Add the common columns in new list

in_out_patient = pd.merge(inp, out, left_on = common_lst, right_on = common_lst, how = ‘outer’)
ben_inout_patient = pd.merge(in_out_patient, ben, left_on = ‘BeneID’, right_on = ‘BeneID’, how = ‘inner’)
test_final = ben_inout_patient
# pd.merge(ben_inout_patient, test_labels , how = ‘inner’, on = ‘Provider’ )

test_final = test_final.replace({‘ChronicCond_Alzheimer’: 2, ‘ChronicCond_Heartfailure’: 2, ‘ChronicCond_KidneyDisease’: 2,
‘ChronicCond_Cancer’: 2, ‘ChronicCond_ObstrPulmonary’: 2, ‘ChronicCond_Depression’: 2,
‘ChronicCond_Diabetes’: 2, ‘ChronicCond_IschemicHeart’: 2, ‘ChronicCond_Osteoporasis’: 2,
‘ChronicCond_rheumatoidarthritis’: 2, ‘ChronicCond_stroke’: 2, ‘Gender’: 2 }, 0)
test_final = test_final.replace({‘RenalDiseaseIndicator’: ‘Y’}, 1)
test_final[‘RenalDiseaseIndicator’] = test_final[‘RenalDiseaseIndicator’].apply(pd.to_numeric) # convert object datatype to numeric datatype

test_final[‘claim_strt_year’] = pd.DatetimeIndex(test_final[‘ClaimStartDt’]).year # Add new feature claim start year of the admitted beneficiary
test_final[‘adm_year’] = pd.DatetimeIndex(test_final[‘AdmissionDt’]).year # Add new feature admit year of the beneficiary

# Add new feature number of days of applied claim of the admitted beneficiary
test_final[‘Claim_Days’] = (pd.DatetimeIndex(test_final[‘ClaimEndDt’]) – pd.DatetimeIndex(test_final[‘ClaimStartDt’])).days
# Add new feature number of days spent in hospital by the beneficiary
test_final[‘Hospital_Days’] = (pd.DatetimeIndex(test_final[‘DischargeDt’]) – pd.DatetimeIndex(test_final[‘AdmissionDt’])).days

test_final[‘Difference’] = np.where(test_final[‘Claim_Days’]>test_final[‘Hospital_Days’],test_final[‘Claim_Days’] – test_final[‘Hospital_Days’], 0)
test_final[‘birth_year’] = pd.DatetimeIndex(test_final[‘DOB’]).year # Add new feature birth year of the beneficiary

test_final[‘IsAlive’] = pd.factorize(test_final[‘DOD’])[0] # Add new feature if a beneficiary is Alive or not(DOD feature)?
test_final[‘IsAlive’] = np.where(test_final[‘IsAlive’] == -1, 1, 0)

out = pd.DatetimeIndex(test_final.DOD).max() # Calculate the max date from DOD column
test_final[‘Age’] = ((out – pd.DatetimeIndex(test_final[‘DOB’])).days)/365 # Calculate age from max DOD and DOB column
test_final[‘Age’] = test_final[‘Age’].round().astype(int) # convert age into int

lst_age = list(test_final[‘Age’].values) # Take age as a list to divide it into age-category
lst_category = []
for i in lst_age:
if i<=45:
lst_category.append(‘Young’) # If age <= 45, then age-category = ‘young’
elif i<=64:
lst_category.append(‘Adult’) # If age <= 64, then age-category = ‘adult’
elif i<=83:
lst_category.append(‘Old’) # If age <= 83, then age-category = ‘old’
else:
lst_category.append(‘Very Old’) # If age > 84, then age-category = ‘very old’

test_final[‘Age_Category’] = lst_category # Add the age-category in the dataframe

new_df = test_final[[‘RenalDiseaseIndicator’, ‘ChronicCond_KidneyDisease’, ‘ChronicCond_Heartfailure’, ‘ChronicCond_IschemicHeart’, ‘ChronicCond_Alzheimer’, ‘ChronicCond_Cancer’, ‘ChronicCond_ObstrPulmonary’,’ChronicCond_Depression’,’ChronicCond_Diabetes’,’ChronicCond_Osteoporasis’,’ChronicCond_rheumatoidarthritis’,’ChronicCond_stroke’]].copy()
test_final[‘Risk’] = new_df.sum(axis = 1) # Find how many beneficiaries are suffering from how many chronic disease?

test_final.fillna(value=0, inplace=True) # fill NULL values with 0 for all the columns obtained above

columns = [‘InscClaimAmtReimbursed’, ‘DeductibleAmtPaid’, ‘IPAnnualReimbursementAmt’, ‘IPAnnualDeductibleAmt’, ‘OPAnnualReimbursementAmt’, ‘OPAnnualDeductibleAmt’, ‘Age’, ‘Hospital_Days’, ‘Claim_Days’, ‘Risk’]
test_final = feature_interaction(test_final, [‘Provider’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘BeneID’], columns, ‘mean’) # function call

test_final = feature_interaction(test_final, [‘AttendingPhysician’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘OperatingPhysician’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘OtherPhysician’], columns, ‘mean’) # function call

test_final = feature_interaction(test_final, [‘DiagnosisGroupCode’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmAdmitDiagnosisCode’], columns, ‘mean’) # function call

test_final = feature_interaction(test_final, [‘ClmProcedureCode_1’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmProcedureCode_2’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmProcedureCode_3’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmProcedureCode_4’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmProcedureCode_5’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmProcedureCode_6’], columns, ‘mean’) # function call

test_final = feature_interaction(test_final, [‘ClmDiagnosisCode_1’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmDiagnosisCode_2’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmDiagnosisCode_3’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmDiagnosisCode_4’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmDiagnosisCode_5’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmDiagnosisCode_6’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmDiagnosisCode_7’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmDiagnosisCode_8’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmDiagnosisCode_9’], columns, ‘mean’) # function call
test_final = feature_interaction(test_final, [‘ClmDiagnosisCode_10’], columns, ‘mean’) # function call

test_final = feature_interaction(test_final, [‘Provider’], [‘ClaimID’], ‘count’) # function call

group_col = [‘BeneID’, ‘AttendingPhysician’, ‘OtherPhysician’, ‘OperatingPhysician’, ‘ClmAdmitDiagnosisCode’, ‘ClmProcedureCode_1’,
‘ClmProcedureCode_2’, ‘ClmProcedureCode_3’, ‘ClmProcedureCode_4’, ‘ClmProcedureCode_5’, ‘ClmProcedureCode_5’,
‘ClmDiagnosisCode_1’, ‘ClmDiagnosisCode_2’, ‘ClmDiagnosisCode_3’, ‘ClmDiagnosisCode_4’, ‘ClmDiagnosisCode_5’, ‘ClmDiagnosisCode_6’,
‘ClmDiagnosisCode_7’, ‘ClmDiagnosisCode_8’, ‘ClmDiagnosisCode_9’, ‘ClmDiagnosisCode_10’, ‘DiagnosisGroupCode’]

for col in group_col:
lst = [‘Provider’, col]
test_final = feature_interaction(test_final, lst, [‘ClaimID’], ‘count’) # function call

remove_col=[‘BeneID’, ‘ClaimID’, ‘ClaimStartDt’,’ClaimEndDt’,’AttendingPhysician’,’OperatingPhysician’, ‘OtherPhysician’,
‘ClmDiagnosisCode_1′,’ClmDiagnosisCode_2’, ‘ClmDiagnosisCode_3’, ‘ClmDiagnosisCode_4′,’ClmDiagnosisCode_5’,
‘ClmDiagnosisCode_6’, ‘ClmDiagnosisCode_7′,’ClmDiagnosisCode_8’, ‘ClmDiagnosisCode_9’, ‘ClmDiagnosisCode_10’,
‘ClmProcedureCode_1’, ‘ClmProcedureCode_2’, ‘ClmProcedureCode_3′,’ClmProcedureCode_4’, ‘ClmProcedureCode_5’,
‘ClmProcedureCode_6′,’ClmAdmitDiagnosisCode’, ‘AdmissionDt’,’claim_strt_year’, ‘adm_year’, ‘DischargeDt’, ‘DiagnosisGroupCode’,
‘DOB’, ‘DOD’,’birth_year’,’State’, ‘County’]

test_final_features=test_final.drop(columns=remove_col, axis=1) # remove unnecessary columns
test_final_features = pd.get_dummies(test_final_features, columns=[‘Gender’, ‘Race’]) # encode categorical features
test_final_features = test_final_features.groupby([‘Provider’], as_index=False).agg(‘sum’) # group the remaining features by provider

return test_final_features

def feature_interaction(train_data, group_col, opr_col, task):
”’Function takes test data groups it using group_col feature and performs task(eg: mean, count) on opr_col”’
for val in opr_col:
new_feature = task+’_’+val+’_’+’per’+”.join(group_col) # column name of new feature
train_data[new_feature] = train_data.groupby(group_col)[val].transform(task) # group by columns to get the new feature
return train_data # store providers, classlabels & probability

Data Predictions: Refer the below code snippet to predict output

def function_1(inp, out, ben, test_labels):
'''Function takes test/unknown data and returns the predicted probabilities'''
test_final_features = data_preprocessing(inp, out, ben, test_labels) # function call
X_test = test_final_features.drop(axis=1, columns=['Provider']) # drop provider column from X
model = joblib.load('best_model.pkl') # load the stored best model

cols_when_model_builds = model.feature_names # get all feature names from best model
X_test = X_test[cols_when_model_builds] # reduce X_test to only the features from best model
y_test = test_final_features[‘Provider’] # use provider column as Y

standard_scaler = StandardScaler()
standard_scaler.fit(X_test)
joblib.dump(standard_scaler, ‘standardized_data.pkl’)
X_test = standard_scaler.transform(X_test) # standardize test data

predictions = model.predict(X_test) # Predict the class labels for test data
pred_prob = model.predict_proba(X_test) # Predict probability estimates of each class label
pred_prob_df = pd.DataFrame(pred_prob)
pred_prob_df.columns=[‘Potential Fraud = No’, ‘Potential Fraud = Yes’] # store the probability estimates as dataframe

final_results = pd.concat([pd.DataFrame(y_test), pd.DataFrame(predictions), pred_prob_df],axis=1)
final_results.reset_index(drop=True, inplace=True) # store providers, classlabels & probability estimates as dataframe
final_results.columns = [‘Provider’,’PotentialFraud’, ‘Non-Fraud Probability’, ‘Fraud Probability’]

return final_results

Prediction for a provider Id:

Prediction of a Fraudulent Provider with probability scores
  1. Business Challenges:
    – The main challenge while working on this project was difficulty in understanding the dataset as no information was provided for the features and many of them had coded values. Therefore, it was a tricky task to interpret the business aspects of such features.
  2. Technical Challenges:
    – A high model AUC and high F1 score is required as this is a critical task and we cannot afford errors. Hence, model training should be carried out efficiently. This requires hyperparameter tuning.
    – The AUC score was not improving with GridSearch CV. Hence, RandomSearch CV(with wide range of values) was used to overcome this issue. The model performance increased by 1% approximately(which is good in terms of less errors for a crotical task).
  1. Most of the features had coded values. So, embeddings can be generated for calculating the similarity scores b/w them. For eg: Diagnostic Group Codes, Procedure Codes. These embeddings might help in improving the model performance.
  2. This problem can be solved using deep learning techniques. Using deep multilayer we can use different kinds of activation functions(Relu, Leaky relu) and dropouts to prevent overfitting. As this a binary classification problem, we can use either sigmoid or softmax in the final layer. This might help in improving the model performance

Feature Descriptions: This spreadsheet is listing all questions raised while doing the extensive Exploratory Data Analysis. It also explains the detailed definitions and conclusions drawn for every feature.

You can find me at : https://www.linkedin.com/in/priyanka-babar-352a70181/

You can access the best-trained model via the web application:

You can find the codes related to this project at the following repository:

  1. Predicting Healthcare Fraud in Medicaid: A Multidimensional Data Model and Analysis Techniques for Fraud Detection
    https://www.sciencedirect.com/science/article/pii/S2212017313002946
  2. A survey on statistical methods for health care fraud detection
    https://cpb-us-w2.wpmucdn.com/sites.gatech.edu/dist/4/216/files/2015/09/p70-Statistical-Methods-for-Health-Care-Fraud-Detection.pdf
  3. Previous Kaggle Submissions https://www.kaggle.com/code/rohitrox/medical-provider-fraud-detection/notebook

source

All the content or information on this blog is given for only educational purposes.