LOGISTIC REGRESSION
What is logistic regression?
The logistic regression model or logit model is used to model the probability of a certain class or event such as win or lose, dead or alive, sick or not. It is a regression analysis where the nature of dependent variable is dichotomous. However this can be extended to several classes of events such as determining whether a data contain human, animal, bird, fish etc. Each object being detected in the data would be assigned a probability between 0 and 1 and the sum adding to one. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
Comparison with linear regression model
We often call logistic model a linear model but it uses a more complex cost function. This cost function is also called as sigmoid function or logistic function which is shown by the curved line in above graph. The hypothesis of logistic regression limits the cost function between 0 and 1. Therefore the linear model fails to represent it as the value in linear model goes above 1.
What is Sigmoid function?
A sigmoid function is a mathematical function having a characteristic “S”-shaped curve or sigmoid curve. In order to map predicted values to probabilities, we use the Sigmoid function. It is represented by following mathematical formula:
Logistic model hypothesis representation
Consider a model with 1 predictor x1 and one response variable Y which we denote P(Y=1). We assume a linear relationship between the predictor variables, and the log-odds (odds means the chances or likelihood of something happening or being the case.) of the event that Y=1. This linear relationship can be written in the following mathematical equation:
We can recover the odds by exponentiating the log-odds
The above formula shows that once β is fixed we can easily compute either the log-odds that Y=1 for a given observation, or the probability that Y=1 for a given observation.
Logistic Model building using Scikit-learn
Now we are going to build a logistic model on a data set and check its various metrics. Let us first load the required dataset
import pandas as pd
df=pd.read_csv("US_Heart_Patients.csv")
df.head()
The dataset is publically available and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes
Renaming the male column as sex_male so as to classify it easily
df.rename(columns={'male':'Sex_male'},inplace=True)
Variables
Each attribute is a potential risk factor. There are both demographic, behavioural and medical risk factors.
Demographic:
- sex: male or female;(Nominal)
- age: age of the patient;(Continuous — Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)
Behavioural:
- currentSmoker: whether or not the patient is a current smoker (Nominal)
- cigsPerDay: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarretts, even half a cigarette.)
Medical( history):
- BPMeds: whether or not the patient was on blood pressure medication (Nominal)
- prevalentStroke: whether or not the patient had previously had a stroke (Nominal)
- prevalentHyp: whether or not the patient was hypertensive (Nominal)
- diabetes: whether or not the patient had diabetes (Nominal)
Medical(current):
- totChol: total cholesterol level (Continuous)
- sysBP: systolic blood pressure (Continuous)
- diaBP: diastolic blood pressure (Continuous)
- BMI: Body Mass Index (Continuous)
- heartRate: heart rate (Continuous — In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)
- glucose: glucose level (Continuous)
Predict variable (desired target):
- 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)
Now let us perform the required exploratory data analysis to prepare our data for model
Checking for missing values
df.isnull().sum()
count=0
for i in df.isnull().sum(axis=1):
if i>0:
count=count+1
print('Total number of rows with missing values is ', count)
print('since it is only',round((count/len(df.index))*100), 'percent of the entire dataset the rows with missing values are excluded.')
Because the dataset has only 12 percent of missing values so we can easily exclude that data from our dataset
df.dropna(axis=0,inplace=True)
now let us select the features which are most important for this data and features which we want as dependent as well as independent variables
from statsmodels.tools import add_constant as add_constant
df_constant = add_constant(df)
df_constant.head()
st.chisqprob = lambda chisq, df: st.chi2.sf(chisq, df)
cols=df_constant.columns[:-1]
model=sm.Logit(df.TenYearCHD,df_constant[cols])
result=model.fit()
result.summary()
The results above show some of the attributes with P value higher than the preferred alpha(5%) and thereby showing low statistically significant relationship with the probability of heart disease. Backward elimination approach is used here to remove those attributes with highest Pvalue one at a time followed by running the regression repeatedly until all attributes have P Values less than 0.05.
Feature Selection : Backward elimination technique
def back_feature_elem (data_frame,dep_var,col_list):
""" Takes in the dataframe, the dependent variable and a list of column names, runs the regression repeatedly eliminating feature with the highest
P-value above alpha one at a time and returns the regression summary with all p-values below alpha"""while len(col_list)>0 :
model=sm.Logit(dep_var,data_frame[col_list])
result=model.fit(disp=0)
largest_pvalue=round(result.pvalues,3).nlargest(1)
if largest_pvalue[0]<(0.05):
return result
break
else:
col_list=col_list.drop(largest_pvalue.index)result=back_feature_elem(df_constant,df.TenYearCHD,cols)
result.summary()
Interpreting the results: Odds ratio, confidence intervals and pvalues
params = np.exp(result.params)
conf = np.exp(result.conf_int())
conf['OR'] = params
pvalue=round(result.pvalues,3)
conf['pvalue']=pvalue
conf.columns = ['CI 95%(2.5%)', 'CI 95%(97.5%)', 'Odds Ratio','pvalue']
print ((conf))
This fitted model shows that, holding all other features constant, the odds of getting diagnosed with heart disease for males (sex_male = 1)over that of females (sex_male = 0) is exp(0.5815) = 1.788687. In terms of percent change, we can say that the odds for males are 78.8% higher than the odds for females.
The coefficient for age says that, holding all others constant, we will see 7% increase in the odds of getting diagnosed with CDH for a one year increase in age since exp(0.0655) = 1.067644.
Similarly , with every extra cigarette one smokes thers is a 2% increase in the odds of CDH.
For Total cholosterol level and glucose level there is no significant change.
There is a 1.7% increase in odds for every unit increase in systolic Blood Pressure.
Splitting the data to train test data
import sklearn
new_features=df[['age','Sex_male','cigsPerDay','totChol','sysBP','glucose','TenYearCHD']]
x=new_features.iloc[:,:-1]
y=new_features.iloc[:,-1]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.20,random_state=5)
Applying Logistic regression model
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(x_train,y_train)
y_pred=logreg.predict(x_test)
Now comes the important part i.e model evaluation part . Now we will check the model accuracy
sklearn.metrics.accuracy_score(y_test,y_pred)
Accuracy of the model is 0.88
Let us check the confusion matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)
conf_matrix=pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
plt.figure(figsize = (8,5))
sn.heatmap(conf_matrix, annot=True,fmt='d',cmap="YlGnBu")
The confusion matrix shows 658+4 = 662 correct predictions and 88+1= 89 incorrect ones.
True Positives: 4
True Negatives: 658
False Positives: 1 (Type I error)
False Negatives: 88 ( Type II error)
Now let us check the model evaluation statistics
TN=cm[0,0]
TP=cm[1,1]
FN=cm[1,0]
FP=cm[0,1]
sensitivity=TP/float(TP+FN)
specificity=TN/float(TN+FP)
print('The acuuracy of the model = TP+TN / (TP+TN+FP+FN) = ',(TP+TN)/float(TP+TN+FP+FN),'\n\n','The Miss-classification = 1-Accuracy = ',1-((TP+TN)/float(TP+TN+FP+FN)),'\n\n','Sensitivity or True Positive Rate = TP / (TP+FN) = ',TP/float(TP+FN),'\n\n','Specificity or True Negative Rate = TN / (TN+FP) = ',TN/float(TN+FP),'\n\n','Positive Predictive value = TP / (TP+FP) = ',TP/float(TP+FP),'\n\n','Negative predictive Value = TN / (TN+FN) = ',TN/float(TN+FN),'\n\n','Positive Likelihood Ratio = Sensitivity / (1-Specificity) = ',sensitivity/(1-specificity),'\n\n','Negative likelihood Ratio = (1-Sensitivity) / Specificity = ',(1-sensitivity)/specificity)
ROC Curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob_yes[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Heart disease classifier')
plt.xlabel('False positive rate (1-Specificity)')
plt.ylabel('True positive rate (Sensitivity)')
plt.grid(True)
A common way to visualize the trade-offs of different thresholds is by using an ROC curve, a plot of the true positive rate ( true positives/ total positives) versus the false positive rate ( false positives /total negatives) for all possible choices of thresholds.
A model with good classification accuracy should have significantly more true positives than false positives at all thresholds.
The optimum position for roc curve is towards the top left corner where the specificity and sensitivity are at optimum levels
Advantages of Logistic Regression
It doesn’t require high computation power and easy to implement, easily interpretable and also doesn’t require feature scaling.
Disadvantages of Logistic Regression
Logistic regression is not able to handle a large number of categorical features, and it is prone to overfitting and also it will not work well with the features that are not correlated to the target variable and are very similar or correlated to each other.