Logistic Regression
Author name- Deepak Chhabra
Roll no- 021330424013
Batch- M2
From- ITM BUSINESS SCHOOL, KHARGHAR, NAVI MUMBAI
Introduction-
In this analysis, we aim to evaluate the predictive performance of a logistic regression model used to forecast the likelihood of default among individuals based on several independent variables, including age, income, and debt-to-income ratio.
Data Collection–
The data utilized in this analysis consists of a sample of individuals, with each observation containing information on their age, income, debt-to-income ratio, and whether they defaulted. The dataset was collected from financial records, surveys, or credit reports, ensuring that it reflects a diverse range of individuals to enhance the generalizability of the findings. The observations were coded as “Yes” for default and “No” for non-default for classification.
Data Analysis-
The analysis was conducted using logistic regression, which estimates the relationship between the independent variables (age, income, and debt-to-income ratio) and the binary outcome variable (default). The model’s performance was assessed using various metrics, including the classification table, Omnibus tests of model coefficients, and pseudo R² values (Cox & Snell R² and Nagelkerke R²). The classification table revealed the model’s predictive accuracy, with an overall percentage of correct predictions at 52.7%. The Nagelkerke R² value indicated that the model explained 25.5% of the variance in the likelihood of default.
Objective-
The primary objective of this analysis is to determine the effectiveness of the logistic regression model in predicting defaults based on the selected independent variables. Specifically, we aim to identify which variables significantly contribute to the prediction of default and assess the model’s overall accuracy and explanatory power.
Interpretation-
Classification Table-
-
- The classification table provides insight into the model’s predictive accuracy. In Step 1, the model correctly predicted 59% of individuals who did not default (No) and only 46.5% of those who did default (Yes). This indicates that while the model is somewhat effective at identifying non-defaulting individuals, it struggles more with accurately predicting defaults.
- The overall accuracy of the model is 52.7%, meaning that just over half of the predictions made by the model were correct. This suggests that the model has limitations in its predictive capability and may not be reliable for decision-making.
- Omnibus Tests of Model Coefficients-
- The Chi-square statistic of 56.018 with a significance level (p-value) of .000 indicates that the model as a whole is statistically significant. This means that at least one of the independent variables (age, income, or debt-to-income ratio) contributes to predicting the likelihood of default.
- Model Summary-
- The Cox & Snell R² value of .125 and the Nagelkerke R² value of .255 suggest that the model explains between 12.5% and 25.5% of the variance in the dependent variable (default). These pseudo R² values indicate that while the model provides some explanatory power, a significant portion of the variance remains unexplained, highlighting the complexity of predicting defaults.
- Variables in the Equation-
- Age: The coefficient for age is -0.005, with a significance level of .003. This indicates that as age increases, the likelihood of default decreases slightly, suggesting that older individuals may be more financially stable.
- Income: The coefficient for income is 0.001, but it is not statistically significant (p = .187). This suggests that income may not be a strong predictor of default in this model, indicating that other factors may play a more critical role.
- Debt-to-Income Ratio: The coefficient for the debt-to-income ratio is 0.005 with a significance level of .000. This positive relationship indicates that as the debt-to-income ratio increases, the likelihood of default also increases. This aligns with common financial understanding that higher debt relative to income can lead to financial strain and increased risk of default.
- Constant- The constant term is -0.139 with a significance level of .078, suggesting that the baseline odds of default when all predictors are held at zero are not statistically significant. This highlights the importance of the independent variables in predicting default.
Conclusion-
The findings from the logistic regression analysis indicate that while the model shows some predictive ability, with an overall accuracy of 52.7%, there is considerable room for improvement. The significant variables identified include age and debt-to-income ratio, with the latter showing a positive association with the likelihood of default. The model’s pseudo R² values suggest that only a modest portion of the variance in defaults is explained by the independent variables. Future research could explore additional predictors or utilize alternative modeling techniques to enhance predictive performance and better understand the factors contributing to defaults.
Classification Tablea,b |
|||||
|
Observed |
Predicted |
|||
|
default |
Percentage Correct |
|||
|
No |
Yes |
|||
Step 0 |
default |
No |
256 |
0 |
100.0 |
Yes |
254 |
0 |
.0 |
||
Overall Percentage |
|
|
50.2 |
||
a. Constant is included in the model. |
|||||
b. The cut value is .500 |
Omnibus Tests of Model Coefficients |
||||
|
Chi-square |
df |
Sig. |
|
Step 1 |
Step |
56.018 |
3 |
.000 |
Block |
56.018 |
3 |
.000 |
|
Model |
56.018 |
3 |
.000 |
Model Summary |
|||
Step |
-2 Log likelihood |
Cox & Snell R Square |
Nagelkerke R Square |
1 |
495.602a |
.125 |
.255 |
a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.
|
This table contains the cox & snell R square and Nagelkerke R square values which are both methods of calculating the explained variation. These value are somethings referred to pseudo R2 values the explained variation in the dependent variables based on our model ranges from 12.5% to 25.5% depending on whether yor refernce the COX & SNELL R2 or Nagelkerke R2 methods. It is preferable to report the Nagelkerkr R2 values.
Classification Tablea |
|||||
|
Observed |
Predicted |
|||
|
default |
Percentage Correct |
|||
|
No |
Yes |
|||
Step 1 |
default |
No |
151 |
105 |
59.0 |
Yes |
136 |
118 |
46.5 |
||
Overall Percentage |
|
|
52.7 |
||
a. The cut value is .500 |
In this case of no default 59% have been predicted correctly, but in case of only 46,5% have been predicted correctly so the overall percentage is 52.7% with cut off 0.5
Variables in the Equation |
|||||||
|
B |
S.E. |
Wald |
df |
Sig. |
Exp(B) |
|
Step 1a |
age |
-.005 |
.011 |
9.995 |
1 |
.003 |
.995 |
income |
.001 |
.003 |
1.905 |
1 |
.187 |
1.001 |
|
debtinc ratio |
.005 |
.019 |
60.309 |
1 |
.000 |
1.142 |
|
Constant |
-.139 |
.462 |
4.693 |
1 |
.078 |
.463 |
|
a. Variable(s) entered on step 1: age , income, debtinc ratio. |