Tuesday, October 11, 2016

Bankruptcy prediction using Logistic Regression with MATLAB

In this practice session, you use the bankruptcy data set from a paper (see the bottom of the assignment). There are 30 failure examples and 30 non-failure examples. You are going to implement a logistic regression algorithm, train a model to learn from the training data. The model is then going to be used to classify businesses as failure or non-failure.

The following is the information of the attributes, from the income statements and balance sheets.

● Size

• Sales

● Profit

• ROCE: profit before tax=capital employed (%)

• FFTL: funds flow (earnings before interest, tax & depreciation)=total liabilities

● Gearing

• GEAR: (current liabilities + long-term debt)=total assets

• CLTA: current liabilities=total assets

● Liquidity

• CACL: current assets=current liabilities

• QACL: (current assets – stock)=current liabilities

• WCTA: (current assets – current liabilities)=total assets

• LAG: number of days between account year end and the date the annual report and accounts were filed at company registry.

• AGE: number of years company has been operating since incorporation date.

• CHAUD: coded 1 if changed auditor in previous three years, 0 otherwise

• BIG6: coded 1 if company auditor is a Big6 auditor, 0 otherwise

The target variable is FAIL, either = 1 or 0. You program and model using logistic regression.

First the data set is read in from an Excel sheet Sheet1 in the xls file. X is normalized as the range of variables are much different.


[data,txt,raw] = xlsread('bankruptcy.xls','Sheet1');

X = data(:,1:12);
X = normalize(X);
X = data(:,1:12); 

y = data(:,13);
[m,n] = size(X);

Then a column of 1 is added to X corresponding to θ0

X = [ones(m,1) X];

theta = zeros(n+1,1); 
The function fmincon is then called to search for the minimum cost, unconstrained optimized, in which the cost is provided by the function computeCost()


options = optimset('GradObj', 'on', 'MaxIter', 100);
[theta,cost] = fminunc(@(t)(computeCost(t,X,y)),theta,options);
The function computeCost()

function [cost,grad] = computeCost(theta,X,y,lambda)
implements the cost function

$$J(\theta)=\frac{1}{2m}\sum\limits_{i=1}^m (-y^{(i)}log(h_\theta(x^{(i)})-(1-y^{(i)})log(1-h_\theta(x^{(i)}))+\frac{\lambda}{2m}\sum\limits_{j=1}^n\theta_j^2$$

and the gradient

$$\frac{\partial J(\theta)}{\partial t}≔\frac{1}{m} \sum\limits_{i=1}^m (h_θ (x^{(i)}-y^{(i)})(x_j)^{(i)}+\frac{\lambda}{m}\theta_j^2$$

then stored in the variables cost and grad to return to the calling function.
The 2 above are implemented by these lines. Note that theta(1) should not be included as it is the parameter of x0 = 1


z = X*theta;
h = sigmoid(z);
grad = (1/m * (h-y)' * X) + lambda * [0;theta(2:end)]'/m; 
cost =  1/(m) * sum(-y .* log(h) - (1-y) .* log(1-h)) + lambda/m/2*sum(theta(2:end).^2);
Performance is much better if data is projected to a high dimensional space (explanation is going to be in another post)