Student Projects: October 2016

Thursday, October 20, 2016

Stock price clustering analysis

The given dataset contains the closing stock prices for S&P500 stocks for a period of time. A few stocks and prices are shown below:

Their symbols show on the column headers. The companies operate in 10 sectors as follows:

Health Care

Financials

Information Technology

Industrials

Utilities

Materials

Consumer Staples

Consumer Discretionary

Energy

Telecommunications Services

In the pre-processing step, a new data set is created to indicate if the stock prices increase compared with the previous day (1 or 0 corresponding to UP or DOWN). The matrix is then transposed such that the up/down movement of a stock is in in a row. The model will cluster rows/points in a number of clusters. Here the number of clusters is chosen to be 10 to see if the stocks (or most of) of companies operating in the same sectors happen to be grouped together.
The km function implements the K-means algorithm. The outer loop loops for a number of max iterations. The first inner loop assigns each example/point to a cluster. The 2nd loop re-computes the centroids of the clusters.


function [idx,centroids] = km(X,K,max_iters)
[m n] = size(X);
centroids = rand(K,n);
idx = zeros(m,1);
old_idx = zeros(m,1);

for j=1:max_iters
    change = false;
    for i=1:m
        [minval,idx(i)] = min(sum((repmat(X(i,:),K,1)-centroids).^2,2));
        if ((idx(i)~=old_idx(i)) && (change == false))
            change = true;
        end
    end
    for i=1:K
       centroids(i,:) = mean(X(idx==i,:),1); 
    end
    
    if (change == false)  
       break;
    end

    old_idx = idx;
 end

And the main script:


clear all; close all; clc;
[data,symbols,raw] = xlsread('sp500_short_period.xlsx','sp500');
movement = double((data(2:end,:)-data(1:end-1,:))>0)';
[m,n] = size(movement);

K = 10; % 10 sectors
max_iters = 100;

[idx,centroids] = km(movement,K,max_iters);

for i=1:K
    fprintf('\nStocks in group %d moving up together\n',i);
    char(symbols(idx == i))
end

An example for cluster 6: All the stock symbols are in Financial industry:
AIV
AVB
BXP
EQR
HCP
HCN
KIM
PCL
PLD
PSA
SPG
VTR
VNO

Loan recommendation system using Collaborative Filtering

You are modeling a recommender system for a lending club platform using model-based Collaborative Filtering (vs memory-based approach computing similarities between users and items to give recommendations using rating data).

The model-based CF is a latent factor model, more robust than the memory-based approach, and handles sparsity better. Consider a sparse rating matrix of which the elements are ratings given by lender j to loan i. The rating matrix is modeled by a matrix product of X (loan-feature matrix) and Ө (user-feature matrix) (see the figure). Each rating given by lender j to loan i is an inner product of row i in X and column j in Ө as illustrated below:

The recommender system model helps estimate the missing ratings and recommends good loans to the lenders. The function costFunction() computes the regularized cost function as follows:

In which the last 2 terms are the regularization terms. And the gradient of the cost function:

In the code of costFunction function, X and q are extracted from the parameter vector namely params. X has features/variables on its columns, and Ө has the lenders’ preference on it columns.


function [J,grad] = costFunction(param,Y,r,n_lenders,n_loans,n_features,lambda)
% Extract X and Theta from param vector
X = reshape(param(1:n_loans*n_features),n_loans,n_features);
Theta = reshape(param(n_loans*n_features+1:end),n_lenders,n_features);
            
% Cost
predictions = X*Theta'; % prediction,nm x nu
errors = (predictions-Y).*r; % also nm x nu
J = (1/2)*sum(sum(errors.^2));

% Gradients
X_grad = errors*Theta; % error is  nm x nu,and Theta is nu x n,X_grad is nm x n
Theta_grad = errors'*X; % error' is  nu x nm,X is nm x n,so Theta_grad is nu x n

% Regularized cost function to penalize overfitting
reg_X = (lambda/2)*sum(sum(X.^2));
reg_Theta = (lambda/2)*sum(sum(Theta.^2));
J = J+reg_Theta+reg_X;

% Add regularization terms to gradients
X_grad = X_grad+lambda*X;
Theta_grad = Theta_grad+lambda*Theta;

grad = [X_grad(:); Theta_grad(:)];
end

The optimizeCost function searches for the optimal parameter vector using gradient descent:


function [param,cost_range] = optimizeCost(param,Y,r,n_lenders,n_loans,n_features,lambda,step,maxrun)
    cost_range = zeros(maxrun,1);
    for iter = 1:maxrun
        [J,grad] = costFunction(param,Y,r,n_lenders,n_loans,n_features,lambda);
        param = param - step * grad; % Gradient descent for both X and Theta
        cost_range(iter) = J;
    end
end

In the data there are 10 lenders and a large number of loans, in which many were rated by the lenders with ratings from 1-10 (in practice there are many lenders, who invest). The ones that are not rated yet have ratings of 0.

The main script:


clear all; close all; clc;
[Y,txt,raw1] = xlsread('loandata.xlsx','loanrating','A1:J101');
[num,text,raw2] = xlsread('loandata.xlsx','loan','A1:J101');
R = (Y~=0);

n_lenders = size(Y,2);
n_loans = size(Y,1);
n_features = 10;

% Initialization
X = randn(n_loans,n_features);
Theta = randn(n_lenders,n_features);
init_param = [X(:); Theta(:)];

% Optimization
lambda = 10;
maxrun = 1000; % maximum number of iterations
step = 0.001; 
[param,cost_range] = optimizeCost(init_param,Y,R,n_lenders,n_loans,n_features,lambda,step,maxrun);

% Extract X and Theta from the variable theta
X = reshape(param(1:n_loans*n_features),n_loans,n_features);
Theta = reshape(param(n_loans*n_features+1:end),n_lenders,n_features);
pred = X * Theta';

Wednesday, October 12, 2016

How to display code and equations in a nice format

Dear Everyone,

In order to format your code as you see in the previous blog posts, in your post in HTML mode, put your MATLAB, R, Python code in between the following tags:

For MATLAB
<pre><code class="language-matlab">

</code></pre>
For R
<pre><code class="language-r">

</code></pre>

and Python
<pre class="brush: python">

</pre>

If you have equation, use Latex.

Tuesday, October 11, 2016

Stock Beta Computation using Linear Regression with MATLAB

In this article I show you how to compute stock beta using linear regression. The dataset is CAPMuniverse available in MATLAB.

The CAPM model:

$$R(k,i) = a(i) + C(k) + b(i) * (M(k) - C(k)) + V(k,i)$$

for samples k = 1, ... , m and assets i = 1, ... , n, where a(i) is a parameter that specifies the non-systematic return of an asset, b(i) is the asset beta, and V(k,i) is the residual error for each asset with associated random variable V(i). Asset alphas a(1), ... , a(n) are zeros in strict form of CAPM but non-zeros in practice.

The MATLAB dataset CAPMuniverse contains the daily total return data from 03-Jan-2000 to 07-Nov-2005 for 12 stocks as follows: 'AAPL', 'AMZN', 'CSCO', 'DELL', 'EBAY', 'GOOG', 'HPQ', 'IBM', 'INTC', 'MSFT', 'ORCL', 'YHOO'. Columns 13 and 14 are daily return data for the market, and the risk-free rate. For computing beta of each stock, You will subtract risk-free rate from the stock and the market returns to get x and y. Note that you need to add a column of 1 to x to make X so that X is of size m x 2

$$h_\theta(x)=\theta^{T}X=\theta_0+\theta_1x$$

More information regarding the dataset can be seen on Mathworks website. You will use regression to find the betas of these securities.

To compute the cost

$$J(\theta_0,\theta_1)=\frac{1}{2m}\sum\limits_{i=1}^m (h_\theta(x^{(i)})-y^{(i)}))^2$$

The function computeCost() as as follows:


function cost = computeCost(X,y,theta)
    m = length(y);
    h = X*theta;  
    cost = 1/2/m*sum((h-y).^2);
end

The parameters θ are updated with the pseudo-code:

Repeat until the maximum number of iterations {

Do the following for all thetas:

$$θ_j≔ θ_j-α/m \sum\limits_{i=1}^m (h_θ (x^{(i)}-y^{(i)})(x_j)^{(i)}$$

where j=0,…,n

}

The cost can be computed within the loop as well.

You need to update all the parameters $θ_j$ simultaneously. The function optimizeCost() output variables, function name, and input parameters


function [theta,cost_range] = optimizeCost(X,y,theta,step,maxrun)
    m = length(y);
    cost_range = zeros(maxrun,1);
    for iter = 1:maxrun
        h = X*theta;
        grad = 1/m * (h-y)' * X; % grad is 1 x d
        theta = theta - step * grad';
        cost_range(iter) = 1/2/m*sum((h-y).^2);
    end
end

Plotting the regression line:

For example the column 12 is the return data for Yahoo (YHOO), θ_1and θ_2 are

theta =

0.0001

1.6543

Where θ_1and θ_2 are alpha and beta values of YHOO computed by regression

Plotting the cost vs the number of iterations shows that the cost function does not increase as the number of iterations increases

Bankruptcy prediction using Logistic Regression with MATLAB

In this practice session, you use the bankruptcy data set from a paper (see the bottom of the assignment). There are 30 failure examples and 30 non-failure examples. You are going to implement a logistic regression algorithm, train a model to learn from the training data. The model is then going to be used to classify businesses as failure or non-failure.

The following is the information of the attributes, from the income statements and balance sheets.

● Size

• Sales

● Profit

• ROCE: profit before tax=capital employed (%)

• FFTL: funds flow (earnings before interest, tax & depreciation)=total liabilities

● Gearing

• GEAR: (current liabilities + long-term debt)=total assets

• CLTA: current liabilities=total assets

● Liquidity

• CACL: current assets=current liabilities

• QACL: (current assets – stock)=current liabilities

• WCTA: (current assets – current liabilities)=total assets

• LAG: number of days between account year end and the date the annual report and accounts were filed at company registry.

• AGE: number of years company has been operating since incorporation date.

• CHAUD: coded 1 if changed auditor in previous three years, 0 otherwise

• BIG6: coded 1 if company auditor is a Big6 auditor, 0 otherwise

The target variable is FAIL, either = 1 or 0. You program and model using logistic regression.

First the data set is read in from an Excel sheet Sheet1 in the xls file. X is normalized as the range of variables are much different.


[data,txt,raw] = xlsread('bankruptcy.xls','Sheet1');

X = data(:,1:12);
X = normalize(X);
X = data(:,1:12); 

y = data(:,13);
[m,n] = size(X);

Then a column of 1 is added to X corresponding to θ0


X = [ones(m,1) X];

theta = zeros(n+1,1);

The function fmincon is then called to search for the minimum cost, unconstrained optimized, in which the cost is provided by the function computeCost()


options = optimset('GradObj', 'on', 'MaxIter', 100);
[theta,cost] = fminunc(@(t)(computeCost(t,X,y)),theta,options);

The function computeCost()


function [cost,grad] = computeCost(theta,X,y,lambda)

implements the cost function

$$J(\theta)=\frac{1}{2m}\sum\limits_{i=1}^m (-y^{(i)}log(h_\theta(x^{(i)})-(1-y^{(i)})log(1-h_\theta(x^{(i)}))+\frac{\lambda}{2m}\sum\limits_{j=1}^n\theta_j^2$$

and the gradient

$$\frac{\partial J(\theta)}{\partial t}≔\frac{1}{m} \sum\limits_{i=1}^m (h_θ (x^{(i)}-y^{(i)})(x_j)^{(i)}+\frac{\lambda}{m}\theta_j^2$$

then stored in the variables cost and grad to return to the calling function.
The 2 above are implemented by these lines. Note that theta(1) should not be included as it is the parameter of x0 = 1


z = X*theta;
h = sigmoid(z);
grad = (1/m * (h-y)' * X) + lambda * [0;theta(2:end)]'/m; 
cost =  1/(m) * sum(-y .* log(h) - (1-y) .* log(1-h)) + lambda/m/2*sum(theta(2:end).^2);

Performance is much better if data is projected to a high dimensional space (explanation is going to be in another post)

Monday, October 10, 2016

Minimum Variance Portfolio with R and MATLAB

Minimum variance portfolio of risky assets that bears the lowest risk level of expected rate of return. This article shows how to program using R and MATLAB

First you create a matrix of random returns. Using real data is encouraged but not required. In R you need quadprog package to solve the QP optimization problem, so you will have

require(quadprog)

library(quadprog)

at the top of the program. Then generate returns with mean and standard deviation parameters. The array has ns rows/observations and na columns/variables

na <- 20
ns <- 100

returns  <- array(rnorm(ns*na, mean = 0.001, sd = 0.005), dim = c(ns,na))

Note that R solves

$$min\, c^{'}w+1/2w^{'}Qw$$

subject to the constraints, while you need just the quadratic term. So you need to make c a zero vector, which is done by dvec = rep(0,20) in the function call. And Q is twice the covariance matrix. A and a contain the equality constraints, here meaning sum of all the weights = a = 1

Q <- 2 * cov(returns)

A <- rbind(rep(1,20))

a <- 1

B <- rbind(diag(na))

b <- rbind(array(0, dim = c(na,1)))

Note that B and b contain the inequality constraints that say individual weights are greater than or equal to 0.

Then comes the call to solve.QP. The parameter meq = 1 to tell it that there is 1 equality constraint on the top of the constraint matrix Amat and bvec (and there is just one in this case, no inequality constraint).

result <- solve.QP(Dmat = Q, dvec = rep(0,20), Amat = t(rbind(A,B)), bvec = rbind(a,b), meq  = 1)

Then get the result

w <- result$solution

Then the weights are:

 
w
[1] 0.042753543 0.043274802 0.032720692 0.052568580 0.002538901 0.084979023 0.041237919 0.071080501 0.020027778
[10] 0.019911924 0.062425080 0.037147835 0.115155891 0.038342951 0.028615360 0.054265149 0.040568791 0.052440092
[19] 0.086009247 0.073935940

Coding in MATLAB is similar. Again here c is a vector of all zeros. B and b represent the constraints that individual weights are positive (no short sales). You need the negative sign as the constraints in MATAB use $&\leq$ but you need $&\geq$, so multiplying both sides with -1 changes the sign.

returns = 0.001 + randn(ns, na) * 0.005;
Q = 2 * cov(returns);
c = zeros(1,na);
A = ones(1,na);
a = 1; 
B = -eye(na);
b = zeros(na,1);
w = quadprog(Q,c,B,b,A,a);