Saturday, December 10, 2016

Predicting Telecom Sector Market Return using Machine Learning Algorithm
Abhishek Bishoyi, Lida Xu

Abstract: Financial market is based on data. Information may be reflected on stock price. Therefore, we can use machine learning technique to analysis financial data. In this report, we use Decision Tree algorithm and Random Forest algorithm to fitting five years stock data in Telecom sector. Based on the model. We forecast one moth stock price and simulate the trading based on the forecasting.

Stock prices follow a stochastic process, which means it is impossible to predict the stock price just based on its historical price. However, stock prices can be influenced by their fundamental performance, such as dividend, debt, liquid assets and so on. Meanwhile, stock prices may have relationship with their volatility. Therefore, we researched several features that may influence the stock prices from academic papers: Daily Trading Volume, Volatility, Earning Yield, Dividend Yield, Book to Market, Debt to Equity, Return on Assets, Profit Margin, and Asset Turnover. The twelve features can be categorized(Table 1) as
Table 1
As we can see, some features as such Return on Asset, have seasonal pattern. Especially for retail industry, the seasonal pattern is obvious.  After discussion, we decided to focus on Telecom industries, one reason is that this industry has less seasonal pattern which meets our model assumption. The companies in this industry also have very similar behaviors. In table 2, we can see, there are seven companies in the telecom sector. Since we are fitting different stocks in the telecom sector, daily close prices as our predicting variable is not good, so we transformed the daily close price as daily return. And for indictors’ purpose, if daily return is negative or zero, it indicates as 0, if daily return is positive, it indicates as 1.
Name
Ticker
American Tower Corp
AMT
AT&T Inc
T
CenturyLink Inc
CTL
Crown Castle International Corp
CCI
Frontier Communications Corp Class B
FTR
Level 3 Communications Inc
LVLT
Verizon Communications Inc
VZ
* Sector and stock information is based on Morningstar
Table 2
At first, we tried to test the data on our local PC using python, however, the dataset is to large, so we choose to use pyspark to run our algorithm. Our original data is cvs format, to run the pyspark on HPC, we need to transform the data into libsvm format.

II. Decision Tree Algorithm and Random Forest Algorithm
Decision Tree Learning maps observations about a target value, and predicting based on the learned mapping. In the tree structures, leaves represent class labels and branched represent conjunction o features that lead to those class labels, for example, see the graph below
A random forest is a collection or ensemble of decision trees, A decision tree is built using the whole dataset considering all features, but in random forests a fraction of the number of rows is selected at random and a particular number of features are selected at random to train on and a decision tree is built on this subset.

<pre>
############################################ Decision Tree Pyspark Code ############################################
# Import packages
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
from pyspark.mllib.util import MLUtils

# Import trainding data
data = MLUtils.loadLibSVMFile(sc, ‘training_data.txt')
# Test data is also based on training data
testData = data
# Fiting a decision tree model
model = DecisionTree.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={},impurity='gini', maxDepth=5, maxBins=32)

predictions = model.predict(testData.map(lambda x: x.features))
pred_list = predictions.top(predictions.count()) 

#write predicted values in this file
thefile = open('predictions_decisiontree.txt','w') 

for item in pred_list: 
thefile.write("%s\n"%item)
thefile.close()

#Compute prediction error:
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))

#This creates the tree for classification
print('Learned classification tree model:')
print(model.toDebugString())



<pre> 
############################################ Random Forest Pyspark Code ############################################
#Import packages
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Test data is also based on the training data
# Train a RandomForest model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
# Note: Use larger numTrees in practice.
# Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto",impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))

#capture the list of predicted values.
pred_list = predictions.top(predictions.count()) 

#write predicted values in this file
thefile = open('predictions_randomforest.txt','w') 

for item in pred_list: 
thefile.write("%s\n"%item)
thefile.close()

labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

print('Test Error = ' + str(testErr))
print('Learned classification forest model:')
print(model.toDebugString())


II. Trading
Based on the forecast, we simulate a trading process. Since after observe the prediction result, Decision Tree Algorithm and Random Forest Algorithm give us same result, we can choose one of these for testing our trading process.  The results from both algorithm are same make us more confident in the prediction. The trading strategy is simple, when the prediction is 1, we have a long position in the stock (buy one share), when the prediction is 0, we put a short position in the stock (sell one share). Therefore, this trading strategy means we trade every day. And the return is calculated based on daily Close-to-Close return. Meanwhile, the return we calculated is not included the transaction fee.
Table 3
We can see from Table 3, Crown Castle International Corp (CCI) has the highest return, however CenturyLink Inc. (CTI) has bad performance. Overall, the prediction has good performance. Actually, the model can be improved a lot, since stock prices are not only influence by the features we selected. There are more factors, such as relative industries’ performance, Analysts’ comments on the stock, etc.

<pre>
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df_final = pd.read_csv("Telecom Trading.csv")

AMT = []
T =[]
CTL = []
CCI = []
FTR =[]
LVLT = []
VZ = []

# Trading strategy, long when indicator is 1 and short when indictor 0
# close to close return
for i in range(len(df_final)-1):
tree_position = df_final.loc[i]['tree'] 
if df_final.loc[i]['Tickers'] =='AMT US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
AMT.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
AMT.append(profit)

if df_final.loc[i]['Tickers'] =='T US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
T.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
T.append(profit)

if df_final.loc[i]['Tickers'] =='CTL US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
CTL.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
CTL.append(profit)

if df_final.loc[i]['Tickers'] =='CCI US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
CCI.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
CCI.append(profit)

if df_final.loc[i]['Tickers'] =='FTR US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
FTR.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
FTR.append(profit) 

if df_final.loc[i]['Tickers'] =='LVLT US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
LVLT.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
LVLT.append(profit)

if df_final.loc[i]['Tickers'] =='VZ US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
VZ.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
VZ.append(profit)
# Cumulative money return and plot
AMT_return = np.cumsum([0.0]+AMT[:-1]).tolist()
plt.plot(AMT_return)
plt.ylabel('AMT Return')
plt.show()

T_return = np.cumsum([0.0]+T[:-1]).tolist()
plt.plot(T_return)
plt.ylabel('T Return')
plt.show()

CTL_return = np.cumsum([0.0]+CTL[:-1]).tolist()
plt.plot(CTL_return)
plt.ylabel('CTL Return')
plt.show() 

CCI_return = np.cumsum([0.0]+CCI[:-1]).tolist()
plt.plot(CCI_return)
plt.ylabel('CCI Return')
plt.show()

FTR_return = np.cumsum([0.0]+FTR[:-1]).tolist()
plt.plot(FTR_return)
plt.ylabel('FTR Return')
plt.show()

LVLT_return = np.cumsum([0.0]+LVLT[:-1]).tolist()
plt.plot(LVLT_return)
plt.ylabel('LVLT Return')
plt.show()

VZ_return = np.cumsum([0.0]+VZ[:-1]).tolist()
plt.plot(VZ_return)
plt.ylabel('VZ Return')
plt.show()
>

III. Conclusion
Based on Decision Tree and Random Forest Algorithm, we fit five years’ data on Telecom Sector. The training data include seven stocks, 8555 records from Nov 1, 2011 to Nov 30, 2016. After training our data, we predict based on the training data. Then we test the prediction based on the real trading. Use simply long and short trading strategy to simulate the predict and trading process. Overall we have a good performance for our model. If not include transaction fee. Our best model has return more than 20. The prediction can also be improved by introducing more features.  

IV. Reference
Kumar, M. and Thenmozhi, M., 2006, January. Forecasting stock index movement: A comparison of support vector machines and random forest. In Indian Institute of Capital Markets 9th Capital Markets Conference Paper.
Kara, Y., Boyacioglu, M.A. and Baykan, Ö.K., 2011. Predicting direction of stock price index movement using artificial neural networks and support vector machines: The sample of the Istanbul Stock Exchange. Expert systems with Applications, 38(5), pp.5311-5319.