Saturday, December 10, 2016

Job Recommendation System

Job Recommendation System

Emily Rayfield, Shuai Hao

MATH 5800-031 Fall 2016

Link to Presentation: https://www.youtube.com/watch?v=zSgpmDkTJ3M&feature=youtu.be

Background

Our system is used to recommend job seekers jobs based on their click history when job ads are sent to users. In our system we use the dataset on the Kaggle website, which includes basic information for users and jobs. The objective of our system is to recommend jobs to users based on job descriptions. To realize our objective, we use the Python and Spark framework. The basic method we use in the system is TF-IDF using a separate stopword removal function and Naive Bayes Classification. We use TF-IDF to text mine to reflect the importance of terms to our users. The Naive Bayes Classifiers are trained using the results from the TF-IDF and used to predict jobs.

Data

We use the datasets on the Kaggle website, which include information on jobs and users as well as users' click history. The 'users' dataset contains User IDs as well as city, state, country, zip code, degree type, major, graduation date, work history, total years experience, and whether the user is currently employed. The 'clicks' dataset contains the job IDs that each user clicked on. Finally the 'jobs' dataset includes job title, description, requirements, city, state, country, zip code, start date, and end date. There are 389708 users and 1092096 jobs total. The system is built to predict whether the users will click the job. There is a challenge for us to analyze the data since we have several non numeric variables and we need to analyze the importance of these variables.

Methodology

TF-IDF
Since we have several aspects that influence the choices of users, we use the TF-IDF (term frequency - inverse document frequency) to text mine to reflect the importance of the term in the document. The IDF (inverse document frequency) is a numeric measure of how much information a term provides. Denote a term by t, a document by d, and the corpus by D. Term frequency TF(t, d) is the number of times that term t appears in document d. Document frequency DF(t, D) is the number of documents that contain term t.
We use the Tokenizer from pyspark.ml.feature to create a term column. Then we use the HashingTF function to take sets of terms and convert those sets into fixed-length feature vectors. Then we use the IDF model to take feature vectors. Then we calculate the TF-IDF, which will be used in the second part of the system.
Naive Bayes Classifier
The Naive Bayes classifier is a probabilistic framework for solving classification problems. This classifier is based on Bayes Theorem, which describes the probability of an event based on prior knowledge of conditions that might be related to the event. Naive Bayes assumes that each pair of features are independent from one another. We use the Naive Bayes algorithm to predict whether a user will click on a job link. We choose labels 1 and 0 to represent whether a user clicks a job, with 1 meaning yes and 0 meaning no. Then we use the model to predict the outcome for each job, get a list of recommended jobs for each user, and save the output to a file.

Results

The code for all users is below, but we will take one user as an example. We ran the model on User ID 47 and it returned 2453 recommended jobs total. The first part of the output, showing the Job IDs of the first 10 predicted jobs, was [Row(_2=u'245'), Row(_2=u'252'), Row(_2=u'964'), Row(_2=u'1095'), Row(_2=u'1732'), Row(_2=u'1832'), Row(_2=u'1838'), Row(_2=u'2095'), Row(_2=u'2100'), Row(_2=u'2106')].
Finally, our system can be improved. It currently uses job titles and descriptions to predict jobs, but factoring in other variables with more user information such as degree type, major, or years of experience could help narrow down job recommendations.

Code

We first unload the datasets and create a list of common words to remove:
 
from pyspark.ml.feature import HashingTF, IDF, Tokenizer 
from pyspark.ml.classification import NaiveBayes 
from pyspark.sql import SQLContext, Row 
import re 
 
sqlContext = SQLContext(sc) 
 
PATTERN = re.compile(r'''((?:[^,"']|"[^"]*"|'[^']*')+)''') 
 
# Regular expressions to remove stop words 
STOP_WORDS_LIST = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "aren't", 
                   "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", 
                   "can't", "cannot", "could", "couldn't", "did", "didn't", "do", "does", "doesn't", "doing", "don't", 
                   "down", "during", "each", "few", "for", "from", "further", "had", "hadn't", "has", "hasn't", "have", 
                   "haven't", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", 
                   "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", 
                   "isn't", "it", "it's", "its", "itself", "let's", "me", "more", "most", "mustn't", "my", "myself", 
                   "no", "nor", "not", "of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours", 
                   "ourselves", "out", "over", "own", "same", "shan't", "she", "she'd", "she'll", "she's", "should", 
                   "shouldn't", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", 
                   "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", 
                   "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "wasn't", "we", 
                   "we'd", "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when", "when's", "where", 
                   "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "won't", "would", 
                   "wouldn't", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"] 
 
# Unload users dataset 
users = sc.textFile("file:///home//emr15007//users.tsv").map(lambda line: line.split("\t")) 
header_users = users.first()  # Extract header 
usersRDD = users.filter(lambda row: row != header_users)  # Filter out header 
 
# Unload clicks dataset 
clicks = sc.textFile("file:///home//emr15007//clicks.tsv").map(lambda line: line.split("\t")) 
header_clicks = clicks.first()  # Extract header 
clicksRDD = clicks.filter(lambda row: row != header_clicks)  # Filter out header 
 
# Unload jobs dataset 
jobs = sc.textFile("file:///home//emr15007//jobs.tsv").map(lambda line: line.split("\t")) 
header_jobs = jobs.first()  # Extract header 
jobsRDD = jobs.filter(lambda row: row != header_jobs)  # Filter out header 
 
# Convert to DataFrame 
jobsDF = jobsRDD.toDF() 
 
Next we create two functions. The remover function removes stopwords from a string and the clean_html function removes unnecessary characters:
 
 
def remover(cstr): 
    keywords_list = cstr.lower().split()  # Split a string into individual words 
    resarr = list(set(keywords_list).difference(set(STOP_WORDS_LIST)))  # Get list of non-stopwords 
    return " ".join(resarr)  # Combine non-stopwords back into one string 
 
 
# Remove unwanted characters 
def clean_html(raw_html): 
    clean_r = re.compile('.*?') 
    clean_text = re.sub(clean_r, '', raw_html) 
    clean_text = clean_text.replace('\\r', '').replace('\\n', '').replace(' ', '').replace('ojp’s', '') 
    return clean_text 
 
In the next part of the code we clean the job descriptions using the above functions and perform the TF-IDF:
 
# Create a DataFrame with concatenated job title and description, as well as job ID 
jobs_features = jobsDF.rdd.map(lambda x: (remover(clean_html(x[2] + ' ' + x[3])), x[0])) 
jobs_featuresDF = sqlContext.createDataFrame(jobs_features) 
 
# Tokenizer to create a column of individual terms 
tokenizer = Tokenizer(inputCol="_1", outputCol="terms") 
termsData = tokenizer.transform(jobs_featuresDF) 
 
# Generate the term frequency vectors using HashingTF 
tf = HashingTF(inputCol="terms", outputCol="rawFeatures").transform(termsData) 
 
# IDF (down-weights columns which appear frequently in a corpus) 
idf = IDF(inputCol="rawFeatures", outputCol="features").fit(tf) 
 
# TF-IDF 
tfidf = idf.transform(tf) 
tfidf.cache() 
Finally we generate a Naive Bayes Classifier for each user and use it to get a list of predicted jobs:
 
# RDDs to be used for labels 
one = sc.parallelize([1.0]) 
zero = sc.parallelize([0.0]) 
 
# Get a list of user IDs 
user_IDs = usersRDD.map(lambda x: x[0]) 
user_list = user_IDs.collect() 
 
# Convert to a set 
user_set = set(user_list) 
 
recommended_jobs = [] 
 
# Loop through all users to generate a Naive Bayes Classifier and predict jobs for each one 
for user in user_set: 
    # Get job IDs of jobs that the user clicked 
    clicks_sample = clicksRDD.filter(lambda row: row[0] == user) 
    jobs_clicked = [item[2] for item in clicks_sample.collect()] 
    # Get jobs that the user did and did not click on; set labels 1.0 = clicked on; 0.0 = did not click on 
    yes = jobsRDD.filter(lambda row: row[0] in jobs_clicked) 
    jobs_yes = yes.cartesian(one) 
    no = jobsRDD.filter(lambda row: row[0] not in jobs_clicked) 
    jobs_no = no.cartesian(zero) 
    jobs_labeled = jobs_yes.union(jobs_no) 
    # Put labeled jobs back in original order 
    jobs_sorted = jobs_labeled.sortBy(lambda x: x[0][0]) 
    labels = jobs_sorted.map(lambda x: (x[1], x[0][0]))  # Put labels into another RDD along with Job ID 
    labelsDF = labels.toDF()  # Convert to DataFrame 
    labelsDF_new = labelsDF.selectExpr('_1 as label', '_2 as _2')  # Rename label column 
    # Train a Naive Bayes model 
    trainingDF = labelsDF_new.join(tfidf, '_2', 'inner') 
    nb = NaiveBayes(featuresCol="features", labelCol="label") 
    model = nb.fit(trainingDF) 
    # Use model to predict outcome of each job 
    predictions = model.transform(tfidf) 
    # Get job IDs for only jobs that model predicts user will click on 
    jobs_predicted = predictions.filter(predictions.prediction == 1.0).select('_2') 
    user_and_jobs = (user[0], jobs_predicted)  # Combine user ID and suggested job IDs into tuple 
    recommended_jobs.append(user_and_jobs) 
 
# Save predicted jobs for all users to file 
results = sc.parallelize(recommended_jobs) 
results.saveAsTextFile('file:///home//emr15007//jobs_output.txt') 
Predicting Telecom Sector Market Return using Machine Learning Algorithm
Abhishek Bishoyi, Lida Xu

Abstract: Financial market is based on data. Information may be reflected on stock price. Therefore, we can use machine learning technique to analysis financial data. In this report, we use Decision Tree algorithm and Random Forest algorithm to fitting five years stock data in Telecom sector. Based on the model. We forecast one moth stock price and simulate the trading based on the forecasting.

Stock prices follow a stochastic process, which means it is impossible to predict the stock price just based on its historical price. However, stock prices can be influenced by their fundamental performance, such as dividend, debt, liquid assets and so on. Meanwhile, stock prices may have relationship with their volatility. Therefore, we researched several features that may influence the stock prices from academic papers: Daily Trading Volume, Volatility, Earning Yield, Dividend Yield, Book to Market, Debt to Equity, Return on Assets, Profit Margin, and Asset Turnover. The twelve features can be categorized(Table 1) as
Table 1
As we can see, some features as such Return on Asset, have seasonal pattern. Especially for retail industry, the seasonal pattern is obvious.  After discussion, we decided to focus on Telecom industries, one reason is that this industry has less seasonal pattern which meets our model assumption. The companies in this industry also have very similar behaviors. In table 2, we can see, there are seven companies in the telecom sector. Since we are fitting different stocks in the telecom sector, daily close prices as our predicting variable is not good, so we transformed the daily close price as daily return. And for indictors’ purpose, if daily return is negative or zero, it indicates as 0, if daily return is positive, it indicates as 1.
Name
Ticker
American Tower Corp
AMT
AT&T Inc
T
CenturyLink Inc
CTL
Crown Castle International Corp
CCI
Frontier Communications Corp Class B
FTR
Level 3 Communications Inc
LVLT
Verizon Communications Inc
VZ
* Sector and stock information is based on Morningstar
Table 2
At first, we tried to test the data on our local PC using python, however, the dataset is to large, so we choose to use pyspark to run our algorithm. Our original data is cvs format, to run the pyspark on HPC, we need to transform the data into libsvm format.

II. Decision Tree Algorithm and Random Forest Algorithm
Decision Tree Learning maps observations about a target value, and predicting based on the learned mapping. In the tree structures, leaves represent class labels and branched represent conjunction o features that lead to those class labels, for example, see the graph below
A random forest is a collection or ensemble of decision trees, A decision tree is built using the whole dataset considering all features, but in random forests a fraction of the number of rows is selected at random and a particular number of features are selected at random to train on and a decision tree is built on this subset.

<pre>
############################################ Decision Tree Pyspark Code ############################################
# Import packages
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
from pyspark.mllib.util import MLUtils

# Import trainding data
data = MLUtils.loadLibSVMFile(sc, ‘training_data.txt')
# Test data is also based on training data
testData = data
# Fiting a decision tree model
model = DecisionTree.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={},impurity='gini', maxDepth=5, maxBins=32)

predictions = model.predict(testData.map(lambda x: x.features))
pred_list = predictions.top(predictions.count()) 

#write predicted values in this file
thefile = open('predictions_decisiontree.txt','w') 

for item in pred_list: 
thefile.write("%s\n"%item)
thefile.close()

#Compute prediction error:
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))

#This creates the tree for classification
print('Learned classification tree model:')
print(model.toDebugString())



<pre> 
############################################ Random Forest Pyspark Code ############################################
#Import packages
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Test data is also based on the training data
# Train a RandomForest model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
# Note: Use larger numTrees in practice.
# Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto",impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))

#capture the list of predicted values.
pred_list = predictions.top(predictions.count()) 

#write predicted values in this file
thefile = open('predictions_randomforest.txt','w') 

for item in pred_list: 
thefile.write("%s\n"%item)
thefile.close()

labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

print('Test Error = ' + str(testErr))
print('Learned classification forest model:')
print(model.toDebugString())


II. Trading
Based on the forecast, we simulate a trading process. Since after observe the prediction result, Decision Tree Algorithm and Random Forest Algorithm give us same result, we can choose one of these for testing our trading process.  The results from both algorithm are same make us more confident in the prediction. The trading strategy is simple, when the prediction is 1, we have a long position in the stock (buy one share), when the prediction is 0, we put a short position in the stock (sell one share). Therefore, this trading strategy means we trade every day. And the return is calculated based on daily Close-to-Close return. Meanwhile, the return we calculated is not included the transaction fee.
Table 3
We can see from Table 3, Crown Castle International Corp (CCI) has the highest return, however CenturyLink Inc. (CTI) has bad performance. Overall, the prediction has good performance. Actually, the model can be improved a lot, since stock prices are not only influence by the features we selected. There are more factors, such as relative industries’ performance, Analysts’ comments on the stock, etc.

<pre>
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df_final = pd.read_csv("Telecom Trading.csv")

AMT = []
T =[]
CTL = []
CCI = []
FTR =[]
LVLT = []
VZ = []

# Trading strategy, long when indicator is 1 and short when indictor 0
# close to close return
for i in range(len(df_final)-1):
tree_position = df_final.loc[i]['tree'] 
if df_final.loc[i]['Tickers'] =='AMT US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
AMT.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
AMT.append(profit)

if df_final.loc[i]['Tickers'] =='T US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
T.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
T.append(profit)

if df_final.loc[i]['Tickers'] =='CTL US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
CTL.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
CTL.append(profit)

if df_final.loc[i]['Tickers'] =='CCI US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
CCI.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
CCI.append(profit)

if df_final.loc[i]['Tickers'] =='FTR US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
FTR.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
FTR.append(profit) 

if df_final.loc[i]['Tickers'] =='LVLT US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
LVLT.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
LVLT.append(profit)

if df_final.loc[i]['Tickers'] =='VZ US EQUITY':

if tree_position == 1:
profit = df_final.loc[i+1]['PX_LAST'] - df_final.loc[i]['PX_LAST']
VZ.append(profit)
if tree_position == 0:
profit = df_final.loc[i]['PX_LAST'] - df_final.loc[i+1]['PX_LAST']
VZ.append(profit)
# Cumulative money return and plot
AMT_return = np.cumsum([0.0]+AMT[:-1]).tolist()
plt.plot(AMT_return)
plt.ylabel('AMT Return')
plt.show()

T_return = np.cumsum([0.0]+T[:-1]).tolist()
plt.plot(T_return)
plt.ylabel('T Return')
plt.show()

CTL_return = np.cumsum([0.0]+CTL[:-1]).tolist()
plt.plot(CTL_return)
plt.ylabel('CTL Return')
plt.show() 

CCI_return = np.cumsum([0.0]+CCI[:-1]).tolist()
plt.plot(CCI_return)
plt.ylabel('CCI Return')
plt.show()

FTR_return = np.cumsum([0.0]+FTR[:-1]).tolist()
plt.plot(FTR_return)
plt.ylabel('FTR Return')
plt.show()

LVLT_return = np.cumsum([0.0]+LVLT[:-1]).tolist()
plt.plot(LVLT_return)
plt.ylabel('LVLT Return')
plt.show()

VZ_return = np.cumsum([0.0]+VZ[:-1]).tolist()
plt.plot(VZ_return)
plt.ylabel('VZ Return')
plt.show()
>

III. Conclusion
Based on Decision Tree and Random Forest Algorithm, we fit five years’ data on Telecom Sector. The training data include seven stocks, 8555 records from Nov 1, 2011 to Nov 30, 2016. After training our data, we predict based on the training data. Then we test the prediction based on the real trading. Use simply long and short trading strategy to simulate the predict and trading process. Overall we have a good performance for our model. If not include transaction fee. Our best model has return more than 20. The prediction can also be improved by introducing more features.  

IV. Reference
Kumar, M. and Thenmozhi, M., 2006, January. Forecasting stock index movement: A comparison of support vector machines and random forest. In Indian Institute of Capital Markets 9th Capital Markets Conference Paper.
Kara, Y., Boyacioglu, M.A. and Baykan, Ö.K., 2011. Predicting direction of stock price index movement using artificial neural networks and support vector machines: The sample of the Istanbul Stock Exchange. Expert systems with Applications, 38(5), pp.5311-5319.

Friday, December 9, 2016

Minimum-Variance Portfolio and Monte Carlo Simulation on Selected Stocks’ Returns

Minimum-Variance Portfolio and Monte Carlo Simulation on Selected Stocks’ Returns

Fall 2016 MATH 5800-030 — Group 6 Final Project
By Zhenqian Li and Pei Wang
09 December 2016

Background and Objectives

We have learned many useful programming skills for financial analyzing with Matlab, Python and R in this course. We want to apply the knowledge to complete a comprehensive project using Python and Quantopian with these following objectives:

  1. Retrieve selected stocks' data from Yahoo Finance for a certain period.
  2. Establish a minimum-variance portfolio based on the stocks' returns.
  3. Run back-testing for our portfolio on Quantopian and compare it with the stock market.
  4. Develop Monte Carlo Simulation on selected stocks’ returns and compare it with their real performances.

Methodology

Minimum Variance Portfolio

Portfolio variance is a measurement of how the aggregate actual returns of a set of securities making up a portfolio fluctuate over time. So the minimum variance portfolio is a portfolio of individually risky assets that, when taken together, result in the lowest possible risk level for the rate of expected return. The name of the term comes from how it is mathematically expressed in Markowitz Portfolio Theory, in which volatility is used as a replacement for risk, and in which less variance in volatility correlates to less risk in an investment.

Generally, in minimum variance portfolio, we aim to minimize:
$$w^{T}Cw$$

Here, C is a covariance matrix of the returns and for w on the expected portfolio return R^{T}w whilst keeping the sum of all the weights equal to 1:
$$\sum_{i}w_{i}=1$$






Efficient Frontier

The efficient frontier is the set of optimal portfolios that offers the highest expected return for a defined level of risk or the lowest risk for a given level of expected return. And all other portfolios below this curve is called sub-optimal portfolio. Because at the same risk level, those portfolios give a lower return comparing the portfolios on the efficient frontier. Notice that the minimum variance portfolio (or Global Minimum-Variance Portfolio) is on the end point of efficient frontier as shown below:
Figure 1: Sample Efficient Frontier

Monte Carlo Simulation

Monte Carlo simulations are used to model the probability of different outcomes in a process that cannot easily be predicted due to the intervention of random variables. In our project, we use random walk as method to simulate the stock price trend and compare it to the actual stock price.

Results

Retrieve Data

We selected eight different companies’ stocks as our data: TWTR, AAPL, FB, GOOG, MSFT, BA, WMT and GM. We used Python to retrieve the stocks’ data for us. Luckily, there is already a library in Python called yahoo-finance. We first downloaded this library using the following command line:

Command prompt install:
pip install yahoo_finance

Once we have installed the yahoo_finance library, we used the following code to get the data with the time interval from 01/01/2016 to 06/30/2016.


Figure 2: Retrieve Data Code
After that, we combined all the adj. close prices for these eight stocks and put it in one Excel sheet as saved as ‘stockprices.xlsx’ for further reference.
Figure 3: Data Import in Python
In this code, we use ‘xlrd’ package for reading data from our Excel file and store it into a matrix. And the we use the ‘pandas’ package to help us to compute the daily return (‘pct_change’ function in code).

Minimum Variance Portfolio
Before we try to find out the minimum variance portfolio, we will obtain the efficient frontier of our portfolio first. The first step is; we will create random weights by the following function:
Figure 4: Function to Create Random Weights
Then we will assign these random weights to our portfolio and compute their corresponding returns and risks.
Figure 5: Random Portfolios
After we obtained the random portfolios from the function above, we scatter it in a plot and see what happens.
Figure 6: Mean and Standard Deviation of Returns of Randomly Generated 
Portfolios
We can see that there are a lot of possible portfolios with different returns and risks. To be more clear, we will draw out the efficient frontier with a red-dotted line.
Figure 7: The Efficient Frontier
As we can see from the plot, each dot represents a portfolio with their corresponding return and risk level. The red-dotted line as shown is efficient frontier. All portfolios on the frontier are optimal in a given risk level. So what we need to do next is to find out the portfolio we aim to achieve: Minimum Variance Portfolio.
The following Python function is defined to solve the quadratic programming problem we mentioned before. We aim to find out the portfolio with minimum risk level from all the random portfolios and obtain the corresponding weights. Notice that the input parameter is a returns vector and the only constraint we need to set is that all weights must sum up to one.
Figure 8: Minimum Variance Solver Function
Finally, we obtain the minimum variance portfolio weights from this function:
Figure 9: The Optimal Weights

Back-testing

After we calculated the weights of the eight stocks in the portfolio, we used the following code to run the back-testing on Quantopian. We assigned the weights and set the function to rebalance the investing portfolio one hour after market opens. At last, we also called another function to record and plot the profit of our portfolio at the end of each day.
Figure 10: Back-testing Code
And the resulting graph we got from the back-testing is shown in the below:
Figure 11: Back-testing Result
From the graph we can tell that our portfolio has a great performance during the period we selected. It has a total return rate as 21.8% with volatility of 0.23, which means our portfolio is pretty stable. And the Sharpe ratio is 1.82. For Sharpe ratio, it is a risk-adjusted measure of return that is often used to evaluate the performance of a portfolio. After calculating the excess return, it's divided by the standard deviation of the risky asset to get its Sharpe ratio. The idea of the ratio is to see how much additional return you are receiving for the additional volatility of holding the risky asset over a risk-free asset - the higher the better. Besides, the total returns curve is always above the Benchmark returns curve.

Develop Monte Carlo Simulation


The method of Monte Carlo Simulation was briefly introduced in this class, but it is widely used in the financial world. So we want to have a try and see what we can get with this method. From previous part for establishing the minimum-variance portfolio, we noticed that the WMT stock has an especially huge weight than any other stock in the portfolio. That is because the main idea of minimum variance portfolio is to reduce the variance among the portfolio. It might assign huge weight on some certain stocks for this purpose. Since this portfolio is mainly built up with WMT stock. We decided to develop simulation with Monte Carlo method for this picked stock.
The Python code we used for Monte Carlo Simulation is attached below:
Figure 12: Monte Carlo Simulation Python Code
We plotted out the graph of real stock prices and the simulated prices for comparison.
Figure 13: Comparison Between Real Price and Monte Carlo Simulation Price
From the graph, we can see that the trend of this two price chart are very similar. Although the fluctuation of Monte Carlo Simulation is less than the real price in this scenario, it does not mean that the Monte Carlo Simulation will produce a relatively stable result for us. What we seen in this graph is just one case for simulation, as we mentioned before, Monte Carlo Simulation we used in our project is based on random walk. So the results obtained by Monte Carlo Simulation is various. Here we just present one case for comparison.

Summary

In summary, we used eight stocks to create a minimum variance portfolio and compared its performance with real stock market by back-testing in Quantopian. In addition, we used Monte Carlo method to simulate prices for a certain stock. As expected, the portfolio’s performance is much better than the benchmark. And that is the reason why we need to create portfolio. By creating a portfolio, we maximize the expected return and minimize the risk according to our risk tolerance.

Acknowledgment

We should definitely give our greatest gratitude towards Professor Do as he has provided us with such great lecture materials and genuine help along this semester. The skills we learned from this course would be beneficial in our further study and career. And we also have to thank every group member in this fantastic group – even though we only have two people. Working with each other was such a pleasant adventure. Every single effort matters in the team work. At last, we want to send our best wishes to all the classmates in this class. We would not be having this interesting class without any of you. Thanks, everyone.