Friday, December 9, 2016

Job Matching by Nearest Neighbors using PySpark

2016 Fall MATH5800-031 Group5 Final Project
by Qintian Sun, Yuyang Dai and Shiqi Wang




.Background and Objectives

Suppose that we have a long list of jobs with their titles and description, and also some resumes, we want to derive an algorithm that gives possible matched jobs for each of these resumes. Then we want to see if jobs can be matched using only titles. This project was done using PySpark.

. Methodology

1.     Combine job titles and description into one dataframe. Import resume into another dataframe.
2.     Break all the sentences into separate terms, remove stop words, use TF-IDF to character the importance of each term, and normalize the TF-IDF data.
3.     For each job, compute squared distance of normalized TF-IDF data to the given resume’s normalized TF-IDF data. Output the first several jobs with the shortest distance.
4.     Use only job titles to do the above again.

. Realization and Results

1.     Process Job Data
Because in the job description, there may be commas inside quotes, so define new pattern and import job data from csv file using this pattern:







Filter out header, extract and concatenate job titles and description, and convert to dataframe:












Use Tokenizer to break data into terms, output to “terms” column:










Use StopWordRemover to remove stop words in “terms”, output to “filtered” column:







Use HashingTF to get term frequency of each term, output to “rawFeatures” column:







Use IDF to get inverse doument frequency (downweight the term which appear frequently in a corpus),  get tfidf to evaluate term importance using the product of tf and idf, output to “features” column:









Because length of job titles and description can vary, we have to normalize each tfidf value in order to compare with each other. Use Normalizer to output to “normFeatures” column:





Finally the dataframe for job data looks like this:



















First row is:



2.     Process Resume Data
Before using the data, we first copy all the text of our three resumes to one single cv.txt file. Write each resume data into one line (no “new line” in the same resume). So we have three lines in total in the cv.txt file.
Then we import the data to one dataframe with only one column ”_1” and three rows. use the exact same methods as we used to process job data:














Dataframe of resumes looks like this, each row contains all the information of one resume:









3.     Find Jobs with Shortest Distance of normFeatures

We first extract “normFeatures” column and use this to compute distance. Use collect to get the values inside the unit:




To match jobs for the first resume, we use vector v1 as the first row of cv_norm:


If we want to match jobs for the second or third resume we use v1 as:







Initialize empty tuple Dis, loop through all jobs to get a tuple of all the squared_distance of the normFeatures to given resume:













Find the first 10 jobID’s of the jobs that have the shortest distances to the given resume:











4.     Use only job titles


We now use only job titles, just need to change the line where we extracted both title and description:









5.     Result and thoughts


1.     The top 10 jobs for the three resume are:

       
         Jobs for Qintian:

 
           
          Jobs for Yuyang:



            Jobs for Shiqi:










All three of us are majoring in math or financial math and have related experience of Technician, IT, finance, and  management, etc. Some of us also have research experience, so there are also some jobs in universities. As shown in the job title and description, these top picked jobs are really good matching for us.


        2.     To test the algorithm, we also did the bottom 10 (worst matchings) jobs matching for the first resume:










As shown in the table, these choices, like babysitting, nursing, children’s entertainer are indeed very non-related to our resumes.

          3.    If only use job titles, the top 10 jobs for the first resume are:













Because there is too few words involved in the algorithm, the algorithm now is not very efficient. Top 10 choices all have something to do with “manager” (management). That word is also very frequent in the resume. And there is a top babysitter job (may because of the word “job”.)  Actually, they all have the same distance.
Thus, we think only using job titles to match will not have good enough top results.