by Qintian Sun, Yuyang Dai and Shiqi Wang
Ⅰ.Background and Objectives
Suppose that we have a long list of jobs with their titles
and description, and also some resumes, we want to derive an algorithm that
gives possible matched jobs for each of these resumes. Then we want to see if
jobs can be matched using only titles. This project was done using PySpark.
Ⅱ. Methodology
1.
Combine job
titles and description into one dataframe. Import resume into another
dataframe.
2.
Break all the
sentences into separate terms, remove stop words, use TF-IDF to character the
importance of each term, and normalize the TF-IDF data.
3.
For each job,
compute squared distance of normalized TF-IDF data to the given resume’s
normalized TF-IDF data. Output the first several jobs with the shortest
distance.
4.
Use only job
titles to do the above again.
Ⅲ. Realization and Results
1. Process Job Data
Because in the job description,
there may be commas inside quotes, so define new pattern and import job data
from csv file using this pattern:

Filter out header, extract and concatenate job titles and description, and convert to dataframe:
Use Tokenizer to break data into terms, output to “terms” column:
Use StopWordRemover to remove stop words in “terms”, output to “filtered” column:
Use HashingTF to get term frequency of each term, output to “rawFeatures” column:
Use IDF to get inverse
doument frequency (downweight the term which appear frequently in a corpus), get tfidf to evaluate term importance using
the product of tf and idf, output to “features” column:
Because length of job titles
and description can vary, we have to normalize each tfidf value in order to
compare with each other. Use Normalizer to output to “normFeatures” column:
Finally the dataframe for job data looks like this:
First row is:
2. Process Resume Data
Before using the data, we
first copy all the text of our three resumes to one single cv.txt file. Write
each resume data into one line (no “new line” in the same resume). So we have
three lines in total in the cv.txt file.
Then we import the data to
one dataframe with only one column ”_1” and three rows. use the exact same
methods as we used to process job data:
Dataframe of resumes looks like this, each row contains all the information of one resume:
3. Find Jobs with Shortest Distance of normFeatures

To match jobs for the first resume, we use vector v1 as the
first row of cv_norm:
If we want to match jobs for the second or third resume we use v1 as:
Initialize empty tuple Dis, loop through all jobs to get a tuple of all the squared_distance of the normFeatures to given resume:
Find the first 10 jobID’s of the jobs that have the shortest
distances to the given resume:
4. Use only job titles
We now use only job titles, just need to change the line
where we extracted both title and description:
5. Result and thoughts
1.
The top 10 jobs for the three resume are:
Jobs for Qintian:

Jobs for Yuyang:
Jobs for Shiqi:
All three of us are majoring in math or financial math and have related experience of Technician, IT, finance, and management, etc. Some of us also have research experience, so there are also some jobs in universities. As shown in the job title and description, these top picked jobs are really good matching for us.
2. To test the algorithm, we also did the bottom 10 (worst matchings) jobs matching for the first resume:
As shown in the table, these choices, like babysitting, nursing, children’s entertainer are indeed very non-related to our resumes.
3. If
only use job titles, the top 10 jobs for the first resume are:
Because there is too few words involved in the algorithm, the algorithm now is not very efficient. Top 10 choices all have something to do with “manager” (management). That word is also very frequent in the resume. And there is a top babysitter job (may because of the word “job”.) Actually, they all have the same distance.
Thus, we think only using job titles to match will not have
good enough top results.














