Distilling Large Language Models using Skill-Occupation Graph

Distilling Large Language Models using Skill-Occupation Graph Context

for HR-Related Tasks

Pouya Pezeshkpour

Hayate Iso

Thom Lake

Nikita Bhutani

Estevam Hruschka

Megagon Labs

Indeed

{pouya,hayate,nikita,estevam}@megagon.ai [email protected]

Abstract

Numerous HR applications are centered around

resumes and job descriptions. While they can

beneﬁt from advancements in NLP, particularly

large language models, their real-world adop-

tion faces challenges due to absence of com-

prehensive benchmarks for various HR tasks,

and lack of smaller models with competitive

capabilities. In this paper, we aim to bridge

this gap by introducing the Resume-Job De-

scription Benchmark (RJDB). We meticulously

craft this benchmark to cater to a wide array of

HR tasks, including matching and explaining

resumes to job descriptions, extracting skills

and experiences from resumes, and editing re-

sumes. To create this benchmark, we propose

to distill domain-speciﬁc knowledge from a

large language model (LLM). We rely on a cu-

rated skill-occupation graph to ensure diversity

and provide context for LLMs generation. Our

benchmark includes over 50 thousand triples

of job descriptions, matched resumes and un-

matched resumes. Using RJDB, we train multi-

ple smaller student models. Our experiments re-

veal that the student models achieve near/better

performance than the teacher model (GPT-4),

afﬁrming the effectiveness of the benchmark.

Additionally, we explore the utility of RJDB on

out-of-distribution data for skill extraction and

resume-job description matching, in zero-shot

and weak supervision manner. We release our

datasets and code

to foster further research

and industry applications.

1 Introduction

In organizational recruitment, resumes and job de-

scriptions play a pivotal role, facilitating identiﬁca-

tion of potential candidates and informing hiring

decisions (Zimmermann et al., 2016; Guo et al.,

2021; Ali et al., 2022). Natural Language Process-

ing (NLP) algorithms play a crucial role in enhanc-

ing this process, unraveling valuable information

https://github.com/megagonlabs/rjdb

Figure 1: Creating a benchmark for HR-related tasks

via large language model distillation. We start by metic-

ulously sampling subgraphs from our curated skill-

occupation graph. Then, we utilize the skills and occu-

pations contained within these subgraphs as contextual

guidance to steer GPT-4 in the generation of resumes

and job descriptions encompassing a variety of tasks.

and insights embedded within resumes and job de-

scriptions. This include information such as job

titles, skill sets, work history, and educational back-

ground. Leveraging such information can facilitate

a broad variety of HR tasks such as aligning candi-

dates with job openings, streamlining the resume

screening process, and aiding in salary negotiation.

Deploying NLP models for such use cases requires

a diverse and representative dataset covering a wide

range of resumes and job descriptions. However,

there is a noticeable absence of publicly available

large-scale benchmarks tailored for this purpose.

Additionally, many companies heavily rely on pro-

prietary, in-house data to train NLP models. How-

ever, such datasets often carry inherent biases and

noise, and only provide necessary annotation for

limited downstream tasks.

In recent years, several attempts have been made

to create/augment datasets by distilling knowledge

from existing large language models (LLMs) (Kim

et al., 2022; Gu et al., 2023; Li et al., 2023; Gu

arXiv:2311.06383v1 [cs.CL] 10 Nov 2023

et al., 2023). The objective is to extract the knowl-

edge stored in a large teacher model such as GPT-4

(OpenAI, 2023), into a benchmark for a speciﬁc

task. Subsequently, this benchmark becomes a valu-

able resource for training smaller student models,

allowing them to emulate the performance of the

teacher model in the targeted task. These student

models present a practical solution for real-world

applications, owing to their reduced size.

Building on the success of prior work in knowl-

edge distillation from LLMs, we propose to dis-

till knowledge from LLMs to generate a multi-

task resume-job description benchmark. As large

language models struggle to generate high qual-

ity and accurate output in low resource domains

(Bang et al., 2023), such as HR, they require to

be provided with necessary guidance as the con-

text (Agrawal et al., 2022; Singhal et al., 2022; Jin

et al., 2023). Skills and past experiences (occu-

pations) are recognized as the foundational infor-

mation required to solve various HR-related tasks

(Qin et al., 2020; Fu et al., 2021; Sun et al., 2021).

Consequently, to guide the knowledge distillation

effort from LLMs, using skill-occupation graph—a

bipartite graph connecting occupations with their

required skills—as the context, emerges as a nat-

ural candidate. Since, to the best of our knowl-

edge, there is no diverse publicly available skill-

occupation graph, we curate our own graph in two

steps: Firstly, we initiate the graph by harvesting

data pertaining to technology-related occupations

and their skills from Dice

. Subsequently, we ex-

tend this graph by ﬁrst extracting general occupa-

tions from the US Bureau of Labor Statistics

, and

then leveraging GPT-4 to generate required skills

for these occupations.

Leveraging the structure of the graph, we em-

ploy a sampling process to assemble a diverse set

of interconnected skills and occupations. Then, we

iteratively provide sampled subgraphs as context to

GPT-4 to generate more than 50,000 triples, each

comprising a job description, a matched resume,

and an unmatched resume. We craft the genera-

tion pipeline such that it yields triples that serve

as resources for training and evaluating models

across several pivotal tasks: (1) Job-Candidate

Matching: Assessing the compatibility between

job openings and potential candidates. (2) Coun-

https://www.dice.com

. We extracted the data from

https://www.kaggle.com/datasets/PromptCloudHQ/

us-technology-jobs-on-dicecom

https://www.bls.gov/oes/current/oes_stru.htm

terfactual Explanation for Matching: Exploring

explanations and reasoning behind job-candidate

matches. (3) Information Extraction: Extracting

pertinent details, such as skills and occupations,

from both resumes and job descriptions. And, (4)

Resume Editing: Facilitating the enhancement

and reﬁnement of resumes for candidates. We pro-

vide an overview of our pipeline to generate RJDB

in Figure 1.

After creating the resume-job description bench-

mark (RJDB), we train individual student mod-

els based on Flan-T5 base (Chung et al., 2022)

for each speciﬁc task within the dataset. Our stu-

dent models demonstrate comparable/superior per-

formance across different tasks, when compared

against the GPT-4 teacher model over held-out data

from the benchmark. To explore the broader utility

of RJDB, we extend our investigation to include

out-of-distribution data, speciﬁcally focusing on

two tasks: resume-job matching and skill extrac-

tion. Remarkably, our student models, ﬁne-tuned

on RJDB, excel in these tasks. We also ﬁnd that the

models ﬁne-tuned on out-of-distribution data when

further ﬁne-tuned using RJDB, yield enhanced per-

formance and improved generalization.

Contributions The main contributions of this pa-

per are as follows: (1) We create a multi-task

resume-job description benchmark (RJDB) to over-

come existing datasets limitations in the HR do-

main. (2) We curated a diverse skill-occupation

graph and utilized it as the context guiding our

benchmark generation process. (3) We intro-

duced a novel distillation pipeline for HR; our ap-

proach comprises various innovative components

that leverage domain-speciﬁc expertise and harness

the world-knowledge from GPT-4 to effectively

address numerous existing challenges in the HR

domain. (4) Training multiple student models us-

ing RJDB on the proposed tasks, we release strong

baselines that achieved performance on par with or

surpassing the teacher model (GPT-4). Finally, (5)

we explored potential use cases of RJDB in out-of-

distribution data. We have made our data, models,

and code publicly available, aiming to facilitate and

advance progress within the HR domain.

2 Distilling LLMs For HR-Related Tasks

To generate high-quality documents in HR domain,

it is crucial to provide knowledge intensive con-

text to guide large language models. Furthermore,

since our goal is to generate documents satisfy-

ing multiple HR-related downstream tasks, this

context should contain necessary annotations for

these tasks. As a result, subgraphs from a skill-

occupation graph naturally emerge as a suitable

candidate to serve as context for guiding LLM gen-

eration. Leveraging a skill-occupation graph to dis-

till knowledge from LLMs presents us with three

distinct challenges: (1) The scarcity of publicly

available skill-occupation graphs that cover a wide

range of general occupations. (2) The need to sam-

ple subgraphs that not only provide the necessary

guidance for LLMs but also yield documents that

align with real-world distributions. And, (3) the

requirement to generate documents that are both

diverse and faithful to the provided context while

offering essential annotations for downstream tasks.

In this section, we take on these challenges by ﬁrst

curating our skill-occupation graph and then in-

troducing our generation pipeline to construct our

multi-task benchmark. Following the introduction

of our pipeline, we then proceed to conduct a com-

prehensive quality assessment of the generated doc-

uments.

2.1 Skill-Occupation Graph

The skill-occupation graph (Dave et al., 2018;

de Groot et al., 2021; Boškoski et al., 2022) is

a powerful tool in the realm of workforce develop-

ment and career matching, deﬁned as a bipartite

graph that links occupations with their required

skills. Our objective is to extract subsets from this

graph to guide the generation of resumes and job

descriptions. To the best of our knowledge, the

only publicly accessible skill-occupation graph is

sourced from DICE which only covers technology

related occupations. As a result, to generate diverse

resumes and job descriptions, we need to construct

our own graph encompassing a broader variety of

occupational categories. Beyond data availability,

the most signiﬁcant challenge in developing such

a graph is ensuring its representation aligns with

real-world distribution. Thus, in this section, as

a proactive step in overcoming these challenges,

we start constructing our graph with a foundation

in technology-related graph from DICE. Building

upon this foundation, we extend the graph by ex-

tracting general occupations from the US Bureau of

Labor Statistics. We then generate required skills

for each one of those occupations by prompting

GPT-4.

#Occ #Skill #Edges

#Avg Skill #Avg Occ

per Occ per Skill

8275 14807 70661 8.5 4.8

Table 1: Data statistics of the curated skill-occupation

graph.

Generating Skills for General Occupations To

diversify our graph beyond technology-related oc-

cupations, we incorporate 1,112 diverse occupa-

tions sourced from the US Bureau of Labor Statis-

tics. The process of generating the required skills

for these occupations involves several steps. Ini-

tially, we match each occupation with the clos-

est counterpart in our in-house proprietary skill-

occupation graph, leveraging Phrase-BERT (Wang

et al., 2021). Subsequently, we extract the number

of provided skills associated with the matched oc-

cupation from the graph. This process is vital for

ensuring that distribution of skills in the generated

graph aligns with real-world distribution. For each

occupation and its respective extracted number of

skills, denoted as

, we employ GPT-4 to gener-

ate required skills for the occupation by using the

prompt:

Prompt 2.1: Prompt for skill generation

Generate {n} number of required skills necessary

for the occupation {OCCUPATION}.

Filtering the Curated Graph In order to sample

diverse subgraphs from our curated graph while

effectively ﬁltering out rare skills and occupations,

we employ a clustering strategy. We adopt density-

based clustering and initiate the process by select-

ing a random occupation as a seed, then gradually

expand the cluster to encompass any node within

a two-hop radius. This expansion continues itera-

tively until all occupations are included within a

cluster. Subsequently, we eliminate clusters that

contain fewer than 10 occupational nodes. As a

result of this process, we successfully partition and

ﬁlter the graph into 404 distinct clusters, with an

average of approximately 84.2 occupational nodes

within each cluster. A statistical overview of the

resulting graph is presented in Table 1.

2.2 Generation Pipeline

In the process of generating resumes and job de-

scriptions, our approach involves sampling sub-

graphs from the curated skill-occupation graph to

serve as context for LLMs. These subgraphs com-

prise sets of skills and past experiences (occupa-

tions) that are to be integrated into a candidate’s

resume and a job description. To ensure the quality

and authenticity of generated documents, several

necessary requirements must be met. Firstly, the

number of sampled skills and experiences should

align with the real-world distribution of these com-

ponents within a resume or job description. Fur-

thermore, the provided experiences should main-

tain a logical chronological order and incorporate

temporal information. Beyond these prerequisites,

we aim to create unbiased and diverse documents

while simultaneously providing essential annota-

tions for downstream tasks. In this section, we

shape our document generation pipeline to adhere

to these requirements. Given the absence of read-

ily available data to help addressing these require-

ments, we mostly rely on the world knowledge

embedded in GPT-4 to fulﬁll them.

Subgraph Sampling To initiate the subgraph

sampling process, we begin by randomly select-

ing a cluster and an occupational node within that

cluster as our starting node. To sample a subgraph,

we implement random walks from the starting node

and conﬁgure the random walks parameters to en-

sure that the expected number of sampled skills

and experiences (occupations) align with prede-

ﬁned values. To specify the predeﬁned number of

experiences, in each sampling step, we randomly

select a value ranging from 1 to 5. For the prede-

ﬁned number of skills, we turn to GPT-4 and pose

the question:

Prompt 2.2: Prompt for subgraph sampling

On average how many skills does a person with a

job title of ‘{Starting Node}’ may have listed

in his or her resume?

This approach enables us to align the number of

skills in generated documents with the real-world

distribution, accomplished by using GPT-4’s exten-

sive world knowledge, tailored to a speciﬁc occu-

pation.

Chronological Order of Experiences in a Sam-

pled subgraph After sampling a subgraph, the

past experiences are presented in a random se-

quence. However, in actual real-world resumes,

past experiences typically follow a logical chrono-

logical order, reﬂecting an applicant’s career pro-

gression over time. For instance, an applicant com-

monly starts as a Software Engineer before advanc-

ing to the role of a Senior Software Engineer. To

rectify this, we turn to GPT-4’s world knowledge

by prompting it with the prompt:

Prompt 2.3: Prompt for ordering experiences

Given the previous experiences of individuals

with {list of Past Experiences}, please arrange

them in a chronological order based on the

likelihood of encountering these experiences

from earlier to later over time.

This aids us in establishing the correct chrono-

logical order for the sampled past experiences.

Incorporating Temporal Information into Ex-

periences We exclusively attribute temporal data

to the past experiences within resumes and job de-

scriptions. To accomplish this, we follow a process

in which we randomly select a value between 1 and

5, signifying the number of years a particular expe-

rience spans. However, when generating resumes,

rather than specifying years of involvement, we

introduce a different approach. By selecting a ran-

dom year between 2015 and 2023 (representing the

last year of being active), we gradually reduce the

number of years attributed to each experience. This

method allows us to establish a speciﬁc time frame

for each past experience, for instance, spanning

from 2017 to 2021.

Diversifying Resumes based on Gender and

Race In our pursuit of generating diverse re-

sumes with respect to race and gender, we concen-

trate on the applicant’s ﬁrst name. Our approach

involves extracting a list of the most popular given

names that are uniformly distributed across various

racial categories, sourced from Wikipedia

. This

results in a collection of approximately 700 ﬁrst

names, separately for both males and females. In

each generation step, we prioritize maintaining a

balanced distribution of male and female names

across various occupation types, ensuring that ev-

ery cluster comprises an equal number of generated

resumes for both genders. Additionally, as these

ﬁrst names are uniformly distributed across differ-

ent racial groups, uniformly sampling them further

ensures the race diversity of generated resumes.

Task Speciﬁc Annotation As our objective is to

construct a benchmark capable of encompassing

multiple tasks, we must ensure that our genera-

tion process aligns with the distinct requirements

https://en.wikipedia.org/wiki/List_of_most_

popular_given_names

Figure 2: Evaluating the quality of generated documents using various approaches with G-eval.

of each task. Given that our approach generates

both resumes and job descriptions from the same

initial set of skills and experiences, the prerequi-

sites for both the matching and extracting tasks

are inherently fulﬁlled. To facilitate the necessary

annotations for resume editing and matching ex-

planation, alongside generating a job description

and a matched resume, we introduce an unmatched

resume by deliberately altering certain elements

within the matched resume. Speciﬁcally, we ran-

domly remove between 1 to 5 skills and decrease

the duration of engagement in the last experience

by randomly selecting a number equal to or less

than the original years of involvement.

2.3 Quality Assessment

With our generation pipeline outlined, our next step

involves determining whether we can leverage In-

structGPT (text-davinci-003) (Ouyang et al., 2022)

or if a more advanced model like GPT-4 (OpenAI,

2023) is required for resume-job description distil-

lation. Additionally, we aim to ascertain whether

it’s more advantageous to generate documents sec-

tion by section or produce the entire document in

one go. To address these inquiries, we delve into

the evaluation of consistency and factuality in the

generated documents, employing G-eval (Liu et al.,

2023), an evaluation method based on large lan-

guage models. Also, we conduct a human study

to gain a deeper understanding of the quality of

generated documents.

LLM-based Evaluation We employ G-eval to

conduct automated assessments of the consistency

and factuality of the generated documents, to ex-

plore the impact of different LLMs and generation

methods (section by section or in its entirety). G-

eval evaluation only requires the input document,

along with criteria and chain-of-thought style (Wei

et al., 2022) instructions for scoring the document

on a scale of 1 to 5 (detailed prompts are avail-

able in Appendix). The outcomes of our evaluation

are presented in Figure 2. The results indicate

that, in general, GPT-4’s approach of generating

Factulity Consistency

Dice 63.4 62.7

Generated 67.9 65.4

Table 2: User study on the quality of generated job-

descriptions.

the entire document at once demonstrates superior

performance across both criteria. As a result, we

consider this approach as the primary component

in our generation pipeline in the remainder of this

paper.

Human-based Evaluation We further assess the

quality of generated documents through a user

study. Since resumes can contain sensitive data,

we only focus on the quality of generated job de-

scriptions and compare it to real job descriptions

gathered from Dice

. We evaluate the quality of

generated versus real world job description through

two criteria: (1) Realisticness, measuring how

probable is the provided job description to be a

human written job description for an actual job. (2)

Consistency, measuring how well the various el-

ements within the job description, including job

title, skills, and requirements, align with one an-

other. We consider 100 generated job descriptions

and extract 100 most similar job descriptions from

Dice using phrase-BERT (Wang et al., 2021) on

job titles. We provide each sample to 3 annotators

asking them to score the document based on both

criteria on as scale from 1-100. Then, we calculate

the average score for each sample.

The outcomes of the user study are presented in

Table 2. The results indicate that the generated job

descriptions exhibit a higher degree of realisticness

and consistency compared to real job descriptions.

One possible explanation for the higher scores of

generated job descriptions could be attributed to

the fact that we consistently generate ﬁxed sections

for all job descriptions, whereas many of the Dice

job descriptions lack well-deﬁned sections.

https://www.dice.com

#Doc Avg #W Min #W Max #W

JD 52K 181.1 70 427

R-M 52K 101.9 30 604

R-U 52K 87.1 23 395

Table 3: Data statistics of RJDB. We report the num-

ber of generated documents (#Doc), and also average,

minimum, and maximum number of words (#W) for

job descriptions (JD), matched resumes (R-M), and un-

matched resumes (R-U).

Avg #Skills Avg #Exp

Sampled 6.43 2.44

Removed 1.9 -

Table 4: The average number of skills and past experi-

ences in each triple, as well as the average number of

skills removed to generate the unmatched resume from

the matched one.

3 Resume-Job Description Benchmark

(RJDB)

Incorporating our generation pipeline, our goal was

to create the Resume-Job Description Benchmark

(RJDB), which comprises 52,000 triples of job de-

scriptions, matched resumes, and unmatched re-

sumes. Details regarding the prompt used for gen-

erating these triples can be found in Appendix. For

insights into the data statistics, including the num-

ber of tokens in the generated documents, please

refer to Tables 3. Additionally, Tables 4 present

the average number of sampled skills, experiences,

and the average number of removed skills to create

the unmatched resume from the matched ones.

To delve deeper into the distribution of the gen-

erated documents, we categorized them into ﬁve

distinct groups: (1) tech, (2) social, product, and

ﬁnance, (3) manual labor, (4) healthcare, and (5)

administrative. The distribution of triples across

these categories is depicted in Figure 3. Further-

more, a more detailed breakdown of the average

number of skills, and words in each document can

be found in Figure 4. Remarkably, documents re-

lated to the tech industry exhibit a lower average

number of skills and words. This phenomenon may

be attributed to the distinctive graph structure sur-

rounding tech-related occupations, as well as the

diversity of these occupations in the graph covering

different level of expertise.

4 Student Models

Now that we have successfully generated the RJDB,

the next crucial step involves the training of FLAN-

Figure 3: Distribution of different job categories over

generated documents.

T5 base student models (which we refer to as T5

in the remainder of paper). This includes train-

ing T5 on matching, explaining, extracting, and

editing tasks. Upon ﬁne-tuning T5 on these tasks

we compare it against our teacher model, GPT-4.

This comparative analysis is essential to measure

the capabilities and potential of student models in

capturing knowledge of teacher model. To achieve

this, we divide the RJDB to train, test, and dev set

with the size of 50000, 1000, and 1000 samples

respectively.

Matcher and Explainer We ﬁne-tune the T5

student model for both matching and explaining

tasks simultaneously. When presented with pairs

of resumes and job descriptions, our goal here is

not only to predict their compatibility but also to

explain the reasons behind this determination by

providing a set of matching or mismatching skills

and experiences. For negative pairs, we consider

modiﬁcations on the matched resume, to create the

un-matched version, to be the explanation, while

for positive examples we consider the set of seed

skills and experiences as the explanation. Con-

sidering the complicated nature of evaluating ex-

planations, here, we only report the percentage

of modiﬁcations that appear in the explanations

in negative pairs, to evaluate the explaining capa-

bility of models. The results for matching and

explaining capability of models are presented in

Table 5. Remarkably, the student model exhibit an

outstanding level of performance, outperforming

GPT-4 (teacher) in both tasks. Based on T5 perfor-

mance, it seems that if provided with a big enough

benchmark a smaller model can achieve compara-

(a) Distribution of average number of

skills.

(b) Distribution of average number of

words per resume.

words per job description.

Figure 4: Distribution of number of skills and words in generated documents.

ble performance with a much larger teacher model

on matching and explaining tasks when tested over

the data from the same distribution as training data.

Extractor We also ﬁne-tune a student model

to extract skills and past experiences from re-

sumes and job descriptions. The results of our stu-

dent models performances are presented in Table

6. Both models generally exhibit superior perfor-

mance when extracting information from resumes

compared to job descriptions. This observation

may be attributed to the typically shorter length of

resumes, and possibly more explicit appearance of

information in resumes compared to job descrip-

tions. Furthermore, in the extraction of past expe-

riences, student models consistently either outper-

form or perform at a comparable level to the teacher

model in both resume and job descriptions. Con-

versely, in the task of skill extraction, while student

models exhibit similar average F1 performance, a

noticeable disparity in accuracy becomes apparent

between the teacher and the student model. Sug-

gesting that this task may demand a higher level of

reasoning, which emphasizes the potential need for

a larger student model or a more complex training

procedure.

Editor Our ﬁnal student model is tailored for the

resume editing task, focusing on the reconstruc-

tion of the skills and experiences section within the

unmatched resume to create the matched version.

We use the entirety of the unmatched resume as an

input and incorporating the modiﬁcations made to

the matched resume to create the unmatched coun-

terpart as an extra signal, which includes the skills

that were removed and changes to the last experi-

ence. In evaluating the model’s performance, in

addition to ROUGE2 score (Lin, 2004), we employ

add

(Xu et al., 2016), measuring F1 score by re-

warding the n-grams (bi-grams in our setting) that

Models

Matching Explaining

Acc F1 Acc

GPT-4 83.2 82.8 37.8

T5 93.1 93.1 53.7

Table 5: Performance of the teacher and student models

on matching and explaining tasks.

Models

Skill Experience

Acc F1 Acc F1

Res

GPT-4 79.2 81.9 56.6 57.7

T5 60.8 73.9 55.5 64.5

GPT-4 83.9 77.1 41.5 40.1

T5 60.5 73.6 53.3 63.1

Table 6: Performance of the teacher and student models

on information extraction task.

appear in the model output and the reference docu-

ment (extra signals) but did not appear in the input

document (unmatched resume). The outcomes of

models’ performance are provided in Table 7. Both

models show higher level of performance when

editing experience section. Additionally, T5 excels

in the

add

metric while underperforming in the

ROUGE metric comparing to GPT-4, highlighting

the observation that student models tend to produce

more concise and precise contents.

5 Out-of-Distribution Use Cases

To explore the broader applications of RJDB, we

delve into scenarios focusing on out-of-distribution

cases. This encompasses two key aspects: zero-

shot adoption, where we apply our models ﬁne-

tuned on RJDB to outside data, and weak supervi-

sion, which involves using our dataset to further

ﬁne-tune models trained on out-of-distribution data.

Since, the resume editing and explanation tasks re-

main relatively under-explored, and ﬁnding avail-

able resources for them proved to be a challenge,

Models ROUGE F

add

Skill

GPT-4 0.347 0.262

T5 0.278 0.271

Exp

GPT-4 0.391 0.374

T5 0.328 0.513

Table 7: Performance of the teacher and student models

on resume editing task.

here, we only focus on matching and extraction

tasks. For matching, we utilize the Machop dataset

created by Wang et al. (2022) from Indeed data,

containing 2000 resume and job-description pairs

divided into 1200, 400, 400 samples as train, test,

and dev set respectively. Additionally, we use 688

resumes and their annotated skills provided in Ma-

chop for skill extraction task.

Zero-Shot Adaptation in Matching For adopt-

ing our student model to be able to zero-shot pre-

dicting out-of-distribution data, in matching task,

we need to incorporate noise into the data they

are trained on. To do so, we consider un-matched

generated resumes with maximum changes as posi-

tive samples. Also, we sample some random pairs

of resume-job description for negative samples,

creating a training set with around 12000 sam-

ples. As matching baselines, we consider Machop-

Sequence (Wang et al., 2022) and Ditto (Li et al.,

2020) which are both language model based solu-

tions. The zero-shot performance of our student

model is provided in Table 8 (T5-R (Zero-shot)).

Despite the fact that our T5-R (Zero-shot) was not

trained on Machop data, surprisingly, not only it

performs similarly to previously reported state-of-

the-art model (Machop-S), it achieves higher per-

formance in precision.

Weak Supervision in Matching HR-related

tasks are highly dependent on social and economic

conditions. For example, in an economic scenario

where job applicants outnumber available positions,

it may be necessary to implement a matching model

that, while potentially sacriﬁcing some true posi-

tive applicants for a job description, signiﬁcantly

reduces false positive cases. As a result, a contin-

uous model adaptation becomes essential in order

to address these ever-changing dynamics. RJDB

can provide us with a controllable resource for tai-

loring existing models in a weak supervise manner

to meet the requirements dictated by social and

economic conditions. Fine-tuning a base Flan-T5

model on Machop dataset (Wang et al., 2022) -

F1 Rec Prec

Ditto 66.4 81.7 56.0

Machop-S 83.5 90.8 77.3

T5-M 46.7 65.8 36.2

Out

T5-R (Zero-shot) 80.5 77.5 83.8

T5-M+R (Weak-Rec) 45.4 83.3 31.2

T5-M+R (Weak-Pre) 50.0 56.7 44.4

Table 8: Matching task performances on out-of-

distribution data.

Models

Test-Machop Human

Acc F1 Rec Prec Acc Avg #

T5-M 84.5 90.0 89.6 90.9 43.3 0.18

T5-R 65.5 52.0 48.1 61.9 78.6 1.95

T5-M+R 83.1 88.1 87.7 89.3 55.0 0.35

Table 9: Skill extraction task on out-of-distribution.

dataset (T5-M), the goal here is to explore the po-

tential of utilizing RJDB data to further enhance

this model. Considering the small training size

of Machop dataset (1200 samples), ﬁne-tuned T5

performs very poorly (Table 8). We explore two

scenarios: (1) one aimed at enhancing recall and

(2) the other focused on improving precision. In the

ﬁrst scenario, we perform minimal additional ﬁne-

tuning on T5-M using a randomly selected subset

of the created training set in previous part (which

consists of 12,000 samples) containing 1,000 ran-

dom samples, referred to as T5-M+R (Weak-Rec).

In the second scenario, we randomly select 500

pairs of positive and negative resume-job descrip-

tion combinations from RJDB to further ﬁne-tune

T5-M (referred to as T5-M+R (Weak-Pre)). The

results are presented in Table 8. As the table illus-

trates, by leveraging RJDB, we can enhance recall

and precision of T5-M by approximately 17.8%

and 8.2%, respectively, with only marginal posi-

tive/negative effects on other metrics.

Skill Extraction in Out-of-Distribution Data

For skill extraction, we explore the effects of ﬁne-

tuning student models using RJDB on enhancing

their generalization capabilities. To achieve this we

adopt resumes provided in Machop, which are ac-

companied with their annotated skills from Indeed.

We randomly select 100 resumes as the test set, 88

samples as dev set, and designate the remaining

500 resumes as the training set. We ﬁne-tune a T5

model using three different datasets: (1) solely the

Machop training set (T5-M), (2) solely a random

sample of 1000 resumes from RJDB (T5-R), we

choose a small set to be comparable to Machop

training set, and (3) a combination of the Machop

data and sampled resumes from RJDB (T5-M+R).

The results of these models’ performance on the

Machop test set are presented in Table 9. Further-

more, to assess the generalization capabilities of

these models, we manually verify the accuracy of

additional skills extracted, i.e., predicted skills be-

yond the labeled ones, by each model from the test

set resumes. The accuracy and average number of

extra skills for each model are presented in Table 9

(Human assessment). As demonstrated, T5-M+R

can extract a signiﬁcantly larger number of new

skills with only a minor sacriﬁce in performance

compared to T5-M. Similarly, while T5-R may un-

derperform when compared to other models on the

Machop test set, it successfully extracts a larger

number of new skills with a higher accuracy.

6 Related Work

Language models can be an invaluable asset for

HR-related tasks, but their widespread adoption in

real-world applications is hindered by the absence

of publicly available multi-task datasets.

NLP for HR In recent years, the integration

of language models into human resources-related

tasks has witnessed a notable surge in interest

and innovation. These advanced natural language

processing models have shown great promise in

addressing the intricacies of HR tasks, ranging

from resume parsing and job description match-

ing to skill extraction and beyond. Bian et al.

(2020) proposed a multi-view co-teaching network

for job-resume matching, designed to effectively

leverage sparse and noisy interactions available be-

tween candidates and jobs. Mahdi et al. (2021);

Li et al. (2021) introduced a BERT based (Devlin

et al., 2019) information extraction from job de-

scriptions and resumes respectively. Authors in

Fang et al. (2023) proposed a skill-aware prompt

learning module to improve the pre-trained model’s

adaptability to downstream HR-related tasks.

Knowledge Distillation from LLMs Knowl-

edge Distillation is a technique used for reduc-

ing the high computational demand of LLMs by

transferring knowledge from a large teacher model

to a smaller student one (Gu et al., 2023). Kim

et al. (2022) curated a million-scale social dia-

logue dataset by distilling knowledge from Instruct-

GPT. West et al. (2021) extracted commonsense

symbolic knowledge from large language models,

while Li et al. (2023) introduced the distillation of

chain-of-thought-style reasoning from LLMs to en-

hance the reasoning capabilities of smaller models.

7 Conclusion

We introduce the Resume-Job Description Bench-

mark (RJDB) by distilling knowledge from GPT-

4. Starting from subgraphs sampled from our cu-

rated skill-occupation graph, we generate resume

and job description catering to multiple HR-related

tasks, from matching and explanation to skill and

experience extraction. We generate 52,000 triples

of job descriptions, matched and unmatched re-

sumes and successfully train student models that

rival or surpass the teacher model (GPT-4) on in-

distribution data. Moreover, we extend our inves-

tigation to demonstrate the adaptability of RJDB

in handling out-of-distribution data for skill extrac-

tion and resume-job description matching, using

zero-shot and weak supervision techniques. We

believe RJDB lays a strong foundation for HR-

related tasks, fostering the development of models

and techniques that can bridge the gap between

research and real-world applications in the ﬁeld of

HR.

References

Monica Agrawal, Stefan Hegselmann, Hunter Lang,

Yoon Kim, and David Sontag. 2022. Large language

models are few-shot clinical information extractors.

In Proceedings of the 2022 Conference on Empiri-

cal Methods in Natural Language Processing, pages

1998–2022.

Irfan Ali, Nimra Mughal, Zahid Hussain Khand, Javed

Ahmed, and Ghulam Mujtaba. 2022. Resume clas-

siﬁcation system using natural language processing

and machine learning techniques. Mehran Univer-

sity Research Journal Of Engineering & Technology,

41(1):65–79.

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-

liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei

Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multi-

task, multilingual, multimodal evaluation of chatgpt

on reasoning, hallucination, and interactivity. arXiv

preprint arXiv:2302.04023.

Shuqing Bian, Xu Chen, Wayne Xin Zhao, Kun Zhou,

Yupeng Hou, Yang Song, Tao Zhang, and Ji-Rong

Wen. 2020. Learning to match jobs with resumes

from sparse interaction data using multi-view co-

teaching network. In Proceedings of the 29th ACM

International Conference on Information & Knowl-

edge Management, pages 65–74.

Pavle Boškoski, Matija Perne, Tjaša Redek, and Bil-

jana Mileva Boshkoska. 2022. Occupation sim-

ilarity through bipartite graphs. arXiv preprint

arXiv:2202.11064.

Hyung Won Chung, Le Hou, Shayne Longpre, Bar-

ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi

Wang, Mostafa Dehghani, Siddhartha Brahma, et al.

2022. Scaling instruction-ﬁnetuned language models.

arXiv preprint arXiv:2210.11416.

Vachik S Dave, Baichuan Zhang, Mohammad Al Hasan,

Khalifeh AlJadda, and Mohammed Korayem. 2018.

A combined representation learning approach for bet-

ter job and skill recommendation. In Proceedings of

the 27th ACM International Conference on Informa-

tion and Knowledge Management, pages 1997–2005.

Maurits de Groot, Jelle Schutte, and David Graus. 2021.

Job posting-enriched knowledge graph for skills-

based matching. arXiv preprint arXiv:2109.02554.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2019. Bert: Pre-training of deep

bidirectional transformers for language understand-

ing. In Proceedings of the 2019 Conference of the

North American Chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, Volume 1 (Long and Short Papers), pages 4171–

4186.

Chuyu Fang, Chuan Qin, Qi Zhang, Kaichun Yao, Jing-

shuai Zhang, Hengshu Zhu, Fuzhen Zhuang, and

Hui Xiong. 2023. Recruitpro: A pretrained language

model with skill-aware prompt learning for intelli-

gent recruitment. In Proceedings of the 29th ACM

SIGKDD Conference on Knowledge Discovery and

Data Mining, pages 3991–4002.

Yan Fu, Nan Li, Juan Feng, and Qiang Ye. 2021. In-

congruent skills and experiences in online labor mar-

ket. Electronic Commerce Research and Applica-

tions, 45:101025.

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang.

2023. Knowledge distillation of large language mod-

els. arXiv preprint arXiv:2306.08543.

Feng Guo, Christopher M Gallagher, Tianjun Sun, Saba

Tavoosi, and Hanyi Min. 2021. Smarter people ana-

lytics with organizational text data: Demonstrations

using classic and advanced nlp models. Human Re-

source Management Journal.

Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu.

2023. Genegpt: Augmenting large language models

with domain tools for improved access to biomedical

information. ArXiv.

Hyunwoo Kim, Jack Hessel, Liwei Jiang, Ximing Lu,

Youngjae Yu, Pei Zhou, Ronan Le Bras, Malihe

Alikhani, Gunhee Kim, Maarten Sap, et al. 2022.

Soda: Million-scale dialogue distillation with so-

cial commonsense contextualization. arXiv preprint

arXiv:2212.10465.

Liunian Harold Li, Jack Hessel, Youngjae Yu, Xi-

ang Ren, Kai-Wei Chang, and Yejin Choi. 2023.

Symbolic chain-of-thought distillation: Small mod-

els can also" think" step-by-step. arXiv preprint

arXiv:2306.14050.

XiaoWei Li, Hui Shu, Yi Zhai, and ZhiQiang Lin. 2021.

A method for resume information extraction using

bert-bilstm-crf. In 2021 IEEE 21st International

Conference on Communication Technology (ICCT),

pages 1437–1442. IEEE.

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan,

and Wang-Chiew Tan. 2020. Deep entity matching

with pre-trained language models. arXiv preprint

arXiv:2004.00584.

Chin-Yew Lin. 2004. Rouge: A package for automatic

evaluation of summaries. In Text summarization

branches out, pages 74–81.

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang,

Ruochen Xu, and Chenguang Zhu. 2023. Gpteval:

Nlg evaluation using gpt-4 with better human align-

ment. arXiv preprint arXiv:2303.16634.

Hussain Falih Mahdi, Rishit Dagli, Ali Mustufa, and

Sameer Nanivadekar. 2021. Job descriptions key-

word extraction using attention based deep learning

models with bert. In 2021 3rd International Congress

on Human-Computer Interaction, Optimization and

Robotic Applications (HORA), pages 1–6. IEEE.

OpenAI. 2023. Gpt-4 technical report. arXiv preprint

arXiv:2303.08774.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-

roll L Wainwright, Pamela Mishkin, Chong Zhang,

Sandhini Agarwal, Katarina Slama, Alex Ray, et al.

2022. Training language models to follow in-

structions with human feedback. arXiv preprint

arXiv:2203.02155.

Chuan Qin, Hengshu Zhu, Tong Xu, Chen Zhu, Chao

Ma, Enhong Chen, and Hui Xiong. 2020. An en-

hanced neural network approach to person-job ﬁt in

talent recruitment. ACM Transactions on Informa-

tion Systems (TOIS), 38(2):1–33.

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mah-

davi, Jason Wei, Hyung Won Chung, Nathan Scales,

Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl,

et al. 2022. Large language models encode clinical

knowledge. arXiv preprint arXiv:2212.13138.

Ying Sun, Fuzhen Zhuang, Hengshu Zhu, Qi Zhang,

Qing He, and Hui Xiong. 2021. Market-oriented job

skill valuation with cooperative composition neural

network. Nature communications, 12(1):1992.

Jin Wang, Yuliang Li, Wataru Hirota, and Eser Kan-

dogan. 2022. Machop: An end-to-end generalized

entity matching framework. In Proceedings of the

Fifth International Workshop on Exploiting Artiﬁcial

Intelligence Techniques for Data Management, pages

1–10.

Shufan Wang, Laure Thompson, and Mohit Iyyer. 2021.

Phrase-bert: Improved phrase embeddings from bert

with an application to corpus exploration. arXiv

preprint arXiv:2109.06304.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,

et al. 2022. Chain-of-thought prompting elicits rea-

soning in large language models. Advances in Neural

Information Processing Systems, 35:24824–24837.

Peter West, Chandra Bhagavatula, Jack Hessel, Jena D

Hwang, Liwei Jiang, Ronan Le Bras, Ximing

Lu, Sean Welleck, and Yejin Choi. 2021. Sym-

bolic knowledge distillation: from general language

models to commonsense models. arXiv preprint

arXiv:2110.07178.

Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen,

and Chris Callison-Burch. 2016. Optimizing sta-

tistical machine translation for text simpliﬁcation.

Transactions of the Association for Computational

Linguistics, 4:401–415.

Tim Zimmermann, Leo Kotschenreuther, and Karsten

Schmidt. 2016. Data-driven hr-r

’esum

’e analysis

based on natural language processing and machine

learning. arXiv preprint arXiv:1606.05611.

A Details of Prompts

We present the prompts utilized in G-eval for as-

sessing consistency and factuality in Figures 5 and

6, respectively. Additionally, the prompt used in

our generation pipeline for generating triples of

job descriptions, matched resumes, and unmatched

resumes is provided in Figure 7.

Consistency

# Instruction:

As a hiring manager, your task is to evaluate job descriptions on a scale of 1-5. This scale represents the consistency of

the job description, with 1 being completely inconsistent and 5 being fully consistent. Your evaluation should consider

the alignment of job responsibilities, required skills, qualifications, and the overall tone of the job description,

as well as the consistency between different sections of the job description. Please ensure you fully understand these

instructions before proceeding.

# Evaluation Criteria:

1. Completely Inconsistent: The job responsibilities, required skills, qualifications, and overall tone of the job

description are not aligned. The description is confusing and does not provide a clear understanding of the job.

Additionally, there are significant inconsistencies between different sections of the job description, making it

confusing and unclear.

2. Mostly Inconsistent: There are some elements of the job description that align, but there are significant

inconsistencies between different sections that make the description unclear.

3. Somewhat Consistent: The job description has a fair amount of alignment between the responsibilities, skills, and

qualifications, as well as between different sections, but there are areas that could be improved for clarity.

4. Mostly Consistent: The job description is mostly aligned, both within sections and between different sections, with

only minor inconsistencies. The description provides a clear understanding of the job.

5. Fully Consistent: The job responsibilities, required skills, qualifications, and overall tone of the job description

are perfectly aligned. Additionally, there is a high level of consistency between different sections, resulting in

a clear and comprehensive understanding of the job.

# Evaluation Steps:

1. Carefully read the entire job description, focusing on the alignment between the job responsibilities, required

skills, qualifications, and the overall tone of the description.

2. Evaluate the overall consistency of the job description based on the provided criteria.

3. Assign a consistency score ranging from 1 to 5, using the Evaluation Criteria as a guide.

# Required Skills:

{skills}

# Required Experience:

{experience}

# Job Description:

{job description}

# Evaluation Form (scores ONLY):

Figure 5: The prompt used for assessing the consistency in generated job descriptions. We use the same prompt,

changing job descriptions to resumes, for resumes as well.

Factuality

# Instruction:

As a hiring manager, your task is to evaluate job descriptions on a scale of 1-5. This scale represents the factuality

of the job description, with 1 being completely false and 5 being completely true. Your evaluation should consider the

accuracy of the job responsibilities, required skills, qualifications, and the overall representation of the job role.

Please ensure you fully understand these instructions before proceeding.

# Evaluation Criteria:

1. Completely False: The job description does not match the job title at all. The responsibilities, required skills,

and qualifications are misleading or incorrect.

2. Mostly False: The job description has some elements of truth but contains significant inaccuracies or exaggerations

in the responsibilities, required skills, or qualifications.

3. Somewhat True: The job description is partially accurate. Some responsibilities, required skills, or qualifications

may be overstated or understated.

4. Mostly True: The job description is largely accurate, with minor discrepancies in the responsibilities, required

skills, or qualifications.

5. Completely True: The job description accurately represents the job title, responsibilities, required skills, and

qualifications without any exaggeration or understatement.

# Evaluation Steps:

1. Carefully read the entire job description, focusing on the job title, responsibilities, required skills, and

qualifications.

2. Evaluate the overall factuality of the job description based on the provided criteria.

3. Assign a factuality score ranging from 1 to 5, using the Evaluation Criteria as a guide.

# Required Skills:

{skills}

# Required Experience:

{experience}

# Job Description:

{job description}

# Evaluation Form (scores ONLY):

Figure 6: The prompt used for assessing the factuality in generated job descriptions. We use the same prompt,

changing job descriptions to resumes, for resumes as well.

Write a job description for a “{job title}” job which require only skill set of “{list of skills}” and only previous job

experience of “{list of experiences with augmented years for job description}” and a matching resume for a candidate with

the name of “{sampled first name}” and having only skill set of “{list of skills}” and only previous job experience of

“{list of experiences with augmented years for resume}”. Then generate exactly the same resume (keeping everything the same)

but excluding skill set of “{list of skills to be removed}” and “{the modification to last experience}”. Don’t include any

extra skills and experience. But generate extra details about provided skills and job experience. The job description should

only contain Job Title, Job Summary, Required Skills, and Responsibilities sections (only include few responsibilities).

Resumes should only contain Personal Information (containing the provided first name and a matching generated last name and

email), Education, Skills, and Experience sections. The generated output should exactly be according the following structure:

###### Job-description

## Job title

.....

## Job Summary

.....

## Required Skills

.....

## Required Experience

.....

## Responsibilities

.....

###### Resume 1

## Personal Information

.....

## Education

.....

## Skills

.....

## Experience

.....

###### Resume 2

## Personal Information

.....

## Education

.....

## Skills

.....

## Experience

.....

h output:

Figure 7: The prompt used for generating documents in RJDB generation pipeline.