Resume Example 1
Borrowed from University of La Verne Career Center - Link
This playground showcases the Resume Ready resume parser and its ability to parse information from a resume PDF. Click around the PDF examples below to observe different parsing results.
Borrowed from University of La Verne Career Center - Link
Created with ResumeReady resume builder - Link
You can also add your resume below to access how well your resume would be parsed by similar Application Tracking Systems (ATS) used in job applications. The more information it can parse out, the better it indicates the resume is well formatted and easy to read. It is beneficial to have the name and email accurately parsed at the very least.
Browse a pdf file or drop it here
File data is used locally and never leaves your browser
Profile | |
---|---|
Name | |
Phone | |
Location | |
Link | |
Summary | |
Education | |
School | |
Degree | |
GPA | |
Date | |
Descriptions | |
Work Experience | |
Company | |
Job Title | |
Date | |
Descriptions | |
Skills | |
Descriptions |
This section is for the technically inclined and will provide an in-depth explanation of the ResumeReady parser algorithm, going through the four steps of how it operates. (Note: the algorithm is designed for parsing single-column resumes in English.)
A PDF file is a standardized file format defined by the ISO 32000 specification. When you open a PDF file with a text editor, the raw content appears encoded and is hard to read. To view it in a readable format, a PDF reader is necessary to decode and display the file. Similarly, the resume parser first decodes the PDF file to extract its text content.
While it’s possible to develop a custom PDF reader according to the ISO 32000 specification, it's easier to utilize an existing library. In this case, the resume parser leverages Mozilla's open-source pdf.js library to initially extract all text items from the file.
The table below lists 0 text items that have been extracted from the added resume PDF. Each text item includes the text content as well as some metadata about it, such as its x, y coordinates on the page, whether the font is bolded, or if it starts a new line. (Note: x,y coordinates are relative to the bottom left corner of the page, which is the origin 0,0)
# | Text Content | Metadata |
---|
The extracted text items aren’t ready to be used just yet and have two main issues:
Issue 1: They contain unwanted noise.Some single text items can be split into multiple ones, as you might notice in the table above. For example, a phone number "(123) 456-7890" might be broken into three text items: "(123) 456", "-", and "7890".
Solution: To address this issue, the resume parser merges adjacent text items into one if their distance is smaller than the average typical character width, whereThe average typical character width is calculated by dividing the sum of all text items' widths by the total number of characters of the text items (excluding bolded texts and new line elements to avoid skewing the results).
Issue 2: They lack context and associations.When reading a resume, we scan it line by line. Our brains can process each section through visual cues such as text boldness and proximity, allowing us to quickly associate texts that are close together as related. However, the extracted text items currently don’t have those contexts or associations and are just disjointed elements.
Solution: To solve this problem, the resume parser rebuilds those contexts and associations similarly to how our brain processes a resume. It first groups text items into lines since we read text line by line, then it groups lines into sections, which will be discussed in the next step.
By the end of step 2, the resume parser extracts 0 lines from the added resume PDF, as shown in the table below. The result is more readable when displayed in lines. (Some lines may contain multiple text items, separated by a blue vertical divider | )
Lines | Line Content |
---|
In step 2, the resume parser begins constructing context and associations by grouping text items into lines. Step 3 continues this process by grouping lines into sections.
Note that every section (except the profile section) begins with a section title that occupies the entire line. This pattern is common not just in resumes but also in books and blogs. The resume parser uses this pattern to group lines with the closest section title above them.
The resume parser applies certain heuristics to detect a section title. The primary heuristic determines a section title if it meets all three of the following conditions:
1. It is the only text item in the line
2. It is bolded
3. Its letters are all UPPERCASE
In simple terms, if a text item is both bolded and uppercase, it is most likely a section title in a resume. This is generally true for well-formatted resumes. There can be exceptions, but in those cases, the use of bolded and uppercase text might not be appropriate.
The resume parser also has a fallback heuristic if the main heuristic doesn’t apply. The fallback heuristic mainly performs a keyword match against a list of common resume section title keywords.
By the end of step 3, the resume parser identifies the sections in the resume and groups the lines under the associated section title, as shown in the table below. Note that the section titles are bolded and the lines associated with the section are highlighted with matching colors.
Lines | Line Content |
---|
Step 4 is the final step of the resume parsing process and represents the core of the resume parser, where it extracts resume information from the sections.
The core of the extraction engine is a feature scoring system. Each resume attribute to be extracted has custom feature sets, where each feature set consists of a feature matching function and a feature matching score if matched (the feature matching score can be either positive or negative). To compute the final feature score of a text item for a specific resume attribute, the text item is run through all its feature sets, and the matching feature scores are summed up. This process is performed for all text items within the section, and the text item with the highest computed feature score is identified as the extracted resume attribute.
As an example, the table below shows three resume attributes in the profile section of the added resume PDF.
Resume Attribute | Text (Highest Feature Score) | Feature Scores of Other Texts |
---|---|---|
Name | ||
Phone |
Having discussed the feature scoring system, let's dive deeper into how feature sets are constructed for a resume attribute. There are two guiding principles:
1. A resume attribute's feature sets are designed in relation to all other resume attributes within the same section.
2. A resume attribute's feature sets are manually crafted based on its characteristics and the likelihood of each characteristic.
The table below lists some of the feature sets for the name attribute. It includes feature functions that match the name attribute with a positive feature score, as well as feature functions that only match other resume attributes in the section with a negative feature score.
Name Feature Sets | |
---|---|
Feature Function | Feature Matching Score |
Contains only letters, spaces, or periods | +3 |
Is bolded | +2 |
Contains all uppercase letters | +2 |
Contains @ | -4 (match email) |
Contains a number | -4 (match phone) |
Contains , | -4 (match address) |
Contains / | -4 (match URL) |
Each resume attribute has multiple feature sets. These can be found in the source code under the extract-resume-from-sections folder, and we won't list them all here. Typically, each resume attribute has a core feature function that strongly identifies it, so we'll list out the core feature function below.
Resume Attribute | Core Feature Function | Regex |
---|---|---|
Name | Contains only letters, spaces, or periods | /^[a-zA-Z\s\.]+$/ |
Matches email format xxx@xxx.xxx xxx can be anything but spaces | /\S+@\S+\.\S+/ | |
Phone | Matches phone format (xxx)-xxx-xxxx () and - are optional | /\(?\d{3}\)?[\s-]?\d{3}[\s-]?\d{4}/ |
Location | Matches city and state format City, ST | /[A-Z][a-zA-Z\s]+, [A-Z]{2}/ |
Url | Matches URL format xxx.xxx/xxx | /\S+\.[a-z]+\/\S+/ |
School | Contains a school keyword, e.g., College, University, School | |
Degree | Contains a degree keyword, e.g., Associate, Bachelor, Master | |
GPA | Matches GPA format x.xx | /[0-4]\.\d{1,2}/ |
Date | Contains a date keyword related to year, month, seasons, or the word Present | Year: /(?:19|20)\d{2}/ |
Job Title | Contains a job title keyword, e.g., Analyst, Engineer, Intern | |
Company | Is bolded or doesn’t match job title & date | |
Project | Is bolded or doesn’t match date |
The last noteworthy aspect is subsections. For the profile section, all text items can be directly passed to the feature scoring system. However, for other sections, such as education and work experience, the section must first be divided into subsections since there may be multiple schools or work experiences listed. The feature scoring system then processes each subsection to retrieve each resume attribute and appends the results.
The resume parser applies certain heuristics to detect a subsection. The main heuristic identifies a subsection if the vertical gap between two lines is larger than the typical line gap * 1.4, as a well-formatted resume usually creates a new empty line break before starting the next subsection. There is also a fallback heuristic that checks if the text item is bolded if the main heuristic does not apply.
And that's a wrap on the ResumeReady parser algorithm :)
Written by Andrew McMorrow in August 2024