Text Mining Our Way to More Jobs for Veterans

06/15/2021 by Chris St. Jeor Machine Learning

Following their years of service, veterans face a difficult challenge when it comes time to integrating into the workplace. In fact, 1 in 3 veterans is underemployed in the civilian workforce, 15% more than the non-veteran workforce. They spent the last x number of years developing a highly advanced skillset but have limited means of aligning those skills with the day-to-day tasks required in the civilian workforce.

To assist veterans with this important transition, Zencos is using text mining and advanced analytics to help them find better opportunities. Zencos has developed an application that allows veterans to explore the wide variety of available jobs, and identify the opportunities that best align with their skillsets and interests. This post will walk you through the approach used and different methods considered during development of the application.

Data Sources: You are What You Eat

Any actionable text mining application requires quality data. The better the data you feed your model, the more insightful your results will be. To build the solution, Zencos pulled the data from O*NET, which is a widely used clearinghouse for job definitions and statistics.

O*NET provides a variety of details for more than 900 jobs. From the skills required to the tools used, they have a wealth of text defining the nitty-gritty of each job listed. O*NET also provides a crosswalk between each job’s Standard Occupation Classification (SOC) code and the Military Occupation Codes (MOC). The text mining and analytics were performed using a combination of Python and SAS Viya 4.0.

(Text) Mining for Gems

While O*NET provides a proverbial mine full of gems and is an excellent resource for any job seeker, the particular problem that veterans face when leaving the armed services is the sheer volume of jobs to filter through. While O*NET does have a crosswalk that will align a veteran’s specific MOC to one specific SOC code, translating their experience this way is akin to viewing the entirety of the civilian labor market through the eye of a needle. Too often, a veteran’s skills become underutilized, and they get left behind. To create a broader view of the labor market, we needed to develop a method to measure the likeness between jobs.

Cleaning the Corpus

Text mining, or any text analytics project for that matter, is best done when the model is fed a slimmed-down corpus of words that are the most meaningful to the job itself. Cleaning the corpus involves two fundamental steps. (For those that don’t know, in text mining, the word corpus is just a fancy term for a large unstructured document of text). The first step is to trim words down to their root meaning. For example, treating the word “read” in one document differently than the word “reading” in another document would lose the relationship across the two documents. In order to maintain this relationship, you need to trim these words down to their root meaning.

The second step is to remove stop words. Stop words are words within a document that do not add fundamental meaning to the text being analyzed, such as the words: the, and, or it. Removing uninformative words from the text allows us to put more focus on the important information.

Getting Down to the Root

Text mining includes two main methods for trimming words down to their root form: stemming and lemmatization (not a made-up word). Both methods attempt to solve the same task (trimming a word down to its root form), but each takes a different approach.

You can think of stemming algorithms as rule-based algorithms. Stemming is essentially a heuristic process that chops the ends of words off to get them down to their most rudimentary meaning.

Lemmatization, however, is a much more sophisticated approach. Lemmatization actually pays attention to the word’s part of speech and attempts to trim the word down to its dictionary meaning. While this approach is far more computationally expensive, you tend to get better and/or more interpretable results from your final product. I will give you a hint as to which approach we used – it rhymes with slemmatization.

Stop That Stop Word!

As previously mentioned, stop words represent the low-level text that does not add much context or meaning to the corpus. We remove these words to allow the clustering models to focus on the more meaningful terms. While there are a host of libraries readily available to use as a starting point, you need to be careful – the list of stop words you want to use is highly dependent on the task at hand. As such, adequate time should be dedicated to exploring your text so that you can tailor-make the list of stop words you remove from your corpus.

One quick method to make these distinctions as to which words you should include in your custom list of stop words is to create a frequency count of each lemmatized word in the corpus. Once you order the list by frequency, you can determine whether a word has a high frequency because of the topic, or because it is a common filler word that adds minimal value to the context of the documents.

Discovering Relations

The fundamental purpose of creating topic clusters is to find the hidden relationships that previously unrelated texts have in common. Or in our case, identifying similarities between previously unrelated jobs.

Once we had our cleaned corpus of job descriptions, we were ready to begin our topic clustering. We tested two different methods for creating the job clusters. The two methods we tested were Gensim’s Latent Dirichlet Allocation (LDA) model, and Machine Learning for Language Toolkit (Mallet) model.

The main difference between the two algorithms is the way in which the text is sampled. While LDA is faster computationally, Mallet is more precise and typically provides better coherence scores. Coherence scores measure the semantic similarity between the high-scoring words, or most important words, in defining the topic or job cluster. Mallet was the far and away winner for our job clustering.

Connecting the Dots

Creating job clusters can only take you so far. We needed a way for veterans to make sense of the information we uncovered. Once each job was assigned to a job cluster, we created a network analysis visualization that shows the strength of relationship between each of the jobs within a cluster. Salary, years of required training, and shared key terms for the cluster were used to create the strength of match between jobs – the thicker the line connecting two jobs, the more closely related they are.

With this information, veterans can now use our dashboard to crosswalk their military occupation code to the O*NET codes, identify which job cluster their previous experience most closely aligns with, and explore available jobs within their related job cluster.

And That’s a Wrap

Our military service members develop diverse, important, and interesting skillsets during their years of service. After sacrificing so much on behalf of our country, veterans should have an easier time connecting their skillsets to the civilian workforce.

Zencos is grateful to have had the opportunity to use text mining and advanced analytics to help veterans explore the wide variety of jobs that are available to veterans and identify the opportunities that best align with their skillsets and interests. If you have any questions regarding the application or the approaches used we’d love to hear from you.