Where does the HR data for machine learning come from?
HR Tech
November 22, 2022

The use of artificial intelligence in recruitment is like transplantology was for medicine in the initial stage of its development - we all feel that it is a breakthrough, and yet it still causes a lot of controversy.

It starts with the fear of leaving a significant portion of our lives (deciding whether we are a good fit for a specific position) up to machine control and extends to widespread stories of errors in training algorithms, which in effect have discriminated against some candidates (a typical feature of experimental technology that is already in the past).

Considering the unprecedented number of job vacancies (10.7 million in the US alone) waiting to be filled, the limited number of recruiters, and increasing pressure to hire at a record pace, algorithm-supported recruiting seems inevitable. Moreover, when efficiently designed, it can significantly ease the burden of everyday tasks.

You can’t pour from an empty cup

Managing the recruitment process is a complex iterative problem with multiple steps to address a critical business need - building a committed team that allows the company to achieve its goals. The use of AI to solve that problem provides many benefits such as increased efficiency, speeding up the process, analysis of the employee market, or better matching between the skills needed for projects and the candidate.

AI tools can be very helpful. One such example is creating candidate profiles. Instead of reading each CV, the right set of mathematical models and tools can extract data and then systematize it. This information can be much easier to search and catalog. However, this problem was solvable even before the era of AI. If we think about how we can use this data, one of the next possible steps will be to confront this cataloged information about candidates with criteria, for example about a given set of skills or experience. It is at this point that the real AI begins.

Moving forward, we may want to get relative ratings of this data and create a ranking from it (the interesting thing is that we will be using 18th century mathematical theorems to do this). Another example might be gathering information about the labor market, analyzing it, and predicting trends for given industries. Clearly, access to such data will contribute to the creation of better job descriptions and raise awareness of the market on which you operate. Good data is the key element of any AI tool.

Difficulties in obtaining high-quality data are the main and fundamental obstacle in building any trustworthy AI-powered system, because it is impossible to create any representation of reality without in-depth knowledge of the area in which we operate - reflected by the information collected about it. The use of such tools in the management of recruitment processes can bring significant benefits at many stages. In this process, creative invention is more limited by the information with which we "teach" AI models than the teaching (or training) itself - more on that in the next section. Machine learning programs become better by training with well-crafted data that is tailored to the problem.

Where does the HR data for machine learning come from?

As we already know, the suggestions and forecasts received from an AI-powered system are based on data. The more the data provided to the system corresponds to reality (in terms of diversity, proportion, etc.), the more reliable our results will be.

There are three sources from which we (as the creators of this type of system) can obtain data:

  • Anonymized resumes

This is the best kind of data that can be processed. In the right amount, they create a good representation of the reality model from which the AI system will learn. On the other hand, it is difficult to acquire this type of data in sufficient quantities. Companies can process a limited number of candidates, and that may not be enough for an AI system to work properly.

  •  Job ads

These are indirect pieces of data that do not create a good representation of the candidate and must be properly prepared to use them in managing the recruitment process. This type of information has one great advantage – it’s easily available in large quantities and made accessible to the public.

  • Generated resumes

The candidate representation reflected in resumes can be easily normalized and systematized without a lot of real data. Most resumes are divided into a finite number of sections, in which information can easily be categorized according to the job market sector. This allows for the creation of many types of resumes generators that can support real data.

One of the newest methods are GANs, which can learn to generate data that is indistinguishable from real data (often distinguishable only by machines). GANs were developed by the father of modern deep learning (or AI for the purposes of this paper, you can take that as a given), Ian Goodfellow, and currently they are developing very rapidly. It's not hard to find generators on the Internet that create images of dogs that never existed but can fool humans.



Machine Learning (ML) – the field of artificial intelligence is devoted to algorithms that improve automatically through experience or exposure to data. Machine learning algorithms build a mathematical model from sample data, called a learning set, to make predictions or decisions without being explicitly programmed by a human.

Deep Learning methods (DL) – a subcategory of machine learning, based on artificial neural networks with representation learning. It involves creating neural networks to improve voice recognition and natural language processing.

Neural Network (NN) – a system designed to process information whose structure and operating principle are similar to the human nervous system. The most prominent feature of a neural network is its ability to learn from examples and the possibility of automatic generalization of acquired knowledge.


Interested? This is just the beginning of our series of posts called Look to the future. We invite you for more soon!

Grzegorz Reinelt
Machine Learning Engineer
HR tech
artificial intelligence
future of HR