During the sepia-tinged days of psychometric testing, back when we still thought completing an occupational personality questionnaire on a Casio calculator was a pretty neat idea, I was required to sit some tests as part of a graduate recruitment process for a car manufacturer.
Thirty jejune and fresh-faced young undergraduates, who had travelled from all over the country, were ushered into a stale, beige testing room in Essex to take a series of ability tests and a personality questionnaire. The air was thick with test anxiety, the rustle of paper answer sheets, and the gentle hum of cognitive reasoning. Pencils furiously scribbled in answers while we complied with the strictly proscribed admin instructions.
Test administrators gathered at the back of the room, flicking acetate, manual, scoring-keys over completed answer sheets and totalling up raw scores with only a fairly small margin of errors in addition.
Ten years later, the picture started to change as tests began to go online. Twenty years later, despite what the BPS test user standards say, the days of group, face-to-face, supervised administration of tests, at least for occupational, selection testing in the UK, are long gone. The vast majority of people completing tests when applying for jobs do so online and unsupervised.
But this is not a piece about the relative merits of internet-based testing (as it was once called; it’s now just called ‘testing’ much like digital cameras are now just ‘cameras’. Or phones, to be accurate). This article is about something that snuck in along the way,which looks to become increasingly opaque and widespread with AI: Automated testing.
What just happened?
Automated testing did not happen overnight; there was no big marketing push or flurry of conference papers. Instead, it was the (inevitable?) outcome of moving tests online. Early online tests largely replicated the use of traditional, paper & pencil tests; while administration was automated and scoring were automated, interpretation and decision-making remained the responsibility of the test user. This early, limited automation removed the human test administrator from the testing and scoring process but retained human decision-making within the interpretation and making use of results stage.
While we can lament the passing of administrators reading aloud from test instruction cards, there is a certain amount to celebrate. The end of manual-scoring errors; no more anxious exam conditions for test takers; test publishers can now easily create norm groups and monitor the psychometric qualities of their tests using the data gathered online (rather than begging test users to send back copies of completed answer sheets); randomised item-banks become possible; candidates no longer need to travel to a test centre (seriously – the first ability test we put online was for an investment bank who up until then would fly candidates from all over the EMEA region into the UK to take a pre-screening numerical test); we all got to bask in the white heat of modernity.
Sure, there were relative disadvantages and reservations, but test users voted with their feet, and by the 2010s, it was very unusual to receive an order for a paper and pencil test.
The move to full automation
The automated testing process of today was driven by a key advantage of using online tests for occupational testing in recruitment scenarios – test users can remotely test candidates very early in the recruitment process and perform a pre-selection sift before embarking on the more costly and time-consuming face-to-face stages like interview or assessment centre. The most beneficial application of this approach was for high volume assessment, for example, a graduate recruitment process attracting 5,000 applications for 10 vacancies. Not only could testing happen remotely, but it could also be used to sift out the majority of applications automatically.
At this point, the human test user is finally removed from the entire testing process. Interpretation and decision-making (reject or invite to the next stage) is no longer a manual, human intervention. The test user should be involved in the design of the calculation being used to determine the success rates that power the automated sifting process.
What happened next?
The growth of online recruitment made applying for jobs easier – simply fire-off CVs to make applications with multiple employers. This put selection ratios under pressure. The traditional 3:1 ratio became a rarity for many roles; employers often attracted far too many candidates, many of whom were clearly unsuitable for the role.
Manual sifting was no longer a cost or time-effective option, and as a result, automated testing became much more widespread. Sifting calculations soon became algorithms of increasing complexity because of the pressure to mitigate the effects of high volumes of applications. Complexity made it increasingly difficult for test users to remain in the loop of the testing process as ‘black-box assessments’ became commonplace, relying on algorithms hidden from view and hard to explain to even qualified test users. In more recent years, the automated testing methodology has been extended beyond the domain of occupational testing into processes associated with educational assessment. For example, universities under-pressure from high application volumes to popular course may include an automated sifting test to reduce numbers that can be managed using the more traditional interview approach.
Implications for test users
Huge volumes of candidates take automated tests with little or no involvement from the human test taker. The algorithms being used to arrive at sifting decisions are hidden inside the assessment platform, sometimes cloaked in mystery and IP sensitivity by the provider. For example, a games-based assessment uses highly complex maths to calculate scores based on candidate behaviours as they play. The assessment provider is unlikely to be in a position to explain where the scores come from other than cite ‘multiple data points’.
The implication of this for test users is that they are removed entirely from the loop of assessment choice and responsibility. The test user can choose to use the automated, blackbox assessment but only with partial information; they cannot know how it works and where its scores come from when choosing the test. This in turn makes it much harder for the test user to fulfil their responsibilities in terms of ensuring fairness, preventing adverse impact, and measuring the validity of their testing process.
And what about AI?
The contemporary difficulties concerning the scoring transparency of black-box, automated assessment tools, and their implications for monitoring adverse impact, are often (but not exclusively) associated with providers with a technology rather than an assessment background. New-tech assessments are more likely to lean heavily into hidden decision-making algorithms due to their complexity or IP sensitivity.
And what can be more new-tech than machine learning and AI? The most common form of this technology to be used with assessments is machine learning. A large data set, such as test scores combined with success results, is given to a machine learning model, which explores the data and tries to identify patterns and infer relationships. The success factors in the data-set could be something as crude as hiring decision (who was hired, who was rejected). Data such as biographical details can also be included alongside the test scores. The purpose of the machine learning process in the case of workplace assessments is to determine links in the data that could be used to predict outcomes and therefore drive decision-making processes based on test scores.
Machine learning creates algorithms that sit inside a blackbox, generating decisions based on the assessment scores provided for each candidate. There is a further level of mystery attached to these algorithms – they were not programmed by a human, they were created from advanced, correlational calculations within the machine learning process. Nobody knows why the machine has uses a test score to predict whether a candidate will be a good hire. It has simply found a correlational line of best fit and used that to create the hiring algorithm. (You may know this already, in which case please excuse the mansplaining. This was for those people at the back who haven’t been paying attention).
The trouble with correlations, no matter how advanced or whether conducted by humans or machines, is that they mean causation – regardless of how strong the line of best fit. But machine learning infers meaning from the data and uses this to make decisions within automated testing processes with potentially no input from any human at any stage. The most infamous example of this going wrong was the Amazon recruitment process, which used algorithms generated by machine learning using previous hiring data. These algorithms were based on meaning inferred from the data and led to a significant gender bias in the automated decisions. It was only when a human revisited the algorithms that gender bias was identified in the original hiring data. The recruitment tool was removed from service.
This is not to argue that machine learning has no place in the development of assessments; it is a powerful statistical tool that could power test design. The danger is the absence of any scrutiny of the algorithms before they are baked into algorithms within a black-box, decision making tool. The danger for test users, who by this stage really have no way if understanding how a test process is making decisions about actual humans, is that the seductive lure of adopting an ‘AI-powered’ assessment leads then to make automated decisions about people that are wrong. And unfair and discriminatory.
On a note of good news, The EU publishes the final text of the Artificial Intelligence Act, which will work alongside existing data regulations (GDPR), to regulate tools including “high-risk applications, such as a CV-scanning tool that ranks job applicants”. These will now be subject to specific legal requirements. The law will come into effect over the next two years, and a core principle is explainability – the extent to which the algorithms being used can be made transparent and the way in which automated decisions are explained. A key distinction will be between ‘programmatic’ algorithms, which have been created by a human and are therefore more easily explained, and machine-learned algorithms.
The EU AI Act will make test users responsible for explaining their automated hiring processes. This might reduce the sheen on AI-based assessment tools.
Final words
As ChatGPT always says, ‘to summarize’, while the practical benefits of automated psychometric testing are evident, it is essential to approach its implementation ethically and transparently.
Candidates and employees should be informed about the nature and purpose of the assessments, and their privacy should be safeguarded. Organisations must also recognise the complementary nature of psychometric testing to human judgment, using it as a tool to support decision-making rather than replace the human touch in the hiring process.