5  Evaluation

Since we are utilizing the large language model in ways where there are no strictly right or wrong answers, we cannot employ the evals framework developed and maintained by OpenAI. Instead, we opted for a distinct method of evaluation. We selected an existing job posting from Meta for a Frontend Developer position and had ChatGPT (GPT-3.5) generate approximately 40 applications, which were then manually evaluated. Subsequently, we utilized GPT-4 to have the same 40 applications evaluated. To facilitate a comparison, we employed the same evaluation categories as those used for the manual assessments. These categories are detailed below, alongside guidelines with examples for scores of 0 and 10 points for each category, to make the evaluation process as impartial as possible:

Our system conducted evaluations using both GPT-3.5-turbo and GPT-4-1106-preview, identifying a discrepancy in the average scores when compared to manual evaluations: a 2.0 gap for GPT-3.5-turbo and a mere 0.7 difference for GPT-4-1106-preview. Since average scores can be misleading, we also examined the distribution of scores for each category. This analysis revealed that the AI model’s evaluations were sometimes too high, indicating that the model was overestimating the applicants’ qualifications. To address this issue, we implemented a new evaluation step. Additionally, during our manual evaluations, we discovered that the manual evaluators sometimes missed relevant information in the applications, which could lead to biased evaluations. Also, the manual evaluations were not always consistent, which is a common problem in human evaluations. All in all the evaluation yielded very promising results, indicating that the AI evaluated the applications in the manner we desired without hallucinations.

A critical step in our evaluation process involves requiring the model to provide a justification and a quote from the CV for each score. These quotes are then verified by a function within our application to confirm their presence in the CV, effectively preventing model hallucinations.

Furthermore, we implement a new evaluation step wherein the original prompt, along with the model’s response including scores, quotes and justifications, is sent back to the model as a complete package. The task for the model is to self-assess and evaluate how well its response matches the criteria. This self-assessment is currently only stored in our database and serves as an indicator for developers on the accuracy of the evaluation, whether on a general level or for individual CVs.

A potential next step could involve introducing a threshold, prompting the model to revise its own response if the self-assessment falls below this value. This mechanism would not only enhance the quality of AI evaluations further but also increase the system’s reliability by introducing an additional layer of verification to ensure the accuracy of the evaluations.