Beyond the Metrics: Lessons from NLP Model Evaluation (2024) • Ryan

Link to the whitepaper

Back in 2024, I had the pleasure of being selected for an AI mentorship program run by my employer at the time: aspirants propose a project they wish to work on to get more hands-on experience with AI tools, and if selected, are paired with a mentor to help guide them through implementing it over the next four months (in addition to one’s day job, of course). Mentees then present their work to peers, leadership, and whoever else is interested. It was a great program not only to get experience myself with tools that I probably otherwise wouldn’t have, but to see others’ forays into Field Note #0015 One gal trained a categorization model to predict not just whether a particular piece of music was, say, jazz or not, but what kind. This may sound trivial for 2026-era AI, but in 2024 it was honestly incredible. .

My particular project was focused on Field Note #0011 Natural Language Processing (NLP) is a subset of computer science which is devoted to processing natural language using a computer. It is not necessarily synonymous or related to AI, although the two often overlap. NLP tasks include speech recognition, text classification, or text generation. View Topic . I was particularly interested in how NLP could be used to help facilitate knowledge management; at the time, AI tools and LLMs weren’t widely available in many workplaces yet, so I specifically wanted to focus on smaller, open-source models that could be run locally on a regular laptop.

I decided to test six models on two types of tasks (three models per task): summarization and Named Entity Recognition (NER). NER identifies, well, Named Entities (like peoples’ names, place names, and so on) and categorizes them as people, locations, etc. Summarization is…well, it’s just summarization. (Spoiler: except when it isn’t.) Between the two, one could theoretically combine them into a pipeline that, say, takes unstructured text like an email or meeting notes, automatically extracts all the named people, tools, organizations, and so on, and provides a summary of what was discussed. These could then be appended as document metadata to make search easier, or reused to generate other kinds of documents based on templates, and so on.

Anyway, I spent the next four months using nights, weekends, and holidays to research, test, and select models; design and generate test data; design, build, and test pipelines in Python; collect, analyze, and contextualize results; and write everything up in this whitepaper (all, of course, with the guidance of my excellent mentor, Jimmy Shue).