Beyond the Metrics: Lessons from NLP Model Evaluation (2024)
Back in 2024, I had the pleasure of being selected for an AI mentorship program run by my employer at the time: aspirants propose a project they wish to work on to get more hands-on experience with AI tools, and if selected, are paired with a mentor to help guide them through implementing it over the next four months (in addition to one’s day job, of course). Mentees then present their work to peers, leadership, and whoever else is interested. It was a great program not only to get experience myself with tools that I probably otherwise wouldn’t have, but to see others’ forays into .
My particular project was focused on . I was particularly interested in how NLP could be used to help facilitate knowledge management; at the time, AI tools and LLMs weren’t widely available in many workplaces yet, so I specifically wanted to focus on smaller, open-source models that could be run locally on a regular laptop.
I decided to test six models on two types of tasks (three models per task): summarization and Named Entity Recognition (NER). NER identifies, well, Named Entities (like peoples’ names, place names, and so on) and categorizes them as people, locations, etc. Summarization is…well, it’s just summarization. (Spoiler: except when it isn’t.) Between the two, one could theoretically combine them into a pipeline that, say, takes unstructured text like an email or meeting notes, automatically extracts all the named people, tools, organizations, and so on, and provides a summary of what was discussed. These could then be appended as document metadata to make search easier, or reused to generate other kinds of documents based on templates, and so on.
Anyway, I spent the next four months using nights, weekends, and holidays to research, test, and select models; design and generate test data; design, build, and test pipelines in Python; collect, analyze, and contextualize results; and write everything up in this whitepaper (all, of course, with the guidance of my excellent mentor, Jimmy Shue).