Abstract

Unstructured text is challenging for knowledge management (KM) systems; large language models (LLMs) can identify entities and summarize content, but proprietary models such as Claude and ChatGPT can raise concerns about data privacy, transparency, and resource consumption. Open-source Natural Language Processing (NLP) models may provide a viable alternative that prioritizes data privacy and control. This research project compared out-of-the-box performance of open source NLP models that can be run locally on a laptop for named entity recognition (NER) and summarization tasks.

Introduction

In recent years, the growing use of NLP models (especially LLMs) has transformed various industries, offering powerful capabilities for tasks like NER and text summarization. These models hold promise for automated knowledge extraction, and content summarization, which are essential for improving the efficiency of KM systems. However, as organizations increasingly rely on proprietary LLMs, concerns about data privacy, transparency, and resource consumption or becoming more pronounced. Open-source NLP models have emerged as a potential solution, offering greater control over data privacy and computational costs.

This research investigates the performance of open-source NLP models in performing two key text processing tasks – NER and summarization – while focusing on their suitability for deployment in resource-constrained environments, such as local laptop deployments. By comparing a diverse set of models - SpaCy, BERT, DistilBERT, BART, T5, and Pegasus - this study provides a detailed evaluation of their performance across three levels of task complexity, assessing not only accuracy, but also resource utilization. The goal of this product is to inform organizations on the trade-offs involved in selecting the best open-source models for these tasks, considering both their computational and accuracy demands.

Methodology

The following sections provide details about the models selected, and the technologies and processes used to generate ground truth, data, evaluate the models, and create the pipelines to test their performance.

Model Selection

Three models were selected for both the NER and summarization tasks, to provide a range of outputs and experiences to compare: for NER, we chose SpaCy, BERT, and DistilBERT; for summarization, we chose T5, BART, and Pegasus. The following sections provide descriptions of each model as well as initial predictions regarding performance.

NER Models

SpaCy

SpaCy is an open-source Python library designed for production NLP tasks. First established in 2015, SpaCy has a robust ecosystem offering NER as well as text classification, rule-based matching, lemmatization, and other common NLP tasks. SpaCy is fast, lightweight, and designed specifically for production environments, and it’s easy to integrate with existing workflows thanks to its being written in Python. Its statistical NER system identifies entities such as companies, locations, works of art, and organizations. SpaCy’s en_core_web_sm model was chosen for this task due to its speed, low resource consumption, and ease of integration; however, we expected trade-offs due to its lower out-of-the-box accuracy, compared to transformer models.

BERT

Bi-directional Encoder Representations from Transformers (BERT) is a Hugging Face transformer model for NLP, first developed in 2018 by researchers at Google AI Language. BERT was built on a transformer architecture which uses an attention mechanism to signal which inputs are the most important to further process. This “attention” makes transformer models able to efficiently process millions of data points by focusing only on the most relevant inputs. Transformer models can be encoder–only, decoder–only, or both; BERT is encoder–only, which means that the architecture is optimized to receive input and build a representation of (or encode) its features. In BERT’s case, this architecture enabled it to be pre-trained extremely efficiently on a massive data set (specifically, Wikipedia and Toronto Book Corpus) using masked language modeling (MLM) and next-sentence prediction (NSP). The MLM approach, trains, a model by masking, or hiding, a word in a sentence and instructing the model to use the words on either side as context clues to predict the masked word. NSP is similarly used to train BERT to predict whether a given sentence “follows “the previous sentence or not.

BERT represented a breakthrough in the application of transformer architecture to NLP tasks; as such, we predicted that it would be more accurate out of the box than SpaCy, but likely would come with a higher computational/resource cost.

DistilBERT

DistilBERT is a “distilled” version of BERT which was first proposed in 2019. DistilBERT was created to address concerns around the cost and privacy concerns of large models like BERT, and provide a lighter-weight option that preserves much of the accuracy of the original BERT model. Researchers used a compression technique called “knowledge distillation” to train the smaller DistilBERT model (the “student”) to mimic the output distribution of BERT (the “teacher”) - the result is a 40% reduction in the number of parameters while retaining more than 95% of the performance. As such we expected that. DistilBERT would performed slightly less well on some of the NER tasks than BERT, while providing better performance in terms of resource consumption and runtime.

Summarization Models

BART

Bi-directional and Auto-Regressive Transformer (BART), originally developed by Facebook AI researchers in 2019, is another transformer model - as opposed to BERT, which is an encoder-only model, BART is an encoder-decoder (or sequence-to-sequence) model. The addition of the decoder enables the model to first encode the input and build an internal representation, then apply a decoder to generate an output. Sequence-to-sequence models are well suited to generative tasks, such as summarization, but are also more computationally expensive than their encoder only counterparts. Summarization is also notoriously difficult for NLP models, so we expected significant trade-offs in resource consumption versus performance.

T5

Text-to-Text Transfer Transformer (T5) is a powerful NLP model developed by Google Research in 2019. As the name suggests, this model converts NLP tasks into a unified text-to-text format (as opposed to other models like BERT, which use various outputs such as class labels, depending on the type of task). This reformed architecture resulted in a model that is both flexible and performance: its developers reported state-of-the-art results on benchmark tests, and due to its text to text format, it can be modified and applied to a wide variety of NLP tasks. However, this level of performance comes at a high computing cost. We decided to test the t5-small model to minimize resource consumption.

Pegasus

Pegasus is a sequence to sequence model with the same encoder decoder architecture as BART. It is pre-trained on the same MLM function as BERT, as well as a summarization specific function called gap sentence generation (GSG). Pegasus is optimized specifically for abstractive summarization; as opposed to extractive summarization models, which collocate relevant chunks from the input, abstractive models, generate their own summary, based on what they judge to be the most salient information from the input. Pegasus was included to provide an option, specifically optimized for abstractive summary out of the box.

Ground-Truth Data Creation

Approach

To evaluate the performance of the models on diverse and challenging scenarios, we designed a tiered data set creation approach. We initially considered using real world data (e.g., book passages), but decided to generate sentences programmatically to give more granular control over the input. For both N ER and summarization tasks, a tiered approach was used to introduce three levels of increasingly challenging tasks to the models, using sets of 40 sentences per level.

NER sentences had the following characteristics according to level:

Level 1: simple, declarative sentences in simple past or present, tense with clear subject-verb-object (SVO) structure. The goal was to identify whether models accurately recognize common entities without any ambiguity.
Level 2: more complex sentence sentences with one or more nested clauses; descriptive sentences that create hierarchy (e.g., titles); prepositional phrase and embedded relationships. The goal was to test models’ performance on correctly identifying entities that are part of more complex or nested structures.
Level 3: sentence sentences within ambiguous or overlapping entities (e.g., “New York Times”, which is an organization, but contains a location name); simple, declarative, sentence sentences in past or simple present tense. The goal was to determine how accurately models recognize and assign entity types which are overlapping or ambiguous.

Summarization texts had the following characteristics according to level:

Level 1: Simple and direct sentences with clear structure; low complexity; factual descriptions. Short texts of 3 to 5 sentences. The goal was to assess models’ summarization capabilities using straightforward content with clear language.
Level 2: academic and precise, with specialized vocabulary; sentences with medium complexity, including jargon, technical, freezes, and pros/cons. Medium length texts of 5 to 10 sentences. The goal was to assess models’ summarization capabilities using technical jargon and acronyms to ensure they captured the pros/cons accurately.
Level 3: cohesive and detailed multipart narratives with transitional phrases and logical flow. Medium length texts of 5 to 8 sentences. The goal was to test models’ ability to summarize logical narratives involving multiple perspectives and cause/effect relationships.

The sentences in each level were constructed from templates and predefined lists of entities (people, locations, and organizations); each level of sentence complexity was then split into common and fantasy “flavors” of entities to mimic domain-specific language that might further complicate a model’s predictions in production.

The following tables provide details and examples of the tiered approaches used for the NER and summarization tasks.

NER Model Evaluation Tiers

Tier	Sentence Features	Entity Types	Example	Goal
1	Simple, declarative sentences in simple past or present tense with clear subject-verb-object (SVO) structure	Standard names for people, well-known geographical locations, and plausible real-world organizations (“common” entity types)	“John Smith of Innovatech Solutions met with leaders in Berlin to discuss a trade agreement.”	Identify whether models accurately recognize common entities without any ambiguity.
2	Simple, declarative sentences in simple past or present tense with clear subject-verb-object (SVO) structure	Fantasy names for people, geographical locations, and organizations (“fantasy” entity types)	“Guillam Swiftfoot of the Crimson Shield met with leaders in Galeharbor to discuss a trade agreement.”	Identify whether models accurately recognize fantastical entities without any ambiguity.
3	More complex sentences with one or two nested clauses; descriptive sentences that create hierarchy (e.g., titles); prepositional phrases and embedded relationships	Common	”Christina Romero, a senior partner at LuminaGen Systems, relocated to Lagos.”	Test models’ performance on correctly identifying common entities that are part of more complex, nested structures
4	More complex sentences with one or two nested clauses; descriptive sentences that create hierarchy (e.g., titles); prepositional phrases and embedded relationships	Fantasy	”The Frostborn Clan, led by Caitrona Connorus, met in Hargion City to discuss the upcoming negotiations.”	Test models’ performance on correctly identifying fantastical entities that are part of more complex, nested structures
5	Sentences with ambiguous/overlapping entities (e.g., “The New York Times” is an organization but contains a location name); simple declarative sentences in past or simple present tense	Common	”Megan White writes for the Sao Paolo Record.”	Determine how accurately models recognize and assign common entity types which are overlapping/ambiguous.
6	Sentences with ambiguous/overlapping entities (e.g., “The New York Times” is an organization but contains a location name); simple declarative sentences in past or simple present tense	Fantasy	”Haderai Delane volunteers for the Ironforge Historical Society.”	Determine how accurately models recognize and assign fantasy entity types which are overlapping/ambiguous.

Summarization Model Evaluation Tiers

Tier	Sentence Features	Entity Types	Example Input	Example Summary	Goal
1	Simple and direct sentences with clear structure; low complexity, factual dscriptions. Short (3-5 sentences)	Common	The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. Completed in 1889, it stands as a global symbol of France. It attracts millions of visitors each year.	The Eiffel Tower in Paris, completed in 1889, is a global symbol and popular tourist attraction.	Assess models’ basic summarization capabilities using straightforward, real-world content with clear language.
2	Simple and direct sentences with clear structure; low complexity, factual dscriptions. Short (3-5 sentences)	Fantasy	The Umbrite cavern in Isenfell glows with an otherworldly light as the arcane ores shimmer in hues of violet and blue. It was discovered in 267. The Umbrite Cavern remains a mysterious attraction for adventurers drawn by its supposed power to grant mystical visions.	The Umbrite Cavern in Isenfell, discovered in 267, is a mysterious site said to grant visions to adventurers.	Assess models’ basic summarization capabilities using straightforward content with clear language.
3	Academic and precise, with specialized vocabulary. Medium complexity, including jargon, technical phrases, and pros/cons. Medium (5-10 sentences).	Common	The CRISPR-Cas9 gene-editing tool allows researchers to make precise modifications to DNA. This technology has revolutionized genetic research, enabling scientists to edit out harmful genes and potentially cure genetic disorders. Despite its promise, ethical concerns regarding use in human embryos remain a topic of intense debate.	CRISPR-Cas9 enables precise DNA editing, offering potential cures for genetic disorders but raising ethical concerns.	Assess models’ summarization capabilities using real-world technical jargon and pros/cons.
4	Academic and precise, with specialized vocabulary. Medium complexity, including jargon, technical phrases, and pros/cons. Medium (5-10 sentences).	Fantasy	The Arcane Conductor allows celestiologists to channel raw aether energy to power the city’s floating towers. This breakthrough has enabled unprecedented advancements in transportation and defense. However, prolonged exposure to aether waves is known to cause delirium, leading to a condition which scholars have termed “arcane decay”.	The Arcane Conductor channels aether energy to power city infrastructure, but poses risks of ‘arcane decay’ from long exposure.	Assess models’ basic summarization capabilities using straightforward content with clear language.
5	Cohesive and detailed multi-part narratives with transitional phrases, multiple perspectives, and logical flow. 5-8 sentences	Common	The city council debated the budget proposal for weeks. After much deliberation, they decided to allocate more funds to public transportation to reduce traffic congestion. This decision was met with approval from environmental groups. However, local businesses voiced concerns about potential tax increases to cover the new expenses. As a result, the council adjusted the plan to balance both interests.	The city council approved more funding for public transport to cut traffic, after getting support from environmental groups and addressing business concerns with budget adjustments.	Test models’ ability to summarize logical narratives involving multiple perspectives and cause/effect relationships.
6	Cohesive and detailed multi-part narratives with transitional phrases, multiple perspectives, and logical flow. (5-8 sentences)	Fantasy	The Royal Forest Guardians decided to implement a new enchantment strategy to protect endangered magical creatures. This strategy was developed after extensive research and collaboration with local sorcerers and druids. Elven councils praised the initiative for its potential to preserve magical biodiversity. However, villagers were concerned about restrictions on land use and the potential impact on their livelihoods. To address these concerns, the Guardians included provisions for sustainable land use practices that benefit both conservation and local livelihoods.	The Royal Forest Guardians implemented a new enchantment strategy to protect endangered creatures, after collaborating with sorcerers and addressing villager concerns with sustainable practices.	Test models’ ability to summarize logical narratives involving multiple perspectives and cause/effect relationships.

Tools and Techniques

We use a mix of generative AI and Python scripts (built and run in Jupyter notebooks) to ingest entities and generate annotation ground-truth files. The following sections detail the specific tools and processes used to generate NER and summarization ground-truth data sets.

NER Ground-truth Generation Procedure

For the NER sentence generation procedure, lists of 50 entities per entity type were first generated using Copilot and collected in a single JSON file for ease of integration into generation scripts. These entity types included:

Common people names (e.g., “Scott Wilson”, “Claire Brown”)
Common locations (e.g., “Tokyo”, “Moscow”, “Toronto”)
“Real-world (i.e., fictional but plausible) organization names (e.g., “BioCore Dynamics”, “NanoSphere Innovations”)
Fantasy people names (e.g., “Lyra Marpessa”, “Ganymede Croft”)
Fantasy organization names (e.g., “Galeharbor”, “Heagate”)
Overlaps (i.e., terms which can be added to a location to form an overlapping organization, like “Times” in “New York Times”) (e.g., “Society”, “College of Art”) we then created sentence templates, according to each complexity level and stored the templates in separate JSON files according to tear. Finally, we created a Jupyter notebook (generate_sentences.ipynb) which used Python to load entities from the JSON file and then plugged random entities from the appropriate lists into the templates to generate sentences and corresponding ground-truth annotations. Each tier’s sentences and annotations were saved in separate JSON files for easy inference and comparison.

Summarization Ground-Truth Generation Procedure

We provided tier descriptions, contextual details, and sample texts to Copilot and used it to generate additional JSON objects containing input text and ground-truth summary pairs for each complexity tier. The following is an example prompt.


Tier 1: Simple, descriptive summaries (common) *Goal*: assess models' basic summarization capabilities, using straightforward, real-world content with clear language. *Extract features*: *Language*: simple and direct with clear structure. *Length*: Short (3-5 sentences) *Complexity*: Low; single-topic, factual descriptions. *Sample template*: "{Landmark} is {description} located in {place}. {Built/discovered/founded} in {year}, it is {known for unique feature}. **Example**: *Input*: "The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. Completed in 1889, it stands as a global symbol of France. The Eiffel Tower attracts millions of visitors each year." **Summary**: the Eiffel Tower in Paris, completed in 1889, is a global symbol and popular tourist attraction." Please give me 40 more inputs/summary pairs like this, as a JSON object with fields for tier, input text, and ground-truth summary, like this: {"tier": "tier1", "input_text": <input_text>, "ground_truth_summary": <ground_truth_summary> }
|

The JSON objects containing input text and summaries were all manually evaluated for suitability and stored in JSON files per tier for later inference and evaluation.

Evaluation Metrics

To comprehensively evaluate model performance, we focused on both task-specific metrics and resource utilization metrics:

Resource metrics (all): Runtime, CPU usage, memory usage
NER performance metrics: Precision, recall, and F1 score
Summarization metrics: ROUGE, BLEU, METEOR, and BERTscore

The following sections provide more information about the metrics, why they were chosen, and how they were captured.

Resource Metrics

Runtime measures the total time taken by a model to process the input data set. Efficient runtime is crucial for applications requiring real-time or near-real-time performance. We used the time Python library to capture runtime in inference pipelines.
CPU usage captures the percentage of processor resources utilized during model inference. Lower CPU usage indicates a lighter-weight model and, conversely, higher CPU usage indicates a more resource-intensive model. The CPU usage of a particular model may be a limiting factor, depending on the resource constraints of the environment or the size of the data set. We used the psutil Python library to capture CPU usage.
Memory usage measures the amount of random access memory (RAM) consumed during inference. Models with lower memory footprint are generally better suited for deployment on edge devices or systems with limited resources. We again used the psutil Python library to capture CPU usage.

NER Performance Metrics

We decided to measure the NER models’ performance using a mix of evaluation metrics to give a well-rounded picture. Depending on the use case (and the implications of different errors), teams may wish to favor either precision or recall, or may wish to use the F1 score for a more holistic metric.

Precision measures the accuracy of positive predictions, made by the model (“out of all the positive predictions. This model made, how many are actually true?”) in other words, high precision should be prioritized when trying to avoid false positives (for example, in spam detection).
- Equation: Precision = True Positives (TP) / (TP + False Positives [FP])
Recall, also known as the true positive rate or sensitivity, measures the models’ overall completeness in identifying all relevant instances (i.e., minimizing the number of false negatives). High recall is critical in use cases such as medical diagnoses, where overlooking positives is unacceptable.
- Equation: Recall = TP / (TP + False Negatives [FN])
Precision and recall are inversely linked: a higher precision score results in a lower recall score and vice versa. It can therefore be tricky for teams to know how to weigh them against each other. The F1 score, also called the harmonic mean, balances precision and recall; thus, it is a good metric in situations where there is not a specific need to prioritize precision or recall.
- Equation: 2 * (precision * recall) / (precision + recall)

Precision, recall, and F1 scores were all calculated using the scikit-learn metrics library in Python.

Summarization Performance Metrics

Summarization is a notoriously difficult task for NLP models to perform and is also quite challenging to benchmark. Unlike NER results, which can be easily assessed as positive or negative (i.e., the model either does or does not correctly identify entities and/or their types), results from text generation tasks are stochastic and involve subjective evaluation. Nevertheless, there are several metrics available for researchers to assess text generation tasks like summarization, including Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Metric for Evaluation of Translation with Explicit Ordering (METEOR), and BERTscore:

ROUGE measures the overlap of N-grams between the model’s predicted summary and the ground-truth summary with emphasis on recall. In other words, how many of the words/n-grams in the ground-truth summaries appear in the model generated summaries.
BLEU provides a score from zero to one comparing how closely a predicted translation matches a reference sentence with emphasis on precision; in other words, how many of the model-generated words/N-grams appear in the ground truth summaries. In addition, BLEU imposes a brevity penalty, which penalizes model results which are shorter than the length of a ground truth reference. (Note: BLEU was originally designed for translation, but is commonly used for other NLP tasks as well.)
METEOR measures the quality of model-predicted text based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It is like the F1 score as it provides a more balanced view than either precision or recall alone.
BERTscore uses contextual embedding to determine the similarity between ground truth and predicted text so it provides a semantic comparison (meaning to meaning) as opposed to word matching (word to word). While it is good at capturing semantic similarities, it may not provide an indication of the quality of the summary. It is also dependent on the embedding model; fine-tuned models may be required for certain domains.

ROUGE and BLEU are perhaps the best known evaluation metrics for text generation tasks, but for several reasons, they were not deemed a good fit for this summarization task:

Measuring the number of exact word matches between ground-truth and predicted text may be a viable strategy for tasks like translation, but not necessarily for summarization. There are many ways of summarizing a given text, and just because the specific words chosen are different, does not indicate that one summary is more accurate than another.
Similarly, BLEU’s brevity penalty could unfairly penalize models which do a better job of preserving the gist of a text in a concise way – in this case, longer does not necessarily mean better, and should therefore not be preferred. BERTScore was deemed to be the best fit for some realization evaluation due to its ability to compare results on a synthetic basis, as opposed to just word by word. BERTScore can be calculated using the bert_score library in Python.

Pipeline Setup

All of the model pipelines were written in Python and run and tested in Jupyter notebooks. The following sections break down the steps included across model pipelines for NER and summarization tasks.

NER Pipelines

The NER model evaluation pipelines for all three models included the following steps:

Load relevant libraries and define functions to read input files and write results to CSV and JSON
Extract ground-truth annotations and input sentences from the JSON file
Create pre- and post-processing functions to:
- Format model predictions to match the ground-truth format (i.e., creating lists of formatted entity predictions including “entity”, “start”, “end”, and “label” values)
- Normalize both ground-truth annotations and model predictions to ensure “apples to apples” comparison, including normalizing capitalization (i.e., making all entries lower-case), stripping leading/trailing white space, and stripping “the” from entities where applicable
- SpaCy only: map model output labels to ground truth annotation labels (e.g., SpaCy’s “geo-political organization” [GPE] maps to ground-truth “LOC”). See Appendix for full mapping of SpaCy labels to ground-truth annotation labels.
Normalize ground truth annotations
Record initial memory and CPU usage and start the timer
Run the NER model over the input sentences, then store formatted and normalized predictions
Record final resource and runtime metrics and calculate total and average usage
Match predicted entries with ground-truth annotations, allowing for entity span and type match:
- Exact matches: Both the span (start and end character offsets) and type (label) match
- Partial matches: Spans overlap but types are mismatched, and/or spans are incomplete
- False negatives: Ground-truth entities not predicted by the model
- False positives: Predicted entities with no corresponding ground-truth match
Calculate evaluation metrics (precision, recall, F1 score) based on matches
Store the results to CSV and JSON files for further analysis and visualization

Summarization Pipelines

Note: During the testing phase, we observed that the chosen models predominantly produced extractive summaries. As the primary goal of the evaluation was to assess abstractive summarization capabilities, these models were deemed unsuitable for the desired task.

The input pre-processing and model inference steps were implemented as planned, focusing on ensuring compatibility with each model’s tokenization and decoding requirements. However, the outputs consistently aligned more closely with extractive rather than abstractive summarization (i.e., copying text word-for-word). These results are detailed in the Results and Discussion: Summarization Models section.

Given the divergence between model outputs and the project’s intended focus, we elected not to proceed with detailed metrics evaluation for the summarization task. This decision enabled us to redirect efforts towards refining the methodology for future work. This overall finding underscores the importance of aligning model capabilities with task requirements in early project stages.

Results and Discussion

The following sections outline the performance of the chosen NLP models across the six input tiers, followed by an evaluation of their relative strengths and limitations.

NER Models

Results

Performance Metrics

As predicted in the Methodology: NER Models section, SpaCy proved to be the lightest-weight, least resource intensive, and fastest model, followed by DistilBERT and BERT. SpaCy easily outpaces both transformer-based models in terms of runtime and memory usage:

SpaCy was consistently faster, with an average runtime of 0.18s per tier and lower memory usage at 0.76MB.
BERT and DistilBERT both showed significantly higher runtime (with averages of 1.41s and 0.74s, respectively) and greater memory consumption (averaging 326MB and 164MB respectively).

These results track with expectations, as SpaCy is a lighter-weight model with a simpler architecture compared to the transformer-based BERT and DistilBERT, which are computationally more intensive. These resource differences are particularly important when selecting a model for environments with limited computational power or where real time performance is crucial, such as edge devices or systems with strict latency requirements. The size and runtime of the models may be a particular concern for users desiring to deploy a model locally on a laptop, as in this experiment.

Average Runtime

Input File	bert	distilbert	spacy
tier_1_sentences	1.44	0.71	0.16
tier_2_sentences	1.40	0.75	0.20
tier_3_sentences	1.52	0.76	0.18
tier_4_sentences	1.43	0.80	0.18
tier_5_sentences	1.28	0.70	0.16
tier_6_sentences	1.38	0.69	0.20

Average Memory Usage

Input File	bert	distilbert	spacy
tier_1_sentences	326.76	164.54	0.28
tier_2_sentences	327.19	165.18	1.68
tier_3_sentences	324.56	164.61	0.52
tier_4_sentences	326.74	164.81	0.67
tier_5_sentences	325.44	164.06	0.55
tier_6_sentences	326.33	166.40	0.88

Average CPU Usage

Input File	bert	distilbert	spacy
tier_1_sentences	31.51	34.23	7.44
tier_2_sentences	33.63	31.96	3.85
tier_3_sentences	31.70	33.86	6.93
tier_4_sentences	34.16	32.83	3.78
tier_5_sentences	37.71	30.13	3.78
tier_6_sentences	32.29	32.37	2.57

Averages Per Model

Averages Per Model	Precision	Recall	F1 Score	Runtime	Memory Usage	CPU Usage
BERT	0.877	0.945	0.908	1.408	326.170	33.500
DistilBERT	0.887	0.963	0.922	0.735	164.933	32.563
SpaCy	0.773	0.877	0.8118	0.180	0.763	4.725

Accuracy metrics

As anticipated there was a distinct trade-off between resource consumption and accuracy among the selected N ER models. Although SpaCy outperformed the transformers in terms of speed and resource efficiency, its F1 score (the harmonic mean of precision and recall) was lower than both BERT and DistilBERT. As noted in the NER Performance Metrics section, the F1 score is a balanced metric between precision and recall, and thus provides a good indicator of overall model performance:

SpaCy had an average F1 score of 0.818, which is notably lower than either BERT (0.908) or DistilBERT (0.922).
In addition to outperforming SpaCy, DistilBERT actually also outperformed its brawnier older brother in both precision and recall (and therefore F1 score), with the following averages:
- BERT: Precision = 0.87; Recall = 0.945; F1 Score = 0.908
- DistilBERT: Precision = 0.887; Recall = 0.963; F1 Score = 0.922

This makes DistilBERT the most accurate model across the board for this use case, while also being considerably more resource-efficient than BERT.

Error Analysis

In addition to evaluation metrics, we reviewed the actual errors resulting from comparing the models’ outputs to ground-truth annotation. Error analysis is crucial for achieving a more holistic understanding of the models’ performance: a high F1 score may initially indicate that an NER model is performing well numerically, but without examining the errors themselves, teams risk missing subtle but critical insights into the patterns of model output that can lead to false negatives or false positives. In other words, certain types of model output may be valid for a team’s requirements even if technically considered an “error” strictly in terms of precision/recall calculation; likewise, the calculations for accuracy metrics themselves may be distorted.

In analyzing the chosen models’ outputs against ground-truth annotations, we observed several types of recurring error which highlight the complexity of automated entity extraction. The following sections provide more detail about each type of error and potential impacts.

Span mismatches

In the majority of cases (63%), errors resulted from span mismatches (as opposed to semantic type mismatches, e.g., PER, ORG, and LOC). These span mismatches included several sub-types:

Title omission: the models frequently omitted honorifics, titles, and/or prefixes, such as Dr., Captain, Lady, and Sir (e.g., Captain Rainald Quarriman vs Rainald Quarriman). In some applications (e.g., historical records, address fields), missing a title may alter the meaning. Removing sections of a name (especially if inconsistently) may also introduce duplicate entities in post-inference pipelines. In casual applications, however, honorifics and titles may be an acceptable omission.
Multi-word splitting: the models sometimes incorrectly, segmented named entities into separate words, treating multi-word locations and proper names as two entities instead of one. As anticipated, the “overlap” nested entity types (e.g., Lanius College of Medicine) comprised a significant percentage of the errors, and fantasy names also proved difficult for the models to parse. For location-based services, missing the full entity span might lead to incorrect mappings (e.g., distinguishing New York from York).
Article omission: DilstilBERT in particular sometimes attempted to classify the definite article (the) from fantasy sentences, separate from the actual entity identified in the ground-truth annotation (e.g., the Frostwolf clan). Post-processing could check for predicted entities that consist solely of articles.
Punctuation and spacing errors (e.g., Araven Grey-Gull vs Araven Grey - Gull): the models occasionally inserted unwanted spaces around punctuation, resulting in false positives and negatives.

Beyond being qualitative concern, span mismatches have a quantify effect on the calculated precision, recall, and F1 score. The way that errors propagate through the scoring system can obscure the actual type-level performance of the model, making it appear either worse or better than it really is. For example, given the following ground-truth entity and model prediction:

Ground-truth entity: Captain Rainald Quarriman (PER)
Model prediction: Rainald Quarriman (PER)

As designed, our model pipeline designated an output as a false positive if the model predicted an entity which was also present in the ground-truth annotation, and a false negative if the model failed to predict an entity present in the ground truth. The omission of Captain therefore resulted in both a false negative and a false positive error:

False negative (FN): Captain Rainald Quarriman was not fully predicted.
False positive (FP): Rainald Quarriman was predicted, but does not exactly match the ground-truth.

Thus, precision and recall calculations both suffer:

Precision (correct predictions / all predictions) decreases because the model produced a spurious entry.
Recall (correct predictions / ground-truth entities) decreases because the model failed to identify the whole entity.

If these kinds of span mismatches are frequent, the overall evaluation metrics are skewed downward, but not because the model was truly bad at recognizing entities, but due to misalignment with annotation granularity.

A secondary consequence of span mismatches dominating the error field is that “actual” false positives and false negatives (e.g., identifying several years or failing to identify the organization Dragon’s Claw), as well as type mismatches (e.g., Paris [LOC] labeled as a PER) may be drowned out. If 80% of a models errors are span mismatches, the dominant signal in the confusion matrix will be false-positive and -negative noise from span issues, making true type confusion harder to detect. Thus, span errors not only artificially lower F1 scores, but they also obscure meaningful insights into model behavior, making debugging harder.

F1 Scores

Input file	Bert	DistilBERT	SpaCy
tier_1_sentences	0.97	0.96	0.95
tier_2_sentences	0.90	0.88	0.78
tier_3_sentences	0.98	1.00	0.93
tier_4_sentences	0.92	0.92	0.67
tier_5_sentences	0.87	0.89	0.84
tier_6_sentences	0.81	0.88	0.74

Precision

Input file	Bert	DistilBERT	SpaCy
tier_1_sentences	0.97	0.97	0.93
tier_2_sentences	0.86	0.83	0.71
tier_3_sentences	0.98	1.00	0.92
tier_4_sentences	0.89	0.87	0.61
tier_5_sentences	0.82	0.83	0.78
tier_6_sentences	0.74	0.82	0.69

Recall

Input file	Bert	DistilBERT	SpaCy
tier_1_sentences	0.97	0.96	0.97
tier_2_sentences	0.95	0.94	0.87
tier_3_sentences	0.98	1.00	0.95
tier_4_sentences	0.96	0.98	0.75
tier_5_sentences	0.92	0.96	0.91
tier_6_sentences	0.89	0.94	0.81

Interpretation and Discussion

Importance of error analysis vs. numerical metrics

One of the biggest takeaways from the experiment was the importance of error analysis beyond numerical metrics. Despite relatively high F1 scores of the outset, span errors, discovered on further review highlight the need for human validation, and model evaluation. Instead of blindly, trusting precision and recall, a deeper qualitative assessment reveals:

Which errors are trivial versus critical based on the use case (does your use case require perfect entity spans? Are omissions like titles or honorifics acceptable?)
Whether post-processing rules could be used to enhance real-world usability

Thus, teams should be aware that accepting metrics such as F1-score on face value without error inspection may lead to misleading conclusions about their model’s performance, and/or mask underlying issues (in this case, “actual” type mismatches versus errors caused by span mismatches). To combat these issues, teams should consider augmenting metrics with error categorization, adopt case-specific metric adjustments, and leverage visualizations such as heat maps to highlight systemic patterns.

Model performance comparison

Any organization wishing to select one of the given models for an NER task will have unique requirements concerning efficiency and accuracy. Given these results, we identified the following key takeaways which teams should consider when choosing one of these models:

SpaCy is the best choice of these models when resource efficiency is a priority, such as in environments with constrained or limited hardware or on edge devices. While its accuracy is lower than that of the transformer-based models, it still provides reasonable precision and recall, making it suitable for tasks for speed and memory usage are the main factors being considered. (Its easy integration with Python workflows is also likely to be an asset in these cases.)
DistilBERT is the most balanced option, providing near-state-of-the-art performance (indeed, outcompeting BERT) on its F1 score and a significantly reduced resource cost compared to BERT. It strikes a balance between performance and computational cost, making it a strong candidate for general-purpose NER tasks, and especially in scenarios where both speed and accuracy are important but where classic-BERT’s resource consumption may be a drawback.
BERT provides similar accuracy to DistilBERT, but comes at the cost of much greater resource consumption. While it may be usable in situations where computational resources are abundant, BERT is likely to be overkill in resource-constrained environments, and that without a corresponding increase in precision or recall.
In all cases, proper tas, identification, scoping, pre-processing, and normalization of the data is crucial to getting decent results. It is also critical for teams to spend time familiarizing themselves with the particular nuances and architecture that their selected model is built on, as this has implications for the output and results (e.g., accounting for differences in how models “chunk” entities, or differences in available tags, etc.).

Summarization Models

Performance

As mentioned in the Summarization Pipelines section, during preliminary testing we observed that the chosen models consistently failed to produce summaries that aligned with the ground-truth summaries: in some cases, the models simply repeated the input text word-for-word or even added text, while in other instances, they repeated incomplete chunks of the input text or simply hallucinated. The following table provides details of models’ predictions per tier and how they compare to the ground-truth summaries.

Model	Sentence Tier	Test Input	Ground-Truth Summary	Model Output	Result
BART	1	The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. Completed in 1889, it stands as a global symbol of France. It attracts millions of visitors each year.	The Eiffel Tower in Paris, completed in 1889, is a global symbol and popular tourist attraction.	The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. Completed in 1889. It stands as a global symbol of France. It attracts millions of visitors each year and is a UNESCO World Heritage Site. The tower is made of wrought iron and was completed in 1889.	Verbatim repetition of input, with additional remarks
BART	2	The Umbrite cavern in Isenfell glows with an otherworldly light as the arcane ores shimmer in hues of violet and blue. It was discovered in 267. The Umbrite Cavern remains a mysterious attraction for adventurers drawn by its supposed power to grant mystical visions.	The Umbrite Cavern in Isenfell, discovered in 267, is a mysterious site said to grant visions to adventurers.	The Umbrite Cavern in Isenfell glows with an otherworldly light as the arcane ores shimmer in hues of violet and blue. It was discovered in 267 and has long been a mysterious attraction for adventurers drawn by its supposed power to grant mystical visions.	Near-verbatim repetition of input
BART	3	The CRISPR-Cas9 gene-editing tool allows researchers to make precise modifications to DNA. This technology has revolutionized genetic research, enabling scientists to edit out harmful genes and potentially cure genetic disorders. Despite its promise, ethical concerns regarding use in human embryos remain a topic of intense debate.	CRISPR-Cas9 enables precise DNA editing, offering potential cures for genetic disorders but raising ethical concerns.	The CRISPR-Cas9 gene-editing tool allows researchers to make precise modifications to DNA. This technology has revolutionized genetic research, enabling scientists to edit out harmful genes and potentially cure genetic disorders. Despite its promise, ethical concerns regarding use in human embryos remain a topic of intense debate.	Verbatim repetition of input
BART	4	The Arcane Conductor allows celestiologists to channel raw aether energy to power the city’s floating towers. This breakthrough has enabled unprecedented advancements in transportation and defense. However, prolonged exposure to aether waves is known to cause delirium, leading to a condition which scholars have termed “arcane decay”.	The Arcane Conductor channels aether energy to power city infrastructure, but poses risks of ‘arcane decay’ from long exposure.	The Arcane Conductor allows celestiologists to channel raw aether energy to power the city’s floating towers. This breakthrough has enabled unprecedented advancements in transportation and defense. But prolonged exposure to aether waves is known to cause delirium, leading to a condition which scholars have termed “arcane decay”.	Near-verbatim repetition of input
BART	5	The city council debated the budget proposal for weeks. After much deliberation, they decided to allocate more funds to public transportation to reduce traffic congestion. This decision was met with approval from environmental groups. However, local businesses voiced concerns about potential tax increases to cover the new expenses. As a result, the council adjusted the plan to balance both interests.	The city council approved more funding for public transport to cut traffic, after getting support from environmental groups and addressing business concerns with budget adjustments.	The city council debated the budget proposal for weeks. They decided to allocate more funds to public transportation to reduce traffic congestion. This decision was met with approval from environmental groups. Local businesses voiced concerns about potential tax increases to cover the new expenses. As a result, the council adjusted the plan to balance both interests.	Near-verbatim repetition of input
BART	6	The Royal Forest Guardians decided to implement a new enchantment strategy to protect endangered magical creatures. This strategy was developed after extensive research and collaboration with local sorcerers and druids. Elven councils praised the initiative for its potential to preserve magical biodiversity. However, villagers were concerned about restrictions on land use and the potential impact on their livelihoods. To address these concerns, the Guardians included provisions for sustainable land use practices that benefit both conservation and local livelihoods.	The Royal Forest Guardians implemented a new enchantment strategy to protect endangered creatures, after collaborating with sorcerers and addressing villager concerns with sustainable practices.	The Royal Forest Guardians decided to implement a new enchantment strategy to protect endangered magical creatures. This strategy was developed after extensive research and collaboration with local sorcerers and druids. Elven councils praised the initiative for its potential to preserve magical biodiversity. However, villagers were concerned about restrictions on land use and the potential impact on their livelihoods.	Verbatim repetition of input, with omission of last sentence/outcome
T5	1	The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. Completed in 1889, it stands as a global symbol of France. It attracts millions of visitors each year.	The Eiffel Tower in Paris, completed in 1889, is a global symbol and popular tourist attraction.	, the Eiffel Tower was one of France’s most famous cathedral-inspired buildings, including the Eiffel Tower.	Fragmented/disjointed output; lack of logical flow
T5	2	The Umbrite cavern in Isenfell glows with an otherworldly light as the arcane ores shimmer in hues of violet and blue. It was discovered in 267. The Umbrite Cavern remains a mysterious attraction for adventurers drawn by its supposed power to grant mystical visions.	The Umbrite Cavern in Isenfell, discovered in 267, is a mysterious site said to grant visions to adventurers.	in 267 and was discovered in 267. The Umbrite Cavern - an Isenfell, is a mysterious attraction for adventurers drawn by its supposed power to grant mystical visions.	Fragmented/disjointed output; lack of logical flow
T5	3	The CRISPR-Cas9 gene-editing tool allows researchers to make precise modifications to DNA. This technology has revolutionized genetic research, enabling scientists to edit out harmful genes and potentially cure genetic disorders. Despite its promise, ethical concerns regarding use in human embryos remain a topic of intense debate.	CRISPR-Cas9 enables precise DNA editing, offering potential cures for genetic disorders but raising ethical concerns.	-editing tool lets scientists make precise modifications to DNA. The tool allows scientists to modify the DNA sequence automatically with precise gene modifications. This technology has revolutionized genetic research, enabling scientists to edit out harmful genes and potentially cure genetic disorders. ethical concerns relative to use in embryos remains a topic of intense debate.	Fragmented/disjointed output; lack of logical flow; hallucination (insertion of automatically)
T5	4	The Arcane Conductor allows celestiologists to channel raw aether energy to power the city’s floating towers. This breakthrough has enabled unprecedented advancements in transportation and defense. However, prolonged exposure to aether waves is known to cause delirium, leading to a condition which scholars have termed “arcane decay”.	The Arcane Conductor channels aether energy to power city infrastructure, but poses risks of ‘arcane decay’ from long exposure.	Conductor allows celestiologists to channel raw aether energy to power the city’s floating towers. This breakthrough has enabled even more advancements in transportation and defense. long exposure to aether waves is known to cause delerium, leading to a condition known to cause delerium.	Fragmented/disjointed output; lack of logical flow
T5	5	The city council debated the budget proposal for weeks. After much deliberation, they decided to allocate more funds to public transportation to reduce traffic congestion. This decision was met with approval from environmental groups. However, local businesses voiced concerns about potential tax increases to cover the new expenses. As a result, the council adjusted the plan to balance both interests.	The city council approved more funding for public transport to cut traffic, after getting support from environmental groups and addressing business concerns with budget adjustments.	on budget eradications. for weeks by city council. The council decided to allocate more federal funds to rail to ease congestion. It was met with approval from environmental groups.	Fragmented/disjointed output; lack of logical flow; hallucination (budget eradications, federal funds to rail); omission of latter half/conclusion
T5	6	The Royal Forest Guardians decided to implement a new enchantment strategy to protect endangered magical creatures. This strategy was developed after extensive research and collaboration with local sorcerers and druids. Elven councils praised the initiative for its potential to preserve magical biodiversity. However, villagers were concerned about restrictions on land use and the potential impact on their livelihoods. To address these concerns, the Guardians included provisions for sustainable land use practices that benefit both conservation and local livelihoods.	The Royal Forest Guardians implemented a new enchantment strategy to protect endangered creatures, after collaborating with sorcerers and addressing villager concerns with sustainable practices.	for forests announced last month that an organization had planned about 20 hectares of wilderness. A groves of forest, some villagers feared restrictions on land use before they could exploit a wild land conservation strategy.	Fragmented/disjointed output; lack of logical flow; hallucination (20 hectares of wilderness, villagers exploit wild land…strategy); omission of latter half/conclusion
Pegasus	1	The Eiffel Tower is a wrought-iron lattice tower located in Paris, France. Completed in 1889, it stands as a global symbol of France. It attracts millions of visitors each year.	The Eiffel Tower in Paris, completed in 1889, is a global symbol and popular tourist attraction.	All images are copyrighted.	Complete hallucination; no relevance to input text
Pegasus	2	The Umbrite cavern in Isenfell glows with an otherworldly light as the arcane ores shimmer in hues of violet and blue. It was discovered in 267. The Umbrite Cavern remains a mysterious attraction for adventurers drawn by its supposed power to grant mystical visions.	The Umbrite Cavern in Isenfell, discovered in 267, is a mysterious site said to grant visions to adventurers.	One of the UK’s most spectacular caves has been revealed to the public for the first time.	Complete hallucination; UK not mentioned at all in input text
Pegasus	3	The CRISPR-Cas9 gene-editing tool allows researchers to make precise modifications to DNA. This technology has revolutionized genetic research, enabling scientists to edit out harmful genes and potentially cure genetic disorders. Despite its promise, ethical concerns regarding use in human embryos remain a topic of intense debate.	CRISPR-Cas9 enables precise DNA editing, offering potential cures for genetic disorders but raising ethical concerns.	The world’s first embryos have been created using a new gene-editing technology.	Hallucination (world’s first human embryos have been created); omits tool details and pros/cons/conclusion
Pegasus	4	The Arcane Conductor allows celestiologists to channel raw aether energy to power the city’s floating towers. This breakthrough has enabled unprecedented advancements in transportation and defense. However, prolonged exposure to aether waves is known to cause delirium, leading to a condition which scholars have termed “arcane decay”.	The Arcane Conductor channels aether energy to power city infrastructure, but poses risks of ‘arcane decay’ from long exposure.	Researchers at the Massachusetts Institute of Technology (MIT) have developed a new type of Conductor that harnesses the power of aether waves.	Hallucination (Massachusetts Institute of Technology); omits tool details and pros/cons/conclusion
Pegasus	5	The city council debated the budget proposal for weeks. After much deliberation, they decided to allocate more funds to public transportation to reduce traffic congestion. This decision was met with approval from environmental groups. However, local businesses voiced concerns about potential tax increases to cover the new expenses. As a result, the council adjusted the plan to balance both interests.	The city council approved more funding for public transport to cut traffic, after getting support from environmental groups and addressing business concerns with budget adjustments.	A new budget has been approved for the city of Portland, Oregon.	Hallucination (city of Portland, Oregon); omits tool details and pros/cons/conclusion
Pegasus	6	The Royal Forest Guardians decided to implement a new enchantment strategy to protect endangered magical creatures. This strategy was developed after extensive research and collaboration with local sorcerers and druids. Elven councils praised the initiative for its potential to preserve magical biodiversity. However, villagers were concerned about restrictions on land use and the potential impact on their livelihoods. To address these concerns, the Guardians included provisions for sustainable land use practices that benefit both conservation and local livelihoods.	The Royal Forest Guardians implemented a new enchantment strategy to protect endangered creatures, after collaborating with sorcerers and addressing villager concerns with sustainable practices.	The Royal Forest is one of the world’s most important habitats for rare and endangered species.	Hallucination (one of the world’s most important habitats); omits tool details and pros/cons/conclusion

Interpretation and Discussion

All of the chosen models struggled with the summarization task, each in different ways:

BART mostly repeated the input text verbatim, and occasionally added details (e.g., the Eiffel Tower being a UNESCO World Heritage site and being made of wrought iron). The exception was for tier 6, which was intended to test models’ ability to summarize texts with cause and effect relationships and logical flow. In this case, BART repeated most of the input verbatim with the exception of the final sentence, which contained the logical outcome from the input (arguably the most important detail!).
T5 produced, disjointed fragmented output; these fragments appeared to be slightly altered versions of the original text, peace back together in ways, which did not make sense either semantically or syntactically (e.g., , the Eiffel Tower was one of France’s most famous cathedral-inspired buildings, including the Eiffel Tower). T5 also hallucinated some details, particularly as the input texts grew more complex (i.e., Tiers 3, 5, and 6, which featured more technical jargon and logical flow, all had hallucinated details). Similarly to BART, T5 also omitted important nuance from it summaries of Tiers 5 and 6, which resulted in its failing to capture the logical conclusions of the texts.
Pegasus produced the starkest deviations from the input and ground-truth summaries: it hallucinated details in every tier, as well as entirely omitting pros, cons, and conclusions from the more complex input tiers. Most surprisingly, in Tiers 1 and 2 (designed to be straightforward input texts), the model produced completely fabricated output text (e.g., “All images are copyrighted”).

Given the complexity of these models, it is difficult to pinpoint an exact cause for the differences and expected outcome versus the actual predicted results. These misfires do, however, indicate a general underlying mismatch between this experiment’s design and the selected models’ design/intended use cases. There are several factors which may have contributed to or combined to contribute to this gap:

The input texts may have been too short for the models to effectively parse. This may be because the models used for this experiment were pre-trained on datasets which featured longer input texts (e.g., BART was pre-trained on the “CNN/Daily Mail” dataset, which is comprised of news articles from CNN and the Daily Mail; Pegasus was pre-trained on the HugeNews and C4 “Common Crawl” datasets, which feature text from news articles and webpages, respectively).
The models’ pre-training may have primed them for a different kind of domain specificity (i.e., journalism/news reporting) than that contained in the input text. Pegasus’ hallucination of “all images are copyrighted”, for example, seems to support this theory.
The task itself (and input/ground-truth summaries) was poorly designed/enunciated for NLP models. Specifically the task of “summarization” was overly general, and the ground truth summaries were more like “paraphrasing” as opposed to “true summarization” from the perspective of a machine learning/NLP task.

On review, there were several assumptions made at the start of this experiment which would need to be thoroughly questioned in any future experimentation:

Is the proposed task actually aligned with NLP/summarization? “Summarization” in NLP parliament refers to a quite specific (and difficult!) task; namely, either extracting key phrases from a longer text (i.e., extractive summarization) or generating a new text containing the key ideas from an input text (abstractive summarization). It is crucial for teams to very carefully research and consider the exact task they are expecting the models to perform: “summarization” can mean different things to different humans, which may not neatly align with one of the two specific flavors of NLP summarization.
Is the selected model a good match for the defined task? What kind of input does the model “expect”? this experiment, attempted to evaluate summarization models based on criteria like runtime, resource consumption, and prediction accuracy; however, future studies should take a step back and consider more foundational aspects of model suitability before proceeding. In particular, teams should first consider a models pre-training (eg., the data set it has been trained on), examining elements like input length and domain specificity to ensure a good match for the intended use case. Bigger and more complex models like ChatGPT and Claude maybe better able to generalize and “figure it out” but the smaller a model is, the more precisely it needs to be aligned with the proposed task to ensure a desirable result.
Is the tradeoff worth it? a key take away from this experiment is that, even assuming that a model works perfectly as intended out of the box, the amount of effort and time required to research design tests, and evaluate performance may/likely will exceed to the potential resource and cost savings (in this case, none!). This experiment was initially conceived with the assumption that a given team would have limited resources available to deploy and run the models; in that case, it is safe to say that, even assuming it was possible to retrain the models for better performance, the “juice isn’t worth the squeeze”.

Overall, the poor performance of the selected summarization models on the given tasks, does not indicate that the models are “bad at summarization”, but rather illustrates the critical importance of research and level setting at the start of a project to ensure that the right tool is being selected for the job. Likewise, it is non-negotiable for teams to conduct robust needs analysis to validate that the task is a good fit for summarization and NLP in the first place.

Conclusion

This study underscores the importance of model selection in NLP tasks, particularly when balancing accuracy with resource consumption. In the case of NER, SpaCy emerges as the optimal choice when computational efficiency is paramount, particularly in environments with limited resources or when speed is crucial. However, for more demanding applications requiring higher accuracy, DistilBERT strikes the best balance, offer offering near-state-of-the-art performance with reduced computational cost compared to BERT. On the other hand, BERT remains a strong contender where computational resources are not a concern, though it may be an over-engineered solution for smaller-scale tasks.

For summarization, however, the results were less favorable. The models tested (BART, T5, and Pegasus) struggled with the tasks in various ways, pointing to a misalignment between the design of the models and the nature of the summarization task. These misfires highlight the critical importance of aligning model capabilities with task requirements, especially when using pre-trained models. Moreover, the findings suggest that, while summarization tasks can be complex and nuanced, the models selected in this study, may not be the most appropriate for certain types of content or use cases.

Overall, this research demonstrates the necessity of thorough task analysis and model selection; it is critical above all for teams to carefully assess whether NLP summarization is a suitable approach for their needs, and to consider more tailored models for their specific applications.

Further, this experiment had an extremely limited scope in terms of training/fine-tuning (i.e., none) - in reality, it is vanishing, unlikely that a model would perform a specification out of the box. Any real world application would certainly require significant time and resources (including domain expertise, as well as expertise in data engineering, analysis, and operations) to ensure that it is production ready, not to mention for ongoing monitoring and maintenance. Future teams considering NLP models for production workflows will need to carefully weigh not just the costs of needs analysis, exploratory data analysis, model research, model selection, training, testing, and deployment, but also fine-tuning, retraining, and ongoing support. “Set it, and forget it” does not apply to an NLP solution, and the projected level of effort must take this into account.

References

“Linguistic Features: SpaCy Usage Documentation.” Linguistic Features. Accessed December 31, 2024. https://spacy.io/usage/linguistic-features#named-entities
SpaCy 101: Everything you need to know. “spaCy 101: Everything You Need to Know - spaCy Usage Documentation.” Accessed October 30, 2024. https://spacy.io/usage/spacy-101
“BERT 101 - State of the Art NLP Model Explained.” n.d. Accessed December 31, 2024. https://huggingface.co/blog/bert-101
“BERT.” n.d. Accessed December 31, 2024. https://huggingface.co/docs/transformers/model_doc/bert
Sanh, Victor. “Smaller, Faster, Cheaper, Lighter: Introducing DistilBERT, a Distilled Version of BERT.” HuggingFace (blog), August 31, 2020. https://medium.com/huggingface/distilbert-8cf3380435b5
“BART”. n.d. Accessed November 13, 2024. https://huggingface.co/docs/transformers/model_doc/bart
“How Do Transformers Work? - HuggingFace NLP Course.” n.d. Accessed December 31, 2024. https://huggingface.co/learn/nlp-course-chapter1/4
“How Do Transformers Work? - HuggingFace NLP Course.” n.d. Accessed December 31, 2024. https://huggingface.co/learn/nlp-course-chapter1/4
“Pegasus.” n.d. Accessed December 31, 2024. https://huggingface.co.docs/transformers/model_doc/pegasus
Zhang, Jinqing, Yao Zhao, Mohammed Saleh, and Peter J. Liu. 2020. “PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization.” arXiv. https://doi.org/10.48550/arXiv.1912.08777.
“Monitoring Text-Based Generative AI Models Using Metrics like BLEU Score.” n.d. Arize AI. Accessed January 2, 2025. https://arize.com/blog-course/generative-ai-metrics-bleu-score
LaVi, iiTAN. 2016. “Answer to ‘Text Summarization Evaluation - BLEU vs ROUGE.’” Stack Overflow. https://stackoverflow.com/a/39190391
“Abisee/Cnn_dailymail - Datasets at HuggingFace.” 2024. February 13, 2024. https://huggingface.co/datasets/abisee/cnn_dailymail

Appendix A: SpaCy Label Mapping

SpaCy model output labels do not correspond one-to-one with the ground-truth annotations used in this study (i.e., ORG/“organization”, LOC/“location”, and PER/“person”). SpaCy labels were mapped to the ground-truth labels as follows:

SpaCy labels corresponding to ground-truth labels:

PERSON (maps to ground-truth PER)
GPE (“geo-political entity”, maps to ground-truth LOC)
FAC (“facility”, e.g., buildings - maps to LOC)
LOC (equivalent to ground-truth LOC)
ORG (equivalent to ground-truth ORG)
NORP (“nationalities or religious or political groups”, maps to ORG) SpaCy labels not corresponding to ground-truth labels:
WORK_OF_ART
CARDINAL (number)
PRODUCT
DATE
LANGUAGE

Beyond the Metrics: Lessons from NLP Model Evaluation (2024)

Abstract

Introduction

Methodology

Model Selection

NER Models

SpaCy

BERT

DistilBERT

Summarization Models

BART

T5

Pegasus

Ground-Truth Data Creation

Approach

Tools and Techniques

NER Ground-truth Generation Procedure

Summarization Ground-Truth Generation Procedure

Evaluation Metrics

Resource Metrics

NER Performance Metrics

Summarization Performance Metrics

Pipeline Setup

NER Pipelines

Summarization Pipelines

Results and Discussion

NER Models

Results

Performance Metrics

Average Runtime

Average Memory Usage

Average CPU Usage

Averages Per Model

Accuracy metrics

Error Analysis

Span mismatches

F1 Scores

Precision

Recall

Interpretation and Discussion

Importance of error analysis vs. numerical metrics

Model performance comparison

Summarization Models

Performance

Interpretation and Discussion

Conclusion

References

Appendix A: SpaCy Label Mapping