Designing and Evaluating a PubMed Metadata Text Mining Software Program

Presentation Date

5-21-2026

Files

Download

Download Full Text (113.2 MB)

Loading...

Media is loading
 

Description

Presentation at the 2026 Medical Library Association Annual Meeting

Abstract

Objectives – Author keywords are an underutilized metadata field, but represent important specialized natural language that concisely describes research. Now that Medline Indexing is largely done by automated indexing there are fewer people suggesting new MeSH terms, which limits its usefulness and relevance. Highly used author keywords could be a new source of MeSH terms, therefore a text mining python software program that analyzes and suggests new Medical Subject Heading (MeSH) terms based on PubMed search result author keywords was designed and evaluated. This python program is of interest to health science librarians engaged in maintaining MeSH and those who frequently use MeSH. 101/100

Hypothesis - A text mining python software program can identify author keywords that are not yet MeSH terms.

Methods – Visual code studio, Microsoft Copilot, and Jupyter Python Notebooks were used to create a text mining python software program. The program was then evaluated by testing the software using a use case. A PubMed search was conducted, osteosarcoma AND dog, then a PubMed metadata file of all results was saved. This file was then loaded into the program. The program then finds all of the Author Keywords and creates a ranked list of how many times the Author Keyword appeared in the search results. Any existing MeSH terms and MeSH entry terms are deleted from this list. Any Author Keywords that are also MeSH Supplemental Records are put into their own ranked list. Then the program deletes MeSH Supplemental Records to create a ranked list without MeSH Supplemental Records. The program returns all these ranked lists to the user and creates a txt file. It was confirmed that found Author Keywords were not MeSH terms, entry terms, MeSH Supplemental Records nor Supplemental Records entry terms using the MeSH database.

Results – The program was able to take a PubMed metadata results file and provide ranked lists of the most used Author Keywords, Author Keywords that are not MeSH terms, Author Keywords that are not MeSH entry terms, and Author Keywords that are not MeSH Supplemental Records. For the PubMed Search osteosarcoma AND dog the most common Author Keyword was osteosarcoma. After removing MeSH terms the most common Author Keyword was canine. After removing both the MeSH terms and MeSH entry terms the most common Author Keyword was also canine. Canine remained the most common after removing MeSH Supplemental Records and Supplemental record entry terms.

Conclusions – The program could identify Author Keywords that were not MeSH terms nor MeSH entry terms. These terms represent potential new MeSH terms. Unexpectedly, the term canine which is commonly used both by researchers and lay-people was neither a MeSH term or entry term. There are two MeSH terms Canidae and Dogs which also represent the concept of canine but canine is not an entry term for either of these concepts. Not only can this program identify potential new MeSH terms it can also suggest terms that can be added as MeSH entry terms. The next step of this project is to create a user-friendly web-interface for the program so it can be used by interested health science librarians and researchers.

Document Type

Presentation

Conference/Presentation Location

MLA 2026

Language

English

Keywords

MeSH, Medical Subject Headings, metadata, text mining, software, PubMed

Disciplines

Health Information Technology

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Designing and Evaluating a PubMed Metadata Text Mining Software Program

Share

COinS