Tip#65: Search String Theory - Applying pairwise combinatorics to PubMed searches
Many thanks to Marijane White for this week's post!
Introducing the Pairwise PubMed Search Generator!
The Pairwise PubMed Search Generator is a tool designed to streamline the creation of complex PubMed search strings. By using two lists of input terms, you can quickly generate search queries that can either be copied to your clipboard or launched directly in PubMed with a single click. This tool can save you thousands of keystrokes and help prevent errors in intricate search constructions.
Continue reading to learn more about PubMed’s pairwise search functionality, a brief overview of pairwise combinatorics, and of course, the tool itself.
PubMed’s proximity search functionality
The National Library of Medicine announced the addition of long-awaited proximity search capability to PubMed in the fall of 2022. Initially only available in the Title and Title/Abstract fields, it was later made available for the Affiliation field as well, where it also limits the search to individual affiliations rather than across all the affiliations on a given record.
The syntax uses the format “search terms”[field:~N] where: search terms: two or more words, contained in double quotes
field: Title, Title/Abstract, or Affiliation
N: the proximity distance between the search termsMore details about the specifics of PubMed’s Proximity Search functionality can be found in the PubMed User Guide.
PubMed’s syntax is unique compared to other search platforms like Ovid, EBSCOhost, or Scopus, which all use an approach to proximity search syntax that could be described as a bounded Boolean AND within a specified proximity distance. (For an in-depth exploration of database proximity syntax, see Tip #59: Getting Up Close and Personal With Database Proximity Syntax from Hilary Kraus and Zahra Premji.)
It is also limited compared to other platforms, because PubMed’s proximity search does not allow the use of truncation. This limitation makes constructing a thorough proximity search difficult, because every ending or spelling variation of interest must be included in the search. The process can be tedious and error-prone, and the length of the search variations requires grows multiplicatively: If your search has one term with three variations and another with five, you must create a total of 15 proximity search string variations, for example. The total number of variations can grow quickly, making many proximity searches too challenging to construct by hand.
First, we find the following in the section on Specifying Terms:
The syntax uses the format “search terms”[field:~N] where: search terms: two or more words, contained in double quotes
field: Title, Title/Abstract, or Affiliation
N: the proximity distance between the search termsMore details about the specifics of PubMed’s Proximity Search functionality can be found in the PubMed User Guide.
PubMed’s syntax is unique compared to other search platforms like Ovid, EBSCOhost, or Scopus, which all use an approach to proximity search syntax that could be described as a bounded Boolean AND within a specified proximity distance. (For an in-depth exploration of database proximity syntax, see Tip #59: Getting Up Close and Personal With Database Proximity Syntax from Hilary Kraus and Zahra Premji.)
It is also limited compared to other platforms, because PubMed’s proximity search does not allow the use of truncation. This limitation makes constructing a thorough proximity search difficult, because every ending or spelling variation of interest must be included in the search. The process can be tedious and error-prone, and the length of the search variations requires grows multiplicatively: If your search has one term with three variations and another with five, you must create a total of 15 proximity search string variations, for example. The total number of variations can grow quickly, making many proximity searches too challenging to construct by hand.
Why is PubMed’s syntax like that?
I was curious about the syntax and decided to investigate. I knew that the new PubMed announced in 2019 and launched in 2020 is built upon the Solr search platform. Having some familiarity with Solr myself, I browsed the documentation and discovered some clues in the documentation for Solr’s Standard Query Parser.First, we find the following in the section on Specifying Terms:
A phrase is a group of words surrounded by double quotes such as “hello dolly”
This is pretty normal search stuff, no surprises here.
Next, we find in the Wildcard Searches section:
Wildcard characters can be applied to single terms, but not to search phrases.
Interesting! You may now be thinking to yourself, “but PubMed allows wildcards in search phrases!” and you’d be absolutely correct. The StandardQueryParser is just one of many query parsers available in Solr, and it seems likely that PubMed is handling phrases with truncation via something other than the Standard Query Parser, possibly the Complex Phrase Query Parser. But this does sound like the limitation that proximity searches have.
Finally, in the Proximity Searches section, we see:
To perform a proximity search, add the tilde character ~ and a numeric value to the end of a search phrase. For example, to search for “apache” and “jakarta” within 10 words of each other, use the search: “jakarta apache”~10
That tilde and number look familiar!
It appears that PubMed’s proximity search syntax is likely based on Solr’s proximity search syntax. If I’m correct about this, I might also guess that PubMed’s proximity search might not ever allow truncation, but I hope I’m wrong about that.
This why I am pleased to tell you about a tool I’ve built to streamline the creation of these searches: The Pairwise PubMed Search Generator.
Wait, what does pairwise mean?
The word pairwise is typically discussed in the context of the technique called pairwise combination, and another term you may encounter is all pairs. Pairwise combination is a technique where, given two lists, items from each list are paired with each of their terms. For example, if we have one list that contains the items 1, 2, 3 and another list that contains the items A, B, C, a pairwise combination of the two lists gives you the pairs 1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C. Here’s an illustration:I became familiar with pairwise techniques in my pre-library career as a software test engineer, where pairwise testing is used to generate test cases when a software application has many features need to be tested in combination and full test coverage is needed. Pairwise combinatorics are also used in experimental research design, statistical analysis, and decision-making tools. I couldn’t find much evidence that they have been used in library and information science.
A brief history of the Pairwise PubMed Search Generator
I have been occasionally using pairwise techniques to generate search strings for PubMed since 2018, when a research team at OHSU came to me with a search challenge: They had long lists of gene/protein names and long lists of keywords, and they were looking for a way to search for pairings from each list in PubMed. I wrote a 48-line Python script that generated search strings and queried PubMed with them via the Entrez API. The script dumped the search results to CSV and I would then load them into a local instance of Solr running on my laptop, so they could be inspected in Solr’s faceted browser interface, sometimes known as Solritas. We were excited about turning it into a tool, but at the time we all lacked the ability to produce a web application and place it online, so it stalled out. It was also somewhat crude, because each pairwise search ran separately, many search results overlapped, and I made no attempt to deduplicate or pivot them. It needed a lot of work.I found myself reaching back to this code after PubMed introduced its proximity search. My use was almost always experimental: I would start a search in PubMed, generate a proximity search when I encountered the need for one, and then translate my work to Ovid MEDLINE syntax when search strings got too long. Eventually, though, I decided to turn it into a web application. Throughout my career, I have tried and failed to build web apps many times, in many different programming languages, but the Streamlit web framework, aimed at an audience of data scientists looking to build dashboards in Python, has changed that. Streamlit is very easy to use if you already know some Python, and apps can be hosted in the Streamlit Community Cloud for free. The PubMed Pairwise Search Generator is just the first of several web apps I hope to produce.
About the Pairwise PubMed Search Generator
What can it do?
Search Functionality
The Pairwise PubMed Search Generator can generate three types of search strategies:MeSH Main Headings and Subheadings
Combine a list of MeSH Main Headings with a list of Subheadings to generate all possible pairs between them, combined with OR. You can specify that the generated search string use a MeSH Major Topic search as well as the NoExp modifier.Note, it is up to you to make sure that all the subheadings are valid for use with all of the main headings. Perhaps someday the tool will be able to do it for you.
Proximity Searches
Generate a PubMed proximity search with the correct syntax from two lists of terms, including the quotes around phrases, with all pairs combined with OR. Users can specify which field to use and the desired proximity distance. Term lists can include phrases.For example, a set of synonyms related to "cognitive dysfunction" combined with a [tiab] field tag and proximity distance of 2:
Intersection Searches
Combine two lists of search terms with the Boolean AND operator, with all pairs combined with OR. These searches allow the use of truncation. You do not have to worry about using quotes because the tool will take care of those for you. Note that PubMed has a limit of 256 wildcard characters in a single search – the tool will attempt to warn you if your generated search exceeds this limit.Additional Features
Hybrid Searches
As a bonus feature, whenever you generate a MeSH search and a proximity or intersection search at the same time, the tool creates an additional search string that combines the MeSH search with either type of keyword search using Boolean OR. The hybrid search strings have their own buttons to launch the search in PubMed, so you can follow best practices by searching PubMed with a combination of MeSH and keyword searches.Search String Metrics
The Streamlit web framework includes a metrics widget that the tool uses to display helpful statistics about the generated search strings:- The number of terms in each list
- The total number of pairs generated
- The number of characters you typed to enter your term lists
- The number of characters the tool generated in the final search string, or in other words, the amount of typing it saved you
Is there anything it doesn’t do?
It doesn’t run the searches or download results for you. Searches are run by either clicking a button that appears below the generated search string, or by copying it to your clipboard and pasting it into PubMed’s search interface yourself. Note that it is possible to generate a search string that is too long to launch via the search button, in which case the copy-to-clipboard option must be used instead. This is another feature that the tool might handle in the future.Also, it only handles two lists of terms. Proximity searches with three or more words will need to combine at least two of the words into a phrase into one of the columns. For example, variations on "sexual development disorders" or "disorders of sex development" would need to be created by adding sexual development and sex development to the first column and disorders to the second column. Readers with compelling examples for a version of the tool that can handle three or more lists of terms are invited to submit an issue at the tool’s GitHub repository to be considered for future inclusion as additional examples for users to try. Other feature requests are also welcome!
Finally, it doesn’t produce search documentation. Users are left to share and document the search strings they generate by whichever method they prefer. I have some ideas for various documentation outputs I might add in the future, such as an Excel output that creates the search strings line-by-line using Excel’s TEXTJOIN function to combine them with the OR operator.
Who is this tool for? Why use it?
The tool is targeted primarily at searchers who only have access to PubMed, or who prefer PubMed over other platforms’ MEDLINE interfaces. Anyone with access to MEDLINE search via Ovid or EBSCOhost will likely want to continue using those tools instead, because their proximity features allow for the creation of much more concise search strategies.If you are working in PubMed, by either choice or necessity, the tool increases the practicality of creating search strategies such as:
- Proximity searches with many word/term variations
- Systematically combining terms that lack MeSH equivalents
- Pairing intervention terms with outcome terms
- Combining drug names with condition names
- Creating exhaustive MeSH/subheading combinations
The tool is not just for librarians and expert searchers! I have found graduate students love the tool and take to it quickly when I demonstrate it to them in literature search consultations. Don’t be afraid to show it to anyone trying to create a thorough search strategy, especially those who are already comfortable with PubMed and who may not want to learn a different platform’s search syntax.
Using the Pairwise PubMed Search Generator
You can try it out right now!
You don’t have to wait until you have a proximity search of your own to try the tool yourself because it comes with a built-in example. The tool has a set of placeholder terms for a search on the topic of frailty measures, which is part of a search strategy I was working on shortly before I built it.To try out the built-in example search:
- Open the tool in your browser (Pairwise PubMed is also accessible via Bond University’s TERA Tools!)
- Use the checkboxes at the top of the form to select which types of searches you want to generate
- Use the Load placeholder terms button to populate the form with the example terms
- Click the Generate search strings button to generate the example searches
Would you like to watch a demo?
I gave a presentation and demonstration of the tool titled Two lists enter, one search leaves! at the Pacific Northwest Chapter of the Medical Library Association (PNCMLA) virtual conference in November 2026. It covers a lot of the background information covered in this post, with a 5-minute demo using the built-in example terms at the end.You can view it at:
Using the tool in a published search strategy? Please cite it!
The Pairwise PubMed Search Generator’s source code has been published at Zenodo with a DOI: https://doi.org/10.5281/zenodo.14768347Citations are deeply appreciated.
Want to run the tool locally?
Many medical librarians work for organizations that may block apps running in the Streamlit Community Cloud, and they may also block access to GitHub. In the latter case, the code may be downloaded from Zenodo instead.The tool has been released under the Apache 2.0 license, which permits a variety of reuse. You are welcome to download the code and run it yourself, all that’s required is a local Python installation and the Streamlit package. Running it is as simple as typing streamlit run pairwise-pubmed.py in the root of the project directory.
Contributors welcome!
Have an idea for a modification or feature? Did you find a bug and you’d like to submit a fix? Please feel free to fork the GitHub repository and submit a pull request.
Enjoy!
Have fun generating pairwise search strategies!I am interested in feedback! Let me know what you think of the PubMed Pairwise Search Generator on Bluesky at https://bsky.app/profile/marijanewhite.bsky.social.


!["cognitive impairment"[tiab:~2] OR "cognitive dysfunction"[tiab:~2] OR "cognitive disorder"[tiab:~2] OR "cognitive disorders"[tiab:~2] OR "cognitive impaired"[tiab:~2] OR "cognitive decline"[tiab:~2] OR "cognitively impairment"[tiab:~2] OR "cognitively dysfunction"[tiab:~2] OR "cognitively disorder"[tiab:~2] OR "cognitively disorders"[tiab:~2] OR "cognitively impaired"[tiab:~2] OR "cognitively decline"[tiab:~2] OR "cognition impairment"[tiab:~2] OR "cognition dysfunction"[tiab:~2] OR "cognition disorder"[tiab:~2] OR "cognition disorders"[tiab:~2] OR "cognition impaired"[tiab:~2] OR "cognition decline"[tiab:~2] OR "mental impairment"[tiab:~2] OR "mental dysfunction"[tiab:~2] OR "mental disorder"[tiab:~2] OR "mental disorders"[tiab:~2] OR "mental impaired"[tiab:~2] OR "mental decline"[tiab:~2]](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhNk9pvZT-nu1j6fgJsUtLtG4jNy64Y_Aowqqv6cNtzH_jsNn0fIfCF7zCkhdGsATPBQzfph1YJaSnu4qPB5DWoODXx-LsA_3cKurERSihA0PnIVfHx77uINNfseiNUSJ18IW8iy9seDscDx2AlVfKomkbMepVPOEh2NqUzMkdFVXVZIWEioA9tUMGn9mQ/w640-h380/Screenshot%202026-03-02%20at%2011.09.56%E2%80%AFAM.png)
Comments
Post a Comment