Alexandria3k documentation¶
The alexandria3k package supplies a library and a command-line tool providing fast and space-efficient relational query access to the following large scientific publication open data sets. Data are decompressed on the fly, thus allowing the package’s use even on storage-restricted laptops. The alexandria3k package supports the following large data sets.
Crossref (184 GiB compressed, 1.9 TiB uncompressed — as of March 2025). This contains publication metadata from all major international publishers. The Crossref data set is split into about 33 thousand files. Each file contains JSON data for 5000 publications (works). In total, Crossref contains data for 167 million works, 35 million abstracts, 465 million associated work authors, and 2.5 billion references.
PubMed (47 GiB compressed, 707 GiB uncompressed — as of April 2025). This comprises more than 36 million citations for biomedical literature from MEDLINE, life science journals, and online books, with rich domain-specific metadata, such as MeSH indexing, funding, genetic, and chemical details.
ORCID summary data set (37 GiB compressed, 651 GiB uncompressed — as of October 2024). This contains about 22 million author details records.
DataCite (24 GiB compressed, 347 GiB uncompressed — as of 2024). This comprises research outputs and resources, such as data, pre-prints, images, and samples, containing about 50 million work entries.
United States Patent Office issued patents (12 GiB compressed, 128 GiB uncompressed — as of January 2025). This contains about 5.4 million records.
Further supported data sets include funder bodies, journal names, open access journals, and research organizations.
The alexandria3k package installation contains all elements required to run it. It does not require the installation, configuration, and maintenance of a third party relational or graph database. It can therefore be used out-of-the-box for performing reproducible publication research on the desktop.
Databases populated with alexandria3k can be used by generative AI applications through the Model Context Protocol and its SQLite reference server. Application examples include topic modeling, snowballing, trend analysis, author disambiguation, citation graph generation, research trend analysis, patent similarity detection, grant and funding prediction, co-authorship network mapping, institutional collaboration analysis, knowledge graph augmentation, research impact prediction, academic fraud detection, technology transfer mapping, interdisciplinary research discovery, and research paper recommendations.
Publication¶
Details about the rationale, design, implementation, and use of this software can be found in the following paper.
Diomidis Spinellis. Open reproducible scientometric research with Alexandria3k. PLoS ONE 18(11): e0294946. November 2023. doi: 10.1371/journal.pone.0294946
Package name derivation¶
The alexandria3k package is named after the Library of Alexandria, indicating how publication data can be processed in the third millenium AD.
Contents¶
- Installation
- Data downloading
- Use overview
- Command line examples
- Obtain list of available commands
- Show DOI and title of all publications
- Save DOI and title of 2021 publications in a CSV file suitable for Excel
- Show Crossref publications with more than 50 authors
- Count Crossref publications by year and type
- Fill a table with a query’s results
- Obtain Patent Office granted patents by type
- Sampling
- Database of COVID research
- Publications graph
- Record selection from external database
- Populate the database with author records from ORCID
- Populate the database with journal names
- Populate the database with funder names
- Work with Scopus All Science Journal Classification Codes (ASJC)
- Populate the database with data regarding open access journals
- Populate the database with the names of research organizations
- Link author affiliations with research organization names
- Python API examples
- Create a Crossref object
- Iterate through the DOI and title of all publications
- Iterate through Crossref publications with more than 50 authors
- Create a dictionary of which 2021 publications were funded by each body
- Database of COVID research
- Reference graph
- Record selection from external database
- Populate the database from ORCID
- Populate the database with journal names
- Populate the database with funder names
- Populate the database with data regarding open access journals
- Work with Scopus All Science Journal Classification Codes (ASJC)
- Populate the database with the names of research organizations
- Link author affiliations with research organization names
- Application examples
- Schema diagrams
- Command line interface
- Python user API
- Python plugin API
- Python utility API
- Development processes
- Plugin development
- FAQ: Frequently asked questions