Python API examples¶
After downloading the required data, the functionality of alexandria3k can be accessed through its Python API, either interactively (for exploratory data analytics) or through Python scripts (for long-running jobs and for documenting research methods as repeatable processes).
Data source query and database population tasks are performed by creating an object instance associated with the corresponding data source class (e.g. Crossref or Orcid). Although many of the examples are based on the Crossref data source, the same principles apply to the other supported data sources.
Create a Crossref object¶
Crossref functionality is accessed by means of a corresponding object created by specifying the data directory.
from alexandria3k.data_sources.crossref import Crossref
crossref_instance = Crossref('April 2022 Public Data File from Crossref')
You can also add a parameter indicating how to sample the containers.
from random import random, seed
from alexandria3k.data_sources.crossref import Crossref
# Randomly (but deterministically) sample 1% of the containers
seed("alexandria3k")
crossref_instance = Crossref('April 2022 Public Data File from Crossref',
lambda _name: random() < 0.01)
Iterate through the DOI and title of all publications¶
for (doi, title) in crossref_instance.query('SELECT DOI, title FROM works'):
print(doi, title)
Iterate through Crossref publications with more than 50 authors¶
This query works by joining the works
table with the
work_authors
table.
The partition=True
argument specifies that this join can be performed
separately on each container file, allowing the query’s execution in
a single pass.
Without this option, the query would take millenia to complete.
Create a dictionary of which 2021 publications were funded by each body¶
Here partition=True
is passed to the query
method in order to
have the query run separately (and therefore efficiently) on each
Crossref data container.
from collections import defaultdict
works_by_funder = defaultdict(list)
for (funder_doi, work_doi) in crossref_instance.query(
"""
SELECT work_funders.doi, works.doi FROM works
INNER JOIN work_funders on work_funders.work_id = works.id
WHERE published_year = 2021
""",
partition=True,
):
works_by_funder[funder_doi].append(work_doi)
Database of COVID research¶
The following command creates an SQLite database with all Crossref data regarding publications that contain “COVID” in their title or abstract.
crossref_instance.populate(
"covid.db", condition="title like '%COVID%' OR abstract like '%COVID%'"
)
Reference graph¶
The following command populates an SQLite database by selecting only a subset of columns of the complete Crossref data set to create a navigable graph between publications and their references.
crossref_instance.populate(
"references.db",
columns=[
"works.id",
"works.doi",
"work_references.work_id",
"work_references.doi",
],
condition="work_references.doi is not null",
)
Record selection from external database¶
The following commands create an SQLite database with all Crossref data
of works whose DOI appears in the attached database named
selected.db
.
from alexandria3k.data_sources.crossref import Crossref
crossref_instance = Crossref(
'April 2022 Public Data File from Crossref',
attach_databases=["attached:selected.db"]
)
crossref_instance.populate(
"selected-works.db",
condition="EXISTS (SELECT 1 FROM attached.selected_dois WHERE works.doi = selected_dois.doi)"
)
Populate the database from ORCID¶
Add tables containing author country and education organization. Only records of authors identified in the Crossref publications through an ORCID will be added.
from alexandria3k.data_sources.orcid import Orcid
orcid_instance = Orcid("ORCID_2022_10_summaries.tar.gz")
orcid_instance.populate(
"database.db",
columns=[
"person_countries.*",
"person_educations.orcid",
"person_educations.organization_name",
],
condition="""EXISTS (SELECT 1 FROM populated.work_authors
WHERE work_authors.orcid = persons.orcid)"""
)
Populate the database with journal names¶
from alexandria3k.data_sources.journal_names import JournalNames
instance = JournalNames(
"http://ftp.crossref.org/titlelist/titleFile.csv"
)
instance.populate("database.db")
Populate the database with funder names¶
from alexandria3k.data_sources.funder_names import FunderNames
instance = FunderNames(
"https://doi.crossref.org/funderNames?mode=list"
)
instance.populate("database.db")
Populate the database with data regarding open access journals¶
from alexandria3k.data_sources.doaj import Doaj
instance = Doaj("https://doaj.org/csv")
instance.populate("database.db")
Work with Scopus All Science Journal Classification Codes (ASJC)¶
from alexandria3k.data_sources.adjcs import Asjcs
from alexandria3k.processes import link_works_asjcs
# Populate database with ASJCs
instance = Asjcs("resource:data/asjc.csv")
instance.populate("database.db")
# Link the (sometime previously populated works table) with ASJCs
link_works_asjcs.process("database.db")
Populate the database with the names of research organizations¶
from alexandria3k.data_sources.ror import Ror
instance = Ror("v1.17.1-2022-12-16-ror-data.zip")
instance.populate("database.db")
Link author affiliations with research organization names¶
from alexandria3k import ror
from alexandria3k.processes import link_aa_base_ror, link_aa_top_ror
# Link affiliations with best match
link_aa_base_ror.process("database.db")
# Link affiliations with top parent of best match
link_aa_top_ror.process("database.db")