PubMed publication data

from alexandria3k.data_sources import pubmed
class data_sources.pubmed.Pubmed(directory, sample=<function Pubmed.<lambda>>, attach_databases=None)

Create an object containing PubMed meta-data that supports queries over its (virtual) tables and the population of an SQLite database with its data.

Parameters
  • directory (str) – The directory path where the PubMed data files are located

  • sample (callable, optional) – A callable to control container sampling, defaults to lambda n: True. The population or query method will call this argument for each PubMed container file with each container’s file name as its argument. When the callable returns True the container file will get processed, when it returns False the container will get skipped.

  • attach_databases (list, optional) – A list of colon-joined tuples specifying a database name and its path, defaults to None. The specified databases are attached and made available to the query and the population condition through the specified database name.

populate(database_path, columns=None, condition=None)

Populate the specified SQLite database using the data specified in the object constructor’s call. The database is created if it does not exist. If it exists, the tables to be populated are dropped (if they exist) and recreated anew as specified.

Parameters
  • database_path (str) – The path specifying the SQLite database to populate.

  • columns (list, optional) – A list of strings specifying the columns to populate, defaults to None. The strings are of the form table_name.column_name or table_name.*.

  • condition (str, optional) – SQL expression specifying the rows to include in the database’s population, defaults to None. The expression can contain references to the table’s columns, to tables in attached databases prefixed by their name, and tables in the database being populated prefixed by populated. Implicitly, if a main table is populated, its details tables will only get populated with the records associated with the corresponding main table’s record.

query(query, partition=False)

Run the specified query on the virtual database using the data specified in the object constructor’s call.

Parameters
  • query (str) – An SQL SELECT query specifying the required data.

  • partition (bool, optional) – When true the query will run separately in each container, defaults to False. Queries involving table joins will run substantially faster if access to each table’s records is restricted with an expression table_name.container_id = CONTAINER_ID, and the partition argument is set to true. In such a case the query is repeatedly run over each database partition (compressed JSON file) with CONTAINER_ID iterating sequentially to cover all partitions. The query’s result is the concatenation of the individual partition results. Running queries with joins without partitioning will often result in quadratic (or worse) algorithmic complexity.

Returns

An iterable over the query’s results.

Return type

iterable

Generated schema

CREATE TABLE pubmed_articles(
  id,
  container_id,
  pubmed_id,
  doi,
  publisher_item_identifier_article_id,
  pmc_article_id,
  journal_title,
  journal_issn,
  journal_issn_type,
  journal_cited_medium,
  journal_volume INTEGER,
  journal_issue INTEGER,
  journal_year INTEGER,
  journal_month INTEGER,
  journal_day INTEGER,
  journal_medline_date,
  journal_ISO_abbreviation,
  article_date_year INTEGER,
  article_date_month INTEGER,
  article_date_day INTEGER,
  article_date_type,
  pagination,
  elocation_id,
  elocation_id_type,
  elocation_id_valid,
  language,
  title,
  vernacular_title,
  journal_country,
  medline_ta,
  nlm_unique_id,
  issn_linking,
  article_pubmodel,
  citation_subset,
  completed_year INTEGER,
  completed_month INTEGER,
  completed_day INTEGER,
  revised_year INTEGER,
  revised_month INTEGER,
  revised_day INTEGER,
  coi_statement,
  medline_citation_status,
  medline_citation_owner,
  medline_citation_version,
  medline_citation_indexing_method,
  medline_citation_version_date,
  keyword_list_owner,
  publication_status,
  abstract_copyright_information,
  other_abstract_copyright_information
);

CREATE TABLE pubmed_authors(
  id,
  container_id,
  article_id,
  given,
  family,
  suffix,
  initials,
  valid,
  identifier,
  identifier_source,
  collective_name
);

CREATE TABLE pubmed_author_affiliations(
  id,
  container_id,
  author_id,
  affiliation,
  identifier
);

CREATE TABLE pubmed_investigators(
  id,
  container_id,
  article_id,
  given,
  family,
  suffix,
  initials,
  valid,
  identifier,
  identifier_source
);

CREATE TABLE pubmed_investigator_affiliations(
  id,
  container_id,
  investigator_id,
  affiliation,
  identifier
);

CREATE TABLE pubmed_abstracts(
  id,
  container_id,
  article_id,
  label,
  text,
  nlm_category,
  copyright_information
);

CREATE TABLE pubmed_other_abstracts(
  id,
  container_id,
  article_id,
  abstract_type,
  language
);

CREATE TABLE pubmed_other_abstract_texts(
  id,
  container_id,
  abstract_id,
  text,
  label,
  nlm_category,
  copyright_information
);

CREATE TABLE pubmed_history(
  id,
  container_id,
  article_id,
  publication_status,
  year INTEGER,
  month INTEGER,
  day INTEGER,
  hour INTEGER,
  minute INTEGER
);

CREATE TABLE pubmed_chemicals(
  id,
  container_id,
  article_id,
  registry_number,
  name_of_substance,
  unique_identifier
);

CREATE TABLE pubmed_meshs(
  id,
  container_id,
  article_id,
  descriptor_name,
  descriptor_unique_identifier,
  descriptor_major_topic,
  descriptor_type,
  qualifier_name,
  qualifier_major_topic,
  qualifier_unique_identifier
);

CREATE TABLE pubmed_supplement_meshs(
  id,
  container_id,
  article_id,
  supplement_mesh_name,
  unique_identifier,
  mesh_type
);

CREATE TABLE pubmed_comments_corrections(
  id,
  container_id,
  article_id,
  ref_type,
  ref_source,
  pmid,
  pmid_version,
  note
);

CREATE TABLE pubmed_keywords(
  id,
  container_id,
  article_id,
  keyword,
  major_topic
);

CREATE TABLE pubmed_grants(
  id,
  container_id,
  article_id,
  grant_id,
  acronym,
  agency,
  country
);

CREATE TABLE pubmed_data_banks(
  id,
  container_id,
  article_id,
  data_bank_name
);

CREATE TABLE pubmed_data_bank_accessions(
  id,
  container_id,
  data_bank_id,
  accession_number
);

CREATE TABLE pubmed_references(
  id,
  container_id,
  article_id,
  citation
);

CREATE TABLE pubmed_reference_articles(
  id,
  container_id,
  reference_id,
  article_id,
  id_type
);

CREATE TABLE pubmed_publication_types(
  id,
  container_id,
  article_id,
  publication_type,
  unique_identifier
);