Plugin development¶
The alexandria3k system is structured around plugins that provide access to data sources and data processing functionality. New plugins can be developed and contributed by adding a Python file in the data_sources or processes source code directory. Plugins are made automagically available to the a3k Command line interface and can also be used through the alexandria3k Python user API.
All plugins shall start with a docstring describing the plugin’s functionality.
Data source plugins¶
To create a new data source plugin you need to perform a number of steps. The employed modules and classes are documented in the Python plugin API.
Create a file in
src/alexandria3k/data_sources/
that implements the data source access. There you need to define the data’s class, their schema, cursors for fetching the data items, and, optionally, a method for downloading the data. The data source’s name for the CLI is all lowercase (e.g.datacite
), for the class name with an initial capital (e.g.Datacite
), and in the documentation and schema’s as formally spelled (e.g.DataCite
).The simplest case involves adding plugins for data available in CSV format (see e.g. the implementations for
asjcs
,journal_names
,doaj
, andfunder_names
). In this case only the following steps are needed.Create the plugin in the
src/alexandria3k/data_sources
directory.Define the data schema in a
tables
global variable through theTableMeta
andColumnMeta
classes. Use theCsvCursor
as the table’s cursor. Unless column values are obtained in order from a CSV file, each column must define a function that will provide its value. Example:
tables = [ TableMeta( "funder_names", cursor_class=CsvCursor, columns=[ ColumnMeta("id"), ColumnMeta("url"), ColumnMeta("name"), ColumnMeta("replaced", lambda row: row[2] if len(row[2]) else None), ], ) ]
Define a URL for fetching the data source, if none is provided, as the global constant
DEFAULT_SOURCE
. If no such exists, set the constant toNone
.Define the data source’s class by subclassing the
DataSource
class. Example (for a single table):
class FunderNames(DataSource): def __init__( self, data_source, sample=lambda n: True, attach_databases=None, ): super().__init__( VTSource(table, data_source, sample), [table], attach_databases )
Data sources with with multiple tables, need the defition of a more complex schema. In these cases you must also define:
an SQLite virtual table data source
VTSource
,a cursor to iterate over the records of each table,
and possibly a partitioning scheme to handle interrelated tables that are streamed concurrently (e.g. a work and its authors).
The virtual table and cursor classes are documented in the APSW Virtual Tables chapter. It is easiest to base this on an existing data source:
The data sources
asjcs
,doaj
,funder_names
,journal_names
map a single CSV dataset into a single relational table.crossref
maps a set of compressed JSON files, all residing in a single directory, and each containing multiple works.datacite
maps a compressed tar file, containing files residing in a non-flat structure. Each file contains 1000 JSON records, separated by newlines.orcid
maps a compressed tar file that has in nested directories one XML file for each person.pubmed
maps a set of compressed XML files, all residing in a single directory, and each containing multiple works.ror
maps a single compressed JSON file into several tables.
The
crossref
,ror
, andorcid
data sources implement partitioning. In addition, thecrossref
,orcid`, and ``datacite
data sources implement indexing over partitions, which allows efficient sampling by skipping unneeded containers. All table rows have anid
field, with a unique identifier for that table across all table rows. As detail table indices are reset for each record, the identifier typically incorporates also the identifiers of the parent tables.Add a small subset of the data in the
tests/data/
directory.Create a file with unit and integration tests in
tests/data_sources/
.Create a Graphviz file with the data source’s schema in
docs/schema/
. Use a color associated with the data source’s logo.Add a legend for the schema’s tables in
docs/schema/other.dot
.Add the schema’s SVG in
docs/schema.rst
.Add the schema in
bin/update-schema
, run it to regenerate the schema diagrams, and, after they are correct, add the generated SVG file, and commit the new and updated files.Update the plugin documentation and the schema diagrams as documented elsewhere in this guide.
Add a motivating example in the
examples
directory.
Data processing plugins¶
To create a data processing plugin follow these steps.
Create the plugin in the
src/alexandria3k/plugins
directory.Define the schema of the generated tables in a
tables
global variable through theTableMeta
andColumnMeta
classes. There is not need to define cursors or accessor functions.Define a function named
process
, which takes as an argument a path to a populated SQLite database, and performs the processing associated with the plugin.Add unit tests in the
tests/processes
directory.Add a motivating example in the
examples
directory.