Python plugin API¶

The alexandria3k plugin API allows the addition of data sources and processing routines. These can be contributed to alexandria3k or used in parallel with it.

db_schema¶

A module supporting virtual database table schema definition.

class db_schema.ColumnMeta(name, value_extractor=None, **kwargs)¶

A container for column meta-data

get_definition()¶: Return column’s DDL definition

get_description()¶: Return column’s description, if any

get_name()¶: Return column’s name

get_value_extractor()¶: Return the column’s value defined extraction function

class db_schema.TableMeta(name, **kwargs)¶

A container for table meta-data

get_column_definition_by_name(name)¶: Return defined column definition DDL for column name

get_columns()¶: Return the table’s columns

get_cursor_class()¶: Return the table’s specified cursor class

get_extract_multiple()¶: Return the function for obtaining multiple records

get_foreign_key()¶: Return our column that refers to the parent table’s primary key

get_name()¶: Return the table’s name

get_parent_extract_multiple()¶: Return the function for obtaining multiple records from the parent table

get_parent_name()¶: Return the name of the main table of which this has details

get_post_population_script()¶: Return the SQL command to run after the table is populated

get_primary_key()¶: Return the parent table’s column name that refers to our foreign key

get_value_extractor_by_name(name)¶: Return defined value extraction function for column name

get_value_extractor_by_ordinal(i)¶: Return defined value extraction function for column at ordinal i

insert_statement()¶: Return an SQL command to insert data into the table

table_schema(prefix='', columns=None)¶: Return the SQL command to create a table’s schema with the optional specified prefix. A columns array can be used to specify which columns to include.

data_source¶

A module supporting queries and database population through (possibly partitioned) virtual database tables.

data_source.CONTAINER_ID_COLUMN = 1¶: int: by convention column 1 of each table hold the container (file) id. This can e.g. be the index of the file in an array of container files or an increasing counter over a streamed collection.

data_source.CONTAINER_INDEX = 1¶: int: database table index using the container id.

class data_source.DataFiles(directory, sample_container, file_extension=None, file_name_regex=None)¶

The source of compressed data files

get_container_iterator()¶: Return an iterator over the int identifiers of all data files

get_container_name(fid)¶: Return the name of the file corresponding to the specified fid

get_file_array()¶: Return the array of data files

class data_source.DataSource(data_source, tables, attach_databases=None)¶

Create a meta-data object that supports queries over its (virtual) tables and the population of an SQLite database with its data. This is an abstract parent class of concrete data source classes that implement each data source, such as Crossref, PubMed, ORCID, and USPTO.

Parameters:

data_source (object) – An object that shall supply the database elements. Its type depends on the requirements of the subclass inheriting DataSource.
tables (TableMeta) – A list of the table metadata associated with the data source The first table in the list shall be the root table of the hierarchy.
attach_databases (list, optional) – A list of colon-joined tuples specifying a database name and its path, defaults to None. The specified databases are attached and made available to the query and the population condition through the specified database name.

static add_column(dictionary, table, column)¶: Add a column required for executing a query to the specified dictionary

close()¶: Terminate use of the data source, freeing associated resources.

download(data_location, database=None, sql_query=None)¶

Download the data source to the specified data location.

Parameters:

data_location (object) – The file or directory path to store for the downloaded data.
database (str, optional) – An optional path, specifying an SQLite database to use for retrieving the records to populate.
sql_query (str, optional) – An SQL SELECT query specifying the required data, defaults to None.

get_query_column_names()¶: Return the column names associated with an executing query

get_table_meta_by_name(name)¶: Return the metadata of the specified table

get_virtual_db()¶: Return the virtual table database as an apsw object

populate(database_path, columns=None, condition=None)¶

Populate the specified SQLite database using the data specified in the object constructor’s call. The database is created if it does not exist. If it exists, the tables to be populated are dropped (if they exist) and recreated anew as specified.

Parameters:

database_path (str) – The path specifying the SQLite database to populate.
columns (list, optional) – A list of strings specifying the columns to populate, defaults to None. The strings are of the form table_name.column_name or table_name.*.
condition (str, optional) – SQL expression specifying the rows to include in the database’s population, defaults to None. The expression can contain references to the table’s columns, to tables in attached databases prefixed by their name, and tables in the database being populated prefixed by populated. Implicitly, if a main table is populated, its details tables will only get populated with the records associated with the corresponding main table’s record.

query(query, partition=False)¶

Run the specified query on the virtual database using the data specified in the object constructor’s call.

Parameters:

query (str) – An SQL SELECT query specifying the required data.
partition (bool, optional) – When true the query will run separately in each container, defaults to False. Queries involving table joins will run substantially faster if access to each table’s records is restricted with an expression table_name.container_id = CONTAINER_ID, and the partition argument is set to true. In such a case the query is repeatedly run over each database partition (compressed JSON file) with CONTAINER_ID iterating sequentially to cover all partitions. The query’s result is the concatenation of the individual partition results. Running queries with joins without partitioning will often result in quadratic (or worse) algorithmic complexity.

Returns:

An iterable over the query’s results.

Return type:

iterable

set_query_columns(query)¶: Set the columns a query requires to run as a set in self.query_columns by invoking the tracer

tables_transitive_closure(table_list, top)¶: Return the transitive closure of all named tables with all the ones required to reach the specified top

class data_source.ElementsCursor(table, parent_cursor)¶

An (abstract) cursor over a collection of data embedded within another cursor.

Close()¶: Cursor’s destructor, used for cleanup

Column(col)¶: Return the value of the column with ordinal col

Eof()¶: Return True when the end of the table’s records has been reached.

Filter(*args)¶: Always called first to initialize an iteration to the first row of the table

abstractmethod Next()¶: Advance reading to the next available element.

abstractmethod Rowid()¶: Return a unique id of the row along all records

container_id()¶: Return the id of the container containing the data being fetched. Not part of the apsw API.

current_row_value()¶: Return the current row. Not part of the apsw API.

record_id()¶: Return the record’s identifier. Not part of the apsw API.

class data_source.FilesCursor(table, get_file_cache)¶

A cursor that iterates over the files in a directory Not used directly by an SQLite table

Filter(index_number, _index_name, constraint_args)¶: Always called first to initialize an iteration to the first (possibly constrained) row of the table

Next()¶: Advance reading to the next available file. Files are assumed to be non-empty.

debug_progress_bar(current_progress=None, total_length=None)¶: Print a progress bar

class data_source.ItemsCursor(table)¶

A cursor over the items of data files. Internal use only. Not used directly by an SQLite table.

Close()¶: Cursor’s destructor, used for cleanup

Eof()¶: Return True when the end of the table’s records has been reached.

Rowid()¶: Return a unique id of the row along all records

current_row_value()¶: Return the current row. Not part of the apsw API.

class data_source.NestedElementsCursor(table, parent_cursor)¶

An abstract cursor designed to facilitate iterating over elements nested within a parent element. It depends on the implementation of the abstract method element_name.

Next()¶: Advance reading to the next available nested element. If the current list of elements is exhausted, it fetches the next list from the parent cursor.

abstractmethod element_name()¶: The record key from which to retrieve the nested elements from the current row of the parent cursor. Not part of the apsw API.

data_source.PROGRESS_BAR_LENGTH = 50¶: int: length of the progress bar printed during database population.

data_source.ROWID_COLUMN = -1¶: int: column that denotes the internally used SQLite rowid integer row identifier.

data_source.ROWID_INDEX = 2¶: int: database table index using the row id.

class data_source.RecordsCursor(table, files_cursor)¶

A cursor for iterating over records within nested elements.

A record could correspond to e.g. the metadata of a publication. Multiple records can appear in a file.

Parameters:

table (StreamingTable) – A StreamingTable object representing the table structure.
file_cursor (object) – A cursor that iterates over the elements in a file. The behavior of this cursor may depend on whether the file is compressed and the type of file being processed.

Close()¶: Cursor’s destructor, used for cleanup

Column(col)¶: Return the value of the column with ordinal col

Eof()¶: Return True when the end of the table’s records has been reached.

Filter(index_number, index_name, constraint_args)¶: Always called first to initialize an iteration to the first row of the table according to the index

Next()¶: Advance to the next item.

Rowid()¶: Return a unique id of the row along all records

container_id()¶: Return the id of the container containing the data being fetched. Not part of the apsw API.

abstractmethod current_row_value()¶: Return the current row. Not part of the apsw API.

element_name()¶: The work key from which to retrieve the elements. Not part of the apsw API.

data_source.SINGLE_PARTITION_INDEX = 'SINGLE_PARTITION'¶: str: denote a table with a single partition by setting or comparing for equality an index value to this reference.

class data_source.StreamingCachedContainerTable(table_meta, table_dict, data_source, sample=<function StreamingTable.<lambda>>)¶

An apsw table streaming over data of the supplied table metadata. This works over a cached data container (e.g. a parsed JSON or XML file) that allows indexing entries within it.

BestIndex(constraints, _orderbys)¶: Called by the Engine to determine the best available index for the operation at hand

class data_source.StreamingTable(table_meta, table_dict, data_source, sample=<function StreamingTable.<lambda>>)¶

An apsw table streaming over data of the supplied table metadata

Parameters:

table_meta (TableMeta) – The table’s metadata.
table_dict (dict[str, TableMeta]) – A map from table names to their metadata.
data_source (object) – An object that is used to supply the database elements. Its type depends on the requirements of the subclass inheriting DataSource.
sample (callable, optional) – A callable to control container sampling, defaults to lambda n: True. The population or query method will call this argument for each Crossref container file with each container’s file name as its argument. When the callable returns True the container file will get processed, when it returns False the container will get skipped.

BestIndex(_constraints, _orderbys)¶: Called by the Engine to determine the best available index for the operation at hand

Destroy()¶: Called when a reference to a virtual table is no longer used

Disconnect()¶: Called when a reference to a virtual table is no longer used

Open()¶: Return the table’s cursor object

cursor(table_meta)¶: Return the cursor associated with this table. The constructor for cursors embedded in others takes a parent cursor argument. To handle this requirement, this method recursively calls itself until it reaches the top-level table.

get_data_source()¶: Return the data source of this table

get_table_meta()¶: Return the metadata of this table

get_table_meta_by_name(name)¶: Return the metadata of the specified table

get_value_extractor_by_ordinal(column_ordinal)¶: Return the value extraction function for column at specified ordinal. Not part of the apsw interface.

sample(data_id)¶: Return true if the identified data element should be included in the sample given in the table’s constructor.

csv_source¶

Functions providing virtual table access to CSV data sources.

class csv_source.CsvCursor(table)¶

A virtual table cursor over CSV data.

Close()¶: Cursor’s destructor, used for cleanup

Column(col)¶: Return the value of the column with ordinal col

Eof()¶: Return True when the end of the table’s records has been reached.

Filter(_index_number, _index_name, _constraint_args)¶: Always called first to initialize an iteration to the first row of the table according to the index

Next()¶: Advance to the next item.

Rowid()¶: Return a unique id of the row along all records

class csv_source.VTSource(table, data_source, sample)¶

Virtual table data source for a single file. This gets registered with the apsw Connection through createmodule in order to instantiate the virtual table.

Connect(_db, _module_name, _db_name, table_name)¶: Create the specified virtual table by creating its schema and an apsw table (a StreamingTable instance) streaming over its data

Create(_db, _module_name, _db_name, table_name)¶: Create the specified virtual table by creating its schema and an apsw table (a StreamingTable instance) streaming over its data

get_container_iterator()¶: Return an iterator over the int identifiers of all data files

get_container_name(_fid)¶: Return the name of the file corresponding to the specified fid

Core XML support for Python.

This package contains four sub-packages:

dom – The W3C Document Object Model. This supports DOM Level 1 +: Namespaces.

parsers – Python wrappers for XML parsers (currently only supports Expat).

sax – The Simple API for XML, developed by XML-Dev, led by David: Megginson and ported to Python by Lars Marius Garshol. This supports the SAX 2 API.
etree – The ElementTree XML library. This is a subset of the full: ElementTree XML release.

Python plugin API¶

db_schema¶

data_source¶

csv_source¶

alexandria3k

Navigation

Related Topics