Python plugin API¶
The alexandria3k plugin API allows the addition of data sources and processing routines. These can be contributed to alexandria3k or used in parallel with it.
db_schema¶
A module supporting virtual database table schema definition.
- class db_schema.ColumnMeta(name, value_extractor=None, **kwargs)¶
A container for column meta-data
- get_definition()¶
Return column’s DDL definition
- get_description()¶
Return column’s description, if any
- get_name()¶
Return column’s name
- get_value_extractor()¶
Return the column’s value defined extraction function
- class db_schema.TableMeta(name, **kwargs)¶
A container for table meta-data
- get_column_definition_by_name(name)¶
Return defined column definition DDL for column name
- get_columns()¶
Return the table’s columns
- get_cursor_class()¶
Return the table’s specified cursor class
- get_extract_multiple()¶
Return the function for obtaining multiple records
- get_foreign_key()¶
Return our column that refers to the parent table’s primary key
- get_name()¶
Return the table’s name
- get_parent_extract_multiple()¶
Return the function for obtaining multiple records from the parent table
- get_parent_name()¶
Return the name of the main table of which this has details
- get_post_population_script()¶
Return the SQL command to run after the table is populated
- get_primary_key()¶
Return the parent table’s column name that refers to our foreign key
- get_value_extractor_by_name(name)¶
Return defined value extraction function for column name
- get_value_extractor_by_ordinal(i)¶
Return defined value extraction function for column at ordinal i
- insert_statement()¶
Return an SQL command to insert data into the table
- table_schema(prefix='', columns=None)¶
Return the SQL command to create a table’s schema with the optional specified prefix. A columns array can be used to specify which columns to include.
data_source¶
A module supporting queries and database population through (possibly partitioned) virtual database tables.
- data_source.CONTAINER_ID_COLUMN = 1¶
int: by convention column 1 of each table hold the container (file) id. This can e.g. be the index of the file in an array of container files or an increasing counter over a streamed collection.
- data_source.CONTAINER_INDEX = 1¶
int: database table index using the container id.
- class data_source.DataFiles(directory, sample_container, file_extension=None, file_name_regex=None)¶
The source of compressed data files
- get_container_iterator()¶
Return an iterator over the int identifiers of all data files
- get_container_name(fid)¶
Return the name of the file corresponding to the specified fid
- get_file_array()¶
Return the array of data files
- class data_source.DataSource(data_source, tables, attach_databases=None)¶
Create a meta-data object that supports queries over its (virtual) tables and the population of an SQLite database with its data. This is an abstract parent class of concrete data source classes that implement each data source, such as Crossref, PubMed, ORCID, and USPTO.
- Parameters
data_source (object) – An object that shall supply the database elements. Its type depends on the requirements of the subclass inheriting DataSource.
tables (TableMeta) – A list of the table metadata associated with the data source The first table in the list shall be the root table of the hierarchy.
attach_databases (list, optional) – A list of colon-joined tuples specifying a database name and its path, defaults to None. The specified databases are attached and made available to the query and the population condition through the specified database name.
- static add_column(dictionary, table, column)¶
Add a column required for executing a query to the specified dictionary
- download(data_location, database=None, sql_query=None)¶
Download the data source to the specified data location.
:param data_location : The file or directory path to store for the downloaded data. :type data_location: object
- Parameters
database – An optional path, specifying an SQLite database
to use for retrieving the records to populate. :type database: str, optional
- Parameters
sql_query (str, optional) – An SQL SELECT query specifying the required data, defaults to None.
- get_query_column_names()¶
Return the column names associated with an executing query
- get_table_meta_by_name(name)¶
Return the metadata of the specified table
- get_virtual_db()¶
Return the virtual table database as an apsw object
- populate(database_path, columns=None, condition=None)¶
Populate the specified SQLite database using the data specified in the object constructor’s call. The database is created if it does not exist. If it exists, the tables to be populated are dropped (if they exist) and recreated anew as specified.
- Parameters
database_path (str) – The path specifying the SQLite database to populate.
columns (list, optional) – A list of strings specifying the columns to populate, defaults to None. The strings are of the form table_name.column_name or table_name.*.
condition (str, optional) – SQL expression specifying the rows to include in the database’s population, defaults to None. The expression can contain references to the table’s columns, to tables in attached databases prefixed by their name, and tables in the database being populated prefixed by populated. Implicitly, if a main table is populated, its details tables will only get populated with the records associated with the corresponding main table’s record.
- query(query, partition=False)¶
Run the specified query on the virtual database using the data specified in the object constructor’s call.
- Parameters
query (str) – An SQL SELECT query specifying the required data.
partition (bool, optional) – When true the query will run separately in each container, defaults to False. Queries involving table joins will run substantially faster if access to each table’s records is restricted with an expression table_name.container_id = CONTAINER_ID, and the partition argument is set to true. In such a case the query is repeatedly run over each database partition (compressed JSON file) with CONTAINER_ID iterating sequentially to cover all partitions. The query’s result is the concatenation of the individual partition results. Running queries with joins without partitioning will often result in quadratic (or worse) algorithmic complexity.
- Returns
An iterable over the query’s results.
- Return type
iterable
- set_query_columns(query)¶
Set the columns a query requires to run as a set in self.query_columns by invoking the tracer
- tables_transitive_closure(table_list, top)¶
Return the transitive closure of all named tables with all the ones required to reach the specified top
- class data_source.ElementsCursor(table, parent_cursor)¶
An (abstract) cursor over a collection of data embedded within another cursor.
- Close()¶
Cursor’s destructor, used for cleanup
- Column(col)¶
Return the value of the column with ordinal col
- Eof()¶
Return True when the end of the table’s records has been reached.
- Filter(*args)¶
Always called first to initialize an iteration to the first row of the table
- abstract Next()¶
Advance reading to the next available element.
- abstract Rowid()¶
Return a unique id of the row along all records
- container_id()¶
Return the id of the container containing the data being fetched. Not part of the apsw API.
- current_row_value()¶
Return the current row. Not part of the apsw API.
- record_id()¶
Return the record’s identifier. Not part of the apsw API.
- class data_source.FilesCursor(table, get_file_cache)¶
A cursor that iterates over the files in a directory Not used directly by an SQLite table
- Filter(index_number, _index_name, constraint_args)¶
Always called first to initialize an iteration to the first (possibly constrained) row of the table
- Next()¶
Advance reading to the next available file. Files are assumed to be non-empty.
- debug_progress_bar(current_progress=None, total_length=None)¶
Print a progress bar
- class data_source.ItemsCursor(table)¶
A cursor over the items of data files. Internal use only. Not used directly by an SQLite table.
- Close()¶
Cursor’s destructor, used for cleanup
- Eof()¶
Return True when the end of the table’s records has been reached.
- Rowid()¶
Return a unique id of the row along all records
- current_row_value()¶
Return the current row. Not part of the apsw API.
- class data_source.NestedElementsCursor(table, parent_cursor)¶
An abstract cursor designed to facilitate iterating over elements nested within a parent element. It depends on the implementation of the abstract method element_name.
- Next()¶
Advance reading to the next available nested element. If the current list of elements is exhausted, it fetches the next list from the parent cursor.
- abstract element_name()¶
The record key from which to retrieve the nested elements from the current row of the parent cursor. Not part of the apsw API.
- data_source.PROGRESS_BAR_LENGTH = 50¶
int: length of the progress bar printed during database population.
- data_source.ROWID_COLUMN = -1¶
int: column that denotes the internally used SQLite rowid integer row identifier.
- data_source.ROWID_INDEX = 2¶
int: database table index using the row id.
- class data_source.RecordsCursor(table, files_cursor)¶
A cursor for iterating over records within nested elements.
A record could correspond to e.g. the metadata of a publication. Multiple records can appear in a file.
- Parameters
table (StreamingTable) – A StreamingTable object representing the table structure.
file_cursor – A cursor that iterates over the elements in a file.
The behavior of this cursor may depend on whether the file is compressed and the type of file being processed. :type file_cursor: object
- Close()¶
Cursor’s destructor, used for cleanup
- Column(col)¶
Return the value of the column with ordinal col
- Eof()¶
Return True when the end of the table’s records has been reached.
- Filter(index_number, index_name, constraint_args)¶
Always called first to initialize an iteration to the first row of the table according to the index
- Next()¶
Advance to the next item.
- Rowid()¶
Return a unique id of the row along all records
- container_id()¶
Return the id of the container containing the data being fetched. Not part of the apsw API.
- abstract current_row_value()¶
Return the current row. Not part of the apsw API.
- element_name()¶
The work key from which to retrieve the elements. Not part of the apsw API.
- data_source.SINGLE_PARTITION_INDEX = 'SINGLE_PARTITION'¶
str: denote a table with a single partition by setting or comparing for equality an index value to this reference.
- class data_source.StreamingCachedContainerTable(table_meta, table_dict, data_source, sample=<function StreamingTable.<lambda>>)¶
An apsw table streaming over data of the supplied table metadata. This works over a cached data container (e.g. a parsed JSON or XML file) that allows indexing entries within it.
- BestIndex(constraints, _orderbys)¶
Called by the Engine to determine the best available index for the operation at hand
- class data_source.StreamingTable(table_meta, table_dict, data_source, sample=<function StreamingTable.<lambda>>)¶
An apsw table streaming over data of the supplied table metadata
- Parameters
table_meta (TableMeta) – The table’s metadata.
table_dict (dict[str, TableMeta]) – A map from table names to their metadata.
data_source (object) – An object that is used to supply the database elements. Its type depends on the requirements of the subclass inheriting DataSource.
sample (callable, optional) – A callable to control container sampling, defaults to lambda n: True. The population or query method will call this argument for each Crossref container file with each container’s file name as its argument. When the callable returns True the container file will get processed, when it returns False the container will get skipped.
- BestIndex(_constraints, _orderbys)¶
Called by the Engine to determine the best available index for the operation at hand
- Destroy()¶
Called when a reference to a virtual table is no longer used
- Disconnect()¶
Called when a reference to a virtual table is no longer used
- Open()¶
Return the table’s cursor object
- cursor(table_meta)¶
Return the cursor associated with this table. The constructor for cursors embedded in others takes a parent cursor argument. To handle this requirement, this method recursively calls itself until it reaches the top-level table.
- get_data_source()¶
Return the data source of this table
- get_table_meta()¶
Return the metadata of this table
- get_table_meta_by_name(name)¶
Return the metadata of the specified table
- get_value_extractor_by_ordinal(column_ordinal)¶
Return the value extraction function for column at specified ordinal. Not part of the apsw interface.
- sample(data_id)¶
Return true if the identified data element should be included in the sample given in the table’s constructor.
csv_source¶
Functions providing virtual table access to CSV data sources.
- class csv_source.CsvCursor(table)¶
A virtual table cursor over CSV data.
- Close()¶
Cursor’s destructor, used for cleanup
- Column(col)¶
Return the value of the column with ordinal col
- Eof()¶
Return True when the end of the table’s records has been reached.
- Filter(_index_number, _index_name, _constraint_args)¶
Always called first to initialize an iteration to the first row of the table according to the index
- Next()¶
Advance to the next item.
- Rowid()¶
Return a unique id of the row along all records
- class csv_source.VTSource(table, data_source, sample)¶
Virtual table data source for a single file. This gets registered with the apsw Connection through createmodule in order to instantiate the virtual table.
- Connect(_db, _module_name, _db_name, table_name)¶
Create the specified virtual table by creating its schema and an apsw table (a StreamingTable instance) streaming over its data
- Create(_db, _module_name, _db_name, table_name)¶
Create the specified virtual table by creating its schema and an apsw table (a StreamingTable instance) streaming over its data
- get_container_iterator()¶
Return an iterator over the int identifiers of all data files
- get_container_name(_fid)¶
Return the name of the file corresponding to the specified fid
Core XML support for Python.
This package contains four sub-packages:
- dom – The W3C Document Object Model. This supports DOM Level 1 +
Namespaces.
parsers – Python wrappers for XML parsers (currently only supports Expat).
- sax – The Simple API for XML, developed by XML-Dev, led by David
Megginson and ported to Python by Lars Marius Garshol. This supports the SAX 2 API.
- etree – The ElementTree XML library. This is a subset of the full
ElementTree XML release.