The Local CAZyme Database Structure
To facilitate the thorough interrogation of data retrieved from CAZy and minimise storing duplicate and redundant data, data retrieved from CAZy is stored in a local SQL database. Every CAZyme scraped from CAZy has the following data:
Protein name
CAZy (sub)family
GenBank accession(s)
Each CAZyme may or may not have the following data, depending on the entry:
EC number(s)
UniProt acession(s)
PDB accession(s)
Note
EC numbers, UniProt accessions and PDB accessions can be retrieved from UniProt for CAZymes
in the local CAZyme database using cazy_webscraper
.
Database Schema
Below is the database ORM. It plots the relationships between elements in each table.
The database ORM (schema) can also be viewed Here
Retrieve the schema of a local CAZyme database
The schema of a local CAZyme database can be retrieved using cazy_webscraper
:
Alternatively, sqlite3 can be used to retrieve the schema:
The Logs table
The database built by cazy_webscraper
contains a table called ‘Logs’. This table logs every
scrape of CAZy, UniProt and GenBank which added data to the database.
The table contains the following columns and data:
log_id: Autoincrement ID number
date: Date scrape was initated (in ISO format)
time: Time scrape was initated (in ISO format)
database: Name of the external database from which data was retrieved (i.e. ‘CAZy’, ‘UniProt’ or ‘GenBank’)
retrieved_annotations: List of annotation types retrieved (e.g. ‘EC number, PDB accession, Sequence’)
classes: CAZy classes for which data was retrieved
families: CAZy families for which data was retrievedclasses
kingdoms: taxonomy Kingdoms filteres applied
genera_filter: taxonomy Kingdoms filteres applied
species_filter: taxonomy Kingdoms filteres applied
strains_filter: taxonomy Kingdoms filteres applied
ec_filter: EC fliters applied to retrieve data for CAZymes annotated with the specified EC numbers (only applies to retrieval of data from UniProt, GenBank and PDB)
cmd_line: Reproduction of the command line arguments passed to
cazy_webscraper
.
The ‘Logs’ table allows any one who uses the database to see how the dataset was compiled.
Genbanks
The Genbanks table contains data retrieved from CAZy and NCBI GenBank. From CAZy the GenBank protein accession.
The protein sequence for the protein can be retrieved from NCBI GenBank using cazy_webscraper
, and stored
in the Genbanks table, as well as the date the sequence was updated in NCBI. The sequence update date is used
to check if a newly retrieved protein sequence from GenBank is newer than the sequence stored in the database, and the
sequence in the database should be updated.
Genbanks_CazyFamilies
The Genbanks_CazyFamilies table is a relationship. The table defines which protein is assigned to which CAZy family.
CazyFamilies
The CazyFamilies lists of CAZy families retrieved from CAZy. If CAZy subfamilies are retrieved each CAZy subfamily is associated with its parent CAZy family.
Taxs
The Taxs table stores taxonomy database, storing the genus and species of the source organisms of CAZymes retrieved from CAZy. Each source organism is associated with a taxonomic class.
Kingdoms
The Kingdoms table lists all taxonomic kingdoms of the source organisms downloaded from CAZy.
UniProts
The UniProts table contains protein data retrieved from UniProt using cazy_webscraper
. This includes:
UniProt ID
Protein name
Protein sequence
Date the protein sequence was last upated in UniProt
Genbanks_Ecs
The Genbanks_Ecs table is a relationship table, and defines which proteins are annotated with which EC numbers.
Ecs
The Ecs table lists all EC numbers retrieved from UniProt using cazy_webscraper
. The EC numbers stored in the
local CAZome database are do not have the ‘EC’ prefix.
Genbanks_Pdbs
The Genbanks_Pdbs is a relationship table, and defines which PDB accessions belong to which protein.
Pdbs
The Pdbs table contains all PDB accessions retrieved from UniProt using cazy_webscraper
.
Note
Not all PDB accessions represented in a CAZyme record at CAZy are necessarily present in PDB. For example, some accessions are placeholders while structures are under embargo.
Note
PDB/RCSB protein structures are not recorded in the local SQLite3 database. They are written to disk in a user-specified directory.
The Schema
As of cazy_webscraper version >= 2.3.0, the schema of a local CAZyme database will be:
CREATE TABLE IF NOT EXISTS "Kingdoms" (
kingdom_id INTEGER NOT NULL,
kingdom VARCHAR,
PRIMARY KEY (kingdom_id),
UNIQUE (kingdom)
);
CREATE TABLE IF NOT EXISTS "GtdbTaxs" (
gtdb_tax_id INTEGER NOT NULL,
kingdom VARCHAR,
phylum VARCHAR,
tax_class VARCHAR,
tax_order VARCHAR,
family VARCHAR,
genus VARCHAR,
species VARCHAR,
release VARCHAR,
PRIMARY KEY (gtdb_tax_id),
UNIQUE (kingdom, phylum, tax_class, tax_order, family, genus, species, release)
);
CREATE TABLE IF NOT EXISTS "CazyFamilies" (
family_id INTEGER NOT NULL,
family VARCHAR NOT NULL,
subfamily VARCHAR,
PRIMARY KEY (family_id),
UNIQUE (family, subfamily)
);
CREATE INDEX fam_index ON "CazyFamilies" (family, subfamily);
CREATE TABLE IF NOT EXISTS "NcbiTaxs" (
ncbi_tax_id INTEGER NOT NULL,
kingdom VARCHAR,
phylum VARCHAR,
tax_class VARCHAR,
tax_order VARCHAR,
family VARCHAR,
genus VARCHAR,
species VARCHAR,
strain VARCHAR,
PRIMARY KEY (ncbi_tax_id),
UNIQUE (ncbi_tax_id)
);
CREATE INDEX ncbi_index ON "NcbiTaxs" (ncbi_tax_id, genus, species);
CREATE TABLE IF NOT EXISTS "Uniprots" (
uniprot_id INTEGER NOT NULL,
uniprot_accession VARCHAR,
uniprot_name VARCHAR,
sequence VARCHAR,
seq_update_date VARCHAR,
PRIMARY KEY (uniprot_id),
UNIQUE (uniprot_accession)
);
CREATE INDEX uniprot_option ON "Uniprots" (uniprot_id, uniprot_accession);
CREATE TABLE IF NOT EXISTS "Ecs" (
ec_id INTEGER NOT NULL,
ec_number VARCHAR,
PRIMARY KEY (ec_id),
UNIQUE (ec_number)
);
CREATE INDEX "ix_Ecs_ec_number" ON "Ecs" (ec_number);
CREATE TABLE IF NOT EXISTS "Pdbs" (
pdb_id INTEGER NOT NULL,
pdb_accession VARCHAR,
PRIMARY KEY (pdb_id),
UNIQUE (pdb_accession)
);
CREATE INDEX pdb_idx ON "Pdbs" (pdb_accession);
CREATE TABLE IF NOT EXISTS "Logs" (
log_id INTEGER NOT NULL,
date VARCHAR,
time VARCHAR,
"database" VARCHAR,
retrieved_annotations VARCHAR,
classes VARCHAR,
families VARCHAR,
kingdoms VARCHAR,
genera_filter VARCHAR,
species_filter VARCHAR,
strains_filter VARCHAR,
ec_filter VARCHAR,
cmd_line VARCHAR,
PRIMARY KEY (log_id)
);
CREATE TABLE IF NOT EXISTS "Taxs" (
taxonomy_id INTEGER NOT NULL,
genus VARCHAR,
species VARCHAR,
kingdom_id INTEGER,
PRIMARY KEY (taxonomy_id),
UNIQUE (genus, species),
FOREIGN KEY(kingdom_id) REFERENCES "Kingdoms" (kingdom_id)
);
CREATE INDEX organism_option ON "Taxs" (taxonomy_id, genus, species);
CREATE TABLE IF NOT EXISTS "Genomes" (
genome_id INTEGER NOT NULL,
assembly_name VARCHAR,
gbk_version_accession VARCHAR,
gbk_ncbi_id INTEGER,
refseq_version_accession VARCHAR,
refseq_ncbi_id INTEGER,
gtdb_tax_id INTEGER,
PRIMARY KEY (genome_id),
UNIQUE (assembly_name, gbk_version_accession, refseq_version_accession),
FOREIGN KEY(gtdb_tax_id) REFERENCES "GtdbTaxs" (gtdb_tax_id)
);
CREATE INDEX genome_options ON "Genomes" (assembly_name, gbk_version_accession, refseq_version_accession);
CREATE TABLE IF NOT EXISTS "Genbanks" (
genbank_id INTEGER NOT NULL,
genbank_accession VARCHAR,
sequence VARCHAR,
seq_update_date VARCHAR,
taxonomy_id INTEGER,
ncbi_tax_id INTEGER,
uniprot_id INTEGER,
PRIMARY KEY (genbank_id),
UNIQUE (genbank_accession),
FOREIGN KEY(taxonomy_id) REFERENCES "Taxs" (taxonomy_id),
FOREIGN KEY(ncbi_tax_id) REFERENCES "NcbiTaxs" (ncbi_tax_id),
FOREIGN KEY(uniprot_id) REFERENCES "Uniprots" (uniprot_id)
);
CREATE INDEX "ix_Genbanks_genbank_accession" ON "Genbanks" (genbank_accession);
CREATE TABLE IF NOT EXISTS "Genbanks_Genomes" (
genbank_id INTEGER NOT NULL,
genome_id INTEGER NOT NULL,
PRIMARY KEY (genbank_id, genome_id),
FOREIGN KEY(genbank_id) REFERENCES "Genbanks" (genbank_id),
FOREIGN KEY(genome_id) REFERENCES "Genomes" (genome_id)
);
CREATE TABLE IF NOT EXISTS "Genbanks_CazyFamilies" (
genbank_id INTEGER NOT NULL,
family_id INTEGER NOT NULL,
PRIMARY KEY (genbank_id, family_id),
FOREIGN KEY(genbank_id) REFERENCES "Genbanks" (genbank_id),
FOREIGN KEY(family_id) REFERENCES "CazyFamilies" (family_id)
);
CREATE TABLE IF NOT EXISTS "Genbanks_Ecs" (
genbank_id INTEGER NOT NULL,
ec_id INTEGER NOT NULL,
PRIMARY KEY (genbank_id, ec_id),
FOREIGN KEY(genbank_id) REFERENCES "Genbanks" (genbank_id),
FOREIGN KEY(ec_id) REFERENCES "Ecs" (ec_id)
);
CREATE TABLE IF NOT EXISTS "Genbanks_Pdbs" (
genbank_id INTEGER NOT NULL,
pdb_id INTEGER NOT NULL,
PRIMARY KEY (genbank_id, pdb_id),
FOREIGN KEY(genbank_id) REFERENCES "Genbanks" (genbank_id),
FOREIGN KEY(pdb_id) REFERENCES "Pdbs" (pdb_id)
);