Tutorials on configuring cazy_webscraper
to retrieve GTDB taxonomic classifications
cazy_webscraper
can be configured to retrieve the latest taxonomic classifications from the
Genome Taxonomy Database (GTDB) for user specified sets of
CAZymes in a local CAZyme database. Many of the same configuration options
apply to the retrieval of protein data from CAZy, UniProt, GenBank and PDB.
The retrieved taxonomic classifications are stored in the local CAZyme database
GtdbTaxs
table. The child prteins for each taxonomy record in the GtdbTaxs
table is identified by the
including a ncbi_tax_id
from the GtdbTaxs
table in the respecitve Genomes
table records.
Note
As in the GTDB database, GTDB taxonomic classifications are retrieved and associated with genomes stored in the local CAZyme database. To retrieve GTDB taxonomic classifications the genomic data for the proteins of interest must be listed in the local CAZyme database.
cazy_webscraper
can be configured via the command line and/or via a YAML configuration file.
This page runs through examples of how to combine the various ‘filters’ that can be applied, to fully customised the retrieval of taxonomic classifications from GTDB. These tutorials are designed for those with less experience using command-line tools.
Note
If you installed cazy_webscraper
using bioconda
or pip
to invoke cazy_webscraper
to retrieve UniProt data call it using cw_get_gtdb_taxs
- this is the method used in this tutorial.
If you installed cazy_webscraper
from source then you will need to invoke cazy_webscraper
from the root of the repo using the command python3 cazy_webscraper/expand/genbank/taxonomy/get_ncbi_taxs.py
.
From this point on, we will be discussed the cw_get_gtdb_taxs
, which is the entry point for
retrieving data from GTDB Taxonomy. We also presume you are comfortable configuring cazy_webscraper
for the
scraping of data from CAZy.
Configuration via the command line
cw_get_gtdb_taxs
has two required arguments:
* The path to the local CAZyme database created using cazy_webscraper
* Source kingdoms. Accepts ‘archaea’ and/or ‘bacteria’
cw_get_gtdb_taxs cazy/cazyme_db.db archaea bacteria
When no optional arguments are provided, the default behaviour is invoked. The default behaviour is to: Retrieve the latest taxonomic classification from GTDB for all proteins in the local CAZyme database which do not currently have GTDB taxonomy data listed in local database, and the taxonomic information (i.e. the higher lineage classifications in the local database) are not updated.
Selecting source kingdoms
GTDB catalogues the taxonomic lineages of archaea and bacterial species. Defining the source taxonomic kingdom(s)
(the second positional argument for cw_get_gtdb_taxs
) determines which datafiles are retrieved from GTDB,
and thus which taxonomic lineages are added to the local CAZyme database. This is separate to the
--kingdoms
filter which is used to define CAZymes of interest by the taxonomic classification retrieved
from CAZy.
To add only archaeal lineages retrieved from GTDB use only archaea
:
cw_get_gtdb_taxs cazy/cazyme_db.db archaea
To add only bacterial lineages to the local CAZyme database, using only bacteria
:
cw_get_gtdb_taxs cazy/cazyme_db.db bacteria
To add both archaeal and bacterial lineages from GTDB to the local CAZyme database using both archaea
and
bacteria
(in any order), separated with a singel space:
cw_get_gtdb_taxs cazy/cazyme_db.db bacteria archaea
Options configurable at the command line
CAZymes of interest can be defined via providing:
A set of GenBank accessions
A set of UniProt accessions
CAZy classes
CAZy families
Taxonomic kingdoms
Genera
Species
Strains
EC numbers (if previously retrieved from UniProt)
Here you can find a full list of the command-line flags and options.
Retrieving taxonomy classifications for specific CAZy classes and families
The --classes
and --families
flags from scraping data from CAZy are applied in the extact same way
for retrieving taxonomy data from GTDB.
For instance, if instead of retrieving protein data for all CAZymes in your local CAZyme database, you want to
retrieve protein data for CAZymes in specific CAZy classes then add the
--classes
flag followed by the classes you want to retrieve protein data for.
Tip
To list multiple classes, separate the classes with a single comma.
For example, if you want to retrieve protein data for all CAZymes from Glycoside Hydrolase and Carbohydrate Esterases then use the command:
cw_get_gtdb_taxs cazy/cazyme.db archaea bacteria --classes GH,CE
OR
cw_get_gtdb_taxs cazy/cazyme.db archaea bacteria --classes Glycoside Hydrolases,Carbohydrate Esterases
Retrieving protein data for proteins from specific specific CAZy families is achieved using the --families
flag. For
example, to retrieve protein data for all proteins in PL1, PL2 and PL3 in the local CAZyme database, use the
following command:
cw_get_gtdb_taxs cazy/cazyme.db archaea bacteria --families PL1,PL2,PL3
Warning
cw_get_gtdb_taxs
only accpets families written in the proper CAZy family syntax.
GH1 is accepted.
gh1 and GlycosideHydrolases1 are not accepted.
As with scraping data from CAZy, the --classes
and --families
flags can be combined. To retrieve
protein data for all CAZymes in PL1, PL2, PL3 and all of GH and CE both:
cw_get_gtdb_taxs cazy/cazyme.db archaea bacteria --families PL1,PL2,PL3 --classes GH,CE
AND
cw_get_gtdb_taxs cazy/cazyme.db archaea bacteria --classes GH,CE --families PL1,PL2,PL3
are accepted.
Applying taxonomic
The --kingdoms
, --genera
, --species
and --strains
flags can be used to refine the dataset
of proteins to retrieve protein data by taxonomy. These flags are applied in the exact same way as they
are used for the scraping of data from CAZy. Only proteins in the local CAZyme database and
matching at least on of the provided taxonomy criteria will have data retrieved from GTDB taxonomy.
For example, if you want to retrieve data for all CAZymes in a local CAZyme database from bacterial and eukaryotic species, then use the command
cw_get_gtdb_taxs cazy/cazyme.db archaea bacteria --kingdoms bacteria,eukaryota
Warning
The kingdoms must be spelt the same way CAZy spells them, for example use ‘eukaryot**a**’ instead of ‘eukaryot**e**’.
Note
The kingdoms are not case sensitive, therefore, both bacteria
and Bacteria
are accepted.
Note
You can list the kingdoms in any order. Thus, both bacteria,eukaryota
and eukaryota,bacteria
are accepted.
You can combine any combination of the optional flags, including combining the taxonomic filters. For example,
you may wish to retrieve taxonomic data for all CAZymes in a local CAZyme database that are derived from all viral species, Aspergillus species, Layia carnosa, Layia chrysanthemoides, Trichoderma reesei QM6a and
Trichoderma reesei QM9414. To do this we would combine the respective flags for a single cw_get_gtdb_taxs
command. The command
we would use would be:
cw_get_gtdb_taxs cazy/cazyme.db archaea bacteria --kingdoms viruses --genera Aspergillus --species Layia carnosa,Layia chrysanthemoides --strains Trichoderma reesei QM6a,Trichoderma reesei QM9414
Note
The order that the flags are used and the order taxa are listed does not matter, and separate multiple taxa names with a single comma with no spaces.
Warning
Use the standard scientific name formating. Captialise the first letter of genus and write a lower case letter for the first letter of the species.
Aspergillus niger is correct
asepergillus niger is incorrect
ASPERGILLUS NIGER is incorrect
Warning
When you specify a species cw_get_gtdb_taxs
will retrieve taxonomic data from all strains of the species.
Applying EC number filter
The retrieval of taxonomic data from GTDB can also be limited to proteins in a local CAZyme database that are annotated with specific EC numbers.
Having previously retrieved EC number annotations from UniProt and adding them to the local CAZyme database, you may
wish to retrieve protein data for CAZymes annotated with specific EC numbers. To do this add the
--ec_filter
flag to the command, follwed by a list of EC numbers.
cw_get_gtdb_taxs cazy/cazyme.db archaea bacteria --ec_filter "EC1.2.3.4,EC2.3.4.5"
Note
Provide complete EC numbers. Both dashes (‘-’) and asterixes (‘*’) are accepted for missing digits in EC numbers.
EC1.2.3.- and EC1.2.3.* are accepted. EC1.2.3. and EC 1.2.3 are not accepted.
Note
The ‘EC’ prefix is not necessary. EC1.2.3.4 and 1.2.3.4 are accepted.
Warning
If using dashes to represent missing digits in EC numbers, it is recommended to bookend the entire EC number list in single or double quotation marks. Some terminals may misinterpret EC1.2.-.- as trying to invoke the options ‘.’
Note
cw_get_gtdb_taxs
will retrieve the GTDB taxonomic classification for all proteins in the local CAZyme
database that are annotated with at least one of the given EC numbers. Therefore, if multiple
EC numbers are given this does not mean taxonomic data will only be retrieved for
CAZymes annotated for all provided EC numbers.
--ec_filter
is based upon EC number annotations stored within the local CAZyme database. For
example, if protein A is annotated with the EC1.2.3.4, but this annotation is not stored in the
local CAZyme database, using --ec_filter EC1.2.3.4
will not cause cw_get_gtdb_taxs
to retrieve
data for protein A. This is because cw_get_gtdb_taxs
does not know protein A is annotated with
EC1.2.3.4, because this annotation is not within its database.
Warning
If --ec_filter
is used along side --ec
, cw_get_gtdb_taxs
will retrieve all EC number
annotations from UniProt for all proteins in the local CAZyme database that are associated with
at least one of the EC numbers provided via --ec_filter
within the CAZyme database.
Combining all filters
The --classes
, --families
, --ec_filter
, --kingdoms
, --genera
, --species
and --strains
flags can
be used in any combination to define a specific subset of proteins in the local CAZyme database for whom
taxonomic data will be retrieved from GTDB.
Below we run through 3 example commands of combining these flags, and the resulting behaviour.
Example 1: To add taxonomic data for all CAZymes in GH, GT, CE1, CE5 and CE8, and which are derived from baceterial species, we use the command:
cw_get_gtdb_taxs cazy/cazyme.db archaea bacteria --classes GH,CE --families CE1,CE5,CE8 --kingdoms bacteria
Example 2: To add taxonomic data for all CAZymes in GH and which are derived from Aspegillus and Trichoderma species, we use the command:
cw_get_gtdb_taxs cazy/cazyme.db archaea bacteria --classes GH --genera Aspegillus,Trichoderma
Example 3: To add taxonomic classifications for all CAZymes in GH,CE and CBM which are derived from baceterial species and are annotated with at least one of EC3.2.1.23, EC3.2.1.37 and EC3.2.1.85, we use the command:
cw_get_gtdb_taxs cazy/cazyme.db archaea bacteria --classes GH,CE,CBM --kingdoms bacteria --ec_filter "3.2.1.23,3.2.1.37,3.2.1.85"
Example 4: To add bacterial taxonomic classifications for all CAZymes in GH,CE and CBM which are derived from baceterial species and are annotated with at least one of EC3.2.1.23, EC3.2.1.37 and EC3.2.1.85, we use the command:
cw_get_gtdb_taxs cazy/cazyme.db bacteria --classes GH,CE,CBM --kingdoms bacteria --ec_filter "3.2.1.23,3.2.1.37,3.2.1.85"
Providing a list of accessions
Instead of retrieving taxonomic data for all CAZymes matching a defined set of criteria,
cw_get_gtdb_taxs
can retrieve taxonomic data for a set of CAZymes defined by their
GenBank and/or UniProt accession.
The flag --genbank_accessions
can be used to provide cw_get_gtdb_taxs
a list of GenBank accessions
to identify the specific set of CAZymes to retrieve taxonomic data for.
The flag --uniprot_accessions
can be used to provide cw_get_gtdb_taxs
a list of UniProt accessions
to identify the specific set of CAZymes to retrieve taxonomic data for.
In both instances (for --genbank_accessions
and --uniprot_accessions
) the list of respective accessions
are provided via a plain text file, with a unique protein accession of each line. The path to this file is
then passed to cw_get_gtdb_taxs
via the respective --genbank_accessions
and --uniprot_accessions
flag.
--genbank_accessions
and --uniprot_accessions
can be used at the same time to define all
CAZymes of interest.
Warning
--genbank_accessions
and --uniprot_accessions
take president over the filter flags.
When either --genbank_accessions
or --uniprot_accessions
is used, cw_get_gtdb_taxs
will
not retrieve any CAZymes from the local database matching a set of criteria.
Therefore, if --genbank_accessions
and --classes
are used, cw_get_gtdb_taxs
will ignore
the --classes
flag and only taxonomic classifications for the proteins listed in the file provided via
the --genbank_accessions
.
Providing GTDB datafiles
By default --cazy_webscraper
retrieves the latest GTDB datafiles from the GTDB website. However,
you can provide your own GTDB datafiles.
Specifically, these are the avaialble from GTDB release page.
The filenames must use the same filename format as GTDB to enable the correct extraction of the GTDB release number from the filename.
To provide a previously downloaded archaea datafile use
the --archaea_file
flag followed by the path point to the target data file.
Similarly, to provide a previously downloaded bacteria datafile use
the --bacteria_file
flag followed by the path point to the target data file.
Note
cazy_webscraper
excepts GTDB datafiles in the compressed (.gz
) tab-separated file format (.tsv
).
Updating genomic classifications
By default cw_get_gtdb_taxs
only adds links to GTDB lineages to genomes in the local CAZyme
database that are not already linked to a GTDB lineage. To update which GTDB lineage genomes in the local
CAZyme database are linked to add the --update_genome_lineage
flag.
cw_get_gtdb_taxs cazy/cazyme.db bacteria \
--classes GH,CE,CBM \
--kingdoms bacteria \
--update_genome_lineage