Usage: Scraping CAZy

cazy_webscraper can be used to retrieve user-specified data sets from the CAZy database. The cazy_webscraper application can be invoked via the command line

Quick Start

To download the entire CAZy dataset, and save the data set to the current working directory with the final name cazy_webscraper_<date>_<time>.db, use the following command structure:

cazy_webscraper <user_email>

Note

The user email address is a requirement of NCBI. NCBI is queried to identify the currect source organism for a given protein, when multiple source organisms are retrieved from CAZy for a single protein. For more information please see the NCBI Entrez documentation.

Note

Typically, downloading the entire CAZy dataset takes 10-15 minutes, although this is dependent on the amount of avaible memory.

To print citation information (including the citations of third party tools used by cazy_webscraper):

cazy_webscraper --citation

Or

cazy_webscraper -C

To print version information (including the versions of third party tools used by cazy_webscraper):

cazy_webscraper --version

Or

cazy_webscraper -V

Command line options

Listed below are the required and optional command-line options when using cazy_webscraper to download data from CAZy.

email - REQUIRED User email address. This is required by NCBI Entrez for querying the Entrez server. - Not needed when printing out citation or version number information.

--cache_dir - Path to cache dir to be used instead of default cache dir path.

--cazy_data - Path to a text file downloaded from CAZy containing a CAZy database dump

--cazy_synonyms - Path to a JSON file containing accepted CAZy class synonsyms if the default are not sufficient.

--config, -c - Path to a configuration YAML file. Default: None.

--citation, -C - Print the cazy_webscraper citation. When called, the program terminates after printng the citation and CAZy is not scraped.

--classes - list of classes from which all families are to be scrape.

--database, -d - Path to an existings local CAZyme database to add newly scraped too. Default: None.

--db_output, -o - Path to write out a new local CAZyme database to. Include the name of the new database, including the .db extension. Default: None.

Warning

Do not use ``–db_output`` and ``–database`` at the same time.

Note

If --db_output and --database are not called, cazy_webscraper will writes out a local CAZyme database to the cwd with the standardised name cazy_webscraper_<date>_<time>.db

--delete_old_relationships - Detele old CAZy family annotations of GenBank accessions. These are CAZy family annotations of a given GenBank accession are in the local database but the accession is not longer associated with those CAZy families, so delete old accession-family relationships.

--families - List of CAZy (sub)families to scrape.

--force, -f - force overwriting existing output file. Default: False.

Warning

If a specified output directory already exists, if --force is not called, cazy_webscraper will not overwrite the output and terminate.

--genera - List of genera to restrict the scrape to. Default: None, filter not applied to scrape.

--kingdoms - List of taxonomic kingdoms to restrict the scrape to. Default: None, filter is not applied.

--log, -l - Target path to write out a log file. If not called, no log file is written. Default: None (no log file is written out).

--nodelete, -n - When called, content in the existing output dir will not be deleted. Default: False (existing content is deleted).

Note

When the --db_output flag is used, cazy_webscraper will create any necessary parent directories. If the direct/immediate parent directory of the database exists, by default cazy_webscraper will delete the content in this parent directory.

--nodelete_cache - When called, content in the existing cache dir will not be deleted. Default: False (existing content is deleted).

--nodelete_log - When called, content in the existing log dir will not be deleted. Default: False (existing content is deleted).

--ncbi_batch_size - The number of protein IDs submitted per batch to NCBI, when retrieving taxonomic classifications. Default 200.

--retries, -r - Define the number of times to retry making a connection to CAZy if the connection should fail. Default: 10.

--skip_ncbi_tax - Skip retrieving the latest taxonomic information for NCBI were multiple taxonomic classifications are retrieved from CAZy for a protein. The first taxonomy retrieved from CAZy will be added to the local CAZyme database. Default False: will not retrieve taxon data from NCBI, will use the first taxon retrieved from the CAZy database dump.

--sql_echo - Set SQLite engine echo parameter to True, causing SQLite to print log messages. Default: False.

--subfamilies, -s - Enable retrival of CAZy subfamilies, otherwise only CAZy family annotations will be retrieved. Default: False.

--species - List of species written as Genus Species) to restrict the scraping of CAZymes to. CAZymes will be retrieved for all strains of each given species.

--strains - List of specific species strains to restrict the scraping of CAZymes to.

--timeout, -t - Connection timout limit (seconds). Default: 45.

--validate, - Retrieve CAZy family population sizes from the CAZy website and check against the number of family members added to the local CAZyme database, as a method for validating the complete retrieval of CAZy data.

--verbose, -v - Enable verbose logging. This does not set the SQLite engine echo parameter to True. Default: False.

--version, -V - Print cazy_webscraper version number. When called and the version number is printed, cazy_webscraper is immediately terminated.

Basic Usage

The command-line options listed above can be used in combination to customise the scraping of CAZy. Some options (e.g. --families and --classes) define the broad group of data that will be scraped, others (e.g. --species) are used to filter and fine-tune the data that is scraped.

Defining CAZy families and classes to scrape

The ‘definition’ arguments (e.g. --classes and --families) indicate which groups of data will be selected for scraping from CAZy, e.g.

cazy_webscraper --families GH169 -o GH169.db
cazy_webscraper --classes AA -o AA.db

will download all CAZymes from the GH169 family, and the AA class, respectively. More than one class or family can be specified, e.g.

cazy_webscraper --families GH169,GH1,GH2,GH3 -o GH_families.db
cazy_webscraper --classes AA,CBM -o other_classes.db

and members of distinct families and classes can be selected simultaneously, e.g.

cazy_webscraper --families GH169,GH1,GH2,GH3 --classes AA,CBM -o complex_query.db

Note

CAZy families should be named using the standard CAZy syntax. GH1 is accepted. “gh1” and “Glycoside hydrolase 1” are not accepted.

Specifying output data location

By default cazy_webscraper writes out a SQL database file to the current working directory, with a filename with the following structure cazy_webscraper_<date>_<time>.db, where the date and time mark the time cazy_webscraper was called.

To specify the location of the output database the --db_output / -o option can be used:

cazy_webscraper --families GH169 -o GH169_output.db

will write an SQL database file to GH169_output.db.

If the target output file already exists, cazy_webscraper by default will not overwrite the existing file and will terminate. To overwrite an existing file use the --force / -f options:

cazy_webscraper --families GH169 -o GH169_output.db -f

A multi-layered path can be provided to cazy_webscraper. If any of the parent directories for the target output path do not exist, cazy_webscraper will build the necessary output direcotires. In the following command if the cazy and families directories do not exist, cazy_webscraper will build these directories:

cazy_webscraper --families GH169 -o cazy/families/GH169_output.db

If any of the output directories exist, by default, cazy_webscraper will terminate. To write to an existing output directory use the --force / -f options:

cazy_webscraper --families GH169 -o GH169_output.db -f

By default cazy_webscraper will delete the existing content in the existing output files. To not delete the content in the existing output directories use the --nodelete / -n:

cazy_webscraper --families GH169 -o GH169_output.db -f -n

If you already have an existing CAZy database, then specifying this database with the -d / --database option will cause the scraper to use the existing database rather than creating a new one:

cazy_webscraper --families GH169 -d GH169/GH169_output.db

Filtering CAZy families and classes

Options that apply a filter to restrict which CAZymes from a class or familiy are scraped from CAZy (e.g. --families and --species) may be applied in combination. For example:

cazy_webscraper --families GH169 \
    --species "Escherichia coli" \
    -o GH169_speciesEscherichia_coli.db

will download only the CAZymes in the GH169 family that are from the species Escherichia coli. The command:

cazy_webscraper --families PL14,PL15,PL16 \
    -o PL14_ec1.2.3.4_kingdomBacteria

will download only CAZymes in the PL14, PL15 and PL16 families that are from the kingdom Bacteria.

Note

cazy_webscraper input options can also be specified in a YAML configuration file, to enable transparency and reproducibility.

Configuration using a YAML file

All command-line options to control CAZy scraping can be provided instead via a YAML configuration file. This supports reproducible documentation of cazy_webscraper usage.

An template YAML file is provided in the cazy_webscraper repository (scraper/scraper_config.yaml):

# Under 'classes' list class from which all proteins will retrieved
# Under each families respective name, list the specific families/subfamilies to be scraped
# Write the FULL family name, e.g. 'GH1', NOT only its number, e.g. '1'
# To list multiple families, each familiy must be on a new line starting indented once
# relative to the parent class name, and the name written within quotation marks.
# For more information on writing lists in Yaml please see:
# https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html
classes:  # classes from which all proteins will be retrieved
Glycoside Hydrolases (GHs):
GlycosylTransferases (GTs):
Polysaccharide Lyases (PLs):
  - "PL28"
Carbohydrate Esterases (CEs):
Auxiliary Activities (AAs):
Carbohydrate-Binding Modules (CBMs):
genera:  # list genera to be scraped
 - "Trichoderma"
species:  # list species, this will scrape all strains under the species
strains:  # list specific strains to be scraped
kingdoms:  # Archaea, Bacteria, Eukaryota, Viruses, Unclassified
 - "Bacteria"

Attention

The YAML configuration file must contain all tags/headings indicated in the example configuration file found in the repository:

  • classes

  • Glycoside Hydrolases (GHs)

  • GlycosylTransferases (GTs)

  • Polysaccharide Lyases (PLs)

  • Carbohydrate Esterases (CEs)

  • Auxiliary Activities (AAs)

  • Carbohydrate-Binding Modules (CBMs)

  • genera

  • species

  • strains

  • kingoms

Each value in the YAML mappings for these arguments must be listed on a separate line, indented by 4 spaces, and the class name encapsulated with single or double quotation marks. For example:

classes:
    - "GT"
    - "pl"
Glycoside Hydrolases (GHs):
    - "GH1"
    - "GH2"

Synonyms for CAZy classes

A number of synonyms may be provided for CAZy classes, e.g. both “GH” and “Glycoside-Hydrolases” are accepted as synonyms for “Glycoside Hydrolases (GHs)” (the name recorded at CAZy). These alternatives are defined in the cazy_webscraper repository, in the file scraper/utilities/parse_configuration/cazy_dictionary.json.

Scraping CAZy subfamilies

cazy_webscraper can scrape CAZy subfamilies, using the standard CAZy notation for subfamilies (e.g. GH3_1).

Note

If any subfamilies are specified for download/scraping in the YAML file, the command line argument --subfamilies must be used.

If a parent CAZy family is listed in the configuration file and --subfamilies is enabled at the command-line, all proteins catalogued under the named family and its subfamilies will be retrieved.