Welcome to cazy_webscraper’s documentation!

cazy_webscraper logo, host organisations and funding
For latest updates and development progress, please check the GitHub repository

Build Information

https://img.shields.io/badge/Version-v1.0.2-yellowgreen https://zenodo.org/badge/DOI/10.5281/zenodo.4300858.svg https://img.shields.io/badge/Licence-MIT-brightgreen https://circleci.com/gh/HobnobMancer/cazy_webscraper.svg?style=shield https://codecov.io/gh/HobnobMancer/cazy_webscraper/branch/master/graph/badge.svg

PyPI

https://img.shields.io/pypi/v/cazy_webscraper.svg?style=flat-square https://img.shields.io/pypi/dm/cazy_webscraper?label=Pypi%20downloads

bioconda

https://anaconda.org/bioconda/cazy_webscraper/badges/version.svg?style=flat-square https://anaconda.org/bioconda/cazy_webscraper/badges/latest_release_date.svg?style=flat-square https://img.shields.io/conda/dn/bioconda/cazy_webscraper?label=Bioconda%20downloads

cazy_webscraper

cazy_webscraper is a Python3 package for the automated retrieval of Carbohydrate-Active enZyme (CAZyme) data from the CAZy database. This program is free to use under the MIT license, and we kindly request that, if you use this program or Python package, you cite it as indicated below.

Hobbs, Emma E. M.; Pritchard, Leighton; Chapman, Sean; Gloster, Tracey M. (2021): cazy_webscraper Microbiology Society Annual Conference 2021 poster. figshare. Poster. https://doi.org/10.6084/m9.figshare.14370860.v7

cazy_webscraper retrieves data from CAZy, writing it to a local SQLite3 file (typically taking 10-15 minutes to scrape the entirety of CAZy).

Additionally, ``cazy_webscraper`` can:

  • Retrieve the protein data from UniProt for CAZymes in the local database. This data includes:

    • UniProt accession

    • Protien name

    • Protein amino acid sequence

    • EC numbers

    • PDB accessions

  • Retrieve protein sequences from NCBI GenBank for CAZymes in the local database.

  • Write out protein sequences retrieved from UniProt and NCBI in FASTA format, and build a local BLAST database.

  • Retrieve protein structures from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank, PDB, for CAZymes in the local database.

  • Be configured to scrape the entire CAZy database, or recover only CAZymes filtered by user-supplied criteria, such as CAZy classes, CAZy (sub)family, or taxonomy.

  • Retrieve the latest taxonomic classifications (including the complete lineage) from the NCBI Taxonomy database

Quickstart

We have produced a “Getting Started With cazy_webscraperposter.

To download the entire CAZy dataset, and save the data set to the current working directory with the file name cazy_webscraper_<date>_<time>.db, use the following command structure:

cazy_webscraper <user_email>

Note

The user email address is a requirement of NCBI. NCBI is queried to identify the currect source organism for a given protein, when multiple source organisms are retrieved from CAZy for a single protein. For more information please see the NCBI Entrez documentation.

Command summary

Below are the list of commands (excluding required and optional arguments) included in cazy_webscraper.

CAZy

To retrieve data from CAZy and compile and SQLite database using cazy_webscraper command.

UniProt

To retrieve protein data from UniProt, use the cw_get_uniprot_data command.

The following data can be retrieved: - UniProt accession - Protein name - EC numbers - PDB accession - Protein sequences

GenBank

  • To retrieve protein sequences from GenBank use the cw_get_genbank_seqs command.

  • To retrieve the latest taxonomic classifications from NCBI Taxonomy using the cw_get_ncbi_taxs command.

Extract sequences

To extract GenBank and/or UniProt protein sequences from a local CAZyme database, use the cw_extract_db_seqs command.

PDB

To protein structure files from PDB use the cw_get_pdb_structures command.

NCBI taxonomies

Retrieve the latest taxonomic classifications (including the complete lineage from kingdom to strain) using the cw_get_ncbi_taxs command.

GTDB taxonomies

Retrieve the latest taxonomic classifications (incluidng the complete lineage from kingdom to strain) from the GTDB database using the cw_get_gtdb_taxs command.

Interrogate the database

To interrogate the database, use the cw_query_database command.

Best practice

When performing a series of many automated, repeated calls to a server it is polite to do this when internet traffic is lowest at the server. This is typically at the weekend and overnight.

When using cazy_webscraper to retrieve data from UniProt, NCBI or PDB, the webscraper can appear to run slowly but this may be due to bandwidth at the database server, or server speed. cazy_webscraper provides a progress bar to reassure the user that the webscraper is working.

Warning

Please do not perform a retrieval of UniProt, NCBI and/or PDB data for the entire CAZy dataset, unless absolutely unavoidable. Retrieving the data from any of these exteranl databases for the entire CAZy dataset will take several hours and may unintentionally deny the service to others.

Documentation

For details and updates on development, please consult the GitHub repository.

Citing cazy_webscraper

If you use cazy_webscraper in your work please do cite our work (including the provided DOI), as well as the specific version of the tool you use. This is not only helpful for us as developers to get our work out into the world, but it is also essential for the reproducibility and integrity of scientific research.

Citation:

Hobbs, E. E. M., Gloster, T. M., and Pritchard, L. (2022) ‘cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets’, bioRxiv, https://doi.org/10.1101/2022.12.02.518825

This paper includes a full description of the operation and examples of use.

cazy_webscraper depends on a number of tools. To recognise the contributions that the authors and developers have made, please also cite the following:

When making an SQLite database:

Hipp, R. D. (2020) SQLite, available: https://www.sqlite.org/index.html.

Retrieving taxonomic, genomic or sequence data from NCBI:

Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., et al (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics, Bioinformatics, 25(11), 1422-1423.

Wheeler,D.L., Benson,D.A., Bryant,S., Canese,K., Church,D.M., Edgar,R., Federhen,S., Helmberg,W., Kenton,D., Khovayko,O. et al (2005) Database resources of the National Centre for Biotechnology Information: Update, Nucleic Acid Research, 33, D39-D45

Retrieving data from UniProt:

Cokelaer, T., Pultz, D., Harder, L. M., Serra-Musach, J., Saez-Rodriguez, J. (2013) BioServices: a common Python package to access biological Web Services programmatically, Bioinformatics, 19(24), 3241-3242.

Downloading protein structure files from RSCB PDB:

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., et al (2022) The Protein Data Bank, Nucleic Acids Research, 28(1), 235-242.

Hamelryck, T., Manderick, B. (2003), PDB parser and structure class implemented in Python. Bioinformatics, 19 (17), 2308–2310

Retrieving and using taxonomic data from GTDB:

Parks, D.H., Chuvochina, M., Rinke, C., Mussig, A.J., Chaumeil, P., Hugenholtz, P. (2022) GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy, Nucleic Acids Research, 50(D1), D785-D794.

Development and issues

If there are additional features you wish to be added, you have problems with the scraper, or would like to contribute please raise an issue at the GitHub repository.