Welcome to cazy_webscraper’s documentation!
Build Information
PyPI
bioconda
cazy_webscraper
cazy_webscraper
is a Python3 package for the automated retrieval of Carbohydrate-Active enZyme (CAZyme) data from the CAZy database. This program is free to use under the MIT license, and we kindly request that, if you use this program or Python package, you cite it as indicated below.
Hobbs, Emma E. M.; Pritchard, Leighton; Chapman, Sean; Gloster, Tracey M. (2021): cazy_webscraper Microbiology Society Annual Conference 2021 poster. figshare. Poster. https://doi.org/10.6084/m9.figshare.14370860.v7
cazy_webscraper
retrieves data from CAZy, writing it to a local SQLite3 file (typically taking 10-15 minutes to scrape the entirety of CAZy).
Additionally, ``cazy_webscraper`` can:
Retrieve the protein data from UniProt for CAZymes in the local database. This data includes:
UniProt accession
Protien name
Protein amino acid sequence
EC numbers
PDB accessions
Retrieve protein sequences from NCBI GenBank for CAZymes in the local database.
Write out protein sequences retrieved from UniProt and NCBI in FASTA format, and build a local BLAST database.
Retrieve protein structures from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank, PDB, for CAZymes in the local database.
Be configured to scrape the entire CAZy database, or recover only CAZymes filtered by user-supplied criteria, such as CAZy classes, CAZy (sub)family, or taxonomy.
Retrieve the latest taxonomic classifications (including the complete lineage) from the NCBI Taxonomy database
Quickstart
We have produced a “Getting Started With cazy_webscraper
” poster.
To download the entire CAZy dataset, and save the data set to the current working directory with the file name
cazy_webscraper_<date>_<time>.db
, use the following command structure:
cazy_webscraper <user_email>
Note
The user email address is a requirement of NCBI. NCBI is queried to identify the currect source organism for a given protein, when multiple source organisms are retrieved from CAZy for a single protein. For more information please see the NCBI Entrez documentation.
Command summary
Below are the list of commands (excluding required and optional arguments) included in cazy_webscraper
.
CAZy
To retrieve data from CAZy and compile and SQLite database using cazy_webscraper
command.
UniProt
To retrieve protein data from UniProt, use the cw_get_uniprot_data
command.
The following data can be retrieved: - UniProt accession - Protein name - EC numbers - PDB accession - Protein sequences
GenBank
To retrieve protein sequences from GenBank use the
cw_get_genbank_seqs
command.To retrieve the latest taxonomic classifications from NCBI Taxonomy using the
cw_get_ncbi_taxs
command.
Extract sequences
To extract GenBank and/or UniProt protein sequences from a local CAZyme database, use the cw_extract_db_seqs
command.
PDB
To protein structure files from PDB use the cw_get_pdb_structures
command.
NCBI taxonomies
Retrieve the latest taxonomic classifications (including the complete lineage from kingdom to strain) using the cw_get_ncbi_taxs
command.
GTDB taxonomies
Retrieve the latest taxonomic classifications (incluidng the complete lineage from kingdom to strain) from the GTDB database using the cw_get_gtdb_taxs
command.
Interrogate the database
To interrogate the database, use the cw_query_database
command.
Best practice
When performing a series of many automated, repeated calls to a server it is polite to do this when internet traffic is lowest at the server. This is typically at the weekend and overnight.
When using cazy_webscraper
to retrieve data from UniProt, NCBI or PDB, the webscraper can appear
to run slowly but this may be due to bandwidth at the database server, or server speed.
cazy_webscraper
provides a progress bar to reassure the user that the webscraper is working.
Warning
Please do not perform a retrieval of UniProt, NCBI and/or PDB data for the entire CAZy dataset, unless absolutely unavoidable. Retrieving the data from any of these exteranl databases for the entire CAZy dataset will take several hours and may unintentionally deny the service to others.
Documentation
For details and updates on development, please consult the GitHub repository.
- Installation
- Quickstart
- Usage: Scraping CAZy
- Tutorials on configuring
cazy_webscraper
to scrape CAZy- Configuration via the command line
- Options configurable at the command line
- Configuring were the output is saved
- Specifying CAZy classes and families to scrape
- Applying taxonomic
- Enabling retrieving subfamily annotations
- Combining CAZy class, CAZy family and taxonomy filters
- Configuration file
- Using a configuration and the command-line
- Additional operations to fine tune how
cazy_webscraper
operates
- The Local CAZyme Database Structure
- Retrieving structure files from PDB
- Retrieving data from UniProt
- Tutorials on configuring
cazy_webscraper
to retrieve data from UniProt - Retrieving Protein Sequences from GenBank
- Retrieving Sequences from GenBank Tutorial
- Extract protein squences from the local database
- Tutorials on configuring the extraction of protein sequences
- Retrieving structure files from PDB
- Tutorials on configuring
cazy_webscraper
to retrieve data from PDB - Retrieving NCBI Taxonomic Classifications
- Tutorials on configuring
cazy_webscraper
to retrieve NCBI taxonomic classifications - Retrieving genomic assembly data from NCBI Assembly
- Tutorials on configuring
cazy_webscraper
to retrieve NCBI genomic assembly data - Retrieving GTDB Taxonomic Classifications
- Tutorials on configuring
cazy_webscraper
to retrieve GTDB taxonomic classifications- Configuration via the command line
- Selecting source kingdoms
- Options configurable at the command line
- Retrieving taxonomy classifications for specific CAZy classes and families
- Applying taxonomic
- Applying EC number filter
- Combining all filters
- Providing a list of accessions
- Providing GTDB datafiles
- Updating genomic classifications
- Interrogating the data using the API
- Tutorial on interrogating the data using the API
- Configuration via the command line
- Accepted file formats
- Options configurable at the command line
- Choosing an output directory
- Overwrite existing files
- Retrieving protein data for CAZy classes and families to scrape
- Applying taxonomic
- Applying EC number filter
- Combining all filters
- Customising the output
- Caching and Using the Cache
- The Logs table
- Cache files when retrieving data from CAZy
- Cache files when retrieving data from UniProt
- Cache files when retrieving protein sequences from GenBank
- Cache files when retrieving taxonomic information from NCBI
- Cache files when retrieving data from PDB
- Cache files when retrieving genomic information from NCBI
- Cache files when retrieving taxonomic information from GTDB
- Cache directory
- Integrate a local CAZyme database into into downstream analyses
- Contributing
- Citations: cite cazy_webscraper and dependencies
- License
- Questions?
Citing cazy_webscraper
If you use cazy_webscraper
in your work please do cite our work (including the provided DOI), as well as the specific version of the tool you use. This is not only helpful for us as developers to get our work out into the world, but it is also essential for the reproducibility and integrity of scientific research.
Citation:
Hobbs, E. E. M., Gloster, T. M., and Pritchard, L. (2022) ‘cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets’, bioRxiv, https://doi.org/10.1101/2022.12.02.518825
This paper includes a full description of the operation and examples of use.
cazy_webscraper
depends on a number of tools. To recognise the contributions that the
authors and developers have made, please also cite the following:
- When making an SQLite database:
Hipp, R. D. (2020) SQLite, available: https://www.sqlite.org/index.html.
- Retrieving taxonomic, genomic or sequence data from NCBI:
Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., et al (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics, Bioinformatics, 25(11), 1422-1423.
Wheeler,D.L., Benson,D.A., Bryant,S., Canese,K., Church,D.M., Edgar,R., Federhen,S., Helmberg,W., Kenton,D., Khovayko,O. et al (2005) Database resources of the National Centre for Biotechnology Information: Update, Nucleic Acid Research, 33, D39-D45
- Retrieving data from UniProt:
Cokelaer, T., Pultz, D., Harder, L. M., Serra-Musach, J., Saez-Rodriguez, J. (2013) BioServices: a common Python package to access biological Web Services programmatically, Bioinformatics, 19(24), 3241-3242.
- Downloading protein structure files from RSCB PDB:
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., et al (2022) The Protein Data Bank, Nucleic Acids Research, 28(1), 235-242.
Hamelryck, T., Manderick, B. (2003), PDB parser and structure class implemented in Python. Bioinformatics, 19 (17), 2308–2310
- Retrieving and using taxonomic data from GTDB:
Parks, D.H., Chuvochina, M., Rinke, C., Mussig, A.J., Chaumeil, P., Hugenholtz, P. (2022) GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy, Nucleic Acids Research, 50(D1), D785-D794.
Development and issues
If there are additional features you wish to be added, you have problems with the scraper, or would like to contribute please raise an issue at the GitHub repository.