Quickstart

Installation

The most recent version of cazy_webscraper can be installed on your local machine using conda or pip. Both methods will install the cazy_webscraper command-line tool, and the Python package cazy_webscraper.

conda

conda install cazy_webscraper

pip

pip should distribute the latest version of cazy_webscraper, although there may be some minor lag between GitHub releases and pip.

pip3 install cazy_webscraper

Tip

cazy_webscraper can also be installed directly from source. More detailed, and alternative installation instructions can be found in the Installation section.

Getting Started Poster

For a quick summary of how to get started, check out our poster:

Hobbs, Emma E. M.; Pritchard, Leighton; Gloster, Tracey M.; Chapman, Sean (2021): cazy_webscraper - getting started. FigShare. Poster. https://doi.org/10.6084/m9.figshare.14370869.v3

Getting Help From cazy_webscraper

To invoke the webscraper and get basic help, call the webscraper at the command line:

cazy_webscraper -h

The default behaviour of the scraper is:

  • Scrape all entries in the CAZy database

  • Write the resulting data to STDOUT

  • Do not retrieve subfamilies (subfamily members will be retrieved but only their parent family be listed)

  • Do not retrieve FASTA files from GenBank

  • Do not retrieve protein sequences from PDB

Downloading a CAZy Family

To download the single CAZy family GH169, use the command:

cazy_webscraper <user email> --families GH169 -o GH169

This will create a new directory GH169 in the current working directory, and will download all CAZy entries in the GH169 family to a new SQLite3 database in that directory.

Note

To retrieve data from NCBI cazy_webscraper utilises Entrez, and therefore, a user email address must be provided as this a requirement of Entrez. When downloading CAZyme records from CAZy, cazy_webscraper retrieves the latest taxonomic classification for proteins assigned to multiple taxonomies in NCBI, and thus cazy_webscraper requires an email address.