Quickstart
Installation
The most recent version of cazy_webscraper
can be installed on your local machine using conda
or pip
. Both methods will install the cazy_webscraper
command-line tool, and the Python package cazy_webscraper
.
conda
conda install cazy_webscraper
pip
pip
should distribute the latest version of cazy_webscraper
, although there may be some minor lag between GitHub releases and pip
.
pip3 install cazy_webscraper
Tip
cazy_webscraper
can also be installed directly from source. More detailed, and alternative installation instructions can be found in the Installation section.
Getting Started Poster
For a quick summary of how to get started, check out our poster:
Hobbs, Emma E. M.; Pritchard, Leighton; Gloster, Tracey M.; Chapman, Sean (2021): cazy_webscraper - getting started. FigShare. Poster. https://doi.org/10.6084/m9.figshare.14370869.v3
Getting Help From cazy_webscraper
To invoke the webscraper and get basic help, call the webscraper at the command line:
cazy_webscraper -h
The default behaviour of the scraper is:
Scrape all entries in the CAZy database
Write the resulting data to STDOUT
Do not retrieve subfamilies (subfamily members will be retrieved but only their parent family be listed)
Do not retrieve FASTA files from GenBank
Do not retrieve protein sequences from PDB
Downloading a CAZy Family
To download the single CAZy family GH169, use the command:
cazy_webscraper <user email> --families GH169 -o GH169
This will create a new directory GH169
in the current working directory, and will download all CAZy entries in the GH169 family to a new SQLite3 database in that directory.
Note
To retrieve data from NCBI cazy_webscraper
utilises Entrez, and therefore, a user email address must be provided as this a requirement of Entrez. When downloading CAZyme records from CAZy, cazy_webscraper
retrieves the latest taxonomic classification for proteins assigned to multiple taxonomies in NCBI, and thus cazy_webscraper
requires an email address.