Retrieving Protein Sequences from GenBank

cazy_webscraper can be used to retrieve protein amino acid sequences from NCBI GenBank for user-specified data sets of CAZymes in the local CAZymes database.

The retrieval of data from NCBI is performed by using the BioPython Bio.entrez <https://biopython.org/docs/1.75/api/Bio.Entrez.html>_ module [Cock et al., 2009].

Cock, P. J. A, Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A. _et al._ (2009) ‘Biopython: freely available Python tools for computaitonal molecular biology and bioinformatics’, _Bioinformatics_, 25(11), pp. 1422-3.

Note

For specific information of the Bio.entrez module please see the entrez documentation.

Quick Start

To download protein sequences for all CAZymes in the local CAZyme database, and write them to the local CAZyme database, use the following command structure:

cw_get_genbank_seqs 'path to local CAZyme db' 'user email address'

For example:

cw_get_genbank_seqs cazy/cazyme.db myemail@domain.com

Note

The cw prefix is an abbreviation of cazy_webscraper.

Command line options

database - REQUIRED Path to a local CAZyme database to add UniProt data to.

email - REQUIRED User email address, required by NCBI Entrez.

--batch_size - Size of batch query posted to NCBI Entrez. Default 150.

--cache_dir - Path to cache dir to be used instead of default cache dir path.

--cazy_data - Path to a txt file downloaded from CAZy containing a CAZy database dump

--cazy_synonyms - Path to a JSON file containing accepted CAZy class synonsyms if the default are not sufficient.

--config, -c - Path to a configuration YAML file. Default: None.

--classes - list of classes to retrieve UniProt data for.

--ec_filter - List of EC numbers to limit the retrieval of protein data for proteins annotated with at least one of the given EC numbers in the local CAZyme database.

--force, -f - Force writing cachce to exiting cache directory.

--file_only, -F - Only add seqs provided via JSON and/or FASTA file. Do not retrieved data from NCBI.

--families - List of CAZy (sub)families to retrieve UniProt protein data for.

--genbank_accessions - Path to text file containing a list of GenBank accessions to retrieve protein data for. A unique accession per line.

--genera - List of genera to restrict the retrieval of protein to data from UniProt to proteins belonging to one of the given genera.

--kingdoms - List of taxonomy kingdoms to retrieve UniProt data for.

--log, -l - Target path to write out a log file. If not called, no log file is written. Default: None (no log file is written out).

--nodelete_cache - When called, content in the existing cache dir will not be deleted. Default: False (existing content is deleted).

--nodelete_log - When called, content in the existing log dir will not be deleted. Default: False (existing content is deleted).

--retries, -r - Define the number of times to retry making a connection to CAZy if the connection should fail. Default: 10.

--seq_dict, - Path to a JSON file, keyed by GenBank accessions and valued by protein sequence. This file is created as part of the cache, after all protein sequences are retrieved from GenBank. This skips the retrieval of the protein sequences from GenBank only for those seqs included in the file.

--seq_file, - Path to a JSON file, keyed by GenBank accessions and valued by protein sequence. This file is created as part of the cache, after all protein sequences are retrieved from GenBank. This skips the retrieval of the protein sequences from GenBank only for those seqs included in the file.

--seq_update - If a newer version of the protein sequence is available, overwrite the existing sequence for the protein in the database. Default is false, the protein sequence is not overwritten and updated.

--sql_echo - Set SQLite engine echo parameter to True, causing SQLite to print log messages. Default: False.

--species - List of species (organsim scientific names) to restrict the retrieval of protein to data from UniProt to proteins belonging to one of the given species.

--strains - List of species strains to restrict the retrieval of protein to data from UniProt to proteins belonging to one of the given strains.

--verbose, -v - Enable verbose logging. This does not set the SQLite engine echo parameter to True. Default: False.

Basic Usage

The command-line options listed above can be used in combination to customise the scraping of CAZy. Some options (e.g. --families and --classes) define the broad group of data that will be scraped, others (e.g. --species) are used to filter and fine-tune the data that is scraped.

The --classes, --families, --kingdoms, --genera, --species, and --strains filteres are applied in the exactly same for retrieving data from CAZy as retrieving protein sequences from GenBank and protein data from UniProt. Examples of using these flags can be found in the tutorial.

The --seq_update flag is used in the same way for retrieving protein sequences from UniProt and GenBank.

Note

To retrieve data for members of specific CAZy subfamilies, list the subfamilies after the --families flag.

Updating local sequences

When using --sequence flag, cazy_webscraper will only add new protein sequences to the database, i.e. it will only add protein sequences to records that do not have a sequence. Therefore, if a protein already has a sequence in the local database, this sequence is not overwritten.

You may wish to update the protein sequences in your local CAZyme database. To do this use the --sequence/-s flag to tell cazy_webscraper to retrieve protein sequences, and use the --seq_update flag.

cw_get_genbank_seqs cazy_db.db -s --seq_update

This instructs cazy_webscraper to overwriting existing protein sequences in the local database if a newer version of the sequence is retrieved from UniProt. This is checked by comparing the ‘last modified date’ of the protein sequence in the local database against the sequence retrieved from UniProt.