Extract protein squences from the local database
cazy_webscraper
can be used to extract protein sequences stored in the local CAZyme database.
The extracted protein sequences can be written to any combination of: * A single FASTA file containing all extracted sequences * One FASTA file per extracted sequence * A BLASTp database
GenBank and/or UniProt protein sequences retrieved can be extracted.
Quick Start
To extract all protein sequences previously retrieved from GenBank and UniProt, use the following command structure:
cw_extract_db_seqs <path to local CAZyme db> genbank uniprot
Note
The cw
prefix on command is an abbreviation of cazy_webscraper
.
Note
‘genbank’ and ‘uniprot’ are not case sensitive, therefore, both GenBank and UniProt are also accepted.
Warning
At least one database (either GenBank or UniProt) must be provided.
Command line options
database
- REQUIRED Path to a local CAZyme database to add UniProt data to.
source
- REQUIRED Define source databases of protein sequences. Accepts ‘genbank’ and ‘uniprot’. To list both, separate with a single space, e.g.
`cw_extract_sequence cazy_database.db genbank uniprot
The database names are not case sensitive, therefore, both GenBank and genbank are accepted.
--blastdb
, -b
- Create BLAST database of extracted protein sequences. Provide the path to the directory to store the BLAST database in.
--fasta_dir
- Write out each extracted sequence to a separate FASTA file in the provided dir. Provide a path to a directory to write out the FASTA files.
--fasta_file
- Write out all extracted sequences to a single FASTA file. Provide a path to write out the FASTA file.
Warning
At least one of --blastdb
, --fasta_dir
, and --fasta_file
must be called to inform cazy_webscraper where to write the output to. If none are called sequences will be extracted._
--cache_dir
- Path to cache dir to be used instead of default cache dir path.
--cazy_synonyms
- Path to a JSON file containing accepted CAZy class synonsyms if the default are not sufficient.
--config
, -c
- Path to a configuration YAML file. Default: None.
--classes
- List of classes from which all families are to be scrape.
--ec_filter
- Limist retrieval of protein data to proteins annotated with a provided list of EC numbers. Separate the EC numbers bu single commas without spaces. Recommend to wrap the entire str in quotation marks, for example:
.. code-block:: bash
cw_get_uniprot_data my_cazyme_db/cazyme_db.db –ec_filter ‘EC1.2.3.4,EC2.3.1.-’
--force
, -f
- Force overwriting exsiting files and writing to existing output directory.
--families`
- List of CAZy (sub)families to scrape.#
--kingdoms
- List of taxonomy kingdoms to retrieve UniProt data for.
--genbank_accession
- Path to text file containing a list of GenBanks accessions to extract protein sequences for. A unique accession per line.
--genera
- List of genera to restrict the scrape to. Default: None, filter not applied to scrape.
--log
, -l
- Target path to write out a log file. If not called, no log file is written. Default: None (no log file is written out).
--nodelete
- When called, content in the existing output dir will not be deleted. Default: False (existing content is deleted).
--nodelete_cache
- When called, content in the existing cache dir will not be deleted. Default: False (existing content is deleted).
--sql_echo
- Set SQLite engine echo parameter to True, causing SQLite to print log messages. Default: False.
--species
- List of species written as Genus Species) to restrict the scraping of CAZymes to. CAZymes will be retrieved for all strains of each given species.
--strains
- List of specific species strains to restrict the scraping of CAZymes to.
--uniprot_accessions
- Path to text file containing a list of UniProt accessions to extract protein sequences for. A unique accession per line.
--verbose
, -v
- Enable verbose logging. This does not set the SQLite engine echo parameter to True. Default: False.
Basic Usage
The command-line options listed above can be used in combination to customise the retrieval the extraction of protein sequences
to proteins of interest. Some options (e.g. --families
and --classes
) define the broad group of proteins,
others (e.g. --species
) are used to filter and fine-tune the protein dataset.
The --classes
, --families
, --kingdoms
, --genera
, --species
, and --strains
filteres are applied
in the exactly same for retrieving data from CAZy and UniProt. Examples of using these flags
can be found in the cazy_webscraper
and cw_get_uniprot_data
tutorial in this documentation.
Note
To extract protein sequences for members of specific CAZy subfamilies, list the subfamilies after the --families
flag.