Tutorials on configuring `cazy_webscraper` to scrape CAZy

cazy_webscraper can be configured to retrieve user specified data sets from CAZy. The same configuration applies to the retrieval of protein data from UniProt, GenBank and PDB.

cazy_webscraper can be configured via the command line and/or via a YAML configuration file.

This page runs through examples of how to combine the various ‘filters’ that can be applied, to fully customised the scraping of CAZy. These tutorials are designed for those with less experience using command-line tools.

Note

If you installed cazy_webscraper using bioconda or pip to invoke cazy_webscraper call the application using the command cazy_webscraper - this is the method used in this tutorial.

If you installed cazy_webscraper from source then you will need to invoke cazy_webscraper using the command python3 <path to cazy_scraper.py>. For example, if you were located in the root of the repository, you would use: python3 cazy_webscraper/cazy_scraper.py.

Configuration via the command line

cazy_webscraper has only one required argument, the user email address. Therefore,: the scraper can be enabled to scrape all of CAZy using the following command:

cazy_webscraper myemail@domain.com

When no optional arguments are provided the default behaviour of the scraper will be performed. The default behaviour is to:

Scrape the entirety of CAZy databases
Write the resulting database to the current working directory
Not retrieve subfamilies (members of subfamilies will be retrieved but only their parent family will be listed)

Note

For those new to using command-line tools: Arguments are additional pieces of information we add onto the end of the command. They are used to configure the specific behaviour performed by yjr computer when we tell it to perform a specific command. In the examples above the command is cazy_webscraper myemail@domain.com, where we have told to computer to run the Python program cazy_webscraper and submit the user email address to NCBI for the retrieval of source orgnaism data. No additional information was provided, the computer will invoke cazy_webscraper using its default behaviour. If you do not want the default behaviour of cazy_webscraper then we need to pass additionally arguments to the computer when telling it to run cazy_webscraper, which we cover in the section below.

Options configurable at the command line

The following behaviours of the cazy_webscraper can be configured at the command-line in the terminal:

Limit the scraping of CAZy to specific CAZy classes, CAZy families, kingdoms, genuera, species, and/or strains
Force writing out the database to a a new or existing directory
Write out a log file of the packages operation
Not delete content already present in the output directory
Enable retrieving subfamilies
Enable verbose logging during the operation of the webscraper

Here you can find a full list of the command-line flags and options.

How to use the command-line options

The command-line options listed above can be used in any combination to customise the scraping of CAZy. The options that apply a ‘filter’ to restrict which CAZymes are scraped from CAZy are applied in combination. For example, if the --families option and --genera option are called then only CAZymes from the specified families and belonging to source organisms within the defined genera will be retrieved.

We will now walk through some examples of how to use cazy_webscraper. All example code presumes cazy_webscraper was installled using Bioconda or pip and therefore, be simply called using the command cazy_webscraper.

Note

For those new to using command-line tools: flags Command-line flags are used to tell the computer specifically which option(s) to change. Flags always come after the command.

The abbreivated version of a flag is given the prefixed of a single dash, followed by a single letter. For example, -s, -o and -l are all examples of short hand flags.

The long version of a flag is prefixed by two dashes, followed by complete words. For example, --output is the long version of the -o.

The flags used by a program are defined within the program. This means the flag -o may represent different options for different programs. Always make sure to check the documentation to see what flags are provided with the program, and what they do!

You can use the command-line to list all flags for a program/tool by typing in the command to invoke that tool, followed by the flag --help or -h. For example: cazy_webscraper --help.

Configuring were the output is saved

Creating a new database

Instead of writing out the database to the current working directory using the default database name (cazy_webscraper_<date>_<time>.db), we can name the database and directory that the database created by cazy_webscraper is written to by calling the --output flag.

We add the flag to the command that invokes cazy_webscraper. For example, to write the output to the directory ‘cazyme_database’ with the file name ‘cazyme_database.db’ we can use:

cazy_webscraper --output cazyme_database/cazyme_database.db

OR we can use the short hand version of the --output flag, -o:

cazy_webscraper -o cazyme_database/cazyme_database.db

Note

The final element of the path provided after the --output / -o flag is the name of database compiled by cazy_webscraper.

The output directory does not have to exist when cazy_webscraper is invoked. cazy_webscraper can make the output directory, including all necessary parent directories.

The --output flag can take an infinetly long path. For example, we could use:

cazy_webscraper -o data/cazyme_research/cazyme_database/cazyme_database.db

If the directories cazymes_research and cazyme_database did not exist, then cazy_webscraper will build these for you.

Overwriting an existing database or directory

If you want to write the output CAZyme database to a directory and/or file that already exists, you will need to add the force (--force or -f) flag anywhere to the cazy_webscraper command. For example:

cazy_webscraper -o data/cazyme_research/cazyme_database/cazyme_database.db -f

By default cazy_webscraper will delete all content in an already existing output directory. Therefore, in the above example, if the directory cazyme_database already existed, cazy_webscraper would delete all content in the directory then proceed.

You may wish to retain the data already in that directory. To do this add the ‘no delete’ (--nodelete or -n) flag anywhere to the cazy_webscraper command. For example:

cazy_webscraper -o data/cazyme_research/cazyme_database/cazyme_database.db -f -n

The order you invoke any of the optional flags does not matter, for example the following three examples perform the exact same operation as the code given above:

cazy_webscraper --force -o data/cazyme_research/cazyme_database/cazyme_database.db -f

cazy_webscraper -n -o data/cazyme_research/cazyme_database/cazyme_database.db -f

cazy_webscraper --nodelete --force --output data/cazyme_research/cazyme_database/cazyme_database.db

The above examples also highlight that it does not matter if you use the long or short versions of each of the flags.

Note

If you elect to write the database to a file in the current working directory, you do not need to worry about cazy_webscraper deleting content in the current working directory. This only applies if you chose to write the database to a directory over than the current working directory.

Add the scraped data to an existing CAZyme database

You may wish to scrape CAZy in multiple stages; maybe your internet dropped out while scraping CAZy and you don’t want to start again, or maybe you scraped CAZy but missed out a species of interest. No matter the reason cazy_webscraper allows you to add more CAZyme data to an existing database previously created by cazy_webscraper.

To do this add the database (--database or -d) flag to the cazy_webscraper command, followed by the path to the CAZyme database you want to add your scraped CAZy data to. For example, to add data to an existing database in cazy/cazyme_db.db use the command:

cazy_webscraper -- database cazy/cazyme_db.db

Note

Don’t forget the .db file extension at the end of the path!

All the paths we pass to cazy_webscraper are a relative path. This means cazy_webscraper will start in the directory the terminal is currently pointed out, and follow the path from there. For example, if we used the command:

cazy_webscraper -d my_cazyme_databases/my_cazyme_database.db

Then the computer will look for a directory called my_cazyme_databases in the directory the terminal is looking at, then within the my_cazyme_databases directory the computer will look for the file my_cazyme_database.db.

Specifying CAZy classes and families to scrape

Scraping specific classes

If instead of scraping all of CAZy, you want to only scrape CAZymes from specific CAZy classes then add the --classes flag followed by the classes you want to scrape. If you want to list multiple classes, separate the classes with a single comma. When you specify a CAZy class to scrape, all CAZy families within that class will be scraped.

For example, if you want to scrape all CAZymes from Glycoside Hydrolase and Carbohydrate Esterases then use the command:

cazy_webscraper --classes Glycoside Hydrolases,Carbohydrate Esterases

cazy_webscraper excepts multiple synonyms for each CAZy class:

Glycoside Hydrolases (GHs): Glycoside-Hydrolases, Glycoside-Hydrolases, Glycoside_Hydrolases, GlycosideHydrolases, GLYCOSIDE-HYDROLASES, GLYCOSIDE-HYDROLASES, GLYCOSIDE_HYDROLASES, GLYCOSIDEHYDROLASES, glycoside-hydrolases, glycoside-hydrolases, glycoside_hydrolases, glycosidehydrolases, GH, gh
GlycosylTransferases (GTs): Glycosyl-Transferases, GlycosylTransferases, Glycosyl_Transferases, Glycosyl Transferases, GLYCOSYL-TRANSFERASES, GLYCOSYLTRANSFERASES, GLYCOSYL_TRANSFERASES, GLYCOSYL TRANSFERASES, glycosyl-transferases, glycosyltransferases, glycosyl_transferases, glycosyl transferases, GT, gt
Polysaccharide Lyases (PLs): Polysaccharide Lyases, Polysaccharide-Lyases, Polysaccharide_Lyases, PolysaccharideLyases, POLYSACCHARIDE LYASES, POLYSACCHARIDE-LYASES, POLYSACCHARIDE_LYASES, POLYSACCHARIDELYASES, polysaccharide lyases, polysaccharide-lyases, polysaccharide_lyases, polysaccharidelyases, PL, pl
Carbohydrate Esterases (CEs): Carbohydrate Esterases, Carbohydrate-Esterases, Carbohydrate_Esterases, CarbohydrateEsterases, CARBOHYDRATE ESTERASES, CARBOHYDRATE-ESTERASES, CARBOHYDRATE_ESTERASES, CARBOHYDRATEESTERASES, carbohydrate esterases, carbohydrate-esterases, carbohydrate_esterases, carbohydrateesterases, CE, ce
Auxiliary Activities (AAs): Auxiliary Activities, Auxiliary-Activities, Auxiliary_Activities, AuxiliaryActivities, AUXILIARY ACTIVITIES, AUXILIARY-ACTIVITIES, AUXILIARY_ACTIVITIES, AUXILIARYACTIVITIES, auxiliary activities, auxiliary-activities, auxiliary_activities, auxiliaryactivities, AA, aa
Carbohydrate-Binding Modules (CBMs): Carbohydrate-Binding-Modules, Carbohydrate_Binding_Modules, Carbohydrate_Binding Modules, CarbohydrateBindingModules, CARBOHYDRATE-BINDING-MODULES, CARBOHYDRATE_BINDING_MODULES, CARBOHYDRATE_BINDING MODULES, CARBOHYDRATEBINDINGMODULES, carbohydrate-binding-modules, carbohydrate_binding_modules, carbohydrate_binding modules, carbohydratebindingmodules, CBMs, CBM, cbms, cbm

Tip

These synonyms are stored in a JSON found at scraper/utilities/parse_configuration/cazy_dictionary.json. Storing these synonyms allows you to modify this file if you wish to add your own synonoms for each CAZy class.

If you have your own synonyms these can be used by using the --cazy_synonyms flag, followed by the path to your JSON file. This JSON file must have the same architecture as the JSON filed used by cazy_webscraper.

Scraping specific families

To specify specific CAZy families to scrape, add the --families flag followed by the families you want to scrape. If you want to scrape multiple families, list all the families you wish to scrape, with each family separated with a single comma.

For example, if you want to scrape all CAZymes from GH2, PL5, CE1, CE2 and AA10 use:

cazy_webscraper --families GH2,PL5,CE1,CE2,AA10

Warning

Make sure to use the accepted CAZy nomenclature; ‘GH2’ is accepted but ‘gh2’ is not.

Scraping specific classes AND families

If you want to specify specific CAZy classes and families to scrape then add both the --classess and -families flags, because you can combine, mix-and-match, any combination of optional flags when invoking cazy_webscraper.

For example, if we wanted to scrape all CAZymes from GH1, PL9 and all of CE we would use the command:

cazy_webscraper --families GH1,PL9 --classes CE

It does not matter what order you add the optional flags to your command. Therefore, if we wanted to scrape all CAZymes from PL1, PL2, PL3 and all of GH and CE, both:

cazy_webscraper --families PL1,PL2,PL3 --classes GH,CE

AND

cazy_webscraper --classes GH,CE --families PL1,PL2,PL3

are accepted.

Note

In the example cazy_webscraper --classes GH,CE --families PL1,PL2,PL3 all CAZymes from PL1, PL2 and PL3 would be retrieved, but no CAZymes from the other PL families, in addition all CAZymes from all GH and CE families would be retrieved, but no CAZymes from AA, GT or CBM families would be retrieved.

Applying taxonomic

Specifying kingdoms

You may only be interest in CAZymes that are derived from species from a specific taxonomic kingdom. CAZy classifies source organisms under one of 5 kingdoms:

Archaea
Bacteria
Eukaryota
Viruses
Unclassified

To restrict the scraping of CAZy to retrieve CAZymes only derived from species from specific taxonomic kingdoms add the --kingdoms flag to the cazy_webscraper command followed by the kingdoms to limit the retrieval of CAZymes to. To list multiple kingdoms you need only add the --kingdoms flag once, then list all the kingdoms you are interested in, separated by a single comma.

For example, if you want to retrieve CAZymes only from bacterial and eukaryotic species then use the command

cazy_webscraper --kingdoms bacteria,eukaryota

Warning

The kingdoms must be spelt the same way CAZy spells them, for example use ‘eukaryot**a**’ instead of ‘eukaryot**e**’.

Note

The kingdoms are not case sensitive, therefore, both bacteria and Bacteria are accepted.

Note

You can list the kingdoms in any order. Thus, both bacteria,eukaryota and eukaryota,bacteria are accepted.

Speciying Genera to scrape

You can customise the scraping of CAZy to retrieve only CAZymes from all species from specific genera. To do this add the --genera flag to the cazy_webscraper command followed by your genera of interes.

To list multiple genera, you need to only add the --genera flag once followed by a list of all your genera, with each genera separated with a single comma and no spaces.

For example, if we wanted to retrieve all CAZymes from all Aspergillus, Trichoderma and Streptomyces species we would use the command:

cazy_webscraper --genera Aspergillus,Trichoderma,Streptomyces

Note

The order that the genera are listed does not matter.

Warning

Make sure to use the expect practise for writing genera names, each genus starts with a captial letter and all other letters are lower case.

Aspergillus is correct

asepergillus is incorrect

ASPERGILLUS is incorrect

Specifying species of organisms to scrape

You can specify to retrieve only CAZymes derived from specific species. To do this add the --species flag to the cazy_webscraper command, followed by a list of all species you wish to retrist the retrieval of CAZymes to. Separate each species with a single comma. Also for each species use the full scientific name for the species.

For example, if we wanted to retrieve all CAZymes from Aspergillus niger and Aspergillus fumigatus we would use the command:

cazy_webscraper --species "Aspergillus niger,Asepergillus fumigatus"

Note

The order that the species are listed does not matter, and separate multiple species names with a single comma with no spaces.

Warning

Use the standard scientific name formating. Captialise the first letter of genus and write a lower case letter for the first letter of the species.

Aspergillus niger is correct

asepergillus niger is incorrect

ASPERGILLUS NIGER is incorrect

Warning

When providing any parameter that contains a space within it, enclose the entire parameter is single or double quotation marks

“Aspergillus niger,Trichoderma reesie” is correct

‘Aspergillus niger,Trichoderma reesie’ is correct

asepergillus niger,Trichoderma reesie is incorrect

Therefore, when using the -species and --strains flag,s always enclode the provided argument (or parameter) in single or double quotation marks.

Warning

When you specify a species cazy_webscraper will retrieval CAZymes from all strains of the species.

Specify specific strains of species to scrape

You may only be interested in specific strains of a species. Instead of scraping CAZymes for all strains of a given speices, add the --strains flag followed by the specific species strains you wish to restrict the retrieval of CAZymes to.

List the full scientific name followed by the strain name. To specify multiple strains, list all strains of interest and separate with a single comma with no space.

For example, if we wanted to retrieve all CAZymes from Aspergillus niger ATCC 1015 and Aspergillus uvarum CBS 121591 we would use the command:

cazy_webscraper --strains "Aspergillus niger ATCC 1015,Aspergillus uvarum CBS 121591"

he order that the strains are listed does not matter.

Note

If you use the --species, --genera and --strains flags in any combination and a source organism matches multiple of the taxonomy critera, the CAZymes derived from that species will only be retrieved once.

For example, using the command cazy_webscraper --genera Aspergillus --species "Aspergillus niger" --strains "Aspergillus niger ATCC 1015" will retrieve all CAZymes from all Aspergillus species once.

The higher taxonomy levels take president, and the command will not retrieve all CAZymes from all Aspergillus species once AND all CAZymes from Aspergillus niger strains as well, and then retrieve another copy of all CAZymes from Aspergillus niger ATCC 1015.

Warning

When providing any parameter that contains a space within it, enclose the entire parameter is single or double quotation marks

“Aspergillus niger,Trichoderma reesie” is correct

‘Aspergillus niger,Trichoderma reesie’ is correct

asepergillus niger,Trichoderma reesie is incorrect

Therefore, when using the -species and --strains flag,s always enclode the provided argument (or parameter) in single or double quotation marks.

Combining taxonomic filters

You can combine any combination of cazy_webscraper optional flags, including combining the taxonomic filtersFor example, you may wish to retrieve all CAZyme derived from all viral species, Aspergillus species, Layia carnosa, Layia chrysanthemoides, Trichoderma reesei QM6a and Trichoderma reesei QM9414. To do this we would combine the respective flags for a single cazy_webscraper command. The command we would use would be:

cazy_webscraper --kingdoms viruses --genera Aspergillus --species "Layia carnosa,Layia chrysanthemoides" --strains "Trichoderma reesei QM6a,Trichoderma reesei QM9414"

Note

This is a single command written on a single line. When typing the command into the terminal do not fit enter until you have finished the command.

Warning

If you use the --species, --genera and --strains flags in any combination and a source organism matches multiple of the taxonomy critera, the CAZymes derived from that species will only be retrieved once. For example, using the command cazy_webscraper --genera Aspergillus --species Aspergillus niger --strains Aspergillus niger ATCC 1015 will retrieve all CAZymes from all Aspergillus species once.

When combining taxonomy filters, the higher taxonomy levels take president. For example, the :command:

cazy_webscraper --genera Aspergillus --species "Aspergillus niger" --strains "Aspergillus niger ATCC 1015"

will not retrieve all CAZymes from all Aspergillus species once AND all CAZymes from Aspergillus niger strains as well. cazy_webscraper will retrieval all CAZymes for all strains of Aspergillus niger.

Enabling retrieving subfamily annotations

By default cazy_webscraper only retrieves the CAZy family annotation for each protein, it does not retrieve the CAZy subfamily annotation. For example, a CAZyme within the CAZy subfamily GH3_1, will be stored in the local CAZyme database as only a GH3 CAZyme.

To retrieve the CAZy family and CAZy subfamily annotations, add the -subfamilies / -s flag, anywhere in the cazy_webscraper command. For example:

cazy_webscraper --families GH3 --subfamilies

This command will retrieve all CAZymes from GH3, and will retrieve the CAZy family and CAZy subfamily annotations. For example, a CAZyme in CAZy subfamily GH3_1 will be stored in the local database under the CAZy family GH3 and the CAZy subfamily GH3_1.

Combining CAZy class, CAZy family and taxonomy filters

You can use any combination of the CAZy class, CAZy family and taxonomy filters to fully customise the scrape of CAZy.

Below are some examples:

Example 1 To retrieve all CAZymes from all CBM families, GH1, GH2 and PL9, and that are derived from any Aspergillus species:

cazy_webscraper --classes CBM --families GH1,GH2,PL9 --genera Aspergillus

Example 2 To retrieve all CAZymes from GH1, and GH2 that are derived from any bacterial species:

cazy_webscraper --families GH1,GH2 --kingdoms bacteria

Example 3 To retrieve CAZymes from all viral species, and all Aspergillus niger strains which are catalogued within GH3_1 and GH3_2

cazy_webscraper --families GH3_1,GH3_2 --subfamilies --species "Aspergillus niger" --kingdoms Bacteria

Configuration file

Whenever cazy_webscraper is invoked and adds data to a database, the configuration of cazy_webscraper (this is the kingdoms, genera, species, strains, CAZy classes and CAZy family filters which were applied) and the data and time the scrape was initiated is logged in the database. However, for optimal reproduction of how cazy_webscraper was used in your research, you can create shareable documentation that others can use to reproduce your CAZy dateset. This is achieved by creating a configuration file rather than configuring the performance of cazy_webscraper at the command line.

Creating a configuration file

cazy_webscraper uses the YAML file type for its configuraiton file; if you are new to YAML files please find more detailed information on YAML files [here](https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html).

A template and example configuration file for scrapping CAZy using cazy_webscraper can be found in the repo, in the configuration_files directory.

The configuration YAML must contain the following tags/headings (identical to how they are presented below):

classes
Glycoside Hydrolases (GHs)
GlycosylTransferases (GTs)
Polysaccharide Lyases (PLs)
Carbohydrate Esterases (CEs)
Auxiliary Activities (AAs)
Carbohydrate-Binding Modules (CBMs)
genera
species
strains
kingdoms

Note

The order of the tags/headings does not matter.

Scraping specific CAZy classes

Under the classes heading list any classes to be scrapped. For each CAZy class listed under ‘classes’, CAZymes will be retrieved for every CAZy family within the CAZy class.

Each class must be listed on a separate line, indented by 4 spaces, and the class name encapsulated with single or double quotation marks. For example:

classes:
    - "GH"
    - "PL"

The same CAZy class name synonyms used for the command line are accepted for the configuration file.

Scraping specific CAZy families

Under the each of the class names listed in the configuration file, list the names of specific families to be scraped from that class. The respective classes of the specificed families do not need to be added to the ‘classes’ list.

Write the true name of the family not only it’s number, for example GH1 is excepted by 1 is not.

Name families using the standard CAZy nomenclature, such as “GT2” and NOT “GlycosylTransferases_2”. Additionally, use the standard CAZy notation for subfamilies (GH3_1).

Warning

If any subfamilies are listed within the configuration file, the retrieval of subfamilies must be enabled at the command line uisng --subfamilies.

Each family must be listed on a separate line and the name surrounded by double or single quotation marks. For example:

Glycoside Hydrolases (GHs):
    - "GH1"
    - "GH2"
    - "GH3_1"

Example configuration file

Below is an example of the content you may wish to put in a configuration file. Using this file will retrieve all CAZymes in CAZy class AA, CAZy families GH1, GH3 and PL9 that are either derived from a bacterial or Trichoderma species.

classes:
   - "AA"
Glycoside Hydrolases (GHs):
   - "GH1"
   - "GH3"
GlycosylTransferases (GTs):
Polysaccharide Lyases (PLs):
   - "PL9"
Carbohydrate Esterases (CEs):
Auxiliary Activities (AAs):
Carbohydrate-Binding Modules (CBMs):
genera:
   - "Trichoderma"
species:
strains:
kingdoms:
   - "Bacteria"

Note

Indentations consist of 4 spaces.

You can add ‘comments’ to configuration file. Comments are section of text that are not read by cazy_webscraper and allow you to add notes to your configuration file. For example:

# This is a comment, text following a hashtag '#' on the same line is not read by cazy_webscraper
# https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html
classes:  # classes from which all proteins will be retrieved
Glycoside Hydrolases (GHs):  # include two spaces between the end of the code and the hashtag
GlycosylTransferases (GTs):
Polysaccharide Lyases (PLs):
- "PL28"
Carbohydrate Esterases (CEs):
Auxiliary Activities (AAs):
Carbohydrate-Binding Modules (CBMs):
genera:  # list genera to be scraped
- "Trichoderma"
species:  # list species, this will scrape all strains under the species
strains:  # list specific strains to be scraped
kingdoms:  # Archaea, Bacteria, Eukaryota, Viruses, Unclassified
- "Bacteria"
ECs:  # only CAZymes with at least one of these EC numbers will be scrapped

Example configuration files and tempate files can be found hre.

Using a configuration file

Once you have created a configuration file (we recommend modifying the template one provided with cazy_webscraper you then need to invoke cazy_webscraper and tell it you are using a configuration file. To do this we add the --config / -c flag to the cazy_webscraper command, followed by the path to the configuration file.

The path we pass to cazy_webscraper is a relative path. This means cazy_webscraper will start in the directory the terminal is currently pointed out, and follow the path from there. For example, if we used the command:

cazy_webscraper -c scraper/scraper_config.yaml

Then the computer will look for a directory called scraper in the directory the terminal is looking at, then look within the scraper directory for a yaml file called scraper_config.yaml.

Note

To check which directory cazy_webscraper is pointed at type pwd into the terminal and hit enter. This is the ‘Present Working Directory’ command, which will print the path to the directory the terminal is presently looking at.

Warning

Your path must point directly to the YAML file. Don’t forget the ‘.yaml’ file extension!

Using a configuration and the command-line

You can configure cazy_webscraper using a combination of command line arguments and a configuration file.

If a CAZyme matches at least one of the configuration data (whether if be from the terminal of the configuration file), one copy of the CAZyme record will be added to the SQL database, and only one copy, no matter how many of the configuration data the CAZyme meets.

To use a configuration file and a the command-line to configure cazy_webscraper, use the configuration file --config flag followed by the path to the configuration file and any of the additional optional flags you wish to use.

Note

The order you invoke the optional flags does not matter.

Additional operations to fine tune how `cazy_webscraper` operates

Scraping data from a previously downloaded CAZy txt file

CAZy provides access to data within its database via text files. cazy_webscraper downloads the CAZy text file containing all data within the CAZy database, providing a database dump. This file is then written to the cache directory (by default, called .cazy_webscraper_<date>_<time>).

For consistency in the dataset, you may wish to perform multiple scrapes of CAZyme data from the same CAZy text file. This could be a CAZy text file you have downloaded from CAZy or a text file downloaded by cazy_webscrapper.

To direct cazy_webscraper to retrieve CAZyme data from a previously downloaded CAZy text file, using the --cazy_data flag, followed by the path to the text file. For example:

cazy_webscraper --cazy_data cazy_db/cazy_data.txt

Warning

--cazy_data must be pointed directly at the text file, not a zipped file containing the CAZy data text file.

Writing out a log file

If you want to have a log file of all terminal output produced by cazy_webscraper then add the log --log / -l anywhere to the cazy_webscraper command, followed by a path to write the log file to. This path is a relative path and must include target a log file specifically. For example:

cazy_webscraper --subfamilies --genera Aspergillus --log log_dir/cazy_webscraper_log.log

Warning

The log file does not already have to exist for cazy_webscraper to write to it; however, all directories included in the path must already exist.

Verbose logging

For more detailed logging (which includes not only error and warning messages (the default) but also configuration setup, number of proteins retrieved etc.), add the verbose logging flag (--verbose or -v) anywhere to the cazy_webscraper command. For example:

cazy_webscraper --subfamilies --genera Aspergillus -v

The verbose flag can be used in combination with the log flag to write all terminal output to a log file.

Changing the connection timeout limit

Sometimes the connection to the CAZy server times out. By default if a connection is attempted to made to CAZy and no response is recieved within 45 seconds, then cazy_webscraper interprets this as the connection timing out. cazy_webscraper then waits 10 seconds and retries the connection.

You can change how long the computer waits for a response from the CAZy server before classifying the connection as timed out by adding the --timeout flag to the cazy_webscraper command, followed by the number of seconds you want the computer to wait for a response from CAZy before classifying the connection as timing out.

For example, to set the connection timeout limit to 30 seconds use the command:

cazy_webscraper --timeout 30

The timeout flag can be used in combination with other flags, for example:

cazy_webscraper --subfamilies --genera Aspergillus -v --timeout 30

Tutorials on configuring cazy_webscraper to scrape CAZy

Configuration via the command line

Options configurable at the command line

How to use the command-line options

Configuring were the output is saved

Creating a new database

Overwriting an existing database or directory

Add the scraped data to an existing CAZyme database

Specifying CAZy classes and families to scrape

Scraping specific classes

Scraping specific families

Scraping specific classes AND families

Applying taxonomic

Specifying kingdoms

Speciying Genera to scrape

Specifying species of organisms to scrape

Specify specific strains of species to scrape

Combining taxonomic filters

Enabling retrieving subfamily annotations

Combining CAZy class, CAZy family and taxonomy filters

Configuration file

Creating a configuration file

Scraping specific CAZy classes

Scraping specific CAZy families

Example configuration file

Using a configuration file

Using a configuration and the command-line

Additional operations to fine tune how cazy_webscraper operates

Scraping data from a previously downloaded CAZy txt file

Writing out a log file

Verbose logging

Changing the connection timeout limit

Tutorials on configuring `cazy_webscraper` to scrape CAZy

Additional operations to fine tune how `cazy_webscraper` operates