Standard Operating Procedures

Genomic Sequence Annotation Pipeline (GSAP)

This SOP describes PATRIC’s Genomic Sequence Annotation Pipeline (GSAP), an automated procedure for DNA-level annotation of microbial genomes. GSAP executes standard programs for gene prediction, sequence alignment, and motif identification. Glimmer and GeneMark are used for identification of protein-coding genes. RNA genes are identified using tRNAscanSE and BLASTN searches of a database of ribosomal and other functional RNAs followed by an additional post-processing algorithm. BLASTX vs. the non-redundant protein database provides an approach to gene identification that is orthogonal to ab initio methods. Finally, RBSfinder and TICO are used to identify ribosome binding sites that are found upstream of transcriptional initiation sites. GSAP does not attempt to interpret these data; rather, it populates the database in preparation for processing by ADC.
Download Download

Automated DNA-Level Curation (ADC)

This SOP describes PATRIC’s Automated DNA-Level Curation (ADC) system, an automated procedure by which multiple lines of evidence are combined and reconciled, prior to curator intervention. ADC takes as input a sequence, its legacy annotation (such as that from GenBank or RefSeq), and annotation from GSAP. These data are likely to contain multiple, potentially conflicting, predictions for the same underlying feature. ADC interprets the data through application of rules, selecting features that are most likely to be correct and flagging regions of ambiguity for manual follow-up. Using this procedure, 70 to 80% of predicted protein-coding genes can be automatically finalized based on agreement between various prediction methods. This leaves only 20 to 30% of the ORFs remaining for manual curation, resulting in a significant time savings.
Download Download

Manual DNA-Level Feature Curation (MDC)

This SOP provides guidance on how to interpret data presented to the curator by the Genomic Sequence Analysis Pipeline (GSAP). Manual genome curation is the process whereby members of PATRIC’s curation team analyze any preexisting genome annotations in the context of data generated by GSAP. Automated evaluation of these data prioritizes curation for those genes where conflicting evidence is present and human judgment is required.
Download Download

Protein Annotation Pipeline (PAP)

This SOP describes PATRIC’s Protein Annotation Pipeline (PAP), an automated procedure for the structural and functional annotation of protein sequences. PAP executes a number of similarity-based methods including the InterProScan suite of programs, plus TIGRfam and BLASTP vs. the non-redundant protein database. These methods support identification of related sequences at the motif, domain, and whole-protein levels. Databases such as Pfam and TIGRfam, which contain Hidden Markov Models (HMMs) derived from protein families, also contain annotation such as GO terms, which can used to infer functional roles to proteins which match these models. In addition, specialized programs such as Memsat2 and LipoP can help identify and characterize membrane-bound and lipid-conjugated proteins, respectively. Finally, PAP calculates hydrophobicity profiles, which can yield insights into membrane topology and protein folding. PAP provides raw data for the automated protein curation pipeline (APCP).
Download Download

Automated Protein Curation Pipeline (APCP)

This SOP describes PATRIC’s Automated Protein Curation Pipeline (APCP). This pipeline automates the transfer of annotation, based on a set of rules, from curated proteins from public databases to proteins predicted by PATRIC's curation process. This automation ensures high fidelity while reducing manual effort.
Download Download

Pathway Database Creation using PathwayTools (PDC)

This SOP describes the automated and manual procedures for creating an initial version of a Pathway Genome Database (PGDB) for a PATRIC bacterial genome. The procedure requires the genomic sequence for each replicon in a separate fasta file and PATRIC annotations for each replicon in pathologic (.pf) file format.
Download Download

Orthologous Gene Prediction (OGP)

This SOP describes the methodology that PATRIC uses for construction of orthologous sets of genes within individual pathosystems. Orthology implies that genes derived from a common ancestor prior to speciation of the respective organisms should have the same functional roles. By constructing groups of orthologous genes and providing a common annotation for each member, PATRIC provides consistent annotation across all species in that pathosystem.
Download Download