ContexTF uses the manually curated transcription factor database from model organisms to perform a homology-based search for putative transcription factors in unannotated prokaryotic genomes. It combines the results of three different homology searches: sequence, structure and genomic context. It is widely assumed that proteins with high sequence and, especially, structural similarity have high probability of performing the same function (Krissinel, 2007). For this reason, most bioinformatic tools take advantage of these characteristic to annotate homologous proteins throughout species. In prokaryotic genomes, where most TFs are found in context with the proteins that they regulate, the conservation of the genes near the regulatory protein could also be used to annotate proteins in prokaryotic organisms. ContexTF integrates a classical search of homologous candidates and provides a snapshot of the genomic contexts of the involved proteins to evaluate the conservation and similarity of the flanking genes of the candidates.
ContexTF is a standalone python pipeline that consists of three executable modules (Fig. 1). The first module relies on HMMER for performing a PFAM domain search of the model TFs present in the database into an unannotated genome. In this step, the PFAM architecture of each reference TF is established, removing the overlapping domains with lowest E-value. Next, these domain architectures are searched in the provided organism predicted proteome, ensuring that the TF candidates found share the same domain composition.
The second module checks the structural similarity between the reference TF and its candidates, using TM-align. This module execution is optional, as the Uniprot identifiers need to be mapped from the protein identifiers in the provided genome to download its AlphaFold predicted structure. If indicated, the HMMER results are used to superimpose the reference protein with each one of its candidates and the TM-score (Zhang & Skolnick, 2004), normalised by the reference protein’s length, is assessed. The TM-score threshold can be adjusted by the user; the default is set to consider proteins structurally similar when the TM-score is higher than 0.5, as it generally indicates that both proteins share the same fold in SCOP/CATH (Andreeva et al., 2020). As the structures are predicted, there might be unordered regions that interfere with the protein superimpositions. To avoid the loss of potential candidates with predicted regions with low confidence (pLDDT < 70) (Mariani et al., 2013), unordered regions that represent more than 10% of the protein are removed before superimposing.
The final module of ContexTF yields the genomic context comparisons between the reference TF and the candidates obtained by HMMER or TM-align. All the reference TFs have been mapped to a genome or assembly from which the target protein will be plotted, along with the designated number of upstream and downstream coding sequences. This allows the user to compare a snapshot of the regions and assess the similarity between the reference and candidates. To facilitate the visual comparison, the flanking proteins are coloured by protein families found by an all-against-all search by MMSeqs2 or PSIBLAST, depending on the user’s preferences. The final output of the pipeline is a collection of figures, one for each reference TF and its corresponding candidates.
ContexTF is a standalone pipeline developed in Python 3.10+. To install ContexTF in your local computer execute the following commands:
git clone https://github.com/mariaartlle/ContexTF.git
cd ContexTF
pip3 install .
cd TF_databases/
unzip genomes4gc.zip
To ensure the portability of ContexTF, a Conda environment is provided with all the python packages needed to execute it and their correspondant versions. To install, execute:
conda env create -n ContexTF_env --file ContexTF_env.yml
conda activate ContexTF_env
ContexTF relies on the local installation of three bioinformatic softwares:
- HMMER (version 3.3.2)
- TMAlign (version 20190822)
- BLAST (version 13-45111+ds-2) or MMseqs2 (version 2.12.0+)
The only required arguments to execute the pipeline are a genomic FASTA file and a GFF3 annotation file:
ContexTF -fna example/input/GCA_000008345.1_ASM834v1_genomic.fna -gff3 example/input/GCA_000008345.1_ASM834v1_genomic.gff
With this we will obtain a tabular file summary of the HMMER candidates found in the provided genome and a genomic context figure for each reference TF found.
ContexTF has more arguments that allow the user to refine the output. In this example, we are using the same genomic files but this time, the structural homology module will be executed and only the candidates with a TM-score higher than 0.7 will be considered as candidates. Regarding the genomic context figures, the protein families will be found with MMseqs2 and for each TF 8 genes will be plotted in the 5' site, and 4 in the 3'.
ContexTF -fna example/input/GCA_000008345.1_ASM834v1_genomic.fna -gff3 example/input/GCA_000008345.1_ASM834v1_genomic.gff --tmalign --tmalign_min_threshold 0.7 --GC_method mmseqs --n_flanking5 8 --n_flanking3 4
ContexTF can be executed with your own set of proteins to search in the input genome, not only the transcription factor database. For this, fill the templates provided to build your protein set with the required fields. Regarding the assemblies, you can provide the link for the NCBI ftp site (example) or provide the local paths to your GFF3 and FNA files with the following format: /your/absolute/local/path/to/GFF3|/your/absolute/local/path/to/FNA
Flag the arguments --custom_db_tsv
and --custom_db_fasta
to provide your custom databases. If only flagged the argument --custom_db_tsv
, the protein fasta file with the sequences will be obtained by mapping through the Uniprot API. Ensure that the proteins in the custom database have AlphaFold structures predicted to be able to run the structural homology module.
# Only needs the .tsv file to execute
ContexTF -fna example/input/GCA_000008345.1_ASM834v1_genomic.fna -gff3 example/input/GCA_000008345.1_ASM834v1_genomic.gff --custom_db_tsv custom_database_template.tsv --tmalign --n_flanking 8
# Alternatively
ContexTF -fna example/input/GCA_000008345.1_ASM834v1_genomic.fna -gff3 example/input/GCA_000008345.1_ASM834v1_genomic.gff --custom_db_tsv custom_database_template.tsv --custom_db_fasta custom_database_template.fasta --tmalign --n_flanking 8
Andreeva, A., Kulesha, E., Gough, J., & Murzin, A. G. (2020). The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Research, 48(D1), D376–D382. https://doi.org/10.1093/nar/gkz1064
Krissinel, E. (2007). On the relationship between sequence and structure similarities in proteomics. Bioinformatics, 23(6), 717–723. https://doi.org/10.1093/bioinformatics/btm006
Mariani, V., Biasini, M., Barbato, A., & Schwede, T. (2013). IDDT: A local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics, 29(21), 2722–2728. https://doi.org/10.1093/bioinformatics/btt473
Zhang, Y., & Skolnick, J. (2005). TM-align: A protein structure alignment algorithm based on the TM-score. Nucleic Acids Research, 33(7), 2302–2309. https://doi.org/10.1093/nar/gki524