Search for Apo-Holo pairs

Introduction

AHoJ is a structural bioinformatics tool that allows automated search and alignment of APO structures for a given HOLO structure (and vice versa) in the PDB.
It can be used to create customized Apo-Holo datasets.

AHoJ is an application for finding structures that belong to the same protein with a user-specified query structure, and annotating them as apo or holo. It can be used for making a single search and visualizing/downloading the results, or serialized by entering multiple queries to generate a dataset of structures. The user starts the search by providing their ligand of interest - or its binding site) and setting the preferred parameters.

How it works?

Quick description

It features multiple modes of search, but its main functionality is ligand-centric, meaning that the user specifies a particular binding site (by entering a binding residue) - or a ligand directly - and AHoJ will find the same binding site in the other structures (by superimposing the two at a time) and interogate it for ligands.

Specifying the ligand or the binding site

This is done by providing the i)structure, ii)chain, iii)name of ligand or residue (using the PDB 3-character code) and iv)position of ligand or residue in the sequence (PDB residue index is used). Specifying all these four arguments is recommended as it avoids ambiguity in the case that more than one ligand molecules of the same type exist within the same chain. When specifying a residue, it is obligatory to specify the position. However, the minimum number of arguments is one - structure. In such case, AHoJ will try to automatically detect chains, ligands and their positions. It can work with ligands that are designated as heteroatoms in the PDB, which means small and medium-sized ligands but not protein subunits.

Therefore, the main and default search is starting with a holo structure, where the user knows the ligand that will be used as a starting point. This user-specified ligand will then define the search and annotation of the results. Any other ligands in the query structure will not play a role in characterising results as apo or holo. In the case that the user does not know the ligand or the binding site, AHoJ can automatically detect available ligands in the query structures if told to do so. If the query structure however does not bind any ligands (apo), the user can still use the "reverse search" mode, where AHoJ will look for structures that belong to the same protein with the query, but it will not focus on a particular binding site. Instead it will list any ligand that it detects in the resulting chains.

Query format & examples

Query Format

<pdb_id> <chains> <ligand> <position>  # comment

pdb_id: This is the 4-character code of a PDB protein structure. This argument is obligatory and only 1 PDB ID can be input per line. (i.e. “1a73” or “3fav” or “3FAV”). If it is the only argument (because the user does not know the ligand that binds to the structure or is using "reverse search", it will trigger automatic detection of ligands in the structure.
chains: A single chain or multiple chains separated by commas (without whitespace), or “ALL” or “” in the case of all chains (i.e. “A” or “A,C,D” or “ALL” or “”). This argument is obligatory if the user intends to provide any argument after that (i.e. ligands or position).
ligand: A single ligand, multiple ligands separated by commas (without whitespace), or no ligands can be input per line (i.e. “HEM” or “hem” or “ATP” or “ZN” or “HEM,ATP,ZN”). This argument is non-obligatory, if omitted, the user should activate the automatic detection of the ligands in the structure from the available option, unless the user is starting with an apo structure, in which case they will need to activate the reverse mode (search for holo from apo). Note: if planning to specify the position argument, you cannot use more than one ligand per query.
position: This argument is an integer (i.e. “260” or “1”). It refers to the PDB index of the previously specified ligand or binding residue. This argument can only be specified when there is one ligand or residue specified.

All elements except pdb_id are optional.

Example Query

1a73 A ZN 201  # consider ZN ligand in position 201 in chain A of 1a73

The application will fetch the structure 1a73, get chain A, and look for zinc+2 (ZN) ligand in position 201 of the sequence to verify the input argument. If ZN is found in chain A and position 201 of 1a73 (1a73A), it will retrieve all other known chains that belong to the same protein with 1a73A, it will align them with 1a73A and look for ZN (and also other ligands) at the superimposed binding site of ZN in 1a73A. If it finds protein chains with ZN, it will list them as HOLO, if the superimposed site is empty of ligands, the chain will be listed as APO. If another ligand is detected on that site instead of ZN, the chain will be listed as APO or HOLO, depending on the value of --lig_free_sites parameter (if the user wants APO with no other ligands there, it will be listed as HOLO, and if the user does not mind other ligands in this binding site, it will be listed as APO).

Example of an alternative query that leads to the same result as the previous example:

1a73 A HIS 134  # consider ligands near residue HIS134 in chain A of 1a73 (the detected ligand will be ZN 201 in chain A)

More examples

1a73 A,B ZN   # consider ZN ligands in chains A and B of 1a73
1a73 ALL ZN   # consider ZN ligands in all chains of 1a73
1a73          # find and consider all ligands in all chains of 1a73
1a73 A        # find and consider all ligands in chain A of 1a73
1a73 A ZN,MG  # consider ZN and MG ligands in chain A of 1a73
1aax          # protein tyrosine phosphatase - long search
4est          # porcine pancreatic elastase

Options

X-ray structures only

When set to ON, only X-ray structures are considered as candidates during the search. This overrides the NMR setting [default = OFF].

Exclude NMR structures

When set to ON, NMR structures are not considered as candidates. In the case of multiple states for a certain structure, the first one is considered [default = ON].

Ligand-free sites

This pertains to the apo/holo classification of candidate structures. When set to ON, it does not tolerate any ligands (in addition to the user-specified one(s)) in the superimposed binding sites of the candidate apo structures. When set to OFF, it tolerates ligands other than the user-specified one(s) in the same superimposed binding site(s). If the user wants to find apo structures that don't bind any ligands in the superimposed binding site(s) of the query ligand(s), this value should be set to ON [default = ON].

Consider water as ligand

When set to ON, allows the detection of water molecules (i.e., 'HOH') as ligands in the superposition of the query ligand(s) in the candidate chains. If this setting is enabled and at least one water molecule is detected within the scanning radius, that would warrant a holo classification for the candidate chain. When a water molecule is defined in the user query, this setting is automatically enabled [default = OFF].

Consider non-standard residues as ligands

When set to ON, allows the detection of non-standard -or modified residues (e.g., 'TPO', 'SEP') as ligands in the superposition of the query ligand(s) in the candidate chains. If this setting is enabled and at least one modified residue is detected within the scanning radius, that would warrant a holo classification for the candidate chain. When a modified residue is defined in the user query, this setting is automatically enabled [default = OFF]. Note: The current list of non-standard residues includes the following residue names: 'SEP TPO PSU MSE MSO 1MA 2MG 5MC 5MU 7MG H2U M2G OMC OMG PSU YG PYG PYL SEC PHA'.

Consider D-amino acids as ligands

When set to ON, allows the detection of D-residues (e.g., 'DAL', 'DSN') as ligands in the superposition of the query ligand(s) in the candidate chains. If this setting is enabled and at least one D-residue is detected within the scanning radius, that would warrant a holo classification for the candidate chain. When a D-residue is defined in the user query, this setting is automatically enabled [default = OFF]. Note: The current list of D-residues includes the following residue names: 'DAL DAR DSG DAS DCY DGN DGL DHI DIL DLE DLY MED DPN DPR DSN DTH DTR DTY DVA'.

Save aligned Apo chains

Save the structure files of the aligned APO chains (mmCIF). Disabling this is only recommended in multiple queries if visualizations are not needed (reduces download size). This setting does not affect the search for apo or holo chains or the final result reports [default = ON].

Save aligned Holo Chains

Save the structure files of the aligned HOLO chains (mmCIF). Disabling this is only recommended in multiple queries if visualizations are not needed (reduces download size). This setting does not affect the search for apo or holo chains or the final result reports [default = ON].

Binding residues threshold

Floating point number that represents a percentage (%) and is applied as a minimum cut-off upon the percentage of the number of successfully mapped binding residues in the candidate chain out of the total number of binding residues in the query chain. The binding residues are mapped between query and candidate by converting PDB to UniProt numbering. "1%" translates to at least 1% percent of the query residues being present in the candidate structure, for the structure to be considered as apo or holo [default = 1.0, min/max = 1/100].

Sequence overlap threshold

Floating point number that represents a percentage (%) and is applied as a cutoff point when comparing the sequence overlap between the query and the candidate chain. It applies to the percentage of sequence overlap between query and candidate chains, and it is calculated from the query's perspective according to the UniProt residue numbering. If set to 100 (%), it means that the candidate chain must completely cover the query chain. It can be longer than the query, but not shorter [default = 0, min/max = 0/100]. Note: "100" guarantees complete coverage, but it is the strictest setting. If the user wants a more lenient filtration, they can lower the value, or even set it to 0 and rely on the template-modeling score (TM-score) by using the default value (0.5) or setting their own TM-score cut-off with the "Minimum TM-score" parameter.

Resolution threshold

Floating point number that represents angstroms and is applied as a cutoff point when assessing candidate structures that are resolved by scattering methods (X-ray crystallography, electron microscopy, neutron diffraction). It applies at the highest resolution value when this is available in the mmCIF structure file. It can take any value [default = 4.0, suggested min/max = 1.5/8. condition is <=].

Minimum TM-score

Floating point number that is applied as a minimum accepted template-modeling score between the query and the candidate chain. Value 1 indicates a perfect match, values higher than 0.5 generally assume the same fold in SCOP/CATH [default = 0.5, min/max = 0/1].

Ligand scanning radius

Floating point number that represents angstroms and is applied as a distance threshold (radius) when detecting binding residues or ligands. This scanning radius is applied on the positions of the atoms in a selection. The resulting scanned space is a "carved" surface that has the shape of the query ligand, extended outward by the given radius. If the candidate structure binds ligands outside of this superimposed area, they will be ignored, and the candidate will be characterised as an apo-protein [default = 4.5].