deepfold.data.search package¶

Submodules¶

deepfold.data.search.crfalign module¶

deepfold.data.search.crfalign.parse_crf(crf_string: str, query_id: str, alignment_dir: Path) → List[TemplateHit]¶

deepfold.data.search.crfalign.parse_pir(pir_string: str, index: int = 0) → TemplateHit¶

deepfold.data.search.input_features module¶

deepfold.data.search.input_features.create_mmcif_features(mmcif_dict: dict, author_chain_id: str, zero_center: bool = False) → dict¶

deepfold.data.search.input_features.create_msa_features(a3m_strings: List[str], sequence: str | None = None, use_identifiers: bool = False) → dict¶

deepfold.data.search.input_features.create_pdb_features(protein_object: Protein, description: str, is_distillation: bool = True, confidence_threshold: float = 50.0) → dict¶

deepfold.data.search.input_features.create_protein_features(protein_object: Protein, description: str, is_distillation: bool = False) → dict¶

deepfold.data.search.input_features.create_sequence_features(sequence: str, domain_name: str) → dict¶

deepfold.data.search.input_features.create_template_features(sequence: str, template_hits: Sequence[TemplateHit], template_hit_featurizer: TemplateHitFeaturizer, max_release_date: str, pdb_id: str | None = None, sort_by_sum_probs: bool = True, shuffling_seed: int | None = None) → dict¶

deepfold.data.search.input_features.create_template_features_from_hhr_string(sequence: str, hhr_string: str, template_hit_featurizer: TemplateHitFeaturizer, release_date: str, pdb_id: str | None = None, shuffling_seed: int | None = None) → dict¶

deepfold.data.search.input_features.create_template_features_from_hmmsearch_sto_string(sequence: str, sto_string: str, template_hit_featurizer: TemplateHitFeaturizer, release_date: str, pdb_id: str | None = None, shuffling_seed: int | None = None) → dict¶

deepfold.data.search.mmcif module¶

Parse mmCIF format files.

deepfold.data.search.mmcif.load_mmcif_file(mmcif_filepath: Path) → str¶: Load .cif file into mmcif string.

deepfold.data.search.mmcif.parse_mmcif_string(mmcif_string: str) → dict¶: Parse mmcif string into mmcif dict.

deepfold.data.search.mmcif.zero_center_atom_positions(all_atom_positions: ndarray, all_atom_mask: ndarray) → ndarray¶

deepfold.data.search.msa_identifiers module¶

Utilities for extracting identifiers from MSA sequence desecriptions.

deepfold.data.search.msa_identifiers.get_identifiers(description: str) → str¶: Compute extra MSA features from the description.

deepfold.data.search.parsers module¶

Parsing various file formats.

class deepfold.data.search.parsers.HitMetadata(pdb_id: str, chain: str, start: int, end: int, length: int, text: str)¶

Bases: object

chain: str¶

end: int¶

length: int¶

pdb_id: str¶

start: int¶

text: str¶

class deepfold.data.search.parsers.TemplateHit(index: int, name: str, aligned_cols: int, sum_probs: float, query: str, hit_sequence: str, indices_query: List[int], indices_hit: List[int])¶

Bases: object

Class representing a template hit.

aligned_cols: int¶

hit_sequence: str¶

index: int¶

indices_hit: List[int]¶

indices_query: List[int]¶

name: str¶

query: str¶

sum_probs: float¶

deepfold.data.search.parsers.convert_stockholm_to_a3m(stockholm_format: str, max_sequences: int | None = None, remove_first_row_gaps: bool = True) → str¶: Converts MSA in Stockholm format to the A3M format.

deepfold.data.search.parsers.deduplicate_stockholm_msa(stockholm_msa: str) → str¶: Remove duplicate sequences (ignoring insertions wrt query).

deepfold.data.search.parsers.parse_a3m(a3m_string: str) → Tuple[Sequence[str], Sequence[Sequence[int]], Sequence[str]]¶

Parses sequences and deletion matrix from a3m format alignment.

Parameters:

a3m_string – The string contents of a a3m file. The first sequence in the file should be the query sequence.

Returns:

A list of sequences that have been aligned to the query.
These might contain duplicates.
The deletion matrix for the alignment as a list of lists.
The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.

Return type:

A tuple of

deepfold.data.search.parsers.parse_e_values_from_tblout(tblout: str) → Dict[str, float]¶: Parse target to e-value mapping parsed from Jackhmmer tblout string.

deepfold.data.search.parsers.parse_fasta(fasta_string: str) → Tuple[Sequence[str], Sequence[str]]¶

Parses FASTA string and returns list of strings with amino-acid sequences.

Parameters:

fasta_string – The string contents of a FASTA file.

Returns:

A list of sequences.
A list of sequence descriptions taken from the comment lines.
In the same order as the sequences.

Return type:

A tuple of two lists

deepfold.data.search.parsers.parse_hhr(hhr_string: str) → Sequence[TemplateHit]¶: Parses the content of an entire HHR file.

deepfold.data.search.parsers.parse_hmmsearch_a3m(query_sequence: str, a3m_string: str, skip_first: bool = True) → Sequence[TemplateHit]¶

Parses an a3m string produced by hmmsearch.

Parameters:

query_sequence – The query sequence.
a3m_string – The a3m string produced by hmmsearch.
skip_first – Whether to skip the first sequence in the a3m string.

Returns:

A sequence of TemplateHit results.

deepfold.data.search.parsers.parse_hmmsearch_sto(query_sequence: str, sto_string: str) → Sequence[TemplateHit]¶: Gets parsed template hits from the raw string output by the tool.

deepfold.data.search.parsers.parse_stockholm(stockholm_string: str) → Tuple[Sequence[str], Sequence[Sequence[int]], Sequence[str]]¶

Parses sequences and deletion matrix from stockholm format alignment.

Parameters:

stockholm_string – The string contents of a stockholm file.
sequence. (The first sequence in the file should be the query)

Returns:

A list of sequences that have been aligned to the query.
These might contain duplicates.
The deletion matrix for the alignment as a list of lists.
The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.
The names of the targets matched,
including the jackhmmer subsequence suffix.

Return type:

A tuple of

deepfold.data.search.parsers.remove_empty_columns_from_stockholm_msa(stockholm_msa: str) → str¶: Removes empty columns (dashes-only) from a Stockholm MSA.

deepfold.data.search.parsers.truncate_stockholm_msa(stockholm_msa_path: str, max_sequences: int) → str¶: Reads + truncates a Stockholm file while preventing excessive RAM usage.

deepfold.data.search.templates module¶

Building the template features for the DeepFold model.

exception deepfold.data.search.templates.AlignRatioError¶

Bases: PrefilterError

An error indicating that the hit align ratio to the query was too small.

exception deepfold.data.search.templates.CaDistanceError¶

Bases: Exception

An error indicating that a CA atom distance exceeds a threshold.

exception deepfold.data.search.templates.DateError¶

Bases: PrefilterError

An error indicating that the hit date was after the max allowed date.

exception deepfold.data.search.templates.DuplicateError¶

Bases: PrefilterError

An error indicating that the hit was an exact subsequence of the query.

exception deepfold.data.search.templates.LengthError¶

Bases: PrefilterError

An error indicating that the hit was too short.

exception deepfold.data.search.templates.NoAtomDataInTemplateError¶

Bases: Exception

An error indicating that template mmCIF didn’t contain atom positions.

exception deepfold.data.search.templates.NoChainsError¶

Bases: Exception

An error indicating that template mmCIF didn’t have any chains.

exception deepfold.data.search.templates.PrefilterError¶

Bases: Exception

A base class for template prefilter exceptions.

exception deepfold.data.search.templates.QueryToTemplateAlignError¶

Bases: Exception

An error indicating that the query can’t be aligned to the template.

exception deepfold.data.search.templates.SequenceNotInTemplateError¶

Bases: Exception

An error indicating that template mmCIF didn’t contain the sequence.

exception deepfold.data.search.templates.TemplateAtomMaskAllZerosError¶

Bases: Exception

An error indicating that template mmCIF had all atom positions masked.

class deepfold.data.search.templates.TemplateFeaturesResult(features: dict | None, error: str | None, warning: str | None)¶

Bases: object

error: str | None¶

features: dict | None¶

warning: str | None¶

class deepfold.data.search.templates.TemplateHitFeaturizer(max_template_hits: int, pdb_mmcif_dirpath: Path, pdb_release_dates: Dict[str, datetime] = {}, pdb_obsolete_filepath: Path | None = None, template_pdb_chain_ids: Set[str] | None = None, shuffle_top_k_prefiltered: int | None = None, kalign_executable_path: str = 'kalign', verbose: bool = False)¶

Bases: object

A class for computing template features from template hits.

get_template_features(query_sequence: str, template_hits: List[TemplateHit], max_template_date: datetime, query_pdb_id: str | None = None, sort_by_sum_probs: bool = True, shuffling_seed: int | None = None) → dict¶

deepfold.data.search.templates.build_query_to_hit_index_mapping(original_query_sequence: str, hit_query_sequence: str, hit_sequence: str, indices_hit: Sequence[int], indices_query: Sequence[int]) → Dict[int, int]¶

Gets mapping from indices in original query sequence to indices in the hit.

hit_query_sequence and hit_sequence are two aligned sequences containing gap characters. hit_query_sequence contains only the part of the original_query_sequence that matched the hit. When interpreting the indices from the .hhr, we need to correct for this to recover a mapping from original_query_sequence to the hit_sequence.

Parameters:

original_query_sequence – String describing the original query sequence.
hit_query_sequence – The portion of the original query sequence that is in the .hhr file.
hit_sequence – The portion of the matched hit sequence that is in the .hhr file.
indices_hit – The indices for each amino acid relative to the hit_sequence.
indices_query – The indices for each amino acid relative to the original query sequence.

Returns:

Dictionary with indices in the original_query_sequence as keys: and indices in the hit_sequence as values.

Return type:

index_mapping

deepfold.data.search.templates.create_empty_template_feats(seqlen: int, empty: bool = False) → dict¶

deepfold.data.search.templates.extract_template_features(mmcif_dict: dict, index_mapping: Dict[int, int], query_sequence: str, template_sequence: str, template_pdb_id: str, template_chain_id: str, kalign_executable_path: str, verbose: bool) → Tuple[dict, str | None]¶

Extracts template features from a single HHSearch hit.

Parameters:

mmcif_dict – mmcif dict representing the template (see load_mmcif_dict).
index_mapping – Dictionary mapping indices in the query sequence to indices in the template sequence.
query_sequence – String describing the amino acid sequence for the query protein.
template_sequence – String describing the amino acid sequence for the template protein.
template_pdb_id – PDB code for the template.
template_chain_id – String ID describing which chain of the structure should be used.
kalign_executable_path – The path to a kalign executable used for template realignment.
verbose – Whether to print relevant details.

Returns:

A dictionary containing the features derived from the template protein structure.
A warning message if the hit was realigned to the actual mmCIF sequence.
Otherwise None.

Return type:

A tuple with

Raises:

NoChainsError – If the mmcif_dict doesn’t contain any chains.
SequenceNotInTemplateError – If the given chain id / sequence can’t be found in the mmcif_dict.
QueryToTemplateAlignError – If the actual template in the mmCIF file can’t be aligned to the query.
NoAtomDataInTemplateError – If the mmcif_dict doesn’t contain atom positions.
TemplateAtomMaskAllZerosError – If the mmcif_dict doesn’t have any unmasked residues.

deepfold.data.search.templates.get_atom_positions(mmcif_dict: dict, auth_chain_id: str, max_ca_ca_distance: float, zero_center: bool) → Tuple[ndarray, ndarray]¶: Gets atom positions and mask from a list of Biopython Residues.

deepfold.data.search.templates.load_mmcif_dict(mmcif_dirpath: Path, pdb_id: str) → dict¶: Load mmCIF dict for pdb_id.

deepfold.data.search.templates.load_pdb_obsolete_mapping(pdb_obsolete_filepath: Path) → Dict[str, str]¶: Parses the data file from PDB that lists which PDB ids are obsolete.

deepfold.data.search package¶

Submodules¶

deepfold.data.search.crfalign module¶

deepfold.data.search.input_features module¶

deepfold.data.search.mmcif module¶

deepfold.data.search.msa_identifiers module¶

deepfold.data.search.parsers module¶

deepfold.data.search.templates module¶

Module contents¶