deepfold.data.search package

Submodules

deepfold.data.search.crfalign module

deepfold.data.search.crfalign.parse_crf(crf_string: str, query_id: str, alignment_dir: Path) List[TemplateHit]
deepfold.data.search.crfalign.parse_pir(pir_string: str, index: int = 0) TemplateHit

deepfold.data.search.input_features module

deepfold.data.search.input_features.create_mmcif_features(mmcif_dict: dict, author_chain_id: str, zero_center: bool = False) dict
deepfold.data.search.input_features.create_msa_features(a3m_strings: List[str], sequence: str | None = None, use_identifiers: bool = False) dict
deepfold.data.search.input_features.create_pdb_features(protein_object: Protein, description: str, is_distillation: bool = True, confidence_threshold: float = 50.0) dict
deepfold.data.search.input_features.create_protein_features(protein_object: Protein, description: str, is_distillation: bool = False) dict
deepfold.data.search.input_features.create_sequence_features(sequence: str, domain_name: str) dict
deepfold.data.search.input_features.create_template_features(sequence: str, template_hits: Sequence[TemplateHit], template_hit_featurizer: TemplateHitFeaturizer, max_release_date: str, pdb_id: str | None = None, sort_by_sum_probs: bool = True, shuffling_seed: int | None = None) dict
deepfold.data.search.input_features.create_template_features_from_hhr_string(sequence: str, hhr_string: str, template_hit_featurizer: TemplateHitFeaturizer, release_date: str, pdb_id: str | None = None, shuffling_seed: int | None = None) dict
deepfold.data.search.input_features.create_template_features_from_hmmsearch_sto_string(sequence: str, sto_string: str, template_hit_featurizer: TemplateHitFeaturizer, release_date: str, pdb_id: str | None = None, shuffling_seed: int | None = None) dict

deepfold.data.search.mmcif module

Parse mmCIF format files.

deepfold.data.search.mmcif.load_mmcif_file(mmcif_filepath: Path) str

Load .cif file into mmcif string.

deepfold.data.search.mmcif.parse_mmcif_string(mmcif_string: str) dict

Parse mmcif string into mmcif dict.

deepfold.data.search.mmcif.zero_center_atom_positions(all_atom_positions: ndarray, all_atom_mask: ndarray) ndarray

deepfold.data.search.msa_identifiers module

Utilities for extracting identifiers from MSA sequence desecriptions.

deepfold.data.search.msa_identifiers.get_identifiers(description: str) str

Compute extra MSA features from the description.

deepfold.data.search.parsers module

Parsing various file formats.

class deepfold.data.search.parsers.HitMetadata(pdb_id: str, chain: str, start: int, end: int, length: int, text: str)

Bases: object

chain: str
end: int
length: int
pdb_id: str
start: int
text: str
class deepfold.data.search.parsers.TemplateHit(index: int, name: str, aligned_cols: int, sum_probs: float, query: str, hit_sequence: str, indices_query: List[int], indices_hit: List[int])

Bases: object

Class representing a template hit.

aligned_cols: int
hit_sequence: str
index: int
indices_hit: List[int]
indices_query: List[int]
name: str
query: str
sum_probs: float
deepfold.data.search.parsers.convert_stockholm_to_a3m(stockholm_format: str, max_sequences: int | None = None, remove_first_row_gaps: bool = True) str

Converts MSA in Stockholm format to the A3M format.

deepfold.data.search.parsers.deduplicate_stockholm_msa(stockholm_msa: str) str

Remove duplicate sequences (ignoring insertions wrt query).

deepfold.data.search.parsers.parse_a3m(a3m_string: str) Tuple[Sequence[str], Sequence[Sequence[int]], Sequence[str]]

Parses sequences and deletion matrix from a3m format alignment.

Parameters:

a3m_string – The string contents of a a3m file. The first sequence in the file should be the query sequence.

Returns:

  • A list of sequences that have been aligned to the query.

    These might contain duplicates.

  • The deletion matrix for the alignment as a list of lists.

    The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.

Return type:

A tuple of

deepfold.data.search.parsers.parse_e_values_from_tblout(tblout: str) Dict[str, float]

Parse target to e-value mapping parsed from Jackhmmer tblout string.

deepfold.data.search.parsers.parse_fasta(fasta_string: str) Tuple[Sequence[str], Sequence[str]]

Parses FASTA string and returns list of strings with amino-acid sequences.

Parameters:

fasta_string – The string contents of a FASTA file.

Returns:

  • A list of sequences.

  • A list of sequence descriptions taken from the comment lines.

    In the same order as the sequences.

Return type:

A tuple of two lists

deepfold.data.search.parsers.parse_hhr(hhr_string: str) Sequence[TemplateHit]

Parses the content of an entire HHR file.

deepfold.data.search.parsers.parse_hmmsearch_a3m(query_sequence: str, a3m_string: str, skip_first: bool = True) Sequence[TemplateHit]

Parses an a3m string produced by hmmsearch.

Parameters:
  • query_sequence – The query sequence.

  • a3m_string – The a3m string produced by hmmsearch.

  • skip_first – Whether to skip the first sequence in the a3m string.

Returns:

A sequence of TemplateHit results.

deepfold.data.search.parsers.parse_hmmsearch_sto(query_sequence: str, sto_string: str) Sequence[TemplateHit]

Gets parsed template hits from the raw string output by the tool.

deepfold.data.search.parsers.parse_stockholm(stockholm_string: str) Tuple[Sequence[str], Sequence[Sequence[int]], Sequence[str]]

Parses sequences and deletion matrix from stockholm format alignment.

Parameters:
  • stockholm_string – The string contents of a stockholm file.

  • sequence. (The first sequence in the file should be the query)

Returns:

  • A list of sequences that have been aligned to the query.

    These might contain duplicates.

  • The deletion matrix for the alignment as a list of lists.

    The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.

  • The names of the targets matched,

    including the jackhmmer subsequence suffix.

Return type:

A tuple of

deepfold.data.search.parsers.remove_empty_columns_from_stockholm_msa(stockholm_msa: str) str

Removes empty columns (dashes-only) from a Stockholm MSA.

deepfold.data.search.parsers.truncate_stockholm_msa(stockholm_msa_path: str, max_sequences: int) str

Reads + truncates a Stockholm file while preventing excessive RAM usage.

deepfold.data.search.templates module

Building the template features for the DeepFold model.

exception deepfold.data.search.templates.AlignRatioError

Bases: PrefilterError

An error indicating that the hit align ratio to the query was too small.

exception deepfold.data.search.templates.CaDistanceError

Bases: Exception

An error indicating that a CA atom distance exceeds a threshold.

exception deepfold.data.search.templates.DateError

Bases: PrefilterError

An error indicating that the hit date was after the max allowed date.

exception deepfold.data.search.templates.DuplicateError

Bases: PrefilterError

An error indicating that the hit was an exact subsequence of the query.

exception deepfold.data.search.templates.LengthError

Bases: PrefilterError

An error indicating that the hit was too short.

exception deepfold.data.search.templates.NoAtomDataInTemplateError

Bases: Exception

An error indicating that template mmCIF didn’t contain atom positions.

exception deepfold.data.search.templates.NoChainsError

Bases: Exception

An error indicating that template mmCIF didn’t have any chains.

exception deepfold.data.search.templates.PrefilterError

Bases: Exception

A base class for template prefilter exceptions.

exception deepfold.data.search.templates.QueryToTemplateAlignError

Bases: Exception

An error indicating that the query can’t be aligned to the template.

exception deepfold.data.search.templates.SequenceNotInTemplateError

Bases: Exception

An error indicating that template mmCIF didn’t contain the sequence.

exception deepfold.data.search.templates.TemplateAtomMaskAllZerosError

Bases: Exception

An error indicating that template mmCIF had all atom positions masked.

class deepfold.data.search.templates.TemplateFeaturesResult(features: dict | None, error: str | None, warning: str | None)

Bases: object

error: str | None
features: dict | None
warning: str | None
class deepfold.data.search.templates.TemplateHitFeaturizer(max_template_hits: int, pdb_mmcif_dirpath: Path, pdb_release_dates: Dict[str, datetime] = {}, pdb_obsolete_filepath: Path | None = None, template_pdb_chain_ids: Set[str] | None = None, shuffle_top_k_prefiltered: int | None = None, kalign_executable_path: str = 'kalign', verbose: bool = False)

Bases: object

A class for computing template features from template hits.

get_template_features(query_sequence: str, template_hits: List[TemplateHit], max_template_date: datetime, query_pdb_id: str | None = None, sort_by_sum_probs: bool = True, shuffling_seed: int | None = None) dict
deepfold.data.search.templates.build_query_to_hit_index_mapping(original_query_sequence: str, hit_query_sequence: str, hit_sequence: str, indices_hit: Sequence[int], indices_query: Sequence[int]) Dict[int, int]

Gets mapping from indices in original query sequence to indices in the hit.

hit_query_sequence and hit_sequence are two aligned sequences containing gap characters. hit_query_sequence contains only the part of the original_query_sequence that matched the hit. When interpreting the indices from the .hhr, we need to correct for this to recover a mapping from original_query_sequence to the hit_sequence.

Parameters:
  • original_query_sequence – String describing the original query sequence.

  • hit_query_sequence – The portion of the original query sequence that is in the .hhr file.

  • hit_sequence – The portion of the matched hit sequence that is in the .hhr file.

  • indices_hit – The indices for each amino acid relative to the hit_sequence.

  • indices_query – The indices for each amino acid relative to the original query sequence.

Returns:

Dictionary with indices in the original_query_sequence as keys

and indices in the hit_sequence as values.

Return type:

index_mapping

deepfold.data.search.templates.create_empty_template_feats(seqlen: int, empty: bool = False) dict
deepfold.data.search.templates.extract_template_features(mmcif_dict: dict, index_mapping: Dict[int, int], query_sequence: str, template_sequence: str, template_pdb_id: str, template_chain_id: str, kalign_executable_path: str, verbose: bool) Tuple[dict, str | None]

Extracts template features from a single HHSearch hit.

Parameters:
  • mmcif_dict – mmcif dict representing the template (see load_mmcif_dict).

  • index_mapping – Dictionary mapping indices in the query sequence to indices in the template sequence.

  • query_sequence – String describing the amino acid sequence for the query protein.

  • template_sequence – String describing the amino acid sequence for the template protein.

  • template_pdb_id – PDB code for the template.

  • template_chain_id – String ID describing which chain of the structure should be used.

  • kalign_executable_path – The path to a kalign executable used for template realignment.

  • verbose – Whether to print relevant details.

Returns:

  • A dictionary containing the features derived from the template protein structure.

  • A warning message if the hit was realigned to the actual mmCIF sequence.

    Otherwise None.

Return type:

A tuple with

Raises:
deepfold.data.search.templates.get_atom_positions(mmcif_dict: dict, auth_chain_id: str, max_ca_ca_distance: float, zero_center: bool) Tuple[ndarray, ndarray]

Gets atom positions and mask from a list of Biopython Residues.

deepfold.data.search.templates.load_mmcif_dict(mmcif_dirpath: Path, pdb_id: str) dict

Load mmCIF dict for pdb_id.

deepfold.data.search.templates.load_pdb_obsolete_mapping(pdb_obsolete_filepath: Path) Dict[str, str]

Parses the data file from PDB that lists which PDB ids are obsolete.

Module contents