deepfold.data.search package¶
Submodules¶
deepfold.data.search.crfalign module¶
- deepfold.data.search.crfalign.parse_crf(crf_string: str, query_id: str, alignment_dir: Path) List[TemplateHit] ¶
- deepfold.data.search.crfalign.parse_pir(pir_string: str, index: int = 0) TemplateHit ¶
deepfold.data.search.input_features module¶
- deepfold.data.search.input_features.create_mmcif_features(mmcif_dict: dict, author_chain_id: str, zero_center: bool = False) dict ¶
- deepfold.data.search.input_features.create_msa_features(a3m_strings: List[str], sequence: str | None = None, use_identifiers: bool = False) dict ¶
- deepfold.data.search.input_features.create_pdb_features(protein_object: Protein, description: str, is_distillation: bool = True, confidence_threshold: float = 50.0) dict ¶
- deepfold.data.search.input_features.create_protein_features(protein_object: Protein, description: str, is_distillation: bool = False) dict ¶
- deepfold.data.search.input_features.create_sequence_features(sequence: str, domain_name: str) dict ¶
- deepfold.data.search.input_features.create_template_features(sequence: str, template_hits: Sequence[TemplateHit], template_hit_featurizer: TemplateHitFeaturizer, max_release_date: str, pdb_id: str | None = None, sort_by_sum_probs: bool = True, shuffling_seed: int | None = None) dict ¶
- deepfold.data.search.input_features.create_template_features_from_hhr_string(sequence: str, hhr_string: str, template_hit_featurizer: TemplateHitFeaturizer, release_date: str, pdb_id: str | None = None, shuffling_seed: int | None = None) dict ¶
- deepfold.data.search.input_features.create_template_features_from_hmmsearch_sto_string(sequence: str, sto_string: str, template_hit_featurizer: TemplateHitFeaturizer, release_date: str, pdb_id: str | None = None, shuffling_seed: int | None = None) dict ¶
deepfold.data.search.mmcif module¶
Parse mmCIF format files.
- deepfold.data.search.mmcif.load_mmcif_file(mmcif_filepath: Path) str ¶
Load .cif file into mmcif string.
- deepfold.data.search.mmcif.parse_mmcif_string(mmcif_string: str) dict ¶
Parse mmcif string into mmcif dict.
- deepfold.data.search.mmcif.zero_center_atom_positions(all_atom_positions: ndarray, all_atom_mask: ndarray) ndarray ¶
deepfold.data.search.msa_identifiers module¶
Utilities for extracting identifiers from MSA sequence desecriptions.
- deepfold.data.search.msa_identifiers.get_identifiers(description: str) str ¶
Compute extra MSA features from the description.
deepfold.data.search.parsers module¶
Parsing various file formats.
- class deepfold.data.search.parsers.HitMetadata(pdb_id: str, chain: str, start: int, end: int, length: int, text: str)¶
Bases:
object
- chain: str¶
- end: int¶
- length: int¶
- pdb_id: str¶
- start: int¶
- text: str¶
- class deepfold.data.search.parsers.TemplateHit(index: int, name: str, aligned_cols: int, sum_probs: float, query: str, hit_sequence: str, indices_query: List[int], indices_hit: List[int])¶
Bases:
object
Class representing a template hit.
- aligned_cols: int¶
- hit_sequence: str¶
- index: int¶
- indices_hit: List[int]¶
- indices_query: List[int]¶
- name: str¶
- query: str¶
- sum_probs: float¶
- deepfold.data.search.parsers.convert_stockholm_to_a3m(stockholm_format: str, max_sequences: int | None = None, remove_first_row_gaps: bool = True) str ¶
Converts MSA in Stockholm format to the A3M format.
- deepfold.data.search.parsers.deduplicate_stockholm_msa(stockholm_msa: str) str ¶
Remove duplicate sequences (ignoring insertions wrt query).
- deepfold.data.search.parsers.parse_a3m(a3m_string: str) Tuple[Sequence[str], Sequence[Sequence[int]], Sequence[str]] ¶
Parses sequences and deletion matrix from a3m format alignment.
- Parameters:
a3m_string – The string contents of a a3m file. The first sequence in the file should be the query sequence.
- Returns:
- A list of sequences that have been aligned to the query.
These might contain duplicates.
- The deletion matrix for the alignment as a list of lists.
The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.
- Return type:
A tuple of
- deepfold.data.search.parsers.parse_e_values_from_tblout(tblout: str) Dict[str, float] ¶
Parse target to e-value mapping parsed from Jackhmmer tblout string.
- deepfold.data.search.parsers.parse_fasta(fasta_string: str) Tuple[Sequence[str], Sequence[str]] ¶
Parses FASTA string and returns list of strings with amino-acid sequences.
- Parameters:
fasta_string – The string contents of a FASTA file.
- Returns:
A list of sequences.
- A list of sequence descriptions taken from the comment lines.
In the same order as the sequences.
- Return type:
A tuple of two lists
- deepfold.data.search.parsers.parse_hhr(hhr_string: str) Sequence[TemplateHit] ¶
Parses the content of an entire HHR file.
- deepfold.data.search.parsers.parse_hmmsearch_a3m(query_sequence: str, a3m_string: str, skip_first: bool = True) Sequence[TemplateHit] ¶
Parses an a3m string produced by hmmsearch.
- Parameters:
query_sequence – The query sequence.
a3m_string – The a3m string produced by hmmsearch.
skip_first – Whether to skip the first sequence in the a3m string.
- Returns:
A sequence of TemplateHit results.
- deepfold.data.search.parsers.parse_hmmsearch_sto(query_sequence: str, sto_string: str) Sequence[TemplateHit] ¶
Gets parsed template hits from the raw string output by the tool.
- deepfold.data.search.parsers.parse_stockholm(stockholm_string: str) Tuple[Sequence[str], Sequence[Sequence[int]], Sequence[str]] ¶
Parses sequences and deletion matrix from stockholm format alignment.
- Parameters:
stockholm_string – The string contents of a stockholm file.
sequence. (The first sequence in the file should be the query)
- Returns:
- A list of sequences that have been aligned to the query.
These might contain duplicates.
- The deletion matrix for the alignment as a list of lists.
The element at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.
- The names of the targets matched,
including the jackhmmer subsequence suffix.
- Return type:
A tuple of
- deepfold.data.search.parsers.remove_empty_columns_from_stockholm_msa(stockholm_msa: str) str ¶
Removes empty columns (dashes-only) from a Stockholm MSA.
- deepfold.data.search.parsers.truncate_stockholm_msa(stockholm_msa_path: str, max_sequences: int) str ¶
Reads + truncates a Stockholm file while preventing excessive RAM usage.
deepfold.data.search.templates module¶
Building the template features for the DeepFold model.
- exception deepfold.data.search.templates.AlignRatioError¶
Bases:
PrefilterError
An error indicating that the hit align ratio to the query was too small.
- exception deepfold.data.search.templates.CaDistanceError¶
Bases:
Exception
An error indicating that a CA atom distance exceeds a threshold.
- exception deepfold.data.search.templates.DateError¶
Bases:
PrefilterError
An error indicating that the hit date was after the max allowed date.
- exception deepfold.data.search.templates.DuplicateError¶
Bases:
PrefilterError
An error indicating that the hit was an exact subsequence of the query.
- exception deepfold.data.search.templates.LengthError¶
Bases:
PrefilterError
An error indicating that the hit was too short.
- exception deepfold.data.search.templates.NoAtomDataInTemplateError¶
Bases:
Exception
An error indicating that template mmCIF didn’t contain atom positions.
- exception deepfold.data.search.templates.NoChainsError¶
Bases:
Exception
An error indicating that template mmCIF didn’t have any chains.
- exception deepfold.data.search.templates.PrefilterError¶
Bases:
Exception
A base class for template prefilter exceptions.
- exception deepfold.data.search.templates.QueryToTemplateAlignError¶
Bases:
Exception
An error indicating that the query can’t be aligned to the template.
- exception deepfold.data.search.templates.SequenceNotInTemplateError¶
Bases:
Exception
An error indicating that template mmCIF didn’t contain the sequence.
- exception deepfold.data.search.templates.TemplateAtomMaskAllZerosError¶
Bases:
Exception
An error indicating that template mmCIF had all atom positions masked.
- class deepfold.data.search.templates.TemplateFeaturesResult(features: dict | None, error: str | None, warning: str | None)¶
Bases:
object
- error: str | None¶
- features: dict | None¶
- warning: str | None¶
- class deepfold.data.search.templates.TemplateHitFeaturizer(max_template_hits: int, pdb_mmcif_dirpath: Path, pdb_release_dates: Dict[str, datetime] = {}, pdb_obsolete_filepath: Path | None = None, template_pdb_chain_ids: Set[str] | None = None, shuffle_top_k_prefiltered: int | None = None, kalign_executable_path: str = 'kalign', verbose: bool = False)¶
Bases:
object
A class for computing template features from template hits.
- get_template_features(query_sequence: str, template_hits: List[TemplateHit], max_template_date: datetime, query_pdb_id: str | None = None, sort_by_sum_probs: bool = True, shuffling_seed: int | None = None) dict ¶
- deepfold.data.search.templates.build_query_to_hit_index_mapping(original_query_sequence: str, hit_query_sequence: str, hit_sequence: str, indices_hit: Sequence[int], indices_query: Sequence[int]) Dict[int, int] ¶
Gets mapping from indices in original query sequence to indices in the hit.
hit_query_sequence and hit_sequence are two aligned sequences containing gap characters. hit_query_sequence contains only the part of the original_query_sequence that matched the hit. When interpreting the indices from the .hhr, we need to correct for this to recover a mapping from original_query_sequence to the hit_sequence.
- Parameters:
original_query_sequence – String describing the original query sequence.
hit_query_sequence – The portion of the original query sequence that is in the .hhr file.
hit_sequence – The portion of the matched hit sequence that is in the .hhr file.
indices_hit – The indices for each amino acid relative to the hit_sequence.
indices_query – The indices for each amino acid relative to the original query sequence.
- Returns:
- Dictionary with indices in the original_query_sequence as keys
and indices in the hit_sequence as values.
- Return type:
index_mapping
- deepfold.data.search.templates.create_empty_template_feats(seqlen: int, empty: bool = False) dict ¶
- deepfold.data.search.templates.extract_template_features(mmcif_dict: dict, index_mapping: Dict[int, int], query_sequence: str, template_sequence: str, template_pdb_id: str, template_chain_id: str, kalign_executable_path: str, verbose: bool) Tuple[dict, str | None] ¶
Extracts template features from a single HHSearch hit.
- Parameters:
mmcif_dict – mmcif dict representing the template (see load_mmcif_dict).
index_mapping – Dictionary mapping indices in the query sequence to indices in the template sequence.
query_sequence – String describing the amino acid sequence for the query protein.
template_sequence – String describing the amino acid sequence for the template protein.
template_pdb_id – PDB code for the template.
template_chain_id – String ID describing which chain of the structure should be used.
kalign_executable_path – The path to a kalign executable used for template realignment.
verbose – Whether to print relevant details.
- Returns:
A dictionary containing the features derived from the template protein structure.
- A warning message if the hit was realigned to the actual mmCIF sequence.
Otherwise None.
- Return type:
A tuple with
- Raises:
NoChainsError – If the mmcif_dict doesn’t contain any chains.
SequenceNotInTemplateError – If the given chain id / sequence can’t be found in the mmcif_dict.
QueryToTemplateAlignError – If the actual template in the mmCIF file can’t be aligned to the query.
NoAtomDataInTemplateError – If the mmcif_dict doesn’t contain atom positions.
TemplateAtomMaskAllZerosError – If the mmcif_dict doesn’t have any unmasked residues.
- deepfold.data.search.templates.get_atom_positions(mmcif_dict: dict, auth_chain_id: str, max_ca_ca_distance: float, zero_center: bool) Tuple[ndarray, ndarray] ¶
Gets atom positions and mask from a list of Biopython Residues.
- deepfold.data.search.templates.load_mmcif_dict(mmcif_dirpath: Path, pdb_id: str) dict ¶
Load mmCIF dict for pdb_id.
- deepfold.data.search.templates.load_pdb_obsolete_mapping(pdb_obsolete_filepath: Path) Dict[str, str] ¶
Parses the data file from PDB that lists which PDB ids are obsolete.