setriq
#
- Provides:
fast computation of pairwise (immunoglobulin) sequence distances
parsed BLOSUM substitution matrices
About#
This package is a one-trick-pony which does its trick pretty well: efficient computation of pairwise sequence distances on CPU. There are a number of distance functions available [1]_ [2]_ 3 4, implemented in C++ and surfaced in Python.
Examples
>>> import setriq
>>> metric = setriq.CdrDist()
>>>
>>> sequences = [
... 'CASSLKPNTEAFF',
... 'CASSAHIANYGYTF',
... 'CASRGATETQYF'
... ]
>>> distances = metric(sequences)
References
- 1
Dash, P., Fiore-Gartland, A.J., Hertz, T., Wang, G.C., Sharma, S., Souquette, A., Crawford, J.C., Clemens, E.B., Nguyen, T.H., Kedzierska, K. and La Gruta, N.L., 2017. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature, 547(7661), pp.89-93. (https://doi.org/10.1038/nature22383)
- 2
Levenshtein, V.I., 1966, February. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady (Vol. 10, No. 8, pp. 707-710).
- 3
python-Levenshtein (https://github.com/ztane/python-Levenshtein)
- 4
Thakkar, N. and Bailey-Kellogg, C., 2019. Balancing sensitivity and specificity in distinguishing TCR groups by CDR sequence similarity. BMC bioinformatics, 20(1), pp.1-14. (https://doi.org/10.1186/s12859-019-2864-8)
Subpackages#
Package Contents#
Classes#
The CdrDist [1]_ class. Inherits from Metric. |
|
Hamming distance class. Inherits from Metric. Sequences must be of equal length. |
|
Jaro distance class. Inherits from Metric. Adapted from [2]. |
|
Jaro-Winkler distance class. Inherits from Jaro. |
|
The Levenshtein [1]_ class. Inherits from Metric. It uses a refactor of the |
|
Longest common substring distance class. Inherits from Metric. |
|
Optimal string alignment distance class. Inherits from Metric. |
|
The SubstitutionMatrix abstract base class. It holds convenience methods for loading and using the classic |
|
TcrDist [1]_ class. Inherits from Metric. It is a container class for individual TcrDistComponent instances. |
Attributes#
- setriq.BLOSUM45#
- setriq.BLOSUM62#
- setriq.BLOSUM90#
- class setriq.CdrDist(substitution_matrix: setriq.modules.substitution.SubstitutionMatrix = BLOSUM45, gap_opening_penalty: float = 10.0, gap_extension_penalty: float = 1.0, return_squareform: bool = False)#
Bases:
Metric
[str
]The CdrDist [1]_ class. Inherits from Metric.
Examples
>>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF'] >>> >>> metric = CdrDist() >>> distances = metric(sequences)
References
- 1
Thakkar, N. and Bailey-Kellogg, C., 2019. Balancing sensitivity and specificity in distinguishing TCR groups by CDR sequence similarity. BMC bioinformatics, 20(1), pp.1-14. (https://doi.org/10.1186/s12859-019-2864-8)
- forward(sequences: Sequence[str]) List[float] #
- class setriq.Hamming(mismatch_score: float = 1.0, return_squareform: bool = False)#
Bases:
Metric
[str
]Hamming distance class. Inherits from Metric. Sequences must be of equal length.
Examples
>>> metric = Hamming(mismatch_score=2.0) >>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF'] >>> distances = metric(sequences)
References
- forward(sequences: Sequence[str]) List[float] #
- class setriq.Jaro(jaro_weights: Optional[List[float]] = None, return_squareform: bool = False)#
Bases:
Metric
[str
]Jaro distance class. Inherits from Metric. Adapted from [2].
Examples
>>> metric = Jaro() >>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF'] >>> distances = metric(sequences)
References
- [1] Jaro, M.A., 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa,
Florida. Journal of the American Statistical Association, 84(406), pp.414-420.
[2] Van der Loo, M.P., 2014. The stringdist package for approximate string matching. R J., 6(1), p.111.
- forward(sequences: Sequence[str]) List[float] #
- class setriq.JaroWinkler(p: float = 0.1, max_l: int = 4, jaro_weights: Optional[List[float]] = None, return_squareform: bool = False)#
Bases:
Jaro
Jaro-Winkler distance class. Inherits from Jaro.
Examples
>>> metric = JaroWinkler(p=0.10) >>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF'] >>> distances = metric(sequences)
References
- 1
Winkler, W.E., 1990. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage.
- class setriq.Levenshtein(extra_cost: float = 0.0, return_squareform: bool = False)#
Bases:
Metric
[str
]The Levenshtein [1]_ class. Inherits from Metric. It uses a refactor of the
python-Levenshtein
[2]_ implementation in the backend.Examples
>>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF'] >>> >>> metric = Levenshtein() >>> distances = metric(sequences)
References
- 1
Levenshtein, V.I., 1966, February. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady (Vol. 10, No. 8, pp. 707-710).
- 2
python-Levenshtein (https://github.com/ztane/python-Levenshtein)
- forward(sequences: Sequence[str]) List[float] #
- class setriq.LongestCommonSubstring(return_squareform: bool = False)#
Bases:
Metric
[str
]Longest common substring distance class. Inherits from Metric.
Examples
>>> metric = LongestCommonSubstring() >>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF'] >>> distances = metric(sequences)
References
- forward(sequences: Sequence[str]) List[float] #
- class setriq.OptimalStringAlignment(return_squareform: bool = False)#
Bases:
Metric
[str
]Optimal string alignment distance class. Inherits from Metric.
Examples
>>> metric = OptimalStringAlignment() >>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF'] >>> distances = metric(sequences)
References
- forward(sequences: Sequence[str]) List[float] #
- class setriq.SubstitutionMatrix(index: Dict[str, int], substitution_matrix: List[List[float]], *args, **kwargs)#
Bases:
abc.ABC
The SubstitutionMatrix abstract base class. It holds convenience methods for loading and using the classic biological sequence substitution matrices (e.g.
BLOSUM
).- index#
a mapping of strings (amino acids) to integers (matrix index)
- Type
Dict[str, int]
- substitution_matrix#
the substitution scoring (\(N \times N\)) matrix
- Type
List[List[float]]
- from_json(self, file_path: Union[str, Path]) SubstitutionMatrix: #
load a substitution matrix from a json file
Examples
suppose we have a token index idx and a substitution matrix scores
>>> idx = {'hello': 0, 'world': 1} >>> scores = [[1., -1.], ... [-1., 1.]] >>> matrix = SubstitutionMatrix(index=idx, substitution_matrix=scores)
here we can see that we can provide any arbitrary substitution matrix, but in general it is advised to use the pre-loaded BLOSUM matrices
>>> [BLOSUM45, BLOSUM62, BLOSUM90] # choose one of the following
these are just instances of
SubstitutionMatrix
, initialised throughfrom_json
- index :Dict[str, int]#
- substitution_matrix :List[List[float]]#
- classmethod from_json(file_path: Union[str, pathlib.Path]) SubstitutionMatrix #
Load a SubstitutionMatrix from a json file.
- Parameters
file_path (Union[str, Path]) – a path to a json file holding at least the token index and the substitution scoring matrix
- Returns
substitution_matrix – returns an instance of the SubstitutionMatrix class, holding the values found at
file_path
- Return type
Examples
>>> sub_mat = SubstitutionMatrix.from_json('/path/to/file.json')
- keys() Tuple[str, str] #
- add_token(token: str, values: Union[float, List[float]], inplace: bool = False) Union[SubstitutionMatrix, None] #
Add a special token to the substitution matrix with a given value or list of values.
- Parameters
token (str) – a special token to be added.
values (Union[float, List[float]]) – a value or list of values to which the token will correspond. If a list of floats is provided, the list must have a length of
len(substitution_matrix) + 1
, i.e. there must be number of rows + 1 elements in the list.inplace (bool) – boolean whether to add token inplace.
- Returns
this is an inplace operation
- Return type
None
Examples
Single value example. The value is repeated to fit the required shape
>>> sm = BLOSUM62 >>> sm.add_token('-', 4.)
List of floats example
>>> sm = BLOSUM62 >>> len(sm) ... 24 >>> sm.add_token('setriq', [*range(26)]) # ints implicitly converted to floats
- class setriq.TcrDist(return_squareform: bool = False, **components: TcrDistComponent)#
Bases:
Metric
[Dict
[str
,str
]]TcrDist [1]_ class. Inherits from Metric. It is a container class for individual TcrDistComponent instances. Components are executed sequentially and their results aggregated at the end (summation).
- components#
holds the names of the components to be executed
- Type
List[str]
Examples
>>> sequences = [ ... {'cdr_1': 'TSG------FNG', 'cdr_2': 'VVL----DGL', 'cdr_2_5': 'SRSN-GY', 'cdr_3': 'CAVR-----'}, ... {'cdr_1': 'TSG------FYG', 'cdr_2': 'NGL----DGL', 'cdr_2_5': 'SRSD-SY', 'cdr_3': 'CA-------'}, ... {'cdr_1': 'NSA------FQY', 'cdr_2': 'TYS----SGN', 'cdr_2_5': 'DKSSKY-', 'cdr_3': 'CAMS-----'} ... ] >>> metric = TcrDist() # will produce a warning stating default configuration (Dash et al) >>> distances = metric(sequences)
References
- 1
Dash, P., Fiore-Gartland, A.J., Hertz, T., Wang, G.C., Sharma, S., Souquette, A., Crawford, J.C., Clemens, E.B., Nguyen, T.H., Kedzierska, K. and La Gruta, N.L., 2017. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature, 547(7661), pp.89-93. (https://doi.org/10.1038/nature22383)
- property required_input_keys List[str] #
Get the keys (=fields) required in the input to TcrDist instance.
- Returns
required_input_keys – returns a list of strings signifying the keys required in the input
- Return type
List[str]
- property default_definition setriq.modules.utils.TcrDistDef #
Get the default TcrDistComponent schema as defined by Dash et al.
- Returns
default_schema – returns the schema for the TcrDistComponent instances held in the default configuration
- Return type
List[tuple]
- forward(sequences: Sequence[Dict[str, str]]) List[float] #