setriq#

Provides:
  1. fast computation of pairwise (immunoglobulin) sequence distances

  2. parsed BLOSUM substitution matrices

About#

This package is a one-trick-pony which does its trick pretty well: efficient computation of pairwise sequence distances on CPU. There are a number of distance functions available [1]_ [2]_ 3 4, implemented in C++ and surfaced in Python.

Examples

>>> import setriq
>>> metric = setriq.CdrDist()
>>>
>>> sequences = [
...     'CASSLKPNTEAFF',
...     'CASSAHIANYGYTF',
...     'CASRGATETQYF'
... ]
>>> distances = metric(sequences)

References

1

Dash, P., Fiore-Gartland, A.J., Hertz, T., Wang, G.C., Sharma, S., Souquette, A., Crawford, J.C., Clemens, E.B., Nguyen, T.H., Kedzierska, K. and La Gruta, N.L., 2017. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature, 547(7661), pp.89-93. (https://doi.org/10.1038/nature22383)

2

Levenshtein, V.I., 1966, February. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady (Vol. 10, No. 8, pp. 707-710).

3

python-Levenshtein (https://github.com/ztane/python-Levenshtein)

4

Thakkar, N. and Bailey-Kellogg, C., 2019. Balancing sensitivity and specificity in distinguishing TCR groups by CDR sequence similarity. BMC bioinformatics, 20(1), pp.1-14. (https://doi.org/10.1186/s12859-019-2864-8)

Subpackages#

Package Contents#

Classes#

CdrDist

The CdrDist [1]_ class. Inherits from Metric.

Hamming

Hamming distance class. Inherits from Metric. Sequences must be of equal length.

Jaro

Jaro distance class. Inherits from Metric. Adapted from [2].

JaroWinkler

Jaro-Winkler distance class. Inherits from Jaro.

Levenshtein

The Levenshtein [1]_ class. Inherits from Metric. It uses a refactor of the python-Levenshtein [2]_

LongestCommonSubstring

Longest common substring distance class. Inherits from Metric.

OptimalStringAlignment

Optimal string alignment distance class. Inherits from Metric.

SubstitutionMatrix

The SubstitutionMatrix abstract base class. It holds convenience methods for loading and using the classic

TcrDist

TcrDist [1]_ class. Inherits from Metric. It is a container class for individual TcrDistComponent instances.

Attributes#

BLOSUM45

BLOSUM62

BLOSUM90

setriq.BLOSUM45#
setriq.BLOSUM62#
setriq.BLOSUM90#
class setriq.CdrDist(substitution_matrix: setriq.modules.substitution.SubstitutionMatrix = BLOSUM45, gap_opening_penalty: float = 10.0, gap_extension_penalty: float = 1.0, return_squareform: bool = False)#

Bases: Metric[str]

The CdrDist [1]_ class. Inherits from Metric.

Examples

>>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF']
>>>
>>> metric = CdrDist()
>>> distances = metric(sequences)

References

1

Thakkar, N. and Bailey-Kellogg, C., 2019. Balancing sensitivity and specificity in distinguishing TCR groups by CDR sequence similarity. BMC bioinformatics, 20(1), pp.1-14. (https://doi.org/10.1186/s12859-019-2864-8)

forward(sequences: Sequence[str]) List[float]#
class setriq.Hamming(mismatch_score: float = 1.0, return_squareform: bool = False)#

Bases: Metric[str]

Hamming distance class. Inherits from Metric. Sequences must be of equal length.

Examples

>>> metric = Hamming(mismatch_score=2.0)
>>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF']
>>> distances = metric(sequences)

References

1

https://en.wikipedia.org/wiki/Hamming_distance

forward(sequences: Sequence[str]) List[float]#
class setriq.Jaro(jaro_weights: Optional[List[float]] = None, return_squareform: bool = False)#

Bases: Metric[str]

Jaro distance class. Inherits from Metric. Adapted from [2].

Examples

>>> metric = Jaro()
>>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF']
>>> distances = metric(sequences)

References

[1] Jaro, M.A., 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa,

Florida. Journal of the American Statistical Association, 84(406), pp.414-420.

[2] Van der Loo, M.P., 2014. The stringdist package for approximate string matching. R J., 6(1), p.111.

forward(sequences: Sequence[str]) List[float]#
class setriq.JaroWinkler(p: float = 0.1, max_l: int = 4, jaro_weights: Optional[List[float]] = None, return_squareform: bool = False)#

Bases: Jaro

Jaro-Winkler distance class. Inherits from Jaro.

Examples

>>> metric = JaroWinkler(p=0.10)
>>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF']
>>> distances = metric(sequences)

References

1

Winkler, W.E., 1990. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage.

class setriq.Levenshtein(extra_cost: float = 0.0, return_squareform: bool = False)#

Bases: Metric[str]

The Levenshtein [1]_ class. Inherits from Metric. It uses a refactor of the python-Levenshtein [2]_ implementation in the backend.

Examples

>>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF']
>>>
>>> metric = Levenshtein()
>>> distances = metric(sequences)

References

1

Levenshtein, V.I., 1966, February. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady (Vol. 10, No. 8, pp. 707-710).

2

python-Levenshtein (https://github.com/ztane/python-Levenshtein)

forward(sequences: Sequence[str]) List[float]#
class setriq.LongestCommonSubstring(return_squareform: bool = False)#

Bases: Metric[str]

Longest common substring distance class. Inherits from Metric.

Examples

>>> metric = LongestCommonSubstring()
>>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF']
>>> distances = metric(sequences)

References

1

https://en.wikipedia.org/wiki/Longest_common_substring_problem

forward(sequences: Sequence[str]) List[float]#
class setriq.OptimalStringAlignment(return_squareform: bool = False)#

Bases: Metric[str]

Optimal string alignment distance class. Inherits from Metric.

Examples

>>> metric = OptimalStringAlignment()
>>> sequences = ['CASSLKPNTEAFF', 'CASSAHIANYGYTF', 'CASRGATETQYF']
>>> distances = metric(sequences)

References

1

https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

forward(sequences: Sequence[str]) List[float]#
class setriq.SubstitutionMatrix(index: Dict[str, int], substitution_matrix: List[List[float]], *args, **kwargs)#

Bases: abc.ABC

The SubstitutionMatrix abstract base class. It holds convenience methods for loading and using the classic biological sequence substitution matrices (e.g. BLOSUM).

index#

a mapping of strings (amino acids) to integers (matrix index)

Type

Dict[str, int]

substitution_matrix#

the substitution scoring (\(N \times N\)) matrix

Type

List[List[float]]

from_json(self, file_path: Union[str, Path]) SubstitutionMatrix:#

load a substitution matrix from a json file

Examples

suppose we have a token index idx and a substitution matrix scores

>>> idx = {'hello': 0, 'world': 1}
>>> scores = [[1., -1.],
...           [-1., 1.]]
>>> matrix = SubstitutionMatrix(index=idx, substitution_matrix=scores)

here we can see that we can provide any arbitrary substitution matrix, but in general it is advised to use the pre-loaded BLOSUM matrices

>>> [BLOSUM45, BLOSUM62, BLOSUM90]  # choose one of the following

these are just instances of SubstitutionMatrix, initialised through from_json

index :Dict[str, int]#
substitution_matrix :List[List[float]]#
classmethod from_json(file_path: Union[str, pathlib.Path]) SubstitutionMatrix#

Load a SubstitutionMatrix from a json file.

Parameters

file_path (Union[str, Path]) – a path to a json file holding at least the token index and the substitution scoring matrix

Returns

substitution_matrix – returns an instance of the SubstitutionMatrix class, holding the values found at file_path

Return type

SubstitutionMatrix

Examples

>>> sub_mat = SubstitutionMatrix.from_json('/path/to/file.json')
keys() Tuple[str, str]#
add_token(token: str, values: Union[float, List[float]], inplace: bool = False) Union[SubstitutionMatrix, None]#

Add a special token to the substitution matrix with a given value or list of values.

Parameters
  • token (str) – a special token to be added.

  • values (Union[float, List[float]]) – a value or list of values to which the token will correspond. If a list of floats is provided, the list must have a length of len(substitution_matrix) + 1, i.e. there must be number of rows + 1 elements in the list.

  • inplace (bool) – boolean whether to add token inplace.

Returns

this is an inplace operation

Return type

None

Examples

Single value example. The value is repeated to fit the required shape

>>> sm = BLOSUM62
>>> sm.add_token('-', 4.)

List of floats example

>>> sm = BLOSUM62
>>> len(sm)
... 24
>>> sm.add_token('setriq', [*range(26)])  # ints implicitly converted to floats
class setriq.TcrDist(return_squareform: bool = False, **components: TcrDistComponent)#

Bases: Metric[Dict[str, str]]

TcrDist [1]_ class. Inherits from Metric. It is a container class for individual TcrDistComponent instances. Components are executed sequentially and their results aggregated at the end (summation).

components#

holds the names of the components to be executed

Type

List[str]

Examples

>>> sequences = [
...     {'cdr_1': 'TSG------FNG', 'cdr_2': 'VVL----DGL', 'cdr_2_5': 'SRSN-GY', 'cdr_3': 'CAVR-----'},
...     {'cdr_1': 'TSG------FYG', 'cdr_2': 'NGL----DGL', 'cdr_2_5': 'SRSD-SY', 'cdr_3': 'CA-------'},
...     {'cdr_1': 'NSA------FQY', 'cdr_2': 'TYS----SGN', 'cdr_2_5': 'DKSSKY-', 'cdr_3': 'CAMS-----'}
... ]
>>> metric = TcrDist()  # will produce a warning stating default configuration (Dash et al)
>>> distances = metric(sequences)

References

1

Dash, P., Fiore-Gartland, A.J., Hertz, T., Wang, G.C., Sharma, S., Souquette, A., Crawford, J.C., Clemens, E.B., Nguyen, T.H., Kedzierska, K. and La Gruta, N.L., 2017. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature, 547(7661), pp.89-93. (https://doi.org/10.1038/nature22383)

property required_input_keys List[str]#

Get the keys (=fields) required in the input to TcrDist instance.

Returns

required_input_keys – returns a list of strings signifying the keys required in the input

Return type

List[str]

property default_definition setriq.modules.utils.TcrDistDef#

Get the default TcrDistComponent schema as defined by Dash et al.

Returns

default_schema – returns the schema for the TcrDistComponent instances held in the default configuration

Return type

List[tuple]

forward(sequences: Sequence[Dict[str, str]]) List[float]#