Preprocessing¶
integer_encode¶
protlearn.preprocessing.integer_encode(X, *, padding=False)
Encode amino acids as integers.
This function converts amino acids into their corresponding integers based on the specified notation, starting at 1. Zeros are reserved for optional padding. This is particularly useful for preparing a sequence-based model such as a long short-term memory (LSTM) or a gated recurrent unit (GRU).
Parameters¶
- X: string, fasta, or a list thereof
- Dataset of amino acid sequences.
- padding: bool, default=False
- False : sequences are returned in their original lengths
True : sequences will be padded with zeros to the length of the longest sequence in the dataset
Returns¶
- enc: ndarray of shape (n_samples,) if padding=False or (n_samples, max_len) if padding=True
- Contains the integer-encoded amino acid sequences.
- amino_acids: amino acid order of enc array
- This serves as a lookup for the encoded sequences.
Examples¶
>>> from protlearn.preprocessing import integer_encode
>>> seq = 'ARKLYPGPGEERNK'
>>> enc, aa = integer_encode(seq)
>>> enc
array([ 1, 15, 9, 10, 20, 13, 6, 13, 6, 4, 4, 15, 12, 9])
>>> aa
'ACDEFGHIKLMNPQRSTVWY'
Below is an example using multiple sequences and padding. If padding=True
,
sequences of unequal lengths will be posteriorly padded with zeros to the length
of the longest sequence in the dataset.
>>> from protlearn.preprocessing import integer_encode
>>> seqs = ['ARKLY', 'EERNPAA', 'QEPGPGLLLK']
>>> enc, aa = integer_encode(seqs, padding=True)
>>> enc
array([[ 1, 15, 9, 10, 20, 0, 0, 0, 0, 0],
[ 4, 4, 15, 12, 13, 1, 1, 0, 0, 0],
[14, 4, 13, 6, 13, 6, 10, 10, 10, 9]])
>>> aa
'ACDEFGHIKLMNPQRSTVWY'
onehot_encode¶
protlearn.preprocessing.onehot_encode(X)
One-hot encoding.
This function converts amino acid sequences into their corresponding one-hot encoded representations. Sequences will be padded with zeros to the maximum sequence length so that the final output has the shape of (n_samples, maximum_length, 20), where 20 is the number of natural amino acids.
Parameters¶
- X: string, fasta, or a list thereof
- Dataset of amino acid sequences.
Returns¶
- enc: ndarray of shape (n_samples, max_len, 20)
- Contains the one-hot-encoded amino acid sequences.
Examples¶
>>> from protlearn.preprocessing import onehot_encode
>>> seqs = ['ARKLY', 'EERNPAA', 'QEPGPGLLLK']
>>> enc = onehot_encode(seqs, padding=True)
>>> enc.shape
(3, 10, 20)
remove_duplicates¶
protlearn.preprocessing.remove_duplicates(X, *, verbose=1)
Remove duplicate sequences.
This function detects and removes duplicate sequences from the dataset.
Parameters¶
- X: string, fasta, or a list thereof
- Dataset of amino acid sequences.
- verbose: int, default=1
- 0 : no information on duplicates is printed
1 : prints number of duplicates removed
2 : prints duplicate sequences and number of times present
Returns¶
- Y: list of length n_samples minus the number of duplicates
- Dataset containing only unique sequences.
Examples¶
>>> from protlearn.preprocessing import remove_duplicates
>>> seqs = ['ARKLY', 'EERNPAA', 'ARKLY', 'QEPGPGLLLK']
>>> seqs = remove_duplicates(seqs)
>>> seqs
['EERNPAA', 'QEPGPGLLLK', 'ARKLY']
remove_unnatural¶
protlearn.preprocessing.remove_unnatural(X)
Remove sequences containing unnatural amino acids.
This function removes sequences containing amino acids other than the 20 natural ones.
Parameters¶
- X: string, fasta, or a list thereof
- Dataset of amino acid sequences.
Returns¶
- Y: list of length n_samples minus the number of sequences containing unnatural amino acids
- Dataset containing only sequences comprised of natural amino acids.
Examples¶
>>> from protlearn.preprocessing import remove_unnatural
>>> seqs = ['ARKLY', 'EERNPJAB', 'QEPGPGLLLK']
>>> seqs = remove_unnatural(seqs)
>>> seqs
['ARKLY', 'QEPGPGLLLK']