EventStream.data.vocabulary module

A vocabulary class for easy management of categorical data element options.

class EventStream.data.vocabulary.Vocabulary(vocabulary: list[str | T] | None = None, obs_frequencies: ndarray | list[float] | None = None)[source]

Bases: Generic[T]

Stores a vocabulary of observed elements of type VOCAB_ELEMENT ordered by frequency.

This class represents a vocabulary of observed elements of specifiable type VOCAB_ELEMENT. All vocabularies include an “unknown” option, codified as the string 'UNK'. Upon construction, the vocabulary is sorted in order of decreasing frequency. The vocabulary can also be described for a text-based visual representation of the contained elements and their relative frequency distribution. Vocabulary elements can be arbitrary types _except_ for integers.

vocabulary

The vocabulary, stored as a plain list, beginning with ‘UNK’ and subsequently proceeding in order of most frequently observed to least frequently observed.

Type:

list[str | EventStream.data.vocabulary.T] | None

obs_frequencies

The observed frequencies of elements of the vocabulary, stored as a plain list.

Type:

numpy.ndarray | list[float] | None

element_types

A set of the types of elements that are allowed in this vocabulary.

Raises:

ValueError – If an empty vocabulary is passed, a vocabulary with duplicates is passed, a vocabulary with integer elements is passed, or a vocabulary whose length differs from the passed observation frequencies.

Examples

>>> vocab = Vocabulary(vocabulary=['apple', 'banana', 'UNK'], obs_frequencies=[3, 5, 2])
>>> vocab.vocabulary
['UNK', 'banana', 'apple']
>>> vocab.obs_frequencies
[0.2, 0.5, 0.3]
>>> len(vocab)
3
>>> vocab = Vocabulary(vocabulary=[], obs_frequencies=[])
Traceback (most recent call last):
    ...
ValueError: Empty vocabularies are not supported.
>>> vocab = Vocabulary(vocabulary=['apple'], obs_frequencies=[1, 2])
Traceback (most recent call last):
    ...
ValueError: self.vocabulary and self.obs_frequencies must have the same length. Got 1 and 2.
>>> vocab = Vocabulary(vocabulary=['apple', 'apple'], obs_frequencies=[1, 2])
Traceback (most recent call last):
    ...
ValueError: Vocabulary has duplicates. len(self.vocabulary) = 2, but len(set(self.vocabulary)) = 1.
>>> vocab = Vocabulary(vocabulary=['apple', 1], obs_frequencies=[1, 2])
Traceback (most recent call last):
    ...
ValueError: Integer elements in the vocabulary are not supported.
describe(line_width: int = 60, wrap_lines: bool = True, n_head: int = 3, n_tail: int = 2, stream: TextIOBase | None = None) int | None[source]

Prints or outputs to a stream a text-based visual representation of the vocabulary.

This both lists the head and tail of the vocabulary but also produces a sparklines representation of the relative frequency distribution of vocabulary elements observed. In the printed head and tail elements, UNK is skipped. If more elements are in the vocabulary than the printed elements, ellipsis will denote the skipped elements.

Parameters:
line_width: int = 60

The maximum width of each line in the description.

wrap_lines: bool = True

Whether to wrap lines that exceed the line_width.

n_head: int = 3

The number of high-frequency elements to include in the description.

n_tail: int = 2

The number of low-frequency elements to include in the description.

stream: TextIOBase | None = None

The stream to write the description to. If None, the description is printed to stdout.

Returns:

The number of characters written to the stream if a stream was provided, otherwise None.

Example

>>> vocab = Vocabulary(
...     vocabulary=['apple', 'banana', 'pear', 'UNK'],
...     obs_frequencies=[3, 4, 1, 2],
... )
>>> vocab.describe(n_head=2, n_tail=1, wrap_lines=False)
4 elements, 20.0% UNKs
Frequencies: █▆▁
Elements:
  (40.0%) banana
  (30.0%) apple
  (10.0%) pear
>>> vocab.describe(n_head=1, n_tail=0, wrap_lines=False)
4 elements, 20.0% UNKs
Frequencies: █▆▁
Examples:
  (40.0%) banana
  ...
>>> vocab.describe(n_head=1, n_tail=0, wrap_lines=False, line_width=10)
4 [...]
[...]
Examples:
  [...]
  ...
>>> vocab.describe(n_head=1, n_tail=0, wrap_lines=True, line_width=10)
4
elements,
20.0% UNKs
Frequencie
s:
Examples:
  (40.0%)
  banana
  ...
filter(total_observations: int | None, min_valid_element_freq: int | float | None) Vocabulary[source]

Filters the vocabulary elements to only those occurring sufficiently often.

Filters out infrequent elements from the vocabulary, pushing the dropped elements into the UNK element. The cutoff frequency can be specified either as an integral count or as a floating point proportion. If specified as a count, it will be converted to a proportion via total_observations, as the internal observed frequency list is stored in terms of frequencies, not counts. Even if UNK occurs in the original vocabulary with frequency below this cut off, it will be retained as it is the destination element for filtered elements, and its output frequency will be updated accordingly.

Parameters:
total_observations: int | None

How many total observations were there of vocabulary elements.

min_valid_element_freq: int | float | None

How frequently must an element have been observed to be retained?

Raises:
  • ValueError – If min_valid_element_freq is not a positive integer or a floating point number

  • between 0 and 1.

Example

>>> vocab = Vocabulary(vocabulary=['apple', 'banana', 'UNK'], obs_frequencies=[5, 3, 2])
>>> vocab.filter(total_observations=10, min_valid_element_freq=0.4)
>>> vocab.vocabulary
['UNK', 'apple']
>>> vocab.obs_frequencies
[0.5, 0.5]
>>> vocab = Vocabulary(vocabulary=['apple', 'banana', 'UNK'], obs_frequencies=[5, 3, 2])
>>> vocab.filter(total_observations=10, min_valid_element_freq=4)
>>> vocab.vocabulary
['UNK', 'apple']
>>> vocab.obs_frequencies
[0.5, 0.5]
>>> vocab = Vocabulary(vocabulary=['apple', 'banana', 'UNK'], obs_frequencies=[5, 3, 2])
>>> vocab.filter(total_observations=10, min_valid_element_freq=None)
>>> vocab.vocabulary
['UNK', 'apple', 'banana']
>>> vocab.filter(total_observations=10, min_valid_element_freq=1.02)
Traceback (most recent call last):
    ...
ValueError: Can only filter vocabularies by floats in (0, 1) or ints > 1; got <class 'float'> 1.02
>>> vocab.filter(total_observations=10, min_valid_element_freq="0.02")
Traceback (most recent call last):
    ...
ValueError: Can only filter vocabularies by floats in (0, 1) or ints > 1; got <class 'str'> 0.02
>>> vocab.filter(total_observations=10, min_valid_element_freq=0)
Traceback (most recent call last):
    ...
ValueError: Can only filter vocabularies by floats in (0, 1) or ints > 1; got <class 'int'> 0
property idxmap : dict[T, int]

Returns a mapping from vocab element to vocabulary integer index.

Returns:

Dictionary mapping vocabulary elements to their index.

Example

>>> vocab = Vocabulary(vocabulary=['apple', 'banana', 'UNK'], obs_frequencies=[3, 5, 2])
>>> vocab.idxmap
{'UNK': 0, 'banana': 1, 'apple': 2}
obs_frequencies : ndarray | list[float] | None = None
vocabulary : list[str | T] | None = None