deepcarskit.data.dataset¶
deepcarskit.data.dataset¶
- class deepcarskit.data.dataset.dataset.Dataset(config)¶
Bases:
object
Dataset
stores the original dataset in memory. It provides many useful functions for data preprocessing, such as k-core data filtering and missing value imputation. Features are stored aspandas.DataFrame
insideDataset
. General and Context-aware Models can use this class.By calling method
build()
, it will processing dataset into DataLoaders, according toEvalSetting
.- Args:
config (Config): Global configuration object.
- Attributes:
dataset_name (str): Name of this dataset.
dataset_path (str): Local file path of this dataset.
field2type (dict): Dict mapping feature name (str) to its type (
FeatureType
).- field2source (dict): Dict mapping feature name (str) to its source
(
FeatureSource
). Specially, if feature is loaded from Argadditional_feat_suffix
, its source has type str, which is the suffix of its local file (also the suffix written in Argadditional_feat_suffix
).- field2id_token (dict): Dict mapping feature name (str) to a
np.ndarray
, which stores the original token of this feature. For example, if
test
is token-like feature,token_a
is remapped to 1,token_b
is remapped to 2. Thenfield2id_token['test'] = ['[PAD]', 'token_a', 'token_b']
. (Note that 0 is always PADDING for token-like features.)- field2token_id (dict): Dict mapping feature name (str) to a dict, which stores the token remap table
of this feature. For example, if
test
is token-like feature,token_a
is remapped to 1,token_b
is remapped to 2. Thenfield2token_id['test'] = {'[PAD]': 0, 'token_a': 1, 'token_b': 2}
. (Note that 0 is always PADDING for token-like features.)- field2seqlen (dict): Dict mapping feature name (str) to its sequence length (int).
For sequence features, their length can be either set in config, or set to the max sequence length of this feature. For token and float features, their length is 1.
uid_field (str or None): The same as
config['USER_ID_FIELD']
.iid_field (str or None): The same as
config['ITEM_ID_FIELD']
.label_field (str or None): The same as
config['LABEL_FIELD']
.time_field (str or None): The same as
config['TIME_FIELD']
.- inter_feat (
Interaction
): Internal data structure stores the interaction features. It’s loaded from file
.inter
.- user_feat (
Interaction
or None): Internal data structure stores the user features. It’s loaded from file
.user
if existed.- item_feat (
Interaction
or None): Internal data structure stores the item features. It’s loaded from file
.item
if existed.
feat_name_list (list): A list contains all the features’ name (
str
), including additional features.
- property avg_actions_of_items¶
Get the average number of items’ interaction records.
- Returns:
numpy.float64: Average number of items’ interaction records.
- property avg_actions_of_user_context¶
- property avg_actions_of_users¶
Get the average number of users’ interaction records.
- Returns:
numpy.float64: Average number of users’ interaction records.
- build()¶
Processing dataset according to evaluation setting, including Group, Order and Split. See
EvalSetting
for details.- Returns:
list: List of built
Dataset
.
- copy(new_inter_feat)¶
Given a new interaction feature, return a new
Dataset
object, whose interaction feature is updated withnew_inter_feat
, and all the other attributes the same.
- copy_field_property(dest_field, source_field)¶
Copy properties from
dest_field
towardssource_field
.- Args:
dest_field (str): Destination field. source_field (str): Source field.
- counter(field)¶
Given
field
, if it is a token field ininter_feat
, return the counter containing the occurrences times ininter_feat
of different tokens, for other cases, raise ValueError.- Args:
field (str): field name to get token counter.
- Returns:
Counter: The counter of different tokens.
- field2feats(field)¶
- fields(ftype=None, source=None)¶
Given type and source of features, return all the field name of this type and source. If
ftype == None
, the type of returned fields is not restricted. Ifsource == None
, the source of returned fields is not restricted.- Args:
ftype (FeatureType, optional): Type of features. Defaults to
None
. source (FeatureSource, optional): Source of features. Defaults toNone
.- Returns:
list: List of field names.
- property float_like_fields¶
Get fields of type
FLOAT
andFLOAT_SEQ
.- Returns:
list: List of field names.
- get_item_feature()¶
- Returns:
Interaction: item features
- get_preload_weight(field)¶
Get preloaded weight matrix, whose rows are sorted by token ids.
0
is used as padding.- Args:
field (str): preloaded feature field name.
- Returns:
numpy.ndarray: preloaded weight matrix. See ../user_guide/config/data_settings for details.
- get_user_feature()¶
- Returns:
Interaction: user features
- history_item_matrix(value_field=None)¶
Get dense matrix describe user’s history interaction records.
history_matrix[i]
represents useri
’s history interacted item_id.history_value[i]
represents useri
’s history interaction records’ values,0
ifvalue_field = None
.history_len[i]
represents number of useri
’s history interaction records.0
is used as padding.- Args:
- value_field (str, optional): Data of matrix, which should exist in
self.inter_feat
. Defaults to
None
.
- value_field (str, optional): Data of matrix, which should exist in
- Returns:
- tuple:
History matrix (torch.Tensor):
history_matrix
described above.History values matrix (torch.Tensor):
history_value
described above.History length matrix (torch.Tensor):
history_len
described above.
- history_user_matrix(value_field=None)¶
Get dense matrix describe item’s history interaction records.
history_matrix[i]
represents itemi
’s history interacted item_id.history_value[i]
represents itemi
’s history interaction records’ values,0
ifvalue_field = None
.history_len[i]
represents number of itemi
’s history interaction records.0
is used as padding.- Args:
- value_field (str, optional): Data of matrix, which should exist in
self.inter_feat
. Defaults to
None
.
- value_field (str, optional): Data of matrix, which should exist in
- Returns:
- tuple:
History matrix (torch.Tensor):
history_matrix
described above.History values matrix (torch.Tensor):
history_value
described above.History length matrix (torch.Tensor):
history_len
described above.
- id2token(field, ids)¶
Map internal ids to external tokens.
- Args:
field (str): Field of internal ids. ids (int, list, numpy.ndarray or torch.Tensor): Internal ids.
- Returns:
str or numpy.ndarray: The external tokens of internal ids.
- inter_matrix(form='coo', value_field=None)¶
Get sparse matrix that describe interactions between user_id and item_id.
Sparse matrix has shape (user_num, item_num).
For a row of <src, tgt>,
matrix[src, tgt] = 1
ifvalue_field
isNone
, elsematrix[src, tgt] = self.inter_feat[src, tgt]
.- Args:
form (str, optional): Sparse matrix format. Defaults to
coo
. value_field (str, optional): Data of sparse matrix, which should exist indf_feat
.Defaults to
None
.- Returns:
scipy.sparse: Sparse matrix in form
coo
orcsr
.
- property inter_num¶
Get the number of interaction records.
- Returns:
int: Number of interaction records.
- property item_counter¶
Get the counter containing the occurrences times in
inter_feat
of different items.- Returns:
Counter: The counter of different items.
- property item_num¶
Get the number of different tokens of
self.iid_field
.- Returns:
int: Number of different tokens of
self.iid_field
.
- join(df)¶
Given interaction feature, join user/item feature into it.
- Args:
df (Interaction): Interaction feature to be joint.
- Returns:
Interaction: Interaction feature after joining operation.
- leave_one_out(group_by, leave_one_mode)¶
Split interaction records by leave one out strategy.
- Args:
group_by (str): Field name that interaction records should grouped by before splitting. leave_one_mode (str): The way to leave one out. It can only take three values:
‘valid_and_test’, ‘valid_only’ and ‘test_only’.
- Returns:
list: List of
Dataset
, whose interaction features has been split.
- property non_seq_fields¶
Get fields of type
TOKEN
andFLOAT
.- Returns:
list: List of field names.
- num(field)¶
Given
field
, for token-like fields, return the number of different tokens after remapping, for float-like fields, return1
.- Args:
field (str): field name to get token number.
- Returns:
int: The number of different tokens (
1
iffield
is a float-like field).
- property seq_fields¶
Get fields of type
TOKEN_SEQ
andFLOAT_SEQ
.- Returns:
list: List of field names.
- set_field_property(field, field_type, field_source, field_seqlen)¶
Set a new field’s properties.
- Args:
field (str): Name of the new field. field_type (FeatureType): Type of the new field. field_source (FeatureSource): Source of the new field. field_seqlen (int): max length of the sequence in
field
.1
iffield
’s type is not sequence-like.
- shuffle()¶
Shuffle the interaction records inplace.
- sort(by, ascending=True)¶
Sort the interaction records inplace.
- Args:
by (str or list of str): Field that as the key in the sorting process. ascending (bool or list of bool, optional): Results are ascending if
True
, otherwise descending.Defaults to
True
- property sparsity¶
Get the sparsity of this dataset.
- Returns:
float: Sparsity of this dataset.
- split_by_folds(folds, group_by=None)¶
Split interaction records by ratios.
- Args:
folds (int): Number of folds group_by (str, optional): Field name that interaction records should grouped by before splitting.
Defaults to
None
- Returns:
list: List of
Dataset
, whose interaction features has been split.- Note:
Other than the first one, each part is rounded down.
- split_by_ratio(ratios, group_by=None)¶
Split interaction records by ratios.
- Args:
ratios (list): List of split ratios. No need to be normalized. group_by (str, optional): Field name that interaction records should grouped by before splitting.
Defaults to
None
- Returns:
list: List of
Dataset
, whose interaction features has been split.- Note:
Other than the first one, each part is rounded down.
- token2id(field, tokens)¶
Map external tokens to internal ids.
- Args:
field (str): Field of external tokens. tokens (str, list or numpy.ndarray): External tokens.
- Returns:
int or numpy.ndarray: The internal ids of external tokens.
- property token_like_fields¶
Get fields of type
TOKEN
andTOKEN_SEQ
.- Returns:
list: List of field names.
- unique(field)¶
- property user_context_num¶
- property user_counter¶
Get the counter containing the occurrences times in
inter_feat
of different users.- Returns:
Counter: The counter of different users.
- property user_num¶
Get the number of different tokens of
self.uid_field
.- Returns:
int: Number of different tokens of
self.uid_field
.