deepcarskit.data.dataset¶
deepcarskit.data.dataset¶
- class deepcarskit.data.dataset.dataset.Dataset(config)¶
- Bases: - object- Datasetstores the original dataset in memory. It provides many useful functions for data preprocessing, such as k-core data filtering and missing value imputation. Features are stored as- pandas.DataFrameinside- Dataset. General and Context-aware Models can use this class.- By calling method - build(), it will processing dataset into DataLoaders, according to- EvalSetting.- Args:
- config (Config): Global configuration object. 
- Attributes:
- dataset_name (str): Name of this dataset. - dataset_path (str): Local file path of this dataset. - field2type (dict): Dict mapping feature name (str) to its type ( - FeatureType).- field2source (dict): Dict mapping feature name (str) to its source
- ( - FeatureSource). Specially, if feature is loaded from Arg- additional_feat_suffix, its source has type str, which is the suffix of its local file (also the suffix written in Arg- additional_feat_suffix).
- field2id_token (dict): Dict mapping feature name (str) to a np.ndarray, which stores the original token
- of this feature. For example, if - testis token-like feature,- token_ais remapped to 1,- token_bis remapped to 2. Then- field2id_token['test'] = ['[PAD]', 'token_a', 'token_b']. (Note that 0 is always PADDING for token-like features.)
- field2token_id (dict): Dict mapping feature name (str) to a dict, which stores the token remap table
- of this feature. For example, if - testis token-like feature,- token_ais remapped to 1,- token_bis remapped to 2. Then- field2token_id['test'] = {'[PAD]': 0, 'token_a': 1, 'token_b': 2}. (Note that 0 is always PADDING for token-like features.)
- field2seqlen (dict): Dict mapping feature name (str) to its sequence length (int).
- For sequence features, their length can be either set in config, or set to the max sequence length of this feature. For token and float features, their length is 1. 
 - uid_field (str or None): The same as - config['USER_ID_FIELD'].- iid_field (str or None): The same as - config['ITEM_ID_FIELD'].- label_field (str or None): The same as - config['LABEL_FIELD'].- time_field (str or None): The same as - config['TIME_FIELD'].- inter_feat (Interaction): Internal data structure stores the interaction features.
- It’s loaded from file - .inter.
- user_feat (Interactionor None): Internal data structure stores the user features.
- It’s loaded from file - .userif existed.
- item_feat (Interactionor None): Internal data structure stores the item features.
- It’s loaded from file - .itemif existed.
 - feat_name_list (list): A list contains all the features’ name ( - str), including additional features.
 - property avg_actions_of_items¶
- Get the average number of items’ interaction records. - Returns:
- numpy.float64: Average number of items’ interaction records. 
 
 - property avg_actions_of_user_context¶
 - property avg_actions_of_users¶
- Get the average number of users’ interaction records. - Returns:
- numpy.float64: Average number of users’ interaction records. 
 
 - build()¶
- Processing dataset according to evaluation setting, including Group, Order and Split. See - EvalSettingfor details.- Returns:
- list: List of built - Dataset.
 
 - copy(new_inter_feat)¶
- Given a new interaction feature, return a new - Datasetobject, whose interaction feature is updated with- new_inter_feat, and all the other attributes the same.
 - copy_field_property(dest_field, source_field)¶
- Copy properties from - dest_fieldtowards- source_field.- Args:
- dest_field (str): Destination field. source_field (str): Source field. 
 
 - counter(field)¶
- Given - field, if it is a token field in- inter_feat, return the counter containing the occurrences times in- inter_featof different tokens, for other cases, raise ValueError.- Args:
- field (str): field name to get token counter. 
- Returns:
- Counter: The counter of different tokens. 
 
 - field2feats(field)¶
 - fields(ftype=None, source=None)¶
- Given type and source of features, return all the field name of this type and source. If - ftype == None, the type of returned fields is not restricted. If- source == None, the source of returned fields is not restricted.- Args:
- ftype (FeatureType, optional): Type of features. Defaults to - None. source (FeatureSource, optional): Source of features. Defaults to- None.
- Returns:
- list: List of field names. 
 
 - property float_like_fields¶
- Get fields of type - FLOATand- FLOAT_SEQ.- Returns:
- list: List of field names. 
 
 - get_item_feature()¶
- Returns:
- Interaction: item features 
 
 - get_preload_weight(field)¶
- Get preloaded weight matrix, whose rows are sorted by token ids. - 0is used as padding.- Args:
- field (str): preloaded feature field name. 
- Returns:
- numpy.ndarray: preloaded weight matrix. See ../user_guide/config/data_settings for details. 
 
 - get_user_feature()¶
- Returns:
- Interaction: user features 
 
 - history_item_matrix(value_field=None)¶
- Get dense matrix describe user’s history interaction records. - history_matrix[i]represents user- i’s history interacted item_id.- history_value[i]represents user- i’s history interaction records’ values,- 0if- value_field = None.- history_len[i]represents number of user- i’s history interaction records.- 0is used as padding.- Args:
- value_field (str, optional): Data of matrix, which should exist in self.inter_feat.
- Defaults to - None.
 
- value_field (str, optional): Data of matrix, which should exist in 
- Returns:
- tuple:
- History matrix (torch.Tensor): - history_matrixdescribed above.
- History values matrix (torch.Tensor): - history_valuedescribed above.
- History length matrix (torch.Tensor): - history_lendescribed above.
 
 
 
 - history_user_matrix(value_field=None)¶
- Get dense matrix describe item’s history interaction records. - history_matrix[i]represents item- i’s history interacted item_id.- history_value[i]represents item- i’s history interaction records’ values,- 0if- value_field = None.- history_len[i]represents number of item- i’s history interaction records.- 0is used as padding.- Args:
- value_field (str, optional): Data of matrix, which should exist in self.inter_feat.
- Defaults to - None.
 
- value_field (str, optional): Data of matrix, which should exist in 
- Returns:
- tuple:
- History matrix (torch.Tensor): - history_matrixdescribed above.
- History values matrix (torch.Tensor): - history_valuedescribed above.
- History length matrix (torch.Tensor): - history_lendescribed above.
 
 
 
 - id2token(field, ids)¶
- Map internal ids to external tokens. - Args:
- field (str): Field of internal ids. ids (int, list, numpy.ndarray or torch.Tensor): Internal ids. 
- Returns:
- str or numpy.ndarray: The external tokens of internal ids. 
 
 - inter_matrix(form='coo', value_field=None)¶
- Get sparse matrix that describe interactions between user_id and item_id. - Sparse matrix has shape (user_num, item_num). - For a row of <src, tgt>, - matrix[src, tgt] = 1if- value_fieldis- None, else- matrix[src, tgt] = self.inter_feat[src, tgt].- Args:
- form (str, optional): Sparse matrix format. Defaults to - coo. value_field (str, optional): Data of sparse matrix, which should exist in- df_feat.- Defaults to - None.
- Returns:
- scipy.sparse: Sparse matrix in form - cooor- csr.
 
 - property inter_num¶
- Get the number of interaction records. - Returns:
- int: Number of interaction records. 
 
 - property item_counter¶
- Get the counter containing the occurrences times in - inter_featof different items.- Returns:
- Counter: The counter of different items. 
 
 - property item_num¶
- Get the number of different tokens of - self.iid_field.- Returns:
- int: Number of different tokens of - self.iid_field.
 
 - join(df)¶
- Given interaction feature, join user/item feature into it. - Args:
- df (Interaction): Interaction feature to be joint. 
- Returns:
- Interaction: Interaction feature after joining operation. 
 
 - leave_one_out(group_by, leave_one_mode)¶
- Split interaction records by leave one out strategy. - Args:
- group_by (str): Field name that interaction records should grouped by before splitting. leave_one_mode (str): The way to leave one out. It can only take three values: - ‘valid_and_test’, ‘valid_only’ and ‘test_only’. 
- Returns:
- list: List of - Dataset, whose interaction features has been split.
 
 - property non_seq_fields¶
- Get fields of type - TOKENand- FLOAT.- Returns:
- list: List of field names. 
 
 - num(field)¶
- Given - field, for token-like fields, return the number of different tokens after remapping, for float-like fields, return- 1.- Args:
- field (str): field name to get token number. 
- Returns:
- int: The number of different tokens ( - 1if- fieldis a float-like field).
 
 - property seq_fields¶
- Get fields of type - TOKEN_SEQand- FLOAT_SEQ.- Returns:
- list: List of field names. 
 
 - set_field_property(field, field_type, field_source, field_seqlen)¶
- Set a new field’s properties. - Args:
- field (str): Name of the new field. field_type (FeatureType): Type of the new field. field_source (FeatureSource): Source of the new field. field_seqlen (int): max length of the sequence in - field.- 1if- field’s type is not sequence-like.
 
 - shuffle()¶
- Shuffle the interaction records inplace. 
 - sort(by, ascending=True)¶
- Sort the interaction records inplace. - Args:
- by (str or list of str): Field that as the key in the sorting process. ascending (bool or list of bool, optional): Results are ascending if - True, otherwise descending.- Defaults to - True
 
 - property sparsity¶
- Get the sparsity of this dataset. - Returns:
- float: Sparsity of this dataset. 
 
 - split_by_folds(folds, group_by=None)¶
- Split interaction records by ratios. - Args:
- folds (int): Number of folds group_by (str, optional): Field name that interaction records should grouped by before splitting. - Defaults to - None
- Returns:
- list: List of - Dataset, whose interaction features has been split.
- Note:
- Other than the first one, each part is rounded down. 
 
 - split_by_ratio(ratios, group_by=None)¶
- Split interaction records by ratios. - Args:
- ratios (list): List of split ratios. No need to be normalized. group_by (str, optional): Field name that interaction records should grouped by before splitting. - Defaults to - None
- Returns:
- list: List of - Dataset, whose interaction features has been split.
- Note:
- Other than the first one, each part is rounded down. 
 
 - token2id(field, tokens)¶
- Map external tokens to internal ids. - Args:
- field (str): Field of external tokens. tokens (str, list or numpy.ndarray): External tokens. 
- Returns:
- int or numpy.ndarray: The internal ids of external tokens. 
 
 - property token_like_fields¶
- Get fields of type - TOKENand- TOKEN_SEQ.- Returns:
- list: List of field names. 
 
 - unique(field)¶
 - property user_context_num¶
 - property user_counter¶
- Get the counter containing the occurrences times in - inter_featof different users.- Returns:
- Counter: The counter of different users. 
 
 - property user_num¶
- Get the number of different tokens of - self.uid_field.- Returns:
- int: Number of different tokens of - self.uid_field.