The dataset is the sales of Google Merchandise Store (GStore) from 2016-08-01 and 2017-08-01,
provided by Kaggle competition (Link here).
There are 55 features in the dataset excluding target feature. There are only 3 numerical features. The number
of unique values for some categorical features is over 1000. Label encoder was used to deal with these categorical features.
Also, one hot encoder and hashed encoder were tried.
The dataset was processed in two different perspectives:
Firstly, the dataset was processed based on visit. There are 903,653 visiting records. Correlation test and ANOVA were applied to reduce the
number of features from 55 to 20. The target feature, transaction revenue, is very skewed. It is noticeable that only
1.3% of the visiting records contian non-zero value. Log transformation was applied on the target feature.
Secondly, the dataset is processed based on customer. Some customers visit more than one time; therefore, the sequential
information should be taken into consideration. There are 714,167 customers in total. Each data point is the
collection of a customer's visiting records and each element in the collection (visit features) is the same as the
data point in visit based data. Accordingly, the target feature is the log of total revenue made by each customer.