iQiyi user retention forecast

User behavior sequence modeling

iQiyi user behavior sequence modeling


The user retention prediction competition organized by iQiyi predicts how many days users will log into the app in the next seven days. It can be done by multi-classification or regression + threshold post-processing.

1. Competition background

The training set has 600,000 samples, and multiple tables are given, including user attributes, login time, playback duration and other characteristics. There are 15,000 samples in the test set A list, and only the user ID and the time point to be predicted are given. Players need to customize label y.
Official website:
Competition code:
Data set download address: Link: Extraction code: pwk8

2. Feature extraction

1. User behavior sequence feature extraction

The login sequence needs to be sorted by time first, so that after grouping and extracting the login time and type, each user’s sequence will be in order, which is conducive to subsequent model extraction of information. The following extracts the user’s login time and login type sequence. In the end, the login type sequence was directly put into gru training. I tried w2v to extract the embedding of the login time series, but the training was too time-consuming and failed.
The code is as follows (example):

#Build sequence
launch_grp = pd.DataFrame()

user_id = []
launch_date_str = []
launch_type_str = []
for i in launch.groupby('user_id'):
    launch_date = []
    launch_type = []
    for j in i[1]['date']:
    for j in i[1]['launch_type']:
launch_grp['user_id'] = list(user_id)
launch_grp['launch_date'] = list(launch_date_str)
launch_grp['launch_type'] = list(launch_type_str)

The two extracted sequences are as follows:

2. User attribute feature extraction

This part of feature extraction is a conventional feature derivation operation, including group aggregation, target_encoding, logical intersection, length statistics and other features (number of user login types, sequence length, playback time in the past 30, 15, 7 days, etc.). When making statistical features, be careful not to cross features. You need to first extract the sequence before the enddate as the training set data. The code is as follows (example):
def get_train_launch_date(row):
    count = 0
    launch_date_list = row.launch_date
    for i in launch_date_list:
        if row.end_date>=i:
            count += 1
    return launch_date_list[:count]

Then perform some statistical features of the training set, so that statistical information after end_date will not be extracted. code show as below:

#Construct the statistical characteristics of login. Note that only the sequence before the end time is used to construct the characteristics, otherwise it will be crossed. The above has solved the crossing problem
launch_grp['launch_times'] = [len(v) for v in launch_grp.launch_date.values]
launch_grp['launch_type_0'] = [len(v)-sum(v) for v in launch_grp.launch_type.values]
launch_grp['launch_type_1'] = [sum(v) for v in launch_grp.launch_type.values]
launch_grp['launch_type_01rate'] = [sum(v)/len(v) if len(v)>0 else 0 for v in launch_grp.launch_type.values]
launch_grp['start_end_launch'] = [max(v)-min(v) if len(v)>0 else 0 for v in launch_grp.launch_date.values]

#Calculate the sequence length of launch_date
launch_date_len = []
for i in launch_grp.launch_date:
launch_grp['launch_date_len'] = launch_date_len


3. Modeling

The features of the input model are mainly divided into statistical features such as behavioral sequence features and user attributes. The behavioral sequences only intercept the login sequence of the past month (we also tried to add the sequence of the past 15, 7, and 3 days). After the user’s various sequences read the model, they are each processed by a gru, and the attribute statistical characteristics are processed by the basic dnn. Their results are spliced ​​and then relu (because it is done as regression, there is no sotfmax).
The data reading method is as follows:

#Make an iterator. Each element in the iterator is a step of bt=n
class DataGenerator(Sequence):
    def __init__(self, df, batch_size): = df
        self.num = df.shape[0]
        self.batch_size = batch_size
        self.fea = ['father_id_score', 'cast_id_score', 'tag_score',
       'device_type', 'device_ram', 'device_rom', 'sex', 'age', 'education',
       'occupation_status', 'territory_score','launch_times', 
       'launch_times_31', 'launch_times_15', 'launch_times_7', 'playtime_31',
       'playtime_15', 'playtime_7']#'launch_date_len_target_enc','start_end_launch', currently the best is only 18,'launch_date_len','launch_type_0', 'launch_type_1'

    def __len__(self):
        return math.ceil(self.num / self.batch_size)

    def __getitem__(self,idx):
        batch_data =[idx*self.batch_size:(idx+1)*self.batch_size]

        input_1 = np.array([i for i in batch_data.launch_seq_31])
        input_2 = np.array([i for i in batch_data.playtime_seq])
        input_3 = np.array([i for i in batch_data.duration_prefer])
        input_4 = np.array([i for i in batch_data.interact_prefer])
        input_5 = np.array(batch_data[self.fea])
        #The above features should be read in the form [[][][]]
        output = np.array(batch_data.label)

        return (input_1, input_2, input_3, input_4, input_5), output

The final model structure is as follows:

def build_model(seq_len,dur_seq_len,inter_seq_len, feature_num):
    input_1 = tf.keras.Input(shape=(seq_len,1))
    output_1 = tf.keras.layers.GRU(32)(input_1)

    input_2 = tf.keras.Input(shape=(seq_len,1))
    output_2 = tf.keras.layers.GRU(32)(input_2)
    input_3 = tf.keras.Input(shape=(inter_seq_len,1))
    output_3 = tf.keras.layers.GRU(11)(input_3)  #11
    input_4 = tf.keras.Input(shape=(dur_seq_len,1))
    output_4 = tf.keras.layers.GRU(16)(input_4)  #16
    input_5 = tf.keras.Input(shape=(feature_num, ))
    output_5 = tf.keras.layers.Dense(64, activation="relu")(input_5)

    output = tf.concat([output_1, output_2,output_3,output_4,output_5], -1)
#     output = tf.keras.layers.Dense(128, activation="relu")(output)
# dp = tf.keras.layers.Dropout(0.15)(output) is removed and increased by 0.002
    output = tf.keras.layers.Dense(64, activation="relu")(output)
    output = tf.keras.layers.Dense(1, activation="relu")(output)

    model = tf.keras.Model(inputs=[input_1, input_2,input_3, input_4,input_5], outputs=output)

    return model

Model training:

new_test = DataGenerator(test,100)

new_train = DataGenerator(train[:594000],100)
new_val = DataGenerator(train.iloc[594000:],100)
model = build_model(seq_len=32,dur_seq_len=16,inter_seq_len=11,feature_num=18)

early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_mse", patience=3, restore_best_weights=True)
lr_reduce = tf.keras.callbacks.ReduceLROnPlateau(patience=2,monitor='val_mse', factor=0.1)
best_checkpoint = tf.keras.callbacks.ModelCheckpoint(model_dir,save_best_only=True, save_weights_only=False,verbose=1),steps_per_epoch=len(train_bt),validation_data=iter(val_bt),validation_steps=len(val_bt),epochs=20,callbacks=[best_checkpoint,early_stopping,lr_reduce])'./data/model/model_fold{}.h5'.format(kf))
#                     use_multiprocessing=False,
#                     workers=1,
#Reload the current fold-optimal model
best_model = tf.keras.models.load_model(model_dir)
#Test set inference
test_pred =  best_model.predict(new_test, steps=len(new_test))[:,0]
#Verification set reasoning
val_pred =  best_model.predict(new_val, steps=len(new_val))[:,0]

#Calculate the overall validation set score
y_true = train.iloc[594000:]['label']
score = aiyiqi_metric(y_true,val_pred)
print('Score: {}'.format(score))

Online evaluation indicators:

def aiyiqi_metric(y_true,y_pred):
    y_true = list(y_true)
    y_pred = list(y_pred)
    score = 0
    for i in range(len(y_true)):
        score += abs(y_true[i]-y_pred[i])/7
    return 100*(1-score/len(y_true))

In addition, multi-fold crossover, semi-supervised and tree models were tried. For details, please refer to the github link:

4. Summary

Referring to this competition and previous competitions, for user behavior sequence modeling tasks, statistical features such as user attributes do not work very well. The key lies in how well the click sequence is extracted, how to put it into model training, and what model to use for training. It can essentially be regarded as an NLP text classification task. In the past, top players in similar competitions directly used BERT to train behavioral sequences.

Related Posts

How to install Anaconda and virtual environment configuration, requirements.txt import and export related solutions in Pycharm

Bellman Optimality Equation

Pandas knowledge points-detailed explanation of grouping function groupby

Detailed explanation of UnitTest for Python interface automated testing

Solution to SSL failure reported during compilation and installation of python3.10

Package python project into exe and installation package

ImportError: cannot import name ‘Literal‘ from ‘typing‘ (D:\Anaconda\envs\tensorflow\lib\

Selenium practical application – realizing automatic playback of courses on Zhidao wisdom tree

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>