We used part of the dataset found in Kaggle, by filtering those Hip-Hop songs.

The structure of the original dataset is:

index: the unique integer to tag each song, will be reassigned after filtering;

song: string, the song’s name, words are connected by “-”;

year: integer, the year when the song was released;

artist: string, the artist’s name, words are connected by “-”;

genre: string, be assigned to equal “Hip-Hop” in filtering;

lyrics: string, the lyrics of each song, separated by lines, empty value will be deleted from the filtered dataset.

An example screenshot of the raw dataset:

number of lefted documents: 21866