CyberBullying Detection in Social Media

Deep Learning

Posted by Pritish Roy on February 07, 2021

Cyberbullying is a type of harassment that intends to inflict harm or hurt someone through an electronic medium. With a more globalized world and the rise of social media platforms cyberbullying is a major problem. 34% of students reported having been bullied online at school at least once. The perception of anonymity and lack of consequences has made cyberbullying a serious problem. The large user base of Twitter has made it one of the most pervasive social media platforms for cyberbullying. The most effective solution is to stop harmful messages at the source. However social media platforms such as Twitter has over 500 million tweets per day (Stricker, 2014). As a result, it is not feasible to manually screen every post. A solution is to run the text through an efficient machine learning classifier that warns the user about potential bullying before the text is broadcasted. This technology will increase social welfare and help reduce the global cyberbullying statistics. 

Word Embedding

Word embedding is a powerful way to associate a vector with a word with the use of dense vectors. They are used as an input to a machine learning pipeline for training or inference. Vectors obtained from one-hot encoding are binary and sparse and high dimensional whereas word embeddings are floating-point vectors in a low dimension packing in more information. For example, instead of the characters for the words ‘the’ as input to the system, we use it as dimension, 300 for example, numeric vector to represent ‘the’ and every word in the vocabulary would have a unique vector associated with it. The geometric relationship between the word vectors display the semantic relationship that the words possess, and they map our language into the geometric space. A semantic relationship like gender, verb, country-capital, and plural vector are common examples of geometric transformation and with hundreds of features of such interpretable and potential useful vectors, they are could be used for any natural language processing application. They are faster to train than it is to hand-build models and most deep learning Natural Language Processing applications use word embedding in the embedding layer if not all. 

Word embedding can approximate meaning based on the distributional hypothesis and semantics which is this idea that words that are more similar to each other are going to be used together more often, for instance, ‘cat’ and ‘dog’ are going to be used more often with the word ‘veterinarian’ then with the word ‘teakettle’. We can use this information to deduce that ‘cat’, ‘dog’ and ‘veterinarian’ are similar to each other in meaning than ‘cat’, ‘dog’, and ‘teakettle’. To make a word2vec model, we take the corpus and see how often two words occur together. One way to quantify co-occurrence is to look at different sentences and compare. The resulted corpus which quantifies how often words are used together is used as a target for a machine learning approach. Initially, vectors are initialized with random weights, and by comparing two pairs of vectors and looking at how close they are to each other in space, with how often they are used together. If they are very far apart in space but are used together a lot, they are moved closer together but if they are close together and never get used together, they have moved further apart. After many iterations, we end up with a vector space representation that approximates the information from the co-occurrence matrix. It is also used to represent the sort of underlying patterns or finding the top ‘n’ similar words in the corpus that was used to train by reducing the dimension and using visualizations. Not detecting homophones is one of the drawbacks of this word2vec because the ‘dog bark’ and ‘tree bark’ would be represented using a single token due to the same spelling even though the meaning is distinct. It is very memory intensive because of one row and one column for every individual word in the corpus, so as the corpus size increase so does the amount of space needed to train embeddings. 

Word2vec algorithm using the genism library contains two models: Skip-Gram and a Continuous Bag of Words (CBOW) with the latter being used in the creation of the embeddings. Given a set of corpus sentences, the model loops over the words of each sentence and try to use the current word in order to predict its neighboring word based on its context, this approach is called Skip-Gram or it employs each of these contexts to predict the current word, in that case, the approach is called Continuous Bag Of Words (CBOW). To limit the number of words in each context, algorithm, and the dimension of the vectors, hyper tune parameters of the model, called Window Size, SG and Dimension is used respectively. 

Convolutional Neural Network(CNN)

Instead of classical machine learning, CNN used in cyberbullying is an advancement in the current field by adapting principles of deep learning. Although originally, CNN was designed for image recognition, their performance has been confirmed in many tasks, including NLP and sentence classification (Ptaszynski, Eronen, & Masui, 2017). The most remarkable aspect of CNN is that it removes three classification phases feature determination, extraction, and selection that another detection algorithm requires. A convolutional neural network is an artificial neural network that has some type of specialization for being able to detect patterns and make sense of them. This pattern detection is what makes CNN so useful in text classification. CNN can and usually do have non-convolutional layers as well, but the basis of a CNN is the convolutional layers. Just like any other layer the convolutional layers receive input and transforms it in some way and then output the transformed input to the next layer. To have a high-level idea of how convolutional layers are detecting patterns, with each convolutional layer we need to specify the number of filters the layer should have, and these filters are actually what detect the patterns. To illustrate, given a sentence usually, there is a lot going on like context, correctness, unity, clarity, coherence, emphasis, etc, so one type of ‘pattern’ the filter could detect could be ‘coherence’ in a sentence and thus the filter would be called a coherence detector. This kind of filter is what we see at the beginning of the network and the deeper the network goes, the more sophisticated the filters become. And so, the deeper layers are able to detect specific criteria like hate, toxicity, racism, etc. For this project the architecture of consists of four layers: Embedding, Convolutional, Max Pooling, and Dense Layers in Keras Sequential Model

Data Pre-Processing

We create tokens for every word of the corpus using TensorFlow’s Tokenizer and the word index. All tokens created are of lowercase as we use these tokens to find the vectors from the word embeddings which are available only in lowercase. By creating tokens, we can turn them into a list of integer indices and pad them with the same length. This will result in a shape (number of samples in the dataset and padded sequence length) that will be used in the CNN. Given four sentences tokens and index lists using tokenizers fit_on_texts function is created for words in the corpus which is turned to sequence using tokenizers texts_to_sequences function and padded using TensorFlow’s pad_sequences. This is done for both the training and the test data. The training padded sequence data is feed into the deep learning model whereas the test padded sequence data is used in the predict function.

The embedding matrix used in embedding layer weights forms a matrix similar to Figure 6. Given a tokenized word from the corpus we pick the corresponding vectors associated with that word from the word embedding created from the training corpus or the google pre-trained to construct a matrix of shape (total words in the corpus, dimension)

Model Specification and Concept

Embedding Layer

The idea of loading the pre-trained word embedding in natural language processing is justified simply because we do not have enough data available to learn features on its own, but we expect the features to be generic semantically, and reusing features learned on different problems makes sense. The embedding layer works similar to a dictionary that maps words, which is converted to a sequence of numbers, to dense vectors. It works essentially as a lookup dictionary where it finds the integers in the internal dictionary weights and returns the associated vectors. The weights of the embedding layer are a matrix of the words in the training corpus along with is features taken from the word embedding of shape (max words, embedding dimension) with each row containing the embedding dimension vector for the word in the reference word index. The embedding layer takes in a two-dimension tensor of shape (batch size, maximum sequence length) where each entry is converted from text to a sequence of numbers and padded to ensure that they are of the same length, for instance, a sequence that is short than the maximum sequence length is padded with zeros and sequence that are greater than maximum sequence length is truncated. This layer returns a three-dimension tensor of shape (batch size, maximum sequence length, dimension) which is passed to the next layer. During training the model the trainable parameter for the embedding layer is turned to FALSE due to computational limitations and thus the embeddings are fixed, otherwise, these word vectors would gradually be adjusted using backpropagation and large gradient updates would be disruptive to the already learned features. 

Convolutional Layer

The convolutional layer is the heart of the CNN which comes right after the embedding layer with an input dimension of (batch size, maximum sequence length, dimension). The task of the convolutional is to convolve on the input vector to detect features. Thus, it compresses the input vector while keeping valuable features by creating a set of matrices called filters. The filters are initialized with random weights which convolve around the input and creating several activation maps by computing a weighted sum of the dot product between the subarray and the filter, sliding across F*F subarrays. Filters weights are learned from the data after iterations to detect the most necessary features for the problem. After learning a certain pattern, they can recognize it anywhere. Dense layers have to learn the new pattern if it appeared in a new area whereas the convolution layer needs fewer samples to learn the representation and have generalization power. They also have spatial hierarchies of patterns. The first layer might learn a small local pattern, the consequent layer will learn a larger pattern made of the features from the first layer, and so on.


Convolution and max-pooling can deal with large data by reducing the size and that’s why they are used one after another. This layer is responsible for downsampling the large input corpus matrix to a smaller matrix and this is how the convolutional neural network gets its robustness. Max pooling slides by strides across the output of the convolutional layer and finds the maximum of the selected portion defined by the pool size, keeping clear and meaningful features. It is similar to convolution, instead of transforming local patches via a learned transformation, they are transformed via a static max tensor operation. Usually, max pooling is done with 2x2 windows and stride 2 to downsample the feature segment by a factor of 2 to process. 

Dense Layer

The main application of the dense layer is for classification. They are a set of fully connected layers where every set of neurons are connected to all other neurons in the following layers. It implements the operation of activation(dot(input, kernel) + bias) where activation represents the activation function, a kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer.

The input from max pooling is passed through a flatten layer where the flatten operation on a tensor reshapes the tensor to have the shape that is equal to the number of elements contained in tensor non including the batch dimension. The number of dense layers varies and here we are stacking dense layers on top of each other; however, the last layer has six units representing the six classes in the dataset. The first dense layer uses the “Relu” activation function which corresponds to max(x, 0) following the Relu(dot(input, kernel) + bias). Initially, they are meaningless but merely a starting point, and gradually after training these weights and biases are adjusted based on the feedback signal. The last dense layer uses the sigmoid activation function which means it will return a probability score for each of the classes.