# Automated Bug Triaging

## Abstract

For a given software bug report, identifying an appropriate developer who could potentially fix the bug is the primary task of a bug triaging process. A bug title (summary) and a detailed description is present in most of the bug tracking systems. Automatic bug triaging algorithm can be formulated as a classification problem, which takes the bug title and description as the input, mapping it to one of the available developers (class labels). The major challenge is that the bug description usually contains a combination of free unstructured text, code snippets, and stack trace making the input data highly noisy. In the past decade, there has been a considerable amount of research in representing a bug report using tf-idf based bag-of-words feature (BOW) model. However, BOW model do not consider the syntactical and sequential word information available in the descriptive sentences.

In this research, we propose a novel bug report representation algorithm using an attention based deep bidirectional recurrent neural network (DBRNN-A) model that learns a syntactic and semantic feature from long word sequences in an unsupervised manner. Instead of BOW features, the DBRNN-A based robust bug representation is then used for training the classification model. Further, using an attention mechanism enables the model to learn the context representation over a long word sequence, as in a bug report. To provide a large amount of data to learn the feature learning model, the unfixed bug reports (constitute about 70% bugs in an open source bug tracking system) are leveraged upon as an important contribution of this research, which were completely ignored in the previous studies.

Another major contribution is to make this research reproducible by making the source code available and creating a public benchmark dataset of bug reports from three open source bug tracking system: Google Chromium, Mozilla Core, and Mozilla Firefox. For our experiments, we use 383,104 bug reports from Google Chromium, 314,388 bug reports from Mozilla Core, and 162,307 bug reports from Mozilla Firefox. Experimentally we compare our approach with BOW model and softmax classifier, support vector machine, naive Bayes, and cosine distance and observe that DBRNN-A provides a higher rank-10 average accuracy.

BOW = bag-of-words (term frequency based); MNB = multinomial naive bayes; SVM = support vector machine; DB-RNN = Deep bidirectional recurrent neural network
Threshold Classifier CV#1 CV#2CV#3CV#4CV#5CV#6CV#7CV#8CV#9CV#10Average
Min. train samples per class = 0   BOW + MNB 21.925.0 26.0 23.0 23.725.9 26.3 26.1 28.7 33.3 26.0 ± 3.0
BOW + Cosine 18.4 20.1 21.3 17.8 20.0 20.6 20.4 20.9 21.1 21.5 20.2 ± 1.2
BOW + SVM11.2 09.3 09.5 09.5 09.4 10.1 10.4 09.910.5 10.8 10.1 ± 0.6
BOW + Softmax12.5 08.5 08.6 08.7 08.6 08.5 09.1 08.9 08.7 08.709.1 ± 1.1
DBRNN-A + Softmax 34.9 36.0 39.6 35.1 36.2 39.5 39.2 39.1 39.4 39.7 37.9 ± 1.9
Min. train samples per class = 5   BOW + MNB 22.2 25.2 26.1 23.1 23.8 26.0 26.5 26.3 29.2 33.6 26.2 ± 3.1
BOW + Cosine 18.6 20.2 21.4 18.2 19.120.7 21.1 21.0 21.622.0 20.4 ± 1.3
BOW + SVM 11.3 11.108.1 08.3 09.209.0 08.9 08.7 08.5 09.009.2 ± 1.0
BOW + Softmax 12.8 11.1 11.1 09.3 11.109.8 10.4 10.5 10.9 11.4 10.8 ± 0.9
DBRNN-A + Softmax 32.2 33.2 37.0 36.4 37.1 37.2 38.3 39.0 39.1 38.2 36.8 ± 2.2
Min. train samples per class = 10   BOW + MNB 22.4 25.5 26.423.3 24.1 26.5 26.8 27.0 30.1 34.326.6 ± 3.3
BOW + Cosine 18.8 20.5 21.7 18.5 19.6 21.221.4 21.121.8 21.0 20.6 ± 1.3
BOW + SVM12.2 11.4 11.8 11.6 11.511.3 11.8 11.0 12.1 11.9 11.7 ± 0.4
BOW + Softmax 11.9 11.3 11.2 11.2 11.3 11.1 11.4 11.3 11.2 11.5 11.3 ± 0.2
DBRNN-A + Softmax 36.2 37.1 40.45 42.2 41.2 41.3 44.0 44.3 45.3 46.0 41.8 ± 3.1
Min. train samples per class = 20   BOW + MNB 22.9 26.2 27.2 24.2 24.6 27.628.2 28.9 31.8 36.0 27.8 ± 3.7
BOW + Cosine 19.3 20.9 22.2 19.4 20.0 22.322.3 22.9 23.1 23.0 21.5 ± 1.4
BOW + SVM12.212.0 11.9 11.9 11.611.5 11.3 11.6 11.6 11.911.7 ± 0.3
BOW + Softmax 11.9 11.8 11.4 11.3 11.211.1 11.0 11.8 11.311.7 11.5 ± 0.3
DBRNN-A + Softmax 36.7 37.4 41.1 42.5 41.8 42.6 44.7 46.8 46.5 47.0 42.7 ± 3.5

Coming soon!

## Mozilla Core

BOW = bag-of-words (term frequency based); MNB = multinomial naive bayes; SVM = support vector machine; DB-RNN = Deep bidirectional recurrent neural network
Threshold Classifier CV#1 CV#2CV#3CV#4CV#5CV#6CV#7CV#8CV#9CV#10Average
Min. train samples per class = 0   BOW + MNB 21.6 23.6 29.7 30.3 31.0 31.2 31.931.7 32.3 32.129.5 ± 3.6
BOW + Cosine16.3 17.4 19.521.3 22.5 23.2 24.0 25.5 27.5 29.122.6 ± 3.9
BOW + SVM13.6 14.6 14.9 14.0 12.1 12.9 11.713.7 14.4 14.1 13.6 ± 1.0
BOW + Softmax14.3 11.89.5 10.0 9.2 10.4 10.5 10.6 11.0 10.810.8 ± 1.4
DBRNN-A + Softmax 30.1 31.7 35.2 33.0 34.1 35.9 34.8 34.2 34.6 35.1 33.9 ± 1.7
Min. train samples per class = 5   BOW + MNB 20.7 23.8 29.7 31.4 31.7 33.8 35.6 36.735.8 36.2 31.5 ± 5.2
BOW + Cosine 15.7 17.7 19.9 21.4 22.8 24.7 26.4 27.529.4 29.9 23.5 ± 4.6
BOW + SVM 16.4 12.9 11.5 10.4 13.413.8 12.7 12.0 12.8 13.1 12.9 ± 1.5
BOW + Softmax 14.9 13.5 12.5 10.6 11.4 12.8 12.1 13.3 12.4 14.0 12.7 ± 1.2
DBRNN-A + Softmax 33.8 31.5 35.8 35.3 34.7 36.8 37.1 38.4 37.7 38.0 35.9 ± 2.1
Min. train samples per class = 10   BOW + MNB 18.4 23.9 29.8 33.436.739.4 38.5 40.841.3 42.5 34.5 ± 7.7
BOW + Cosine 16.0 18.0 20.0 21.4 22.7 25.7 27.8 30.433.1 35.5 25.1 ± 6.2
BOW + SVM17.5 15.6 16.5 16.4 16.4 17.0 17.2 17.4 16.9 16.216.7 ± 0.6
BOW + Softmax15.6 14.2 14.4 13.9 14.0 13.4 13.8 14.5 14.9 14.1 14.3 ± 0.6
DBRNN-A + Softmax 32.5 33.7 35.5 36.5 36.4 34.4 36.1 37.3 38.9 39.6 36.1 ± 2.1
Min. train samples per class = 20   BOW + MNB 21.3 24.3 30.2 34.8 38.5 39.4 37.5 40.7 42.1 41.8 35.1 ± 7.0
BOW + Cosine 16.8 18.4 20.4 23.3 28.6 31.335.7 38.6 37.3 38.9 28.9 ± 8.2
BOW + SVM14.6 15.2 16.4 14.5 13.9 15.7 16.8 15.6 16.1 16.415.5 ± 0.9
BOW + Softmax 18.8 16.4 11.4 10.5 11.8 13.1 13.614.3 14.8 15.314.0 ± 2.4
DBRNN-A + Softmax 33.3 34.9 36.5 36.8 37.7 39.0 41.3 42.6 41.1 43.3 38.8 ± 3.2

Coming soon!

## Mozilla Firefox

BOW = bag-of-words (term frequency based); MNB = multinomial naive bayes; SVM = support vector machine; DB-RNN = Deep bidirectional recurrent neural network
Threshold Classifier CV#1 CV#2CV#3CV#4CV#5CV#6CV#7CV#8CV#9CV#10Average
Min. train samples per class = 0   BOW + MNB 19.1 21.3 24.5 22.925.8 28.1 30.3 31.9 33.9435.55 27.4 ± 5.2
BOW + Cosine 17.3 20.3 22.925.4 26.9 28.3 29.8 27.5 28.9 30.125.7 ± 4.1
BOW + SVM13.4 11.4 13.8 15.5 14.5 14.5 14.3 14.4 14.614.6 14.1 ± 1.0
BOW + Softmax11.9 17.8 17.8 15.7 13.6 15.5 13.7 13.1 13.113.6 14.6 ± 1.9
DBRNN-A + Softmax 33.6 34.2 34.7 36.1 38.0 37.3 38.9 36.3 37.4 38.1 36.5 ± 1.7
Min. train samples per class = 5   BOW + MNB 21.1 26.8 31.133.4 36.5 36.0 37.6 36.9 34.9 36.5 33.1 ± 5.1
BOW + Cosine 20.8 23.0 23.7 26.2 27.4 29.2 32.332.7 34.1 35.2 28.5 ± 4.8
BOW + SVM 14.4 16.0 17.8 17.8 17.816.3 16.7 16.7 16.7 15.216.5 ± 1.1
BOW + Softmax 18.2 14.8 16.7 16.715.4 14.5 12.5 12.9 12.9 13.714.8 ± 1.8
DBRNN-A + Softmax 27.6 34.9 37.9 38.7 40.1 42.3 45.2 44.9 45.0 44.5 40.1 ± 5.3
Min. train samples per class = 10   BOW + MNB 21.7 27.6 32.1 34.8 37.7 34.6 32.6 34.736.7 38.5 33.1 ± 4.8
BOW + Cosine 18.1 21.2 24.4 27.0 28.3 30.1 32.3 34.0035.4 36.6 28.7 ± 5.8
BOW + SVM09.909.9 11.811.8 11.8 12.8 12.912.9 12.912.8 11.9 ± 1.1
BOW + Softmax 14.3 15.6 12.1 09.5 09.5 11.212.0 12.6 12.1 12.712.1 ± 1.8
DBRNN-A + Softmax 35.1 36.4 40.5 42.5 45.4 47.4 48.9 49.1 51.1 51.4 44.8 ± 5.6
Min. train samples per class = 20   BOW + MNB 22.0 22.8 23.6 26.329.2 32.3 34.4 36.4 38.6 38.430.4 ± 6.2
BOW + Cosine 18.4 21.9 25.127.5 29.131.4 33.8 35.9 36.738.3 29.8 ± 6.3
BOW + SVM18.7 16.9 15.4 18.220.6 19.1 20.3 21.822.7 21.9 19.6 ± 2.2
BOW + Softmax 16.5 13.3 13.2 13.811.6 12.1 12.3 12.3 12.5 12.913.1 ± 1.3
DBRNN-A + Softmax 38.9 37.4 39.5 43.9 45.0 47.1 50.5 53.3 54.3 55.8 46.6 ± 6.4

Coming soon!

## Analysis

The rank-10 average accuracy of the deep learning algorithm on all three datasets. It can be observed that as the number of training samples per class increases, the overall triaging accuracy increases.

The rank-10 average accuracy of the deep learning algorithm on all three datasets by using only title or title along with the description in bug report. Discarding the description reduces the performance significantly.

## Source Code

The entire implementation is done using Python. The complete script to reproduce the results from the entire paper can be downloaded from here.

Let us walk through the implementation of our approach. The required packages for our implementation are:

1. Stanford NLTK
2. Gensim for word2vec
3. Keras with Tensorflow backend
4. Scikit-learn from Python

The required packages can be imported into python as follows:

import numpy as np
np.random.seed(1337)
import json, re, nltk, string
from nltk.corpus import wordnet
from gensim.models import Word2Vec
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers import Dense, Dropout, Embedding, LSTM, Input, merge
from keras.optimizers import RMSprop
from keras.utils import np_utils
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics.pairwise import cosine_similarity

The JSON file location containing the data for deep learning model training and classifier training and testing are provided as follows:

open_bugs_json = '/home/data/chrome/deep_data.json'
closed_bugs_json = '/home/data/chrome/classifier_data_0.json'

The hyperparameters required for the entire code can be initialized upfront as follows:

#1. Word2vec parameters
min_word_frequency_word2vec = 5
embed_size_word2vec = 200
context_window_word2vec = 5

#2. Classifier hyperparameters
numCV = 10
max_sentence_len = 50
min_sentence_length = 15
rankK = 10
batch_size = 32

The bugs are loaded from the JSON file and the preprocessing is performed as follows:

with open(open_bugs_json) as data_file:

all_data = []
for item in data:
#1. Remove \r
current_title = item['issue_title'].replace('\r', ' ')
current_desc = item['description'].replace('\r', ' ')
#2. Remove URLs
current_desc = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', current_desc) #3. Remove Stack Trace start_loc = current_desc.find("Stack trace:") current_desc = current_desc[:start_loc] #4. Remove hex code current_desc = re.sub(r'(\w+)0x\w+', '', current_desc) current_title= re.sub(r'(\w+)0x\w+', '', current_title) #5. Change to lower case current_desc = current_desc.lower() current_title = current_title.lower() #6. Tokenize current_desc_tokens = nltk.word_tokenize(current_desc) current_title_tokens = nltk.word_tokenize(current_title) #7. Strip trailing punctuation marks current_desc_filter = [word.strip(string.punctuation) for word in current_desc_tokens] current_title_filter = [word.strip(string.punctuation) for word in current_title_tokens] #8. Join the lists current_data = current_title_filter + current_desc_filter current_data = filter(None, current_data) all_data.append(current_data)  A vocabulary is constructed and the word2vec model is learnt using the preprocessed data. The word2vec model provides a semantic word representation for every word in the vocabulary. wordvec_model = Word2Vec(all_data, min_count=min_word_frequency_word2vec, size=embed_size_word2vec, window=context_window_word2vec) vocabulary = wordvec_model.vocab vocab_size = len(vocabulary) The data used for training and testing the classifier is loaded and the preprocessing is performed as follows: with open(closed_bugs_json) as data_file: data = json.load(data_file, strict=False) all_data = [] all_owner = [] for item in data: #1. Remove \r current_title = item['issue_title'].replace('\r', ' ') current_desc = item['description'].replace('\r', ' ') #2. Remove URLs current_desc = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', current_desc)
#3. Remove Stack Trace
start_loc = current_desc.find("Stack trace:")
current_desc = current_desc[:start_loc]
#4. Remove hex code
current_desc = re.sub(r'(\w+)0x\w+', '', current_desc)
current_title= re.sub(r'(\w+)0x\w+', '', current_title)
#5. Change to lower case
current_desc = current_desc.lower()
current_title = current_title.lower()
#6. Tokenize
current_desc_tokens = nltk.word_tokenize(current_desc)
current_title_tokens = nltk.word_tokenize(current_title)
#7. Strip punctuation marks
current_desc_filter = [word.strip(string.punctuation) for word in current_desc_tokens]
current_title_filter = [word.strip(string.punctuation) for word in current_title_tokens]
#8. Join the lists
current_data = current_title_filter + current_desc_filter
current_data = filter(None, current_data)
all_data.append(current_data)
all_owner.append(item['owner'])

The ten times chronological cross validation split is performed as follows:

totalLength = len(all_data)
splitLength = totalLength / (numCV + 1)

for i in range(1, numCV+1):
train_data = all_data[:i*splitLength-1]
test_data = all_data[i*splitLength:(i+1)*splitLength-1]
train_owner = all_owner[:i*splitLength-1]
test_owner = all_owner[i*splitLength:(i+1)*splitLength-1]

For the ith cross validation set, remove all the words that is not present in the vocabulary

i = 1 # Denotes the cross validation set number
updated_train_data = []
updated_train_data_length = []
updated_train_owner = []
final_test_data = []
final_test_owner = []
for j, item in enumerate(train_data):
current_train_filter = [word for word in item if word in vocabulary]
if len(current_train_filter)>=min_sentence_length:
updated_train_data.append(current_train_filter)
updated_train_owner.append(train_owner[j])

for j, item in enumerate(test_data):
current_test_filter = [word for word in item if word in vocabulary]
if len(current_test_filter)>=min_sentence_length:
final_test_data.append(current_test_filter)
final_test_owner.append(test_owner[j])   

For the ith cross validation set, remove those classes from the test set, for whom the train data is not available.

i = 1 # Denotes the cross validation set number
# Remove data from test set that is not there in train set
train_owner_unique = set(updated_train_owner)
test_owner_unique = set(final_test_owner)
unwanted_owner = list(test_owner_unique - train_owner_unique)
updated_test_data = []
updated_test_owner = []
updated_test_data_length = []
for j in range(len(final_test_owner)):
if final_test_owner[j] not in unwanted_owner:
updated_test_data.append(final_test_data[j])
updated_test_owner.append(final_test_owner[j])

unique_train_label = list(set(updated_train_owner))
classes = np.array(unique_train_label)

Create the data matrix and labels required for the deep learning model training and softmax classifier as follows:

X_train = np.empty(shape=[len(updated_train_data), max_sentence_len, embed_size_word2vec], dtype='float32')
Y_train = np.empty(shape=[len(updated_train_owner),1], dtype='int32')
# 1 - start of sentence, # 2 - end of sentence, # 0 - zero padding. Hence, word indices start with 3
for j, curr_row in enumerate(updated_train_data):
sequence_cnt = 0
for item in curr_row:
if item in vocabulary:
X_train[j, sequence_cnt, :] = wordvec_model[item]
sequence_cnt = sequence_cnt + 1
if sequence_cnt == max_sentence_len-1:
break
for k in range(sequence_cnt, max_sentence_len):
X_train[j, k, :] = np.zeros((1,embed_size_word2vec))
Y_train[j,0] = unique_train_label.index(updated_train_owner[j])

X_test = np.empty(shape=[len(updated_test_data), max_sentence_len, embed_size_word2vec], dtype='float32')
Y_test = np.empty(shape=[len(updated_test_owner),1], dtype='int32')
# 1 - start of sentence, # 2 - end of sentence, # 0 - zero padding. Hence, word indices start with 3
for j, curr_row in enumerate(updated_test_data):
sequence_cnt = 0
for item in curr_row:
if item in vocabulary:
X_test[j, sequence_cnt, :] = wordvec_model[item]
sequence_cnt = sequence_cnt + 1
if sequence_cnt == max_sentence_len-1:
break
for k in range(sequence_cnt, max_sentence_len):
X_test[j, k, :] = np.zeros((1,embed_size_word2vec))
Y_test[j,0] = unique_train_label.index(updated_test_owner[j])

y_train = np_utils.to_categorical(Y_train, len(unique_train_label))
y_test = np_utils.to_categorical(Y_test, len(unique_train_label))

Construct the architecture for deep bidirectional RNN model using Keras library as follows:

input = Input(shape=(max_sentence_len,), dtype='int32')
sequence_embed = Embedding(vocab_size, embed_size_word2vec, input_length=max_sentence_len)(input)

forwards_1 = LSTM(1024, return_sequences=True, dropout_U=0.2)(sequence_embed)
attention_1 = SoftAttentionConcat()(forwards_1)
after_dp_forward_5 = BatchNormalization()(attention_1)

backwards_1 = LSTM(1024, return_sequences=True, dropout_U=0.2, go_backwards=True)(sequence_embed)
attention_2 = SoftAttentionConcat()(backwards_1)
after_dp_backward_5 = BatchNormalization()(attention_2)

merged = merge([after_dp_forward_5, after_dp_backward_5], mode='concat', concat_axis=-1)
after_merge = Dense(1000, activation='relu')(merged)
after_dp = Dropout(0.4)(after_merge)
output = Dense(len(train_label), activation='softmax')(after_dp)
model = Model(input=input, output=output)
model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=1e-4), metrics=['accuracy'])

The Soft Attention layer is implemented as follows:
Adapted from code written by braingineer

def make_safe(x):
return K.clip(x, K.common._EPSILON, 1.0 - K.common._EPSILON)

class ProbabilityTensor(Wrapper):
""" function for turning 3d tensor to 2d probability matrix, which is the set of a_i's """
def __init__(self, dense_function=None, *args, **kwargs):
self.input_spec = [InputSpec(ndim=3)]
#layer = TimeDistributed(dense_function) or TimeDistributed(Dense(1, name='ptensor_func'))
layer = TimeDistributed(Dense(1, name='ptensor_func'))
super(ProbabilityTensor, self).__init__(layer, *args, **kwargs)

def build(self, input_shape):
assert len(input_shape) == 3
self.input_spec = [InputSpec(shape=input_shape)]
if K._BACKEND == 'tensorflow':
if not input_shape[1]:
raise Exception('When using TensorFlow, you should define '
'explicitly the number of timesteps of '
'If your first layer is an Embedding, '
'make sure to pass it an "input_length" '
'argument. Otherwise, make sure '
'the first layer has '
'an "input_shape" or "batch_input_shape" '
'argument, including the time axis.')

if not self.layer.built:
self.layer.build(input_shape)
self.layer.built = True
super(ProbabilityTensor, self).build()

def get_output_shape_for(self, input_shape):
# b,n,f -> b,n
#       s.t. \sum_n n = 1
if isinstance(input_shape, (list,tuple)) and not isinstance(input_shape[0], int):
input_shape = input_shape[0]

return (input_shape[0], input_shape[1])

return None

energy = K.squeeze(self.layer(x), 2)
p_matrix = K.softmax(energy)
p_matrix = (p_matrix / K.sum(p_matrix, axis=-1, keepdims=True))*mask
return p_matrix

def get_config(self):
config = {}
base_config = super(ProbabilityTensor, self).get_config()
return dict(list(base_config.items()) + list(config.items()))

class SoftAttentionConcat(ProbabilityTensor):
'''This will create the context vector and then concatenate it with the last output of the LSTM'''
def get_output_shape_for(self, input_shape):
# b,n,f -> b,f where f is weighted features summed across n
return (input_shape[0], 2*input_shape[2])

return None
else:
raise Exception("Unexpected situation")

# b,n,f -> b,f via b,n broadcasted
p_vectors = K.expand_dims(super(SoftAttentionConcat, self).call(x, mask), 2)
expanded_p = K.repeat_elements(p_vectors, K.int_shape(x)[2], axis=2)
context = K.sum(expanded_p * x, axis=1)
last_out = x[:, -1, :]
return K.concatenate([context, last_out])


Train the deep learning model and test using the classifier as follows:

early_stopping = EarlyStopping(monitor='val_loss', patience=2)
hist = model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=200)

predict = model.predict(X_test)
accuracy = []
sortedIndices = []
pred_classes = []
for ll in predict:
sortedIndices.append(sorted(range(len(ll)), key=lambda ii: ll[ii], reverse=True))
for k in range(1, rankK+1):
id = 0
trueNum = 0
for sortedInd in sortedIndices:
pred_classes.append(classes[sortedInd[:k]])
if y_test[id] in classes[sortedInd[:k]]:
trueNum += 1
id += 1
accuracy.append((float(trueNum) / len(predict)) * 100)
print('Test accuracy:', accuracy)

train_result = hist.history
print(train_result)

To compare the deep learning based features, term frequency based bag-of-words model features are constructed as follows:

train_data = []
for item in updated_train_data:
train_data.append(' '.join(item))

test_data = []
for item in updated_test_data:
test_data.append(' '.join(item))

vocab_data = []
for item in vocabulary:
vocab_data.append(item)

# Extract tf based bag of words representation
tfidf_transformer = TfidfTransformer(use_idf=False)
count_vect = CountVectorizer(min_df=1, vocabulary= vocab_data,dtype=np.int32)

train_counts = count_vect.fit_transform(train_data)
train_feats = tfidf_transformer.fit_transform(train_counts)
print train_feats.shape

test_counts = count_vect.transform(test_data)
test_feats = tfidf_transformer.transform(test_counts)
print test_feats.shape
print "======================="

Four baseline classifiers are built over the bag-of-words features:

1. Naive Bayes
2. Support Vector Machine
3. Cosine similarity
4. Softmax classifier
All the classifiers are implemented using the scikit package of python. The Naive Bayes classifier is implemented as follows:

classifierModel = MultinomialNB(alpha=0.01)
classifierModel = OneVsRestClassifier(classifierModel).fit(train_feats, updated_train_owner)
predict = classifierModel.predict_proba(test_feats)
classes = classifierModel.classes_

accuracy = []
sortedIndices = []
pred_classes = []
for ll in predict:
sortedIndices.append(sorted(range(len(ll)), key=lambda ii: ll[ii], reverse=True))
for k in range(1, rankK+1):
id = 0
trueNum = 0
for sortedInd in sortedIndices:
if y_test[id] in classes[sortedInd[:k]]:
trueNum += 1
pred_classes.append(classes[sortedInd[:k]])
id += 1
accuracy.append((float(trueNum) / len(predict)) * 100)
print accuracy

The implementation of Support Vector Machine is as follows:

classifierModel = svm.SVC(probability=True, verbose=False, decision_function_shape='ovr', random_state=42)
classifierModel.fit(train_feats, updated_train_owner)
predict = classifierModel.predict(test_feats)
classes = classifierModel.classes_

accuracy = []
sortedIndices = []
pred_classes = []
for ll in predict:
sortedIndices.append(sorted(range(len(ll)), key=lambda ii: ll[ii], reverse=True))
for k in range(1, rankK+1):
id = 0
trueNum = 0
for sortedInd in sortedIndices:
if y_test[id] in classes[sortedInd[:k]]:
trueNum += 1
pred_classes.append(classes[sortedInd[:k]])
id += 1
accuracy.append((float(trueNum) / len(predict)) * 100)
print accuracy 

The implementation of cosine similarity based classification is provided as follows:

predict = cosine_similarity(test_feats, train_feats)
classes = np.array(trainls)
classifierModel = []

accuracy = []
sortedIndices = []
pred_classes = []
for ll in predict:
sortedIndices.append(sorted(range(len(ll)), key=lambda ii: ll[ii], reverse=True))
for k in range(1, rankK+1):
id = 0
trueNum = 0
for sortedInd in sortedIndices:
if y_test[id] in classes[sortedInd[:k]]:
trueNum += 1
pred_classes.append(classes[sortedInd[:k]])
id += 1
accuracy.append((float(trueNum) / len(predict)) * 100)
print accuracy

The softmax (regression) based classification is performed as follows:

classifierModel = LogisticRegression(solver='lbfgs', penalty='l2', tol=0.01)
classifierModel = OneVsRestClassifier(classifierModel).fit(train_feats, updated_train_owner)
predict = classifierModel.predict(test_feats)
classes = classifierModel.classes_

accuracy = []
sortedIndices = []
pred_classes = []
for ll in predict:
sortedIndices.append(sorted(range(len(ll)), key=lambda ii: ll[ii], reverse=True))
for k in range(1, rankK+1):
id = 0
trueNum = 0
for sortedInd in sortedIndices:
if y_test[id] in classes[sortedInd[:k]]:
trueNum += 1
pred_classes.append(classes[sortedInd[:k]])
id += 1
accuracy.append((float(trueNum) / len(predict)) * 100)
print accuracy