Comparison Study of Logistic Regression Model for Albanian Texts.

e


INTRODUCTION
Logistic regression refers to a classifier that classifies an observation into one of two classes. We find in statistical classification theory different statistical models, the logistic regression is considered as a linear method for classification (III) which is implemented in many different softwares, one of them is the popular statistical software R (IV). Considering authorship attribution as a classification problem we attempt to estimate the probability to find the right author for each text under study. Logistic regression forms a best fitting model using the maximum likelihood method, which maximizes the probability of classifying the observed data into the appropriate category. In our previous paper (I) we defined a model for six Albanian texts using logistic regression, as a classification statistical method considering the authorship of a single author for a given text. As result the parameters used wasn't the best due to the small number of text taken for study. In this paper using R we improve the logistic regression model increasing the number of texts and number of independent variables. The application is realized with data from one hundred texts of ten different authors, considering as the independent variables in the model, number of letters, number of words, number of vowels, number of consonants, number of punctuations and number of sentences. Analyzing these Albanian texts data it results that about 40% of their letters consist of vowels with a 95% confidence interval of ]0.3985;0.4032[ that explains the high correlation between number of letters and number of vowels. By reviewing different cases of model we defined the most significant independent variables, as result for all the authors under study the average correct predicted probability was 0.918.

METHOD AND RESULTS
Logistic regression analysis belongs to the class of generalized linear models. These models are characterized by their response distribution and a link function, which transfers the mean value to a scale in which the relation to background variables is described as linear and additive. Logistic regression is a mathematical model through logistic function which is used to indicate the relationship between independent random variables with a qualitative dependent variable with two values 0/1 (dichotomous, dummy). In a logistic regression analysis, the link function is logistic function = [ /(1 − )]. A binary classifier based on a logistic regression model learns the mapping of a feature vector to a category label assignment for the − ℎ category label by modeling conditional probability ( | ) directly.
considering as a linear combination of . For a text we must make a statistical decision whether this text belongs or not to a particular author. We have applied the logistic regression model in one hundred texts of ten different authors. All formatting are removed from the texts. For each text formatted in .doc file we have calculated number of letters, number of words, number of vowels, number of consonants, number of punctuations and number of sentences. The model is defined using R software for different case of study. Considering as the independent variable all the above variables in the model not all the parameters are statistically significant. Reviewing different cases of model it results that number of letters variable is correlated with both number of vowels and number of consonants variables. Referring to the above mentioned data we have obtained a simple logistic regression model for author 1 using glm() function in R: model<-glm(Author1~N_words+N_letters+N_vowels+N_consonant+N_sentences+N_punctuations, data = paper,family= binomial()) > summary(model1,corr=T) Call: glm(formula = Author1 ~ N_words + N_letters + N_vowels + N_consonant + N_sentences + N_punctuations, family = binomial(), data = paper)  Analyzing different models one of the best fitted is achieved considering as the independent variable number of word, number of letters, number of sentences and number of punctuations.
> Model1 <-glm(Author1 ~ N_words+N_letters+N_sentences+N_punctuations, data = paper,family = binomial()) > summary(Model1,corr=T) Call: glm(formula = Author1 ~ N_words + N_letters + N_sentences + N_punctuations, family = binomial(), data = paper) Testing if it is true that the fifth text belonging to the first author: = 1 5 = 9.660309 − 01 , P Y = 0 X 5 = 1 − 9.660309 − 01 = 0.339698e − 01 we note that This shows that the text is written by the first author. These probabilities are calculated using the function predict(). As result the corrected predicted probability for this model is 0.96. This result shows that 96% of texts are classified correctly in the fitted model. For all the authors under study the average correct predicted probability is 0.918. In our previous work (II) we had defined a multinomial logistic regression model which determines as the most significant independent variables, number of words, number of vowels, number of consonants, number of punctuation and number of sentences, with the highest overall correct predicted probability 0.738. The logistic regression model determines as the most significant independent variable, number of letters, number of sentences, number of punctuations. Recognizing as many linguistic features of an author should give good mathematical models for authorship attribution of texts. In logistic regression models for Albanian texts results that not all the parameters are statistically significant, from six independent variables only three of them defined most significant while in the multinomial logistic regression model five variables results significant. Logistic regression model gives higher predicted probability than multinomial logistic regression model but we had to define as logistic regression models as the number of the authors. The problem is that it will take a lot of time to find the right model due to the number of authors while we define the authorship of a text, with one only model of I S S N 2347-1 9 2 1 V o l u m e 1 2 N u m b e r 0 9 J o u r n a l o f A d v a n c e s i n M a t h e m a t i c s 6575 | P a g e c o u n c i l f o r I n n o v a t i v e R e s e a r c h S e p t e m b e r 2 0 1 6 w w w . c i r w o r l d . c o m multinomial logistic regression. As conclusion multinomial logistic regression model for Albanian texts has more advantages than logistic regression model.

CONCLUSIONS
In our logistic model we used six independed variables drawn from 100 texts of ten different Albanian authors. Analyzing these Albanian texts data it results that about 40% of their letters consist of vowels with a 95% confidence interval of ]0.3985;0.4032[ that explains the high correlation between number of letters and number of vowels. By reviewing different cases of model we defined as most significant independent variables, number of letters, number of sentences, number of punctuations. As result for all the authors under study the average correct predicted probability is 0.918. Comparing the results taken in this paper with them taken in multinomial logistic regression model (II) we conclude that multinomial logistic regression model for Albanian texts has more advantages than logistic regression model.