Imene Bensalem, Paolo Rosso, Salim Chikhi. On the use of character n-grams as the only intrinsic evidence of plagiarism. Language Resources and Evaluation, 2019. Vol. 53 p. 363.

When a shift in writing style is noticed in a document, doubts arise about its originality. Based on this clue to plagiarism, the intrinsic approach to plagiarism detection identifies the stolen passages by analysing the writing style of the suspicious document without comparing it to textual resources that may serve as sources for the plagiarist. Character n-grams are recognised as a successful approach to modelling text for writing style analysis. Although prior studies have investigated the best practice of using character n-grams in authorship attribution and other problems, there is still a need for such investigations in the context of intrinsic plagiarism detection. Moreover, it has been assumed in previous works that the ways of using character n-grams in authorship attribution remain the same for intrinsic plagiarism detection. In this paper, we study the effect of character n-grams frequency and length on the performance of intrinsic plagiarism detection. Our experiments utilise two state-of-the-art methods and five large document collections of PAN labs written in English and Arabic. We demonstrate empirically that the low- and the high-frequency n-grams are not equally relevant for intrinsic plagiarism detection, but their performance depends on the way they are exploited.