Title: Detecting textual derivatives

Keywords: detection of derivation, revisions, plagiarism, signature approach, similarity metrics, information retrieval

Abstract:

Thanks to the Internet, the production and publication of content is possible with ease and speed. This possibility raises the issue of controling the origins of this content. This work focuses on detecting derivation links between texts. A derivation link associates a derivative text and the pre-existing texts from which it was written. We focused on the task of identifying derivative texts given a source text for various forms of derivation. Our first contribution is the definition of a theoretical framework defines the concept of derivation as well as a model framing the different forms of derivation. Then, we set up an experimental framework consisting of free software tools, evaluation corpora and evaluation metrics based on IR. The Piithie and Wikinews corpora we have developed are to our knowledge the only ones in French for the detection of derivation links. Finally, we explored different methods of detection based on the signature-based approach. In particular, we have introduced the notions of specificity and invariance to guide the choice of descriptors used to modelize the texts in the expectation of their comparison. Our results show that the choice of motivated descriptors, including linguistically motivated ones, can reduce the size of the modelization of texts, and therefore the cost of the method, while offering performances comparable to the much more voluminous state of the art approach.

Thesis jury:

Publications:

  • Poulard, F., N. Hernandez and B. Daille. 2011, Detecting derivatives using specific and invariant descriptors, Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2011), Tokyo, Japan.
  • Hernandez, N., F. Poulard, M. Vernier and J. Rocheteau. 2010, Building a French-speaking community around UIMA, gathering research, education and industrial partners, mainly in Natural Language Processing and Speech Recognizing domains, Workshop Abstracts LREC 2010 Workshop 'New Challenges for NLP Frameworks', La Valleta Malte, p.64. http://hal.archives-ouvertes.fr/hal-00481459/en/
  • Poulard, F., N. Hernandez, S. D. Afantenos and B. Daille. 2010, Evaluation de descripteurs statistiques et linguistiques pour la détection de dérivation de texte, Document numérique, 13, 3/2010, p.69-93. http://hal.archives-ouvertes.fr/hal-00554351/en/.
  • Dejean, C., M. Fortun, C. Massot, V. Pottier, F. Poulard and M. Vernier. 2010, Un étiqueteur de rôles grammaticaux libre pour le français intégré à Apache UIMA, Actes de la 17e Conférence sur le Traitement Automatique des Langues Naturelles 17e Conférence sur le Traitement Automatique des Langues Naturelles, Montréal, Canada. http://hal.archives-ouvertes.fr/hal-00493847/en/.
  • Poulard, F., S. D. Afantenos and N. Hernandez. 2009, Nouvelles considérations pour la détection de réutilisation de texte, Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles, Senlis France, p.67. http://hal.archives-ouvertes.fr/hal-00401072/en/.
  • Hernandez, N., F. Poulard, S. Afantenos, M. Vernier J. Rocheteau. 2009, Apache UIMA pour le Traitement Automatique des Langues, 16ème conférence sur le Traitement Automatique des Langues Naturelles (TALN'09) - Session Démonstration. http://hal.archives-ouvertes.fr/hal-00423728/en/,
  • Poulard, F., T. Waszak, N. Hernandez and P. Bellot. 2008, Repérage de citations, classification des styles de discours rapporté et identification des constituants citationnels en écrits journalistiques, Actes de la 15e Conférence sur le Traitement Automatique des Langues Naturelles Traitement Automatique des Langues Naturelles, Avignon, France, p.450-459. http://hal.archives-ouvertes.fr/hal-00401011/en/.
  • Poulard, F. 2008, Analyse quantitative et qualitative de citations extraites d'un corpus journalistique, Actes de la 12e édition de RECITAL Rencontre des Etudiants-Chercheurs en Informatique et en Traitement Automatique des Langues (RÉCITAL), Avignon France, p.101-110. http://hal.archives-ouvertes.fr/hal-00401001/en/.