Resumen:
The leakage of sensitive information is a pressing problem when information is processed digitally due to the economic, political and social repercussions that it can cause to its owner. Despite the risks and possible threats, the information must always be kept available to users, therefore, alternatives must be available to protect, detect, and prevent the leakage of sensitive information. A particular case of this problem is the leakage of sensitive textual documents. However, the identification of unstructured sensitive information is a problem whose solution is not totally satisfactory despite the development of methods and applications with promising results. Thus, it is necessary to continue developing methods that contribute to the effective solution of the problem based on a critical analysis of existing techniques and their future projections. In this work we start from a taxonomy of the approaches with which this problem has been approached. From the taxonomy, the critical analysis of the techniques and above all considering the practical needs, a method of solution to the problem of determining the sensitivity of textual documents is proposed from the perspective of Logical Combinatorial Patterns Recognition. The problem is approached as a supervised classification problem with two classes: sensitive and non-sensitive textual documents. The proposal in this work is the STClass method to determine the sensitivity of documents, which consists of two phases: the training phase, where the parameters for classification are defined and the classification phase. With the datasets used, 96% of the well classified documents were reached.