Polylingual Text Classiﬁcation in the Legal Domain
TERE SA GONÇ ALVE S, PAULO QUARE SMA ∗
SUMM ARY:1. Introduction – 2. Concepts and Tools – 2.1. Automatic Text Classiﬁca-
tion – 2.2. Support Vector Machines – 3. Polylingual Approach to Text Classiﬁcation
– 3.1. Combining MonolingualClassiﬁer – 3.2. Using PolylingualClassiﬁers – 4. Ex-
periments – 4.1. Dataset Description – 4.2. Experiment al Setup – 4.3. Monolingual
Experiments – 4.4. Monolingual Combiner Experiments – 4.5. Polylingual Experi-
ments – 5. Conclusions and Future Work
1. INT RODU CTI ON
Current Information Technologies and Web-based services need to man-
age, select and ﬁlter increasing amounts of textual information. Text classiﬁ-
cation allows users, through navigation on class hierarchies, to browse more
easily the texts of their interests. This paradigm is very effective both in
ﬁltering information as in the development of online end-user services.
Since the number of documents involved in these applications is large,
efﬁcient and automatic approaches are necessary for classiﬁcation. A Ma-
chine Learning approach can be used to automatically build the classiﬁers.
The construction process can be seen as a problem of supervised learning:
the algorithm receives a relatively small set of labelled documents and gen-
erates the classiﬁer. Several algorithms have been applied, such as decision
trees, linear discriminant analysis and logistic regression, the naïve Bayes
algorithm and Support Vector Machines (SVM). Besides having a justiﬁed
learning theory describing its mechanics, with respect to text classiﬁcation
SVM are known to be computationally efﬁcient, robust and accurate.
Because of the globalization trend, an organization or individual often
generates, acquires and archives the same document written in different lan-
guages (i.e., polylingual documents); moreover, many countries adopt mul-
tiple languages as their ofﬁcial languages. If these polylingual documents
are organized into existing categories one would like to use this set of pre-
classiﬁed documents as training documents to build models to classify newly
arrived polylingual documents.
For multilingual text classiﬁcation, some prior studies address the chal-
lenge of cross-lingual text classiﬁcation. However, prior research has not
∗T. Gonçalvesis Auxiliar Professor at the Department of Computer Science of the Uni-
versity of Évora; P. Quaresma is Associated Professor at the same Department.