Using a regular expression to specify stop words in Weka Machine Learning from Java

In the following article, I want to share some Java code with you on how to use stop words based on a regular expression in Weka. Weka is a collection of machine learning algorithms for data mining tasks written in Java. The algorithms can either be applied directly to a dataset or called from your own Java code [1].  This article refers to algorithms being called directly from Java – not from the Weka Explorer.

Problem: Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary. These words are called stop words. [2]  Weka offers several options to specify Stopwords – but a single regular expression is not part of the default implementations of the StopwordsHandler.

Implementations of StopwordsHandler
Solution: The following simple implementation of the StopwordsHandler solves the problem:

import weka.core.stopwords.StopwordsHandler;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExStopwords implements StopwordsHandler {
    private final Pattern pattern;
    public RegExStopwords(String regexString) {
        pattern = Pattern.compile(regexString);
    }
    @Override
    public boolean isStopword(String s) {
        Matcher matcher = pattern.matcher(s);
        return matcher.find();
    }
}

You can then add the regular expression based stopwords to different Filters – In this case a StringToWordVector:

       StringToWordVector filter = new StringToWordVector();
       filter.setStopwordsHandler(new RegExStopwords("([0-9]|@|n\\/a|[\\%\\€\\$\\£])"));
       ...
       filter.setIDFTransform(true);
       filter.setTFTransform(true);
       ...

Version: This code has been tested with the following development version of Weka. (Use the following Maven dependency)

<dependency>
    <groupId>nz.ac.waikato.cms.weka</groupId>
    <artifactId>weka-dev</artifactId>
    <version>3.7.13</version>
</dependency>

References
[1] Weka 3 – Data Mining with Open Source Machine Learning Software in Java –  http://www.cs.waikato.ac.nz/ml/weka/
[2] Stop Words http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

0 Responses to “Using a regular expression to specify stop words in Weka Machine Learning from Java”


  1. No Comments

Leave a Reply