Using a regular expression to specify stop words in Weka Machine Learning from Java

In the following article, I want to share some Java code with you on how to use stop words based on a regular expression in Weka. Weka is a collection of machine learning algorithms for data mining tasks written in Java. The algorithms can either be applied directly to a dataset or called from your own Java code [1].  This article refers to algorithms being called directly from Java – not from the Weka Explorer.

Problem: Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary. These words are called stop words. [2]  Weka offers several options to specify Stopwords – but a single regular expression is not part of the default implementations of the StopwordsHandler.

Implementations of StopwordsHandler
Solution: The following simple implementation of the StopwordsHandler solves the problem:

import weka.core.stopwords.StopwordsHandler;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExStopwords implements StopwordsHandler {
    private final Pattern pattern;
    public RegExStopwords(String regexString) {
        pattern = Pattern.compile(regexString);
    public boolean isStopword(String s) {
        Matcher matcher = pattern.matcher(s);
        return matcher.find();

You can then add the regular expression based stopwords to different Filters – In this case a StringToWordVector:

       StringToWordVector filter = new StringToWordVector();
       filter.setStopwordsHandler(new RegExStopwords("([0-9]|@|n\\/a|[\\%\\€\\$\\£])"));

Version: This code has been tested with the following development version of Weka. (Use the following Maven dependency)


[1] Weka 3 – Data Mining with Open Source Machine Learning Software in Java –
[2] Stop Words

0 Responses to “Using a regular expression to specify stop words in Weka Machine Learning from Java”

  1. No Comments

Leave a Reply