In the following article, I want to share some Java code with you on how to use stop words based on a regular expression in Weka. Weka is a collection of machine learning algorithms for data mining tasks written in Java. The algorithms can either be applied directly to a dataset or called from your own Java code [1]. This article refers to algorithms being called directly from Java – not from the Weka Explorer.
Problem: Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary. These words are called stop words. [2] Weka offers several options to specify Stopwords – but a single regular expression is not part of the default implementations of the StopwordsHandler.

import weka.core.stopwords.StopwordsHandler; import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegExStopwords implements StopwordsHandler { private final Pattern pattern; public RegExStopwords(String regexString) { pattern = Pattern.compile(regexString); } @Override public boolean isStopword(String s) { Matcher matcher = pattern.matcher(s); return matcher.find(); } }
You can then add the regular expression based stopwords to different Filters – In this case a StringToWordVector:
StringToWordVector filter = new StringToWordVector(); filter.setStopwordsHandler(new RegExStopwords("([0-9]|@|n\\/a|[\\%\\€\\$\\£])")); ... filter.setIDFTransform(true); filter.setTFTransform(true); ...
Version: This code has been tested with the following development version of Weka. (Use the following Maven dependency)
<dependency> <groupId>nz.ac.waikato.cms.weka</groupId> <artifactId>weka-dev</artifactId> <version>3.7.13</version> </dependency>
References
[1] Weka 3 – Data Mining with Open Source Machine Learning Software in Java – http://www.cs.waikato.ac.nz/ml/weka/
[2] Stop Words http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html