Archive for the 'machine learning' Category

Using a regular expression to specify stop words in Weka Machine Learning from Java

In the following article, I want to share some Java code with you on how to use stop words based on a regular expression in Weka. Weka is a collection of machine learning algorithms for data mining tasks written in Java. The algorithms can either be applied directly to a dataset or called from your own Java code [1].  This article refers to algorithms being called directly from Java – not from the Weka Explorer.

Problem: Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary. These words are called stop words. [2]  Weka offers several options to specify Stopwords – but a single regular expression is not part of the default implementations of the StopwordsHandler.

Implementations of StopwordsHandler
Solution: The following simple implementation of the StopwordsHandler solves the problem:

import weka.core.stopwords.StopwordsHandler;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExStopwords implements StopwordsHandler {
    private final Pattern pattern;
    public RegExStopwords(String regexString) {
        pattern = Pattern.compile(regexString);
    public boolean isStopword(String s) {
        Matcher matcher = pattern.matcher(s);
        return matcher.find();

You can then add the regular expression based stopwords to different Filters – In this case a StringToWordVector:

       StringToWordVector filter = new StringToWordVector();
       filter.setStopwordsHandler(new RegExStopwords("([0-9]|@|n\\/a|[\\%\\€\\$\\£])"));

Version: This code has been tested with the following development version of Weka. (Use the following Maven dependency)


[1] Weka 3 – Data Mining with Open Source Machine Learning Software in Java –
[2] Stop Words

Free eBook: Bayesian Reasoning and Machine Learning

bayesian-learning-ebookWhile studying for the Coursera Machine Learning lecture I attended last year, my learning partner Dimitris L. recommended we should use the Bayesian Reasoning and Machine Learning book by Prof. David Barber as complementary literature. David Barber is currently a professor in Information Processing in the department of Computer Science UCL where he develops novel information processing schemes, mainly based on the application of probabilistic reasoning. As the title of the book suggests, it is all about the concepts and techniques behind Bayesian reasoning and machine learning:


Machine learning methods extract value from vast data sets quickly and with modest resources. They are established tools in a wide range of industrial applications, including search engines, DNA sequencing, stock market analysis, and robot locomotion, and their use is spreading rapidly. Continue reading ‘Free eBook: Bayesian Reasoning and Machine Learning’

What is the Machine Learning class by Prof Ng on Coursera like? My experiences

machine-learning-logoSometime last year in October, I decided to learn more about big data, machine learning and predictive analytics. I gave Coursera a try and enrolled in the 10 weeks  Machine Learning class by Prof Andrew Ng. from Stanford University [1-4]. Prof Ng. is one of the world renowned experts in the field of machine learning, the director of the Stanford AI Lab,  a truly amazing teacher and one of the co-founder of Coursera.

For those who do not know Coursera: Coursera is an educational technology company which is offering free massive open online courses. It has cooperations with universities all around the globe and offers courses in computer science, engineering, physics, humanities, medicine, biology, social sciences, mathematics and business.

Continue reading ‘What is the Machine Learning class by Prof Ng on Coursera like? My experiences’