Archive for the 'java' Category

Using a regular expression to specify stop words in Weka Machine Learning from Java

In the following article, I want to share some Java code with you on how to use stop words based on a regular expression in Weka. Weka is a collection of machine learning algorithms for data mining tasks written in Java. The algorithms can either be applied directly to a dataset or called from your own Java code [1].  This article refers to algorithms being called directly from Java – not from the Weka Explorer.

Problem: Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary. These words are called stop words. [2]  Weka offers several options to specify Stopwords – but a single regular expression is not part of the default implementations of the StopwordsHandler.

Implementations of StopwordsHandler
Solution: The following simple implementation of the StopwordsHandler solves the problem:

import weka.core.stopwords.StopwordsHandler;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExStopwords implements StopwordsHandler {
    private final Pattern pattern;
    public RegExStopwords(String regexString) {
        pattern = Pattern.compile(regexString);
    }
    @Override
    public boolean isStopword(String s) {
        Matcher matcher = pattern.matcher(s);
        return matcher.find();
    }
}

You can then add the regular expression based stopwords to different Filters – In this case a StringToWordVector:

       StringToWordVector filter = new StringToWordVector();
       filter.setStopwordsHandler(new RegExStopwords("([0-9]|@|n\\/a|[\\%\\€\\$\\£])"));
       ...
       filter.setIDFTransform(true);
       filter.setTFTransform(true);
       ...

Version: This code has been tested with the following development version of Weka. (Use the following Maven dependency)

<dependency>
    <groupId>nz.ac.waikato.cms.weka</groupId>
    <artifactId>weka-dev</artifactId>
    <version>3.7.13</version>
</dependency>

References
[1] Weka 3 – Data Mining with Open Source Machine Learning Software in Java –  http://www.cs.waikato.ac.nz/ml/weka/
[2] Stop Words http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

SQL Connection Strings

In the last couple of weeks I have been working a lot with different databases which I had to connect to from Java. It is sometime stressful to look up the right format of the connection string to the database. Even though, these strings should, or are meant to be standardized, they are not.

I found this very helpful website ConnectionStrings.com which lists the connection strings for open-source as well as professional commercial databases. This list includes among others, the connection strings for Microsoft SQL Server 2008, MySQL, Oracle, IBM DB2, Informix, Postgre SQL, Caché, SQLite, …

As an example, if you want to connect to a server in a replicated server configuration without concern on which server to use, use the following connection string:

Server=serverAddress1, serverAddress2, serverAddress3;Database=myDataBase;
Uid=myUsername;Pwd=myPassword;

I want to share this information with you, because it can save you a lot of time, looking up those strings in tutorials or in the documentation of the different databases. Continue reading ‘SQL Connection Strings’

SMSing – SMS client and server

SMSing ScreenshotIn this post, I present you a tiny program to send SMS messages through the internet:

SMSing is a simple client/server based application to send SMS text messages through the Internet using Clickatell as service provider. The application supports multiple users, credits, logging aso. In the current version, SMSing should not be used in a productive environment, but shall demonstrate how to use and combine different technologies to send SMS´s. Nevertheless SMSing is really interesting for people sending lots of SMS who are bored to type the text using the keyboard of their mobile phones and pay the high prices of their mobile providers. SMSing even allows you to send anonymous messages, or messages with fake sender-number.

Continue reading ‘SMSing – SMS client and server’

WebsiteWatcher – An easy to use website monitor with SMS notification

I use this post to present you one very simple but effective program to monitor static web pages and to send you an SMS message in case of changes on the website. I called the tool WebsiteWatcher:

WebsiteWatcher allows you to monitor a specific (static) website for changes. If the page is updated, you immediately get an SMS message as notification. This tool was initially used to monitor university websites where results of exams were published. Please notice that this simple tool only works for static pages. Clickatell is used as SMS provider. It provides a reliable, fast and cheap SMS-gateway for 578 networks in 192 countries.

Continue reading ‘WebsiteWatcher – An easy to use website monitor with SMS notification’