Archive for the 'software' Category

Using a regular expression to specify stop words in Weka Machine Learning from Java

In the following article, I want to share some Java code with you on how to use stop words based on a regular expression in Weka. Weka is a collection of machine learning algorithms for data mining tasks written in Java. The algorithms can either be applied directly to a dataset or called from your own Java code [1].  This article refers to algorithms being called directly from Java – not from the Weka Explorer.

Problem: Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary. These words are called stop words. [2]  Weka offers several options to specify Stopwords – but a single regular expression is not part of the default implementations of the StopwordsHandler.

Implementations of StopwordsHandler
Solution: The following simple implementation of the StopwordsHandler solves the problem:

import weka.core.stopwords.StopwordsHandler;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegExStopwords implements StopwordsHandler {
    private final Pattern pattern;
    public RegExStopwords(String regexString) {
        pattern = Pattern.compile(regexString);
    }
    @Override
    public boolean isStopword(String s) {
        Matcher matcher = pattern.matcher(s);
        return matcher.find();
    }
}

You can then add the regular expression based stopwords to different Filters – In this case a StringToWordVector:

       StringToWordVector filter = new StringToWordVector();
       filter.setStopwordsHandler(new RegExStopwords("([0-9]|@|n\\/a|[\\%\\€\\$\\£])"));
       ...
       filter.setIDFTransform(true);
       filter.setTFTransform(true);
       ...

Version: This code has been tested with the following development version of Weka. (Use the following Maven dependency)

<dependency>
    <groupId>nz.ac.waikato.cms.weka</groupId>
    <artifactId>weka-dev</artifactId>
    <version>3.7.13</version>
</dependency>

References
[1] Weka 3 – Data Mining with Open Source Machine Learning Software in Java –  http://www.cs.waikato.ac.nz/ml/weka/
[2] Stop Words http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

What is the Machine Learning class by Prof Ng on Coursera like? My experiences

machine-learning-logoSometime last year in October, I decided to learn more about big data, machine learning and predictive analytics. I gave Coursera a try and enrolled in the 10 weeks  Machine Learning class by Prof Andrew Ng. from Stanford University [1-4]. Prof Ng. is one of the world renowned experts in the field of machine learning, the director of the Stanford AI Lab,  a truly amazing teacher and one of the co-founder of Coursera.

For those who do not know Coursera: Coursera is an educational technology company which is offering free massive open online courses. It has cooperations with universities all around the globe and offers courses in computer science, engineering, physics, humanities, medicine, biology, social sciences, mathematics and business.

Continue reading ‘What is the Machine Learning class by Prof Ng on Coursera like? My experiences’

Using OpenRefine to gain insights into, cluster, clean and enrich messy data

OpenRefine logoImagine the following scenario: You get this file (Excel, CSV, Text, XML,…) containing a list with lots of customer, vendor or project data and you want to structure and clean the data before you can use it to do some analytics, reporting, or other processing steps on it . There are a lot of duplicate entries, names are spelled in different ways, everything is a big mess and a manual clean up will cost you a few hours of your precious time…

Solution

OpenRefine (formerly Google Refine) is a free and open source application which allows you to explore data (generate insights), clean and transform it using powerful scripting possibilities and to reconcile or match it with data from any kind of webservice or databases like Freebase. The possibilities are endless since it is possible to extend your dataset with all kind of data available through webservices. In addition to the core OpenRefine product, a growing list of extensions and plugins  is available. [2]

Continue reading ‘Using OpenRefine to gain insights into, cluster, clean and enrich messy data’

Using SQL WITH clause to create temporary static tables at query time

A few days ago, I came across the following problem: I currently work on a project where I am the responsible of an application which generates entries to a log table every time a job is executed. This table contains a lot of information on statuses of jobs, possible problems, exceptions, duration, aso. I was working on some analytics on this data and needed to enrich the data by the version of the software which generated the log entry (since we were not capturing this in the log table). From our configuration management tool, I was able to extract the dates when which versions of the software was deployed in production

Problem

My intention was to create a temporary table to join onto the  logged entries, but I didn´t want to create the tables on the Oracle server (mainly because they would have been just temporary tables and because the schema-user I was using didn´t have the rights to create tables).

Continue reading ‘Using SQL WITH clause to create temporary static tables at query time’

MySQL: group_concat allows you to easily concatenate the grouped values of a row

Last week I stumbled over a really useful function in MySQL: group_concat allows you to concatenate the data of one column of multiple entries by grouping them by one field field. You can choose the separator to use for the concatenation. The full syntax is as follows:

GROUP_CONCAT([DISTINCT] expr [,expr ...]
             [ORDER BY {unsigned_integer | col_name | expr}
                 [ASC | DESC] [,col_name ...]]
             [SEPARATOR str_val])

According to the MySQL documentation, the function returns a string result with the concatenated non-NULL values from a group. It returns NULL if there are no non-NULL values. To eliminate duplicate values, use the DISTINCT clause. To sort values in the result, use the ORDER BY clause. To sort in reverse order, add the DESC (descending) keyword to the name of the column you are sorting by in the ORDER BY clause.

Continue reading ‘MySQL: group_concat allows you to easily concatenate the grouped values of a row’

The influence of software quality requirements on the suitability of software cost estimation methods

Today I was giving a speech at the 24th International Forum on COCOMO and Systems/Software Cost Modeling held at MIT in Cambridge, MA. I presented the intermediate results I achieved in the research for my master thesis, which is supervised by Dr. Stefan Wagner from TUM and Dr. Ricardo Valerdi from MIT.  Here are the slides, as well as the abstract of the work.

Download

Abstract

Cost/Benefit-Aspects of Software Quality Assurance

As software becomes more and more pervasive, high software quality as well as the ability to perform good software cost estimates become more and more important. It is obvious that business owners want the software to run smoothly, deliver value and obviously, they want to know what building or adapting a software system costs upfront.

This is why, in summer 2008, I took part in a seminar on software quality at the chair of Univ.-Prof. Dr. Dr. h.c. Manfred Broy, Technische Universität München. I did extensive research on software quality in general and wrote a paper on the Cost/Benefit aspects of Software Quality Assurance, which I want to present you. The paper points out several interesting aspects on how to optimize investments into various software quality assurance techniques and thus into software quality.

Because of the high quality of the papers written by the seminar participants, the seminar supervisors decided to officially publish the results as working paper of the Technische Universität München. You can find the link to the publication in the links-section at the end of this article.

Please feel free to share your thoughts on this paper.

Cost/Benefit-Aspects of Software Quality Assurance – Abstract:

Along with the ever more apparent importance and critically of software systems for modern societies, arises the urgent need to deal efficiently with the quality assurance of these systems. Even though the necessity of investments into software quality should not be underestimated, it seems economically unwise to invest seemingly random amounts of money into quality assurance. The precise prediction of the costs and benefits of various software quality assurance techniques within a particular project allows for economically sound decision-making.

This paper presents the cost estimation models COCOMO, its successor COCOMO II and COUALMO, which is a quality estimation model and has been derived from COCOMO II. Furthermore an analytical idealized model of defect detection techniques is presented. It provides a range of metrics: the return on investment rate (ROI) of software quality assurance for example. The method of ROI calculation is exemplified in this paper.

In conclusion an overview on the debate concerning quality and cost ascertaining in general will be given. Although today there are a number of techniques to verify the cost-effectiveness of quality assurance, the results are thus far often unsatisfactory. Since all known models make heavy use of empirically gained data, it is very important to question their results judiciously and avoid misreadings.

Download the software cost estimation and quality assurance paper:

Cost/Benefit-Aspects of Software Quality Assurance

Continue reading ‘Cost/Benefit-Aspects of Software Quality Assurance’