Usefulness of Confusion Matrices

A Confusion Matrix is a visual performance assessment of a classification algorithm in the form of a table layout or matrix. Each column of the matrix represents predicted classifications and each row represents actual defined classifications. This representation is a useful way to help evaluate a classifier model. A well behaved model should produce a balanced matrix and have consist percent correctness numbers for accuracy, recall, precision and an F measure. If it does not, there is cause to further evaluate the data used to build the model and the data used to test the model. If you are building your own classification models, this is a helpful way to evaluate them. If you are buying a product for something like Sentiment Analysis which uses a classification model, you should ask for data associated with the Confusion Matrix to help evaluate the tool.

If a classification algorithm distinguishes between positive, negative and neutral text statements, for example, a confusion matrix summarizing the results of a classification algorithm might look like the following.

Actual Pos Neg Neutral
Pos 15 10 100
Neg 10 15 10
Neutral 10 100 1000
  • In this matrix, the total actual number of statements classified in each category is the sum of the row. 125 Positive, 35 Negative and 1110 Neutral.
  • Columns represent predictions made by the algorithm. In the first column, 15 Positive statements were classified correctly as Positive. 10 Negative statements were incorrectly classified as Positive and 10 Neutral statements were incorrectly classified as Positive. The 20 statements incorrectly classified are considered false positives.
  • Reading left to right across the top line, of the total Positive statements, 15 were classified as Positive, 10 were classified as Negative and 100 were classfied as Neutral. 110 Positive statements were missed and considered false negatives.
  • Values in the diagonal are correctly classified and are underlined. All other classifications are incorrect.


The simplest and most intuitve assessment is Accuracy. It is the correct classifications divided by all classifications. In this example, divide the sum of all the underlined values by the sum of values in all cells. 1030/1270 = 0.811

How to Cheat:

If the categories are highly imbalanced, high accuracy can be obtained by always guessing the category with the largest number of elements. Accuracy can be checked by randomly classifying the elements and comparing the percent guessed correctly with the accuracy of the algorithm. Accuracy should also be validated with Precision and Recall to detect how the data may be set up to cheat.

Precision is the correct classifications penalized by the number of incorrect classifications. true positives / ( true positives + false positives). In this example, the precision is:

(reading down each predicted column)

positive statements is 15 / (15 + 20) = .43

negative statements is 15 / (15 + 110) = .12

neutral statements is 1000 / (1000 + 110) = .90

How to Cheat:

You can get high precision by rarely classifying in a given category but this ruins Recall.


Recall is the number of correct classifications penalized by the number of missed items. true positives / (true positives + false negatives). in this example, the recall is:

(reading across each Actual row)

positive statements is 15 / (15 + 110) = .12

negative statements is 15 / (15 + 20) = .43

neutral statements is 1000 / (1000 + 110) = .90

F Measure (F1)

F1 measure is a derived effectiveness measurement. The resultant value is interpreted as a weighted average of the precision and recall. The best value is 1 and the worst is 0.

2((precision*recall) / (precision+recall))

positive statements:  2*((0.43 * 0.12) / (0.43 + 0.12)) = 0.18

negative statements: 2*((0.12 * 0.43) / (0.12 + 0.43)) = 0.18

neutral statements: 2*((0.9 * 0.9) / (0.9 + 0.9)) = 0.9


In this example, although the Accuracy at 81% seems good, the precision and recall indicate that the numbers are skewed by having an imbalance in the number of values tested across the categories. This is clearly visible in the F measure numbers. It could be that the model in this example was built with many more neutral statements than positive and negative and the test data contains mostly neutral statements. Accuracy alone can be misleading.

Ideally you want to see a matrix more like the following:

Actual Pos Neg Neutral
Pos 105 10 10
Neg 6 90 8
Neutral 12 15 150

Accuracy = 0.85


Positive = 0.85

Negative = 0.78

Neutral = 0.89


Positive = 0.84

Negative = 0.87

Neutral = 0.85

F Measure

positive = 0.84

negative = 0.86

neutral = 0.87

All the numbers are in the same ballpark and are reasonably balanced across the matrix with the bulk of the numbers on the diagonal showing a majority of correct classifications. This also shows a greater likelihood of the model being constructed with a balanced data set and that the test data evaluated by the model is also balanced.


Generating SerialVersionUID in Netbeans 7

There are many reasons why you should generate a unique id for Serializable java objects.

  • making sure you serialize and deserialize objects with the same structure
  • serializable objects without a defined id have one generated. different JVMs use different algorithms to generate ids. if you marshall objects across different platforms, the ids may not match
  • changes to classes that break backward compatibility should declare a changed UID to insure errors are thrown by old code using those objects.
To name just a few. If you’re interested, there is a Java Object Serialization Specification.
If you use Netbeans 7, there is a plugin that will auto-generate SerialVersionUIDs. In Kenai, there is a  SerialVerionsUID Generator project that has an nbm plugin eu-easyedu-netbeans-svuid-1.9.7.nbm .  The link points to the Netbeans 7.0.1 version. There are earlier versions of the plugin available to download from the project as well, at the time of this posting.
Once the plugin is installed, when creating a class that implements Serializable, right-click in the edit window, select Insert Code… , then select:
  • add generated serialVersionUID or
  • add default serialVersionUID
The correct syntax is added to the class file code.

Clearing cache in JPA 2.0

Managing a data cache effectively is important for application or service performance, although not necessarily easy to do. A simple and common design with a data cache that occurs and seems to cause problems is the following:

  • a process that continually commits new data to a database
  • another process executing is a separate virtual machine, typically a web service, marshaling data from the database
The web service needs to pull the most current data from the database each time there is a client request to access data. With different product implementations of caching, not all the JPA calls seem to work the same. There can be multiple levels of caching as well. The clear() and flush() methods of EntityManager don’t clear multiple levels of caching in all cases, even with caches turned off as set with persistence properties.
Every time there is a request to pull data from the service, the current set of data in the database needs to be represented. Not what is in the cache since it may not be in synch with the database. It seems like this should be simple but it took some experimenting on my part to get this working as needed. There also doesn’t appear to be much information about handling this particular scenario. There are probably solutions posted somewhere but I add one solution here to make it easier to find.
Before making any gets for data, use the following call:


This seems to work for all cache settings and forces all caches to be cleared.  Subsequent calls to get data result in getting the latest committed data in the database. This may not be the most efficient way, but it always gets the latest data. For high systems requiring high performance, this won’t wok very well. It would be better to refresh the cache periodically and have all clients just get the latest cached values. But there is still the issue of refreshing the entire cache.

Any suggestions on doing this better are appreciated. But for now, this works consistently across platforms and reasonably quick for small to moderate amounts of data.

User Feedback in Design

The difference between an easy to use and useful application and one that you’re forced to struggle with, usually directly correlates with whether or not  the application was designed and built considering user input during the course of development. This applies both to user interface design and application functions. Original ideas come from project stakeholders and they drive the project. Users evaluate features and ease of use in the context of what they wish to accomplish. Their feedback drives the implementation. With as much as 75% of Facebook users unhappy with the changes introduced 21 Sept 2011, it’s safe to say that their user base did not have much input during the development of the latest release. Facebook now has an earful of feedback. Let this be a prominent example showing the importance of:

  • knowing who your users are,
  • knowing what your users wish to accomplish,
  • understanding what makes your users’ tasks simple and efficient to accomplish,
  • listening to your users for product success!

The Importance of Distributed Systems Development

In an age of ever-increasing information collection and the need to evaluate it, building systems which utilize the yet untapped and available compute resources in everyone’s home and hands should be driving the development of more sophisticated distributed computing systems. Today, large data processing facilities provide significant computing capabilities. Utilizing the worldwide plethora of distributed resources in a coherent way is much more powerful.

Distributed programming and processing tools and techniques are currently a reality but are in their infancy. The potential rapid growth of distributed systems is already supported by:

  • Storage, bandwidth, and CPUs staying on course to becoming nearly free. (Free. The Future of a Radical Price)
  • The number of people and devices connected to the internet continually grows.
  • Data storage requirements increase as data accumulation from all sources grows as does the number of sources.

It is becoming more common to see Terabyte storage devices in homes. Desktop and laptop appliances have become somewhat of a commodity affordable to your average consumer. You can stake out a claim to a table for a period of time at your local coffee shop and access the internet for free. Becoming a social citizen on the internet with portable computing resources was once cost prohibitive and is now plummeting to a price affordable to a significant portion of the population.

Distribution of affordable, cheap and free compute devices to the general public continues to grow. Most of the resources sit idle much of the time. Game consoles, cell phones, tablets, laptops, desktops, etc. can now all participate in the storage and processing of data.

Like it or not, the ability to capture and share data is becoming increasingly easy. You can watch your favorite gorilla in the jungle or use collective intelligence to extract social and individual’s patterns with service APIs provided by large corporations like Google and Amazon. Today’s transient data coming from sources in real time will eventually be stored. Much of the data is and will be captured and stored in perpetuity within corporations and in data centers. Some of what should be available may be accessible through the gates of these data centers.

The approach to managing and controlling processing remains focused on huge data centers. In this sense, social and engineering thought is still akin to the 19th-century practice of building monolithic systems with centralized control. As data generation increases and the cost of storage decreases, huge data centers are being built to house and process data. Google, Apple, Codera and NTT America to name just a few. What will they do with all this data and how much will be shared?

IBM announced its plans to build a petaflop machine for the SKA telescope program. It is a laudable and beneficial effort. Undoubtedly, research and lessons learned from the effort will be valuable. But efforts should be made to build distributed systems of equal or greater benefit. Efforts such as BOINC provide a rudimentary effective start. File sharing peers using DHT have already demonstrated power and influence. Both illustrate the cost-effective use of existing distributed compute resources where most data is accessible to everyone.

Distributed Computing is in its infancy (I’m not referring to Cloud Computing). A number of technologies supporting distributed computing have been developed. Some have survived and some waned. A sophisticated distributed system is on par with the importance of nanotechnologies and artificial intelligence. It will support those other technologies as well. It has the potential to distribute energy needs for processing rather than requiring a power plant dedicated to running a data center. It has the potential to distribute data storage so it’s never lost and provides a means for individuals to control their own personal information. It has the potential to provide mechanisms which capture data in real time and process as needed where needed with the most efficient use of resources. In so doing, mirroring the real world (ala Gelernter’s Mirror World).

So although building data center citadels and powerful HPC computers is valuable, so is developing and building sophisticated distributed computing systems. In fact, it’s likely much more important.

Reblog this post [with Zemanta]

TED 2009 and Distributed Computing

The 2009 TED conference was this week. This is its 25th year although the first time I have heard of it thanks to Twitter mostly and those twittering the experience. I didn’t attend but monitored the activities and sessions some. It is a gathering and sharing of great minds, their visions, aspirations and creations in both science and art.  I hope to be able to attend in person at some point.

With all the talk and demos about technological advances and the need to capture, mine and process the vast amounts of electronic data produced, I’m surprised there was no mention of harnessing the compute power available in phones, desktops, clouds, supercomputers, and all devices everywhere. Also, how that might be done. Maybe I missed it (since I wasn’t there) or maybe it wasn’t the proper forum for that kind of discussion, but it seems to me that it was glaringly missing.

If anyone knows about such discussions taking place at the conference in sessions or even breakout groups, I am interested in finding out about them.

Installing Puppet on Solaris

There are a number of sites with information about installing Puppet on Solaris. They each contain slightly different instructions which get you most of the way there. With a little finesse it’s not hard to follow the instructions and get things working. This post includes yet another set of instructions for installing Puppet and getting things running. Hopefully with these instructions and others as reference your installation goes smoothly.

For those who are unfamiliar with Puppet, it is a tool for automating system administration. It is built and supported by Reductive Labs. They describe Puppet as  a declarative language for expressing system configuration, a client and server for distributing it, and a library for realizing the configuration. Rather than a system administrator having to follow procedures, run scripts and configure things by hand, Puppet enables defining a configuration and automatically applies it to specified servers and then maintains it. Puppet can be downloaded for many of the most popular operating systems. There is a download page with links to some installation instructions.

Installation on Solaris

1. To make installation more automated, install the Solaris package pkg-get. This tool simplifies getting the latest version of packages from a known site. A copy can be found at  Blastwave.

download to /tmp
Make sure the installation is done with root privilege. su to root.
run the following command from the /tmp directory

# pkgadd -d pkg_get.pkg

The package can also be added using the following command

#pkgadd -d

2) Verify that the pkg-get configuration file is configured for your region. In this case in the U.S. Change the default download site in the configuration file /opt/csw/etc/pkg-get.conf to:




3) Add some new directories to your path.  pkg-get, wget and gpg are installed in /opt/csw/bin.

# export PATH=/opt/csw/bin:/opt/csw/sbin:/usr/local/bin:$PATH

4) Install the complete wget package. wget is a tool GNU tool used to download and install packages from the web. This is a very useful tool to automate installs and software updates. This tool will be used by pkg-get.

# pkg-get -i wget


If you haven’t installed the entire Solaris OS, the pkg-get may fail to install wget, with the error:

“no working version of wget found, in PATH”

This is probably due to missing  SUNWwgetr and SUNWwgetu packages. Install them by inserting an installation DVD disk into the DVDROM and mount it to /media/xxxx

Install the Solaris packages

# pkgadd -d . SUNWwgetr
# pkgadd -d . SUNWwgetu

5) Configure pkg-get to support automation.

# cp -p /var/pkg-get/admin-fullauto /var/pkg-get/admin

6) Install gnupg and an md5 utility so security validation of Blastwave packages can be done.

# pkg-get -i gnupg textutils

You may also need to define $LD_LIBRARY_PATH to /usr/sfw/lib to find needed libraries.

7) Copy the Blastwave PGP public key to the local host.

# wget –output-document=pgp.key

8) Import pgp key

# gpg –import pgp.key

9) Verify that the following two lines in /opt/csw/etc/pkg-get.conf are COMMENTED OUT.


10) Puppet is build with Ruby. Install the Ruby software (CSWruby) from Blastwave.

# pkg-get -i ruby

11) Install the Ruby Gems software (CSWrubygems) from Blastwave.

# pkg-get -i rubygems

12) Update to the latest versions and install a the gems used by Puppet

# gem update –system

# gem install facter

# gem install puppet –version ‘0.24.7’

or current version. The gem update command can also be used to update the software.

# gem update puppet

13) Create the puppet user and group:

Info to add in /etc/passwd: puppet:x:35001:35001:puppet user:/home/puppet:/bin/sh
Info to add in /etc/shadow: puppet:LK:::::::
Info to add in /etc/group: puppet::35001:

14) Create the following core directories and set the permissions:

# mkdir -p /sysprov/dist/apps /sysprov/runtime/puppet/prod/puppet/master
# chown -R puppet:puppet /sysprov/dist /sysprov/runtime

15) add puppet configuration definitions in /etc/puppet/puppet.conf. The initial content using your own puppetmaster hostname should be:

server =
report = true

16) Repeat this process for the servers which will run Puppet. At least 2 should be set up. One will be the Master Puppet server, the other a Puppet client server that will be managed.

Validating the Installation and Configuring Secure Connections

To verify that the Puppet installation is working as expected, pick a single client to used as a testbed. With Puppet installed on that machine, run a single client against the central server to verify that everything is working correctly.

Start the master puppet daemon on the server defined in puppet.conf files.

# puppetmasterd –debug

Start the first client in verbose mode, with the –waitforcert flag enabled. The default server name for puppetd is Puppet. Use the server flag and define the server name running puppetmasterd. Later the server hostname can be added to the configuration file.

# puppetd –server –waitforcert 60 –test
Adding the –test flag causes puppetd to stay in the foreground, print extra output, only run once and then exit, and to just exit if the remote configuration fails to compile (by default, puppetd will use a cached configuration if there is a problem with the remote manifests).
Running the client should produce a message like:

info: Requesting certificate
warning: peer certificate won’t be verified in this SSL session
notice: Did not receive certificate

This message will repeat every 60 seconds with the above command. This is normal, since your server is not initially set up to auto-sign certificates as a security precaution. On your server running puppetmasterd, list the waiting certificates:

# puppetca –list

You should see the name of the test client. Now go ahead and sign the certificate:

# puppetca –sign

The test client should receive its certificate from the server, receive its configuration, apply it locally, and exit normally.
By default, puppetd runs with a waitforcert of five minutes; set the value to the desired number of seconds or to 0 to disable it entirely.

Getting this far, you now have puppet installed with a base initial configuration and secure connections defined between a Puppet master server and one puppet client server. At this point you can start defining manifests for desired server configurations.

There are various sample recipes and manifest to start working with. Viewing and editing some of thes is a good place to start learning how to create configuration defintions. If there is interest I can share sample as well if I have one that may be useful for your needs.