If you've looked at my recent posts, you know I'm working on a plugin for VisualVM, a very useful tool supplied with the JDK. In one example, I showed how to attach to a waiting Java application using a socket-based AttachingConnector. At that time I said that there were two primary ways of attaching to a process with JDI -- via shared memory, and with a socket.

It turns out there is a "third way". Following is an example of why this way is useful, and why it was provided.

When I last wrote JDI programs (in Java 5), I would notice that my target application would start up and print (to stdout) the port on which it was listening, as in the following:
Listening for transport dt_socket at address: 55779
In Java 5, if you detached your debugger from this process, you would get another line to stdout in the target's console, like this:
Listening for transport dt_socket at address: 55779
and this would go on for as long as you chose to attach and detach, etc.

At some point (and I don't know when this started happening), the port on which the target is listening started changing on each detach of an external debugger. If in Java 6 (I'm using u20), you repeatedly attach and detach from the target process, you'll see the following out in the target's console:
Listening for transport dt_socket at address: 55837
ERROR: transport error 202: recv error: Connection reset by peer
Listening for transport dt_socket at address: 55844
ERROR: transport error 202: recv error: Connection reset by peer
Listening for transport dt_socket at address: 55846
ERROR: transport error 202: recv error: Connection reset by peer
Listening for transport dt_socket at address: 55911
If you're writing an application that attaches using the debug port, each time you attach you need to find out what port the target is using. This information is not available from the process itself; in other words, you have to play the usual unpleasant game of capturing console output to know what the port is. Even if you specify a port at target start, you still need to get your hands on the value.

You can still find the original request for a feature to attach to a process by its process ID if you search around the old Java bug reports. The long and short of it: a new AttachingConnector was created, one which attaches by PID. As you know, sometimes it isn't much fun finding a process's PID either. In my case, however, I am writing a plugin for VisualVM, and one thing you get for free when you do that is Visual VM's API, which as you might expect includes calls to get the PID. My goal, then, is to use this new connector in my VisualVM plugin, and I thought it might be appreciated if I shared the details.

I've adapted my test program from an earlier post so that it now outputs the details of each AttachingConnector; the changed code fragment is shown here:
List<AttachingConnector> attachingConnectors = vmMgr.attachingConnectors();
for (AttachingConnector ac: attachingConnectors)
{
Map paramsMap = ac.defaultArguments();
Iterator keyIter = paramsMap.keySet().iterator();
System.out.println("AttachingConnector: '" + ac.getClass().getName() + "'");
System.out.println(" name: '" + ac.name() + "'");
System.out.println(" description: '" + ac.description() + "'");
System.out.println(" transport name: '" + ac.transport().name() + "'");
System.out.println(" default arguments:");
while (keyIter.hasNext())
{
String nextKey = keyIter.next();
System.out.println(" key: '" + nextKey + "'; value: '" + paramsMap.get(nextKey) + "'");
}
}
The output from this code is shown below:
AttachingConnector:  'com.sun.tools.jdi.SocketAttachingConnector'
name: 'com.sun.jdi.SocketAttach'
description: 'Attaches by socket to other VMs'
transport name: 'dt_socket'
default arguments:
key: 'timeout'; value: 'timeout='
key: 'hostname'; value: 'hostname=AdamsResearch'
key: 'port'; value: 'port='
AttachingConnector: 'com.sun.tools.jdi.SharedMemoryAttachingConnector'
name: 'com.sun.jdi.SharedMemoryAttach'
description: 'Attaches by shared memory to other VMs'
transport name: 'dt_shmem'
default arguments:
key: 'timeout'; value: 'timeout='
key: 'name'; value: 'name='
AttachingConnector: 'com.sun.tools.jdi.ProcessAttachingConnector'
name: 'com.sun.jdi.ProcessAttach'
description: 'Attaches to debuggee by process-id (pid)'
transport name: 'local'
default arguments:
key: 'pid'; value: 'pid='
key: 'timeout'; value: 'timeout='
A couple of things I hadn't noticed before is that the socket-based connector comes with the hostname argument pre-set to my machine's hostname, and that all three connectors have a timeout default argument. The first observation brings up an interesting point: if you use the local, PID-based connector, remember that you'll only be attaching to processes on your debugger's host.

I changed my test program to use the local connector and it works as before! Well, no, actually, it does not. Here's what I now get:
java.lang.UnsatisfiedLinkError: no attach in java.library.path
Exception in thread "main" java.io.IOException: no providers installed
at com.sun.tools.jdi.ProcessAttachingConnector.attach(ProcessAttachingConnector.java:86)
at com.adamsresearch.jdiDemo.JDIDemo.main(JDIDemo.java:70)
Does this mean the local connector isn't exactly ready for use? No, but I have been burned by the same issue that has plagued a number of others (scroll down in that page -- the issue was found by a reader of that post and was solved, partially, by another reader of that post). I'm working on a Windows platform, and when you do that you have to be a little careful ;-> . In this case, the problem is caused by 1) using the java interpreter as found on the system path, and 2) not making sure that path points directly to your JDK or JRE directory. The executable will look in a path relative to itself for the needed libraries, and when Windows copies the java executable to C:\Windows\system32 (or similar) -- and if you use that executable -- that relative path is broken. I believe this is the true issue, unlike described in the comments on the above post, where the distinction is made between using the JRE java and the JDK java. I don't think that's the issue. For example, below are the results of my attach test in 3 different scenarios:
  1. Using java from my path, the first hit of which comes from C:\Windows\system32:

    java -cp c:\jdk1.6.0_20\lib\tools.jar;. com.adamsresearch.jdiDemo.JDIDemo 10816 863 fileName
    ...
    java.lang.UnsatisfiedLinkError: no attach in java.library.path
    Exception in thread "main" java.io.IOException: no providers installed
    at com.sun.tools.jdi.ProcessAttachingConnector.attach(ProcessAttachingConnector.java:86)
    at com.adamsresearch.jdiDemo.JDIDemo.main(JDIDemo.java:70)

  2. Using the full path to the JRE bin java:

    c:\jdk1.6.0_20\jre\bin\java -cp c:\jdk1.6.0_20\lib\tools.jar;. com.adamsresearch.jdiDemo.JDIDemo 10816 863 fileName
    ...
    Attached to process 'Java HotSpot(TM) 64-Bit Server VM'

  3. Using the full path to the JDK bin java:
    c:\jdk1.6.0_20\bin\java -cp c:\jdk1.6.0_20\lib\tools.jar;. com.adamsresearch.jdiDemo.JDIDemo 10816 863 fileName
    ...
    Attached to process 'Java HotSpot(TM) 64-Bit Server VM'
As you can see, the above seems to support my theory that it's not the JRE vs the JDK, but rather the context-poor placement of the java executable in the "usual" Windows binaries directory, that caused the problem. That posting is several years old, so it is possible that at that time, the needed JDI libraries actually were not included in the JRE, but it is clear that today, you will see the same exception if you use the java executable found in Windows' default binaries directory.

Now, if I run my JDI application against my JarView utility, searching for AttachingConnector in the JDK installation directory, I get the following output:
Breakpoint at line 863:
fileName = 'AttachingConnector.class'
Breakpoint at line 863:
fileName = 'GenericAttachingConnector$1.class'
Breakpoint at line 863:
fileName = 'GenericAttachingConnector.class'
Breakpoint at line 863:
fileName = 'ProcessAttachingConnector$1.class'
Breakpoint at line 863:
fileName = 'ProcessAttachingConnector$2.class'
Breakpoint at line 863:
fileName = 'ProcessAttachingConnector.class'
Breakpoint at line 863:
fileName = 'SharedMemoryAttachingConnector$1.class'
Breakpoint at line 863:
fileName = 'SharedMemoryAttachingConnector.class'
Breakpoint at line 863:
fileName = 'SocketAttachingConnector$1.class'
Breakpoint at line 863:
fileName = 'SocketAttachingConnector.class'
and so have done what I set out to do, which is 1) debug-attach by process ID, and 2) thrash through the inevitable hiccups and share the solutions. Hopefully this will be useful to you, too.

Note: actually, there are even more ways to attach to a Java process. JPDA Connection and Invocation is the definitive guide, from Oracle. If you're going to be writing debuggers, you can't go wrong reading this page first.
  1. For nearly two years, I've been trying to branch out and add another programming language to my brain.  I read and blogged about Seven Languages in Seven Weeks, by Brian Tate, an excellent book that I blasted through in seven days to save a little time.  If you read my blog, you'll know that I finally settled on Haskell, started posting about my experience as an object-oriented programmer writing in a functional language, and then things kind of fizzled out.

    I really like Haskell.  However, I think I'm one of those people who tend to learn better when under pressure.  Since I didn't have a job requirement to learn Haskell or an otherwise motivating situation, I never really quite got in to it.  I still plan to, some day.

    But, I have finally picked the "new" language I want to learn, and that is R (I say "new" because of course R is not a new language).  I had a number of reasons to do so:
    • Big Data is all the buzzword-rage right now, and R figures prominently in many big-data scenarios.
    • I'm taking MOOCs at coursera, and the ones I'm taking use R as the programming platform, ensuring that I must have more than a superficial understanding of the language.  I had actually looked at R once before and never stuck with it for the same reasons I did not stick with Haskell -- no looming deadlines!
    • As I learn more about R, I become more impressed by how handily it performs tasks that require a lot of boilerplate code in any other language I've used, so that experience provides me more motivation to keep learning.
    • I am currently working at a bank, and I'm already starting to use R not only to greatly speed up some tasks that I need to perform, but also to perform analyses that would have required so much Java code that they would have gone on the "back burner."
    I'm also happy to report there has been some convergence, for me, among big data, R, Haskell and my recent exposure to functional programming.  R is an interesting language.  I don't have an especially formal computer-science background (instead, I'm from physics, math, and electrical engineering), so I probably would not be the best person to articulate how R checks (and does not check) boxes for functional and object-oriented languages.  But all that Haskell investigation helped a lot when I started learning MapReduce, and seeing functional features in R that also fit well into the MapReduce paradigm makes me feel - as all curious types should - that all that investigation was worthwhile.

    I'll still blog about Java occasionally, but my posts for the near future will be focused on my self-training to fill in gaps in my skill set related to big data.  I have started a new blog on this topic, called Data Scientist in Training.  If you read me on DZone, you don't have to do much to find me, as my posts from both blogs will continue to find their way to DZone (the big-data posts go to a microzone called Big Data/BI Zone).  If you read me directly on Blogger, then please bookmark the link above if you're interested in what I'm doing.  At the least, please check out my Welcome! post, where I explain my path and reference some resources that you, too, may want to check out in the event that you want to learn more about big data, too.

    My posts about R on Data Scientist in Training will not explicitly say anything in the title like "Java developer struggles with R data frames", but it will still be obvious that my approach to R is that of a developer who has used Java for about 90% of his coding for the last 15 years.  If you're a Java developer and are learning R, I hope there will be some content there of special use to you.  As I've searched online while learning R, I've noticed helpful responders trying to explain how to move from the "use a for-loop to iterate and then build your model in rows" approach to "use a mapping function to create your new column of data, then add it to your data frame".  (In fact, this reminds me of another feature I like about R -- R data frames remind me of tables in the column-oriented databases used extensively in big data).  I'm going to blog in near-real-time so I don't forget those dead ends I encountered as I was trying to map Java onto R, and that perspective is the one I think will be most helpful to fellow Java/OO developers.

    There are a few posts on Data Scientist in Training already.  The next one will be specifically about R -- I hope you check it out when it arrives!












  2. I've been experimenting with using Pig on some Fannie-Mae MBS data lately.  While I don't mind writing MapReduce programs to process data (especially the fairly simple tasks I'm doing now), I really do appreciate the "magic" Pig does under the blanket, you might say.  If you don't know, Pig, a member of the Hadoop ecosystem (and now a first-class Apache project at pig.apache.org), is a framework for analyzing large data sets.  In this mini-tutorial we'll see how Pig works with Hadoop and HDFS, and just how much you can accomplish with only a few lines of script.  I am using Pig version 0.10.0 on Hadoop 1.1.0 (on Ubuntu 12.04, on VirtualBox 4.2.4, on Windows 7SP1, on the third floor of a tri-level at 1728 m above sea level, but that could change -- see this story about another "PIG").

    I'll assume in this tutorial that you have Hadoop and Pig installed and that you are running Hadoop at least in pseudo-distributed mode.  If you're really fresh to both topics, I would recommend first looking at their respective Apache websites, and for getting Hadoop deployed and running, it really doesn't get any better than Michael Noll's posts on the subject.  For a single-node cluster (which is sufficient for following this tutorial), see his post Running Hadoop On Ubuntu Linux (Single-Node Cluster).  I am reading Tom White's Hadoop:  The Definitive Guide, which contains a very useful chapter on Pig.

    First, the dataset

    Before we start, let's look at the data we'll be parsing.  On the Fannie Mae website, you can find a page of the most-recent mortgage-pool issues (click on "New Issues Statistics).  Pipe-delimited files are available for each day for which issue data is available.  On this page, I'm most interested in the New Issue Pool Statistics, which I'll abbreviate NIPS.  These files are interesting in that they contain records in several different formats (a link on the above page refers to a document that describes the various record formats found in a NIPS file).  So, as you parse a NIPS file, you need to look at the 2nd column of data first, then refer to the file-format description file, to interpret the data.

    As an example, I'm looking at the last few lines of the 9 November NIPS file.  I've included only the lines for one CUSIP, AQ7340:

    AQ7340|01|3138MPEN1|11/01/2012|FNMS 02.5000 CI-AQ7340|$2,218,111.00|2.5||12/25/2012|U.S. BANK N.A.|U.S. BANK N.A.|9|247169.44|11/01/2027|||||||3.092|||1|180|179|92|779|0.0||0.0|CI  ||76.17|97AQ7340|02|MAX|375000.0|3.5|94.0|813|180|2|180
    AQ7340|02|75%|317250.0|3.25|93.0|796|180|1|180
    AQ7340|02|MED|241800.0|3.0|92.0|786|180|0|180
    AQ7340|02|25%|209725.0|3.0|91.0|773|180|0|179
    AQ7340|02|MIN|179500.0|2.875|90.0|697|180|0|178
    AQ7340|03|REFINANCE|9|100.0|$2,218,111.30
    AQ7340|04|1|9|100.0|$2,218,111.30
    AQ7340|05|PRINCIPAL RESIDENCE|9|100.0|$2,218,111.30
    AQ7340|08|2012|9|100.0|$2,218,111.30
    AQ7340|09|GEORGIA|1|8.98|$199,118.84
    AQ7340|09|ILLINOIS|1|9.52|$211,250.00
    AQ7340|09|MICHIGAN|2|19.13|$424,312.98
    AQ7340|09|MINNESOTA|2|24.93|$552,916.34
    AQ7340|09|MISSOURI|1|10.82|$239,984.67
    AQ7340|09|WASHINGTON|2|26.62|$590,528.47
    AQ7340|10|U.S. BANK N.A.|9|100.0|$2,218,111.30
    AQ7340|17|BROKER|1|16.91|$375,000.00
    AQ7340|17|CORRESPONDENT|6|59.27|$1,314,611.30
    AQ7340|17|RETAIL|2|23.83|$528,500.00

    In this tutorial, we will be processing NIPS files to add up the total unpaid balances (UPBs) totaled on a per-state basis. Referring to the NIPS file layout description, I see I need to look at records where field #2 is "09". What we will want to do with this data is to accumulate the dollar amount of each UPB into a "state" key, over an entire NIPS file or set of NIPS files, and output the totals by state when we're done.

    These are not huge datasets, of course.  But, for the purpose of creating an interesting tutorial, we'll process a small amount of NIPS data and see which states are seeing the most mortgage activity (at least as far as Fannie-Mae new issues are concerned).  The main point here is learning how to process the data and leverage the capabilities of Pig.

    Loading the data into HDFS

    I will start by downloading all of the available data on the Fannie Mae NIPS page to my local filesystem.  At the time of this tutorial, this included data from the 23rd of August (2012) through the 21st of November.  This set provides just a little under 400K lines of output.  The next step is to copy from my local storage to HDFS:

    $ bin/hadoop dfs -copyFromLocal /home/hduser/dev/pigExamples/nipsData /user/hduser/pigExample

    We can verify the transfer to HDFS with:

    $ bin/hadoop dfs -ls /user/hduser/pigExample

    Examining a single file with Pig

    We're going to start by loading a single file and attempting to filter out lines whose record type is not "09".  I'm assuming you have installed Pig and it is configured to access HDFS.  Start the Pig interpreter:

    hduser@ubuntu:~$ pig
    2012-11-24 16:56:19,305 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
    2012-11-24 16:56:19,306 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hduser/pig_1353801379300.log
    2012-11-24 16:56:19,518 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000
    2012-11-24 16:56:19,858 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001
    grunt> 

    You can see from the output that Pig knows I'm running Hadoop in (pseudo-)distributed mode.  If you don't see these, verify your PIG_CLASSPATH is set.  Next I'm going to load a single file from HDFS into Pig.  Pig assumes the field delimiter is a tab; since our file is pipe ("|") delimited, we will use PigStorage to override the default:

    grunt> NIPS_9Nov = load 'pigExample/nips_11092012.txt' using PigStorage('|');

    (Note that the path is relative to '/user/hduser').  This load will not occur until the data is required; for example, right now a "dump" would cause the file to be loaded.  In fact, if you type

    grunt> dump NIPS_9Nov;

    you will see a flurry of activity, related to the Hadoop MapReduce task(s) being created on your behalf, culminating with the actual output of the parsed-on-pipe-symbol text, of which the last few lines look like the following:

    (AQ7340,09,GEORGIA,1,8.98,$199,118.84)
    (AQ7340,09,ILLINOIS,1,9.52,$211,250.00)
    (AQ7340,09,MICHIGAN,2,19.13,$424,312.98)
    (AQ7340,09,MINNESOTA,2,24.93,$552,916.34)
    (AQ7340,09,MISSOURI,1,10.82,$239,984.67)
    (AQ7340,09,WASHINGTON,2,26.62,$590,528.47)
    (AQ7340,10,U.S. BANK N.A.,9,100.0,$2,218,111.30)
    (AQ7340,17,BROKER,1,16.91,$375,000.00)
    (AQ7340,17,CORRESPONDENT,6,59.27,$1,314,611.30)
    (AQ7340,17,RETAIL,2,23.83,$528,500.00)

    This is good; this is what we want.  Next we'll want to look only at the record-type=09 fields, then accumulate balances on a per-state level.  In a fresh Pig shell, enter the following:

    grunt> nips_9nov = load '/user/hduser/pigExample/nips_11092012.txt' using PigStorage('|') as (poolNumber:bytearray, recordType:int, state:bytearray, numberOfLoans:int, percentageUpb:float, aggregateUpb:bytearray);
    grunt> fr_9nov = filter nips_9nov by (recordType == 9);
    grunt> dump fr_9nov;

    This will produce the same output as before, only restricting it to the "record type = 09" fields.  Again, here is the tail of this output:

    (AQ7337,9,WYOMING,1,2.7,$170,619.65)
    (AQ7340,9,GEORGIA,1,8.98,$199,118.84)
    (AQ7340,9,ILLINOIS,1,9.52,$211,250.00)
    (AQ7340,9,MICHIGAN,2,19.13,$424,312.98)
    (AQ7340,9,MINNESOTA,2,24.93,$552,916.34)
    (AQ7340,9,MISSOURI,1,10.82,$239,984.67)
    (AQ7340,9,WASHINGTON,2,26.62,$590,528.47)

    At this point I'm going to change direction a little and put the Pig statements into a script, so it is a little easier to catch the output.  Create a new file called "pigTest.pig" and add the following lines to it:

    nips_9nov = load '/user/hduser/pigExample/nips_11092012.txt' using PigStorage('|') as (poolNumber:bytearray, recordType:int, state:bytearray, numberOfLoans:int, percentageUpb:float, aggregateUpb:bytearray);
    fr_9nov = filter nips_9nov by (recordType == 9);
    dump fr_9nov;

    Save the file and invoke it with:

    pig -f pigTest.pig &> pigTest.log

    Some of Pig's output goes to stderr, so you'll want to capture both stdout and stderr to your log file.  Open the log file and scroll down to:

     Job Stats (time in seconds):

    and look at the next two lines, the first of which is a header.  Note that Pig only generated a Map job and no Reduce jobs (Maps = 1, Reduces = 0, Feature = "MAP_ONLY").  Since we are only loading records and filtering them based on a field characteristic, no Reduce job was necessary.

    Next, we'll want to parse the aggregate unpaid balances for each mortgage, sum them by state, and output the totals.  The aggregate UPB is in the form of a human-readable, not-much-fun-to-parse bytearray (e.g. $3,759,464.16).  To treat these as floats we'll have to do a little cleanup.  This may not be terribly efficient, but I used a nested "REPLACE" function call:

    fr_clean = foreach fr_9nov generate poolNumber, state, numberOfLoans, percentageUpb, (float)REPLACE(REPLACE(aggregateUpb, '\$', ''), ',', '') as upbFloat;

    Note that if you enter this expression in the Pig shell, you'll need two additional escape ("\") characters in front of the dollar sign (which, as in the java.lang.String.replaceAll() method, is interpreted as a regex). In a script, you'll need to escape both the dollar sign and the backslash.  Trust me.  fr_clean will now contain cleaned-up unpaid balances that look like real floats.  In the Pig shell, you can verify the schema of the relation (but not that the data will parse, as this has not happened yet) with the following:

    grunt> describe fr_clean;
    2012-11-26 23:21:45,570 [main] WARN  org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
    2012-11-26 23:21:45,570 [main] WARN  org.apache.pig.PigServer - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s).
    fr_clean: {poolNumber: bytearray,state: bytearray,numberOfLoans: int,percentageUpb: float,upbFloat: float}

    The final steps to output the states (and the District of Columbia) with the total unpaid balances of all new issues (for this file, in millions of dollars) are:

    grunt> byState = group fr_clean by state;
    grunt> totalUpb = foreach byState generate group, SUM(fr_clean.upbFloat)/1000000.0;
    grunt> dump totalUpb;

    I've glossed over these steps, but basically you are grouping by state and summing the unpaid balances on a per-state basis, scaling the totals by one million.  After the dump call is completed, we get 51 lines of output, the last few of which are here:

    ...
    (CALIFORNIA,1021.2734624101563)
    (NEW JERSEY,103.9833925234375)
    (NEW MEXICO,18.8126310078125)
    (WASHINGTON,153.9220293671875)
    (CONNECTICUT,33.7019688515625)
    (MISSISSIPPI,24.3124981796875)
    (NORTH DAKOTA,7.280279875)
    (PENNSYLVANIA,147.6614224453125)
    (RHODE ISLAND,15.24327924609375)
    (SOUTH DAKOTA,10.51592517578125)
    (MASSACHUSETTS,156.1397877109375)
    (NEW HAMPSHIRE,16.52243540234375)
    (WEST VIRGINIA,8.3394678828125)
    (NORTH CAROLINA,129.42278906640624)
    (SOUTH CAROLINA,70.23617646875)
    (DISTRICT OF COLUMBIA,18.288814109375)

    In other words, California totaled slightly more than one billion dollars for the pools issued on the 9th of November in 2012.  

    To wrap things up a little, I'll next run from a Pig script file.  I mentioned earlier we need to be a little careful about the escape character in the "REPLACE" call.  Here's the script to process a single file:

    nips_9nov = load '/user/hduser/pigExample/nips_11092012.txt' using PigStorage('|') as (poolNumber:bytearray, recordType:int, state:bytearray, numberOfLoans:int, percentageUpb:float, aggregateUpb:bytearray);
    fr_9nov = filter nips_9nov by (recordType == 9);
    fr_clean = foreach fr_9nov generate poolNumber, state, numberOfLoans, percentageUpb, (float)REPLACE(REPLACE(aggregateUpb, '\\\$', ''), ',', '') as upbFloat;
    byState = group fr_clean by state;
    totalUpb = foreach byState generate group, SUM(fr_clean.upbFloat)/1000000.0;
    dump totalUpb;

    Processing the entire dataset

    There's not much left to do here but run against the entire dataset, which in our case is about three months' worth of new-issues files.  A slight modification to the script:

    nips = load '/user/hduser/pigExample' using PigStorage('|') as (poolNumber:bytearray, recordType:int, state:bytearray, numberOfLoans:int, percentageUpb:float, aggregateUpb:bytearray);
    fr = filter nips by (recordType == 9);
    fr_clean = foreach fr generate poolNumber, state, numberOfLoans, percentageUpb, (float)REPLACE(REPLACE(aggregateUpb, '\\\$', ''), ',', '') as upbFloat;
    byState = group fr_clean by state;
    totalUpb = foreach byState generate group, SUM(fr_clean.upbFloat)/1000000.0 as total;
    sortedUpb = order totalUpb by total;
    dump sortedUpb;

    results in similar data (sorted in ascending order of total aggregate UPB), of course with only larger numbers.  For example, we see that during a three-month period starting in late August 2012, new Fannie-Mae pools representing $629M were issued for properties in Alaska.  You can also see from the output file that one Map and one Reduce job were created, and I have to admit, quite a number of records dropped (due to failure to parse):

    2012-11-26 23:51:38,269 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 2396 time(s).
    2012-11-26 23:51:38,269 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning FIELD_DISCARDED_TYPE_CONVERSION_FAILED 30869 time(s).

    On first inspection, it appears that 2396 "record type = 9" records actually didn't have enough fields to provide an aggregate unpaid balance column, and that I failed to successfully convert quite a few balances.  I did not investigate these records; however, such records generally tell you that you need to modify your parse logic.  In other words -- a good topic for another post!









  3. I've recently been writing JMS clients for an application I'm building and keep finding myself having to re-learn some basic configuration. While some standalone JMS servers are still quite simple to configure, the JMS resources of some application servers have become somewhat complex to configure over the years. I'm quite sure I remember, in an earlier version of WebLogic Server, going to a JMS page and configuring a topic or queue on-the-spot. Those days are gone. Below are notes which, if nothing else, will help me if I forget. It's not an exhaustive list, of course; just the products I happen to be working with now.

    The Simple Cases

    These are Apache's ActiveMQ and Allure Technology's JetStreamQ. There's not much to say here; once you get the server running (which itself is a very simple step), you create JMS resources in your Java code and they're, well, just there for you to use. What a concept!

    A Little More Complexity: Glassfish

    For Glassfish, you need to set up your JMS connection factories, queues, topics, etc. in the application server. I use NetBeans (v 7.1) to manage the server. This procedure is still fairly simple, although the concept of referring to JMS resources as File resources seems a little odd to me. The following steps assume you have started Glassfish and that you have at least one application deployed to Glassfish.

    1. Click "File", "New File..."
    2. In the "New File" dialog, ensure that your application is selected in the "Project" combo. One thing I've noticed here is that you need to pop the combo and select an actual JavaEE component. If you add the JMS resource to the overall NetBeans project, it appears to go to /dev/null. You'll notice this later when you can't find your resource.
    3. Under "Categories:", select "Glassfish"
    4. Under "File Types:" (yes, that really is odd), select "JMS Resource"
    5. Click "Next" to create a ConnectionFactory
    6. In "JNDI Name:", enter "jms/connectionFactory" (or, your choice, of course)
    7. Ensure the new item is enabled
    8. Under "Choose Resource Type:", select "javax.jms.ConnectionFactory"
    9. Click "Finish" (unless you want to configure properties on the ConnectionFactory, in which case click "Next")
    10. Starting back at Step #1, create a JMS queue, following the above steps but this time selecting a resource type of "javax.jms.Queue"
    11. Note that a Queue (or Topic) requires at least one property, which is its name, which is the reason you can't just click "Finish". Go to the properties screen and supply a name.
    12. If you're using NetBeans 7.1, you'll notice this still isn't enough to complete the task, because of a bug in the dialog(missing document listener). After entering a value for the name, click in the "Name" field and the "Finish" button should be enabled. Click "Finish".
    13. Repeat Steps 10-12 and create a Topic, remembering to choose resource type "javax.jms.Topic"
    14. Under the "Projects" tab, right-click on your Java EE project and select "Deploy".
    15. Under the "Services" tab, expand your running Glassfish server instance, right-click on "Resources" and choose "Refresh". You should see your new ConnectionFactory, Queue and Topic in their respective folders, and you should now be able to look them up by their JNDI names.
    Much More Flexibility -- WebLogic Server

    In WebLogic, before you create JMS resources, you have to decide how you want messages to be persisted, among other things. There's a fair amount of setup, and it makes sense to just back up all the way to the beginning, rather than try to figure out why your new resource is useless. I'm using WebLogic 12c in this example.

    Set up a persistent JMS store
    1. In the WebLogic web console, expand "Services", then select "Persistent Stores".
    2. Above the Persistent Stores table, click "New", then (for this example) select "Create FileStore".
    3. Choose a target server; in my case I have an admin server and two managed nodes, and I chose the admin server as the target.
    4. Enter a directory, paying attention to the accompanying text (in other words, ensure this is a real directory).
    5. Click OK.
    Set up a JMS Server
    1. Under "Services", expand the "Messaging" node, then select "JMS Servers".
    2. Click the "New" button above the JMS Servers table.
    3. Enter a Name.
    4. Choose a persistent store. Note that the WLS console allows you to create one now, if you haven't yet done so.
    5. Click "Next".
    6. Choose a target for the JMS Server. Again, in my case, I chose the admin server.
    7. Click "Finish".
    Set up a JMS Module
    1. Under "Services", "Messaging", select "JMS Modules".
    2. Click the "New" button above the JMS Modules table.
    3. Enter a Name, then click "Next".
    4. Again, choose a target, then click "Next" again.
    5. We could start adding resources, but for now just click "Finish".
    Create a JMS Module Subdeployment
    1. Under "Services", "Messaging", "JMS Modules", select your new JMS Module.
    2. Click the "Subdeployments" tab.
    3. Above the table, click the "New" button.
    4. Enter a name for the subdeployment, then click "Next".
    5. Select a target. This time, instead of targeting to the admin server or managed nodes, target the subdeployment to the JMS Server you created earlier.
    6. Click "Finish".
    When you view this new Subdeployment under the "Subdeployments" tab for your new JMS Module, you should see your JMS Server under the "Targets" column, but nothing under the "Resources" column.

    Finally, create your JavaEE JMS resources
    1. Under "Services", "Messaging", "JMS Modules", select your new JMS Module.
    2. Under the "Configuration" tab, click the "New" button.
    3. Select "Connection Factory", then click "Next".
    4. Enter a Name or accept the auto-populated default.
    5. Enter a JNDI name for the ConnectionFactory.
    6. Click "Next".
    7. Click "Advanced Targeting".
    8. Choose the Subdeployment you created above. Note that the form now populates with possible targets, but also note that the Subdeployment is already shown to be targeted, as you configured this property earlier.
    9. Click "Finish". Note that all the property columns are populated for the new resource.
    10. In the JMS Module "Configuration" tab, click the "New" button again.
    11. Select "Queue", then click "Next".
    12. Enter a Name and a JNDI Name, then click "Next".
    13. Choose a Subdeployment, then click "Finish".
    14. In the JMS Module "Configuration" tab, click the "New" button again.
    15. Select "Topic", then click "Next".
    16. Enter a Name and a JNDI Name, then click "Next".
    17. Choose a Subdeployment, then click "Finish".
    At this point, you'll see that the JMS Module "Configuration" table contains your three new resources, as well as their subdeployment and targets. If you click on the "Subdeployments" tab, you will see that your new Subdeployment lists the three resources in that view as well. You should now be able to do a JNDI lookup of the resources, of course.

    I hope the above information is helpful. I think the Glassfish documentation is quite simple and to-the-point, as well as easy to find from the NetBeans help menu. The WebLogic documentation is a lot more spread out and it took me quite a while to figure everything out, which involved a lot of strange deployment error messages along the way. Please let me know if you find any mistakes. Good luck!
  4. A few years ago, I posted a how-to on Java-SE-based Web Services. More recently, I've become interested in asynchronous web-service invocation, and, as it turns out, Java SE supports that, too. This post, then, is the asynchronous version of that older post. How I got to the structure of this post is a story in itself. To make things simpler, I will first go through all the steps to deploy
    an asynchronous Java SE web service. Then, I will explain the route I chose, and what I see as the positives and negatives of the results.

    Here's the outline:
    • Create a WSDL definition file
    • Using the WSDL file, generate server artifacts; implement the web service operations
    • Create an external JAX-WS binding definitions file
    • Generate client-side artifacts using the WSDL and binding definitions files
    • Demonstrate synchronous and asynchronous client-side invocations of the web-service operations
    Let's cut to the chase.

    Create a WSDL definition file

    Here I'll create a minimal WSDL file, describing a web service which returns the exchange
    rate of two currencies. Here's the file, called exchange-rate.wsdl:
    <?xml version="1.0" encoding="UTF-8"?>
    <definitions xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"
    xmlns:wsp="http://www.w3.org/ns/ws-policy"
    xmlns:wsp1_2="http://schemas.xmlsoap.org/ws/2004/09/policy"
    xmlns:wsam="http://www.w3.org/2007/05/addressing/metadata"
    xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"
    xmlns:tns="http://async.ws.adamsresearch.com/"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema"
    xmlns="http://schemas.xmlsoap.org/wsdl/"
    targetNamespace="http://async.ws.adamsresearch.com/"
    name="ExchangeRateService">

    <types>
    <xsd:schema xmlns:tns="http://async.ws.adamsresearch.com/"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    version="1.0"
    targetNamespace="http://async.ws.adamsresearch.com/">
    <xsd:element name="getExchangeRate" type="tns:getExchangeRate"></xsd:element>
    <xsd:element name="getExchangeRateResponse" type="tns:getExchangeRateResponse"></xsd:element>
    <xsd:complexType name="getExchangeRate">
    <xsd:sequence>
    <xsd:element name="arg0" type="xsd:string" minOccurs="0"></xsd:element>
    <xsd:element name="arg1" type="xsd:string" minOccurs="0"></xsd:element>
    </xsd:sequence>
    </xsd:complexType>
    <xsd:complexType name="getExchangeRateResponse">
    <xsd:sequence>
    <xsd:element name="return" type="xsd:double"></xsd:element>
    </xsd:sequence>
    </xsd:complexType>
    </xsd:schema>
    </types>
    <message name="getExchangeRate">
    <part name="parameters" element="tns:getExchangeRate"></part>
    </message>
    <message name="getExchangeRateResponse">
    <part name="parameters" element="tns:getExchangeRateResponse"></part>
    </message>
    <portType name="ExchangeRate">
    <operation name="getExchangeRate">
    <input wsam:Action="http://async.ws.adamsresearch.com/ExchangeRate/getExchangeRateRequest"
    message="tns:getExchangeRate"></input>
    <output wsam:Action="http://async.ws.adamsresearch.com/ExchangeRate/getExchangeRateResponse"
    message="tns:getExchangeRateResponse"></output>
    </operation>
    </portType>

    <binding name="ExchangeRatePortBinding" type="tns:ExchangeRate">
    <soap:binding transport="http://schemas.xmlsoap.org/soap/http" style="document"></soap:binding>
    <operation name="getExchangeRate">
    <soap:operation soapAction=""></soap:operation>
    <input>
    <soap:body use="literal"></soap:body>
    </input>
    <output>
    <soap:body use="literal"></soap:body>
    </output>
    </operation>
    </binding>

    <service name="ExchangeRateService">
    <port name="ExchangeRatePort" binding="tns:ExchangeRatePortBinding">
    <soap:address location="http://localhost:8282/exchangeRate"></soap:address>
    </port>
    </service>

    </definitions>

    The web service has a single operation -- getExchangeRate -- which takes a String representation of two currencies and returns a double. There's no need to flesh this service out with a lot of operations -- I just want to demonstrate the dev cycle required to deploy a web service with asynchronous operations.

    Generate the service-side artifacts

    Next, we'll create a Java web-service project using Maven. In a convenient directory, enter

    mvn archetype:generate -DgroupId=com.adamsresearch.ws.async -DartifactId=AsyncService

    (substituting your values, of course) and accept all the default values. This creates the skeleton of what will be our web service. Next, we'll modify the POM file to read the WSDL file and generate the service-side artifacts. For this, we will use the JAX-WS utility wsimport (found in the Java SE SDK; we will be using the Maven wsimport plugin, however). First, we have to decide where to put the WSDL file. We can put it anywhere we want, of course, but note from the jaxws:import docs page that the default location is ${basedir}/src/wsdl. I prefer to put mine in a directory src/main/resources/wsdl, so I'll need to specify that directory below, in the arguments to wsimport.

    Open the POM file in the top-level directory of the project and add the following, after the dependencies element:

    <build>
    <finalName>ExchangeRateWebService</finalName>
    <plugins>
    <plugin>
    <artifactId>maven-compiler-plugin</artifactId>
    <configuration>
    <source>1.6</source>
    <target>1.6</target>
    </configuration>
    </plugin>
    <plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>jaxws-maven-plugin</artifactId>
    <executions>
    <execution>
    <goals>
    <goal>wsimport</goal>
    </goals>
    <configuration>
    <wsdlDirectory>${basedir}/src/main/resources/wsdl</wsdlDirectory>
    <keep>true</keep>
    <packageName>com.adamsresearch.ws.async.generated</packageName>
    <sourceDestDir>${basedir}/src/main/java</sourceDestDir>
    </configuration>
    </execution>
    </executions>
    </plugin>
    </plugins>
    </build>
    The first dependency is an acknowledgment that Maven defaults to 1.3. For those of you without white hair, that's a really old version of Java.

    I've decided to override a number of wsimport defaults, to create the service in the desired package and to drop the files in the desired directory. Note that packageName does not have a default setting. Now enter mvn install from the project top-level directory.

    I now have an interface -- ExchangeRate -- as well as some additional supporting classes. It is time to write our service implementation. Here is my first cut:

    package com.adamsresearch.ws.async;

    import javax.jws.WebMethod;
    import javax.jws.WebService;
    import javax.xml.ws.Endpoint;
    import com.adamsresearch.ws.async.generated.ExchangeRate;

    @WebService(serviceName="ExchangeRateService", portName="ExchangeRatePort", endpointInterface="com.adamsresearch.ws.async.generated.ExchangeRate")
    public class ExchangeRateEndpoint implements ExchangeRate
    {
    public static void main(String[] args)
    {
    if (args.length != 1)
    {
    System.out.println("Usage: java -cp <jarFile> com.adamsresearch.ws.async.ExchangeRateEndpoint publishURL");
    System.exit(-1);
    }
    ExchangeRateEndpoint wsInstance = new ExchangeRateEndpoint();
    Endpoint.publish(args[0], wsInstance);
    System.out.println("Published endpoint at URL " + args[0]);
    }

    @WebMethod
    public double getExchangeRate(String fromCurrency, String toCurrency)
    {
    if (fromCurrency.equals("AS1") && toCurrency.equals("GMD"))
    {
    return 2.78;
    }
    else
    {
    return 0.0;
    }
    }
    }
    I then run mvn install once more, then launch the web service with the following:

    C:\dev\AsyncWSDev\AsyncService>java -cp target\ExchangeRateWebService.jar com.adamsresearch.ws.async.ExchangeRateEndpoint

    http://
    localhost:8282/exchangeRateService Published endpoint at URL http://localhost:8282/exchangeRateService

    If we open a browser at the specified URL with "?wsdl" appended, we'll see the web-service WSDL file, verifying that we successfully deployed the service. Note that the Java runtime cleverly extracts the XML Schema from the WSDL file and references it via import. Both the WSDL file and the XML Schema file (as retrieved by the HTTP request in the browser) are dynamically generated by the Java runtime.

    Create an external JAX-WS binding definitions file

    Why are we performing this step? To produce a more-fully-functional client-side API for our to-be-created web service client. You can provide two types of binding definitions files to wsimport -- JAXB-related files, and a file that specifies some customizations of the web service (which is why it's called a binding customization).

    Here is our binding customization file:

    <?xml version="1.0" encoding="UTF-8"?>

    <bindings
    xmlns:wsdl="http://schemas.xmlsoap.org/wsdl/"
    wsdlLocation="http://localhost:8282/exchangeRateService?wsdl"
    xmlns="http://java.sun.com/xml/ns/jaxws">

    <!-- applies to wsdl:definitions node, that would mean the entire wsdl -->
    <enableAsyncMapping>false</enableAsyncMapping>

    <!-- wsdl:portType operation customization -->
    <bindings node="wsdl:definitions/wsdl:portType [@name='ExchangeRate']/wsdl:operation[@name='getExchangeRate']">
    <enableAsyncMapping>true</enableAsyncMapping>
    </bindings>
    </bindings>

    Note that the outer envelope of the file is called bindings, and that there is an inner element also called bindings. Not to go into too much detail, but elements can be applied at the global level, and at a portType or even an operation level. In this file, I've disabled asynchronous mapping at the global level, but turned it on just for the getExchangeRate operation. This precaution prevents potential new operations from being inadvertently exposed as asynchronous operations. More on the wsdlLocation later.

    Generate the client-side artifacts

    We're going to create a new Maven project for the web-service client (more on this later), so next we'll create another project:

    mvn archetype:generate -DgroupId=com.adamsresearch.ws.async -DartifactId=AsyncClient

    As with the service-side artifacts, I will create a resources directory, but I'll do this a little differently this time.

    Instead of referencing the WSDL file in the project filesystem, I'll point wsimport to it via URL. Note this is what I did above in the binding customization file, since it also references the WSDL location. In the client project directory structure, I'll create a src/main/resources/jaxws directory and put the binding customization file (which I called async-bindings.xml) in the
    directory.

    Here's the relevant section of the client-project POM file, modified to point to the binding customization file and to access the WSDL file from the service itself:

    <build>
    <finalName>ExchangeRateWebService</finalName>
    <plugins>
    <plugin>
    <artifactId>maven-compiler-plugin</artifactId>
    <configuration>
    <source>1.6</source>
    <target>1.6</target>
    </configuration>
    </plugin>
    <plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>jaxws-maven-plugin</artifactId>
    <executions>
    <execution>
    <goals>
    <goal>wsimport</goal>
    </goals>
    <configuration>
    <wsdlUrls>
    <wsdlUrl>http://localhost:8282/exchangeRateService?wsdl</wsdlUrl>
    </wsdlUrls>
    <bindingDirectory>${basedir}/src/main/resources/jaxws</bindingDirectory>
    <keep>true</keep>
    <packageName>com.adamsresearch.ws.async.generated</packageName>
    <sourceDestDir>${basedir}/src/main/java</sourceDestDir>
    </configuration>
    </execution>
    </executions>
    </plugin>
    </plugins>
    </build>

    Let's do a first build, before we actually write a client, and see what we get:

    mvn install
    Now take a look at the generated ExchangeRate Java interface. When we generated the artifacts for the server, we had a single method declaration for our one operation in this file:

    @WebMethod
    @WebResult(targetNamespace = "")
    @RequestWrapper(localName = "getExchangeRate", targetNamespace = "http://async.ws.adamsresearch.com/", className = "com.adamsresearch.ws.async.generated.GetExchangeRate")
    @ResponseWrapper(localName = "getExchangeRateResponse", targetNamespace = "http://async.ws.adamsresearch.com/", className = "com.adamsresearch.ws.async.generated.GetExchangeRateResponse")
    public double getExchangeRate(
    @WebParam(name = "arg0", targetNamespace = "")
    String arg0,
    @WebParam(name = "arg1", targetNamespace = "")
    String arg1);

    When we implemented this interface in our endpoint, it was a simple matter to process the input parameters and return the double value. If you open the artifact that we just generated, you'll see there are two additional declarations:

    @WebMethod(operationName = "getExchangeRate")
    @RequestWrapper(localName = "getExchangeRate", targetNamespace = "http://async.ws.adamsresearch.com/", className = "com.adamsresearch.ws.async.generated.GetExchangeRate")
    @ResponseWrapper(localName = "getExchangeRateResponse", targetNamespace = "http://async.ws.adamsresearch.com/", className = "com.adamsresearch.ws.async.generated.GetExchangeRateResponse")
    public Response getExchangeRateAsync(
    @WebParam(name = "arg0", targetNamespace = "")
    String arg0,
    @WebParam(name = "arg1", targetNamespace = "")
    String arg1);

    @WebMethod(operationName = "getExchangeRate")
    @RequestWrapper(localName = "getExchangeRate", targetNamespace = "http://async.ws.adamsresearch.com/", className = "com.adamsresearch.ws.async.generated.GetExchangeRate")
    @ResponseWrapper(localName = "getExchangeRateResponse", targetNamespace = "http://async.ws.adamsresearch.com/", className = "com.adamsresearch.ws.async.generated.GetExchangeRateResponse")
    public Future getExchangeRateAsync(
    @WebParam(name = "arg0", targetNamespace = "")
    String arg0,
    @WebParam(name = "arg1", targetNamespace = "")
    String arg1,
    @WebParam(name = "asyncHandler", targetNamespace = "")
    AsyncHandler asyncHandler);

    What happened here is that we got two additional options to retrieve the data -- one which returns a pollable object, and one which allows you to specify an asynchronous handler (note that Response is a subinterface of Future).

    Implement the client-side logic

    At this point, you're probably wondering why we will be invoking an asynchronous operation when we haven't yet implemented it on the server. Bear with me for a moment, while we write our client.

    Here's my version of a client which exercises the three different available method signatures.

    package com.adamsresearch.ws.async;

    import java.net.MalformedURLException;
    import java.net.URL;
    import javax.xml.namespace.QName;
    import javax.xml.ws.AsyncHandler;
    import javax.xml.ws.Response;
    import com.adamsresearch.ws.async.generated.ExchangeRate;
    import com.adamsresearch.ws.async.generated.ExchangeRateService;
    import com.adamsresearch.ws.async.generated.GetExchangeRateResponse;

    public class ExchangeRateClient
    {
    protected ExchangeRateClient theClient = null;
    protected String wsdlUrl = null;
    protected double rate = 0.0d;
    ExchangeRate excRate = null;

    public static void main(String args[]) throws MalformedURLException, InterruptedException
    {
    if (args.length != 1)
    {
    System.out.println("Usage java -cp <jarFile> com.adamsresearch.ws.async.ExchangeRateClient serviceWsdlUrl");
    System.exit(-1);
    }
    ExchangeRateClient client = new ExchangeRateClient(args[0]);
    Thread.sleep(5000L);
    }

    public ExchangeRateClient(String urlStr) throws MalformedURLException
    {
    theClient = this;
    wsdlUrl = urlStr;
    URL url = new URL(wsdlUrl);
    QName qname = new QName("http://async.ws.adamsresearch.com/", "ExchangeRateService");
    ExchangeRateService exchangeRateService = new ExchangeRateService(url, qname);
    excRate = exchangeRateService.getExchangeRatePort();

    // synchronous:
    System.out.println("Airstrip One / Ganymede exchange rate, retrieved synchronously, is: " + excRate.getExchangeRate("AS1", "GMD"));

    // asynchronous with polling:
    try
    {
    Response = excRate.getExchangeRateAsync("AS1", "GMD");
    Thread.sleep (2000L);
    GetExchangeRateResponse output = response.get();
    System.out.println("--> retrieved via polling: " + output.getReturn());
    }
    catch (Exception exc)
    {
    System.out.println(exc.getClass().getName() + " polling for response: " + exc.getMessage());
    }

    // asynchronous with callback:
    excRate.getExchangeRateAsync("AS1", "GMD", new AsyncHandler()
    {
    public void handleResponse(Response response)
    {
    System.out.println("In AsyncHandler");
    try
    {
    theClient.setCurrencyExchangeRate(response.get().getReturn());
    }
    catch (Exception exc)
    {
    System.out.println(exc.getClass().getName() + " using callback for response:" + exc.getMessage());
    }
    }
    });
    }

    protected void setCurrencyExchangeRate(double newRate)
    {
    rate = newRate;
    System.out.println("--> via callback, updated exchange rate to " + rate);
    }
    }

    The Thread.sleep() in main is just to make sure we're still around when the web service responds. Finally, invoking the client:

    c:\dev\AsyncWSDev\AsyncClient>java -cp target\ExchangeRateWebService.jar
    com.adamsresearch.ws.async.ExchangeRateClient http://localhost:8282/exchangeRateService?wsdl
    Airstrip One / Ganymede exchange rate, retrieved synchronously, is: 2.78
    --> retrieved via polling: 2.78
    In AsyncHandler
    --> via callback, updated exchange rate to 2.78

    So there we have it -- a Java SE client which hits a Java SE web service three ways: synchronously, asynchronously with polling, and asynchronously with a callback handler.

    There's a lot to discuss about these results, some of which, frankly, I did not expect (for example, why did we not have to explicitly implement asynchronous service-side logic?). That is the topic of another post, which I hope you'll find interesting, too.

  5. In an earlier post, we stepped through the building of an asynchronous web service, deployed in Java SE. I saved my comments for this post to keep things a little cleaner. But there are some loose ends to discuss, especially when you hear my motivation for building an asynchronous web service in the first place.

    Normally, you want to invoke a service asynchronously because you expect potentially long server response times. This, I would think, is the most likely reason for wanting asynchronous invocation. For example, in a user-facing application, you don't want to block on a call, especially if your front end has other things it could be doing.

    Lately, I've been spending a lot of time on "server push" applications, using frameworks like DWR (Direct Web Remoting) and CometD; these frameworks provide a clean façade for logic which either maintains long-lived HTTP connections or polls for data, for example. I'd rather not have to rely on a servlet-based framework to do this, however; if possible, I'd prefer to simply leverage the Java SE. So my ulterior motive was to see if I could use it to implement server push. Unfortunately, server push and asynchronous response are not equivalent, of course, although there's some obvious overlap. The overlap lies in that first (and, for async, only) response.

    What I found most interesting about this exercise is that the asynchronous logic (at least, the logic visible to us, the developers) was manifested only on the client side. In other words, the asynchronous versions of the operation were entirely a client-side artifact -- we did nothing on the web-service side to implement asynchronicity! I did not expect that result. We just implemented a synchronous Java method on the service side, and the Java SE runtime took care of the asynchronous-invocation details for us, simply by virtue of our using the asynchronous operations of the client-side API.

    I really expected, when I started this exercise, that I would generate asynchronous service-side stubs and actually implement the asynchronous logic myself. In that case, it would have been natural to assume that I could play some neat trick and continue sending data "down the pipe", rather than providing just one response and seeing the connection get closed afterwards. I'm not sure how I thought I was going to do this (groundless optimism?), but I planned on figuring that out once I got the basics working.

    What we ended up with is exactly what JAX-WS advertised, and that is actually quite handy; all you have to do is to make the asynchronous call and you can go on with whatever else you need to do; the runtime takes the response, when it returns, and updates your pollable object (or calls your callback).

    One thing that is appealing about this result is that you really don't need to build your web service by starting with a WSDL file, since the asynchronicity is purely a client-side artifact. In other words, if you prefer to start with a Java bean and use wsgen to generate the web service, that is fine -- you don't need to do anything else on the service side to support asynchronous invocation. It's only when you create the client that you need to worry about asynchronous invocation, but by then you already have a deployed web service, so you can generate the client-side, asynchronous API by running wsimport against the deployed web service's published WSDL file. Overall, a rather convenient result!

    So, to recap:
    1. JAX-WS in the Java SE provides relatively straightforward support for asynchronous web-service development. The service can be developed starting Java-first or WSDL-first, as the asynchronicity doesn't come into play until the client-side artifacts are generated.
    2. JAX-WS does not (as far as I know at this point) support the concept of server push, by which I mean a long-lived output stream over which data can be pushed to the client.




  6. As I have mentioned in earlier posts, I am using the Java Debug Interface (JDI) to create a Java process-monitoring tool. I'm retracing my steps from an earlier such effort a few years ago, and as I wade through the details, I'm posting details in the hope they'll be helpful to someone else out there. I sometimes wonder who would really be interested in this material, but I have been noticing that my old posts on JDI, DTrace, etc. continue to get hits every week, even the old ones, so I know you're out there. I hope you find this article helpful, too.

    Current Status:

    At this stage in the development of my application, I have set up breakpoint requests and am stepping through events as they arrive. You'd probably do this in a loop similar to the following:

    EventQueue evtQueue = virtualMachine.eventQueue();
    EventSet evtSet = null;
    while (!stopRequested)
    {
    try
    {
    evtSet = evtQueue.remove();
    EventIterator evtIter = evtSet.eventIterator();
    while (evtIter.hasNext())
    {
    ...
    }
    }
    catch (Exception exc)
    {
    ...
    }
    finally
    {
    evtSet.resume();
    }
    }

    In my application, I handle debugger events in a thread, so I need a way for the thread to know if it should stop (that's the purpose of the boolean stopRequested). Note that an EventSet actually contains an EventIterator which can contain several events. Also note that it is quite important to ensure that resume() is always called on the EventSet, regardless of any exceptions that are thrown. Otherwise, you could end up with some permanently hung threads and have to restart your target application.

    Inspecting Variables:

    At a breakpoint, you'll probably want to extract some of the current state of your target application. There are two places to find variables in the JDI API -- instance/class variables, and the stack. Static and instance variables can always be listed by inspecting the class's ReferenceType -- they're always there, as long as the class is loaded, even if the application isn't running (you may still have to be at a breakpoint to look at their values, of course, but the point is you can also see what a class's static and instance variables are simply by looking at its definition).

    On the other hand, the stack variables aren't defined unless you are, well, on the stack where they're defined (e.g. local variables in a method). In JDI, you find these two types of variables in different parts of the API, in locations which will make sense to you as we continue our discussion. You see the same division in a debugger. For example in Eclipse, when you are at a breakpoint and look at the "Variables" window, you will see class/instance variables in a tree structure under "this", and all local (stack) variables listed peer-level to "this" under their local names.

    The following sections assume that you have extracted an Event from an EventIterator, verified that it is a BreakpointEvent, extracted the EventRequest from the Event, and verified that it is a BreakpointRequest (these are all JDI classes).

    Inspecting Static and Instance Variables:

    Given a BreakpointEvent and a variable name (defined in your breakpoint request specification), how would you search for the variable in the static and instance variables list? Every BreakpointEvent has a Location (obtained via the location() method). In addition to having the usual information you would expect a Location object to have (the name of the source, the line number, method name, etc.), a Location has a reference to the class's ReferenceType. The ReferenceType has a method called allFields() which, as the API documentation says, "Returns a list containing each Field declared in this type, and its superclasses, implemented interfaces, and/or superinterfaces. All declared and inherited fields are included, regardless of whether they are hidden or multiply inherited. " Exactly what you want, right?

    allFields() returns a List. The JDI Field type has a name field (method: name()) which can be used, while iterating over the List, to find the target variable. Once you have retrieved the correct Field, you get its Value by appealing to the ReferenceType again, as in:

    Value val = theReferenceType.getValue(theField);

    The only other detail at this point is actually extracting the value of the Value. This is a slightly messy topic, which we'll discuss after we look at the other location of variables -- the stack.

    Inspecting Stack Variables:

    Stack variables, and specifically their values at your breakpoint, are associated with the JDI StackFrame and Frame classes. They are referred to as LocalVariables in JDI, you get reach them through the BreakpointEvent, by accessing its ThreadReference, then the StackFrame at the top level (that is, element #0) of the stack. You can then query the visible variables on the StackFrame by name to find the target variable. In other words, you could do something like this:

    StackFrame stackFrame = breakpointEvt.thread().frame(0);
    LocalVariable localVar = stackFrame.visibleVariableByName(varName);
    Value val = stackFrame.getValue(localVar);

    Note that you must get the LocalVariable from the StackFrame, but having retrieved it, you must again access the StackFrame to get the Value of the LocalVariable. Note also that we are now at the same stage we were with class/instance variables -- we have a Value, but we need to find out what its "value" is. As I said, this is a little messy and is the subject of the next section.

    Extracting Values:

    In JDI, a Value isn't just something you can meaningfully display with toString() (unless you find hash codes especially meaningful). The JDI API documentation contains a very useful table under com.sun.jdi.Value which describes the family of subinterfaces of Value. Basically, all the interfaces which represent a usable value contain a value() method, but this method is only defined in the sub-interfaces (since not all of them define one). This means that if you want to inspect a Value that at heart is a boolean, or a String, you can call value() on it, but you must first determine if it is an instance of the type you need, then cast it to that type and call value() on the cast instance. I don't know if there's any more practical way to do this that doesn't involve writing extremely clever (in other words, unmaintainable) code, so here's how I do it:

    if (value instanceof BooleanValue)
    {
    return ((BooleanValue)value).value();
    }
    else if (value instanceof ByteValue)
    {
    return ((ByteValue)value).value();
    }
    ...


    and so on.

    There are two main subinterfaces of Value -- PrimitiveValue and ObjectReference, along with a little more hierarchy which will be important to you. For example, if you are using code such as the example above to extract values, you will want to check for StringReference before ObjectReference. Why? StringReference has a value() method, as you would expect and hope, but StringReference is a subinterface of ObjectReference, which does not have a value() method. If you check for ObjectReference first, and you're looking at a String Value, you will miss out on the available cast to StringReference and miss out on a chance to inspect the value (yes, I accidentally did this; I did it this time, and I remembered when I did it that I did the same thing a few years ago).

    Drill-down:

    If you, too, are writing what I would call a "real-time" JDI-based tool, you'll probably want to provide some drill-down at a breakpoint. For example, you might want to look at a specific field of a variable on the stack, rather than just the Value of the variable. The way I handle this is to allow a "dot" notation, where "a.b.c" allows my users to specify that they want to look at field c of field b of variable/field a. Things get a little interesting here, too, as the following example illustrates.

    Suppose you are monitoring the traffic of an application and you just want to know the length of Strings being processed (e.g. returned from a service) at a breakpoint. You don't really want to display the String itself; this could be enormous (and remember, your target application is halted while this breakpoint is being processed, so I/O on a large String could cripple the application). All you want to know is the length of the String. But you don't want to write code to actually invoke general methods on your stack/instance variables. And in this particular case, there is a field available to you from JDI on the String class -- count. So your users should be able to specify, for example, incomingMessage.count, and get back an integer with the length of the String.

    The only problem is, not everyone knows that there is a field in java.lang.String called count -- they're more accustomed to calling the method length(). In an interactive debugger, this issue isn't a problem, as you don't specify what you want in advance -- you just inspect the state when the breakpoint occurs, and the debugger will show all the fields for you. In a debugger of the type I'm developing, users would need to know what is available. For a stack variable, this is a problem because you can't inspect it until the application is running. For class/instance variables, once you have the ReferenceType of the class, you can at least inspect those variables and drill down at will.

    The way I'm handling this issue (for stack variables) is to provide a diagnostic version of a breakpoint request, where the user can request a dump of the variables on the stack, to a specified depth. Note such a feature should be used with some care; an application with a very large number of stack variables, all traversed to (say) 10 levels deep, can end up being suspended for seconds, minutes, or tens of minutes just dumping the state of the stack (yes, I've done that before too, but will not repeat that this time!). Suffice to say that if you want to inspect an application for nested variables, you should do so on a non-production system.

    How do you drill down into fields? Unsurprisingly, there's a different procedure for class/instance variables than for stack variables. Unfortunately, some things in JDI are not immediately obvious (in fact, every time I write code to do this, I have to "re-discover" how to do it), which is why I illustrate below.

    For class/instance variables, you got the fields by getting the ReferenceType and looking at its Fields. Having retrieved a reference to the Field, you can simply repeat the process -- there is no method on the Field itself to drill down, but you can get the ReferenceType of each Field and then get the Fields of each Field from each Field's ReferenceType. Here's an example:

    List childFields = bpEvt.location().declaringType().allFields();
    for (Field childField: childFields)
    {
    List grandChildFields = childField.declaringType().allFields();
    for (Field grandChildField: grandChildFields)
    {
    System.out.println(grandChildField.name() + " is a grandchild field");
    }
    }

    The situation for stack variables is as follows. If you are trying to drill down into a variable, by definition it is an ObjectReference, right? In other words, there's no drilling down into a primitive. If the Value on the StackFrame is an instance of an ObjectReference, then you can get the ReferenceType of the Value and retrieve its Fields just as we did above when we retrieved the ReferenceType of the class/instance Field. Note that since StringReference is a subinterface of ObjectReference, we could do something like the following:

    else if (value instanceof StringReference)
    {
    // look at the fields:
    List childFields = ((StringReference) value).referenceType().allFields();
    for (Field childField: childFields)
    {
    System.out.println(">> child field: '" + childField.name() + "'");
    }
    return ((StringReference)value).value();
    }


    If we run this code, we would see something like the following output for a String:

    >> child field: 'value'
    >> child field: 'offset'
    >> child field: 'count'
    >> child field: 'hash'
    >> child field: 'serialVersionUID'
    >> child field: 'serialPersistentFields'
    >> child field: 'CASE_INSENSITIVE_ORDER'

    (Note our count Field, which happens to be an IntegerValue). Of course, you can continue drilling down, using the same procedure, on each of these Fields.

    That's all I have for now -- look for more in the coming weeks as the project comes along.
  7. In my Part-1 post on this topic, we actually did all the I/O I'm going to do here. We lazily read in the entire sample data file, a file containing data describing events generated by a process monitor. My next goal was to re-hydrate my Events from the Strings serialized to the file. These Strings were generated by calling the function show on my List of Events.

    Time to back up a little. We got show for free, but we had to ask for it. Leaving aside the Event data type for a moment and looking at the simpler Property data type, remember how we defined it:
         data Property = Property {
    key :: Key,
    value:: Value }
    deriving (Show)

    At that time we discussed "deriving (Show)" by simply noting it allowed ghci to know how to output Property values. Of course, there's more to it than that. The "more" is Haskell typeclasses. A Haskell typeclass defines a set of functions that can be defined to operate on a data type. The show function is a member of the Show typeclass and has the following type:
         *EventProcessor> :type show
    show :: Show a => a -> String
    For a moment, let's see what had happened if we had not specified that Property derives Show:
         -- file:  c:/HaskellDev/eventProcessor/simpleProperty.hs
    type Key = String
    type Value = String
    data Property = Property {
    key :: Key,
    value:: Value }

    If we load this file, then create a Property, then try to show it, we see the following:
         Prelude> :load simpleProperty.hs
    [1 of 1] Compiling Main ( simpleProperty.hs, interpreted )
    Ok, modules loaded: Main.
    *Main> let prop1 = Property "key1" "value1"
    *Main> show prop1
    :1:1:
    No instance for (Show Property)
    arising from a use of `show'
    Possible fix: add an instance declaration for (Show Property)
    In the expression: show prop1
    In an equation for `it': it = show prop1

    That's considerably more helpful than a typical compiler error message. The key is the hint to add an instance declaration. In order for the functions of a typeclass to be applicable to your data type, you need to declare an instance of that typeclass that handles your data type. In our case, we can get Show simply by stating that we derive it. So we go back to our old definition:
         -- file:  c:/HaskellDev/eventProcessor/simpleProperty.hs
    type Key = String
    type Value = String
    data Property = Property {
    key :: Key,
    value:: Value }
    deriving (Show)

    Repeating our earlier test, now we get the following:
         *Main> :load simpleProperty.hs
    [1 of 1] Compiling Main ( simpleProperty.hs, interpreted )
    Ok, modules loaded: Main.
    *Main> let prop1 = Property "key1" "value1"
    *Main> show prop1
    "Property {key = \"key1\", value = \"value1\"}"

    That's definitely an improvement. But, back to the subject -- what I want is to be able to read a String and turn it in to a Property. First, let's try to read a String into an Integer:
         *Main> read "5"
    :1:1:
    Ambiguous type variable `a0' in the constraint:
    Read a0) arising from a use of `read'
    Probable fix: add a type signature that fixes these type variable(s)
    In the expression: read "5"
    In an equation for `it': it = read "5"

    Again, a helpful hint from ghci. It basically doesn't know what we are trying to create and suggests adding a type signature. Here's how you would do it:
         *Main> let anInt = (read "5")::Integer
    *Main> :type anInt
    anInt :: Integer
    *Main> show anInt
    "5"

    Haskell now knows we're trying to read an Integer, so it recognizes that anInt is an Integer and knows how to show it. All of this should tell you that Integer has defined instances for both Show and Read.

    Can we get Read "for free" simply by stating that our data type derives it? It can't hurt to try:
         -- file:  c:/HaskellDev/eventProcessor/simpleProperty.hs
    type Key = String
    type Value = String
    data Property = Property {
    key :: Key,
    value:: Value }
    deriving (Read, Show)
    Next, I'll create a Property and output it with show, then see if that format (which, incidentally, is how I created my sample Event file -- using show on a List of Events) is "read"-able:

    *Main> let prop1 = Property "key1" "value1"
    *Main> show prop1
    "Property {key = \"key1\", value = \"value1\"}"
    *Main> let prop2 = (read "Property {key = \"key1\", value = \"value1\"}")::Property
    *Main> show prop2
    "Property {key = \"key1\", value = \"value1\"}"

    This is great news -- I didn't have to write a parser, and I'm perfectly happy to use, as my format, the same format that Haskell uses to show a Haskell data structure.

    What I would really like to do is to be able to read a full Event, as defined in my earlier posts, including a List of Events. Here's an example Event:
         "Event {timestamp = 1320512200548, className = \"java.lang.String\", lineNumber = 1293, message = \"NPE in substring()\", properties = [Property {key = \"userId\", value = \"smith\"},Property {key = \"sessionId\", value = \"ABCD1234\"}]}"

    Let's try the same trick with Event that we used with Property:
         -- file:  c:/HaskellDev/eventProcessor/notSoSimpleProperty.hs

    type Timestamp = Integer
    type ClassName = String
    type LineNumber = Integer
    type Message = String
    type Key = String
    type Value = String
    type Properties = [Property]

    data Property = Property {
    key :: Key,
    value:: Value }
    deriving (Read, Show)

    data Event = Event {
    timestamp :: Timestamp,
    className :: ClassName,
    lineNumber :: LineNumber,
    message :: Message,
    properties :: Properties }
    deriving (Read, Show)
    Was it enough simply to declare that Event derives Show? Let's see:
         *EventProcessor> :load notSoSimpleProperty.hs
    [1 of 1] Compiling Main ( notSoSimpleProperty.hs, interpreted )
    Ok, modules loaded: Main.
    *Main> let event1 = (read "Event {timestamp = 1320512200548, className = \"java.lang.String\", lineNumber = 1293, message = \"NPE in substring()\", properties = [Property {key = \"userId\", value = \"smith\"},Property {key = \"sessionId\", value = \"ABCD1234\"}]}")::Event
    *Main> show event1
    "Event {timestamp = 1320512200548, className = \"java.lang.String\", lineNumber = 1293, message = \"NPE in substring()\", properties = [Property {key = \"userId\", value = \"smith\"},Property {key = \"sessionId\", value = \"ABCD1234\"}]}"
    *Main> show (properties event1)
    "[Property {key = \"userId\", value = \"smith\"},Property {key = \"sessionId\", value = \"ABCD1234\"}]"

    This is great. We can take a String representation of an Event, as output by show, and use it directly with read to instantiate an Event variable.

    This discussion in no way described how to provide an instance of a typeclass; maybe in another post. Right now I still want to create Events from my sample Event file.

    My first cut at this is the following:

    import System.IO
    import EventProcessor

    main :: IO ()
    main = do
    inh <- openFile "sampleEvents" ReadMode
    inContents <- hGetContents inh
    let eventList = processData (lines inContents)
    hClose inh

    processData :: [String] -> [Event]
    processData = map createEvent

    createEvent :: String -> Event
    createEvent event = (read event)::Event


    While this appears to be correct, it doesn't do anything that proves I've been able to parse my example-file lines into a List of Events. Unfortunately, at this point I'm a little stuck, as I haven't yet figured out how to correctly extract elements from my Events (I can do so interactively in ghci, but I am plagued by compiler errors if I try to do so in the do block).

    I'm going to leave this for a while and go back to some online resources/tutorials, then revisit. The problem for me is that I still don't understand the I/O monad, as it's called. It's becoming clear to me that the concept of a monad -- apparently so integral to Haskell -- isn't easily grasped, as evidenced by the overwhelming number of "Here's my take on monads" tutorials on-line (one author, I see, jokes that everyone who learns about monads appears to post a tutorial on the subject shortly thereafter). So I'm off to the Haskell Wiki to learn about monads, then learn some more about Haskell I/O, and then pick up where I left off.
  8. I'm about halfway through Real World Haskell, and I've spent a week trying to decide when to write this post. As the authors point out, Haskell I/O is easy to work with. However, understanding the significance of what is going on in Haskell I/O requires a little more than simply outputting "Hello, World". The reason: once you include I/O, you start mixing pure and impure code, and a good understanding of what's going on will additionally include some discussion of monads. While my understanding of these topics is starting to gel, what I'm about to say should be taken (like everything else in this series) more as a snapshot of my transition from object-oriented, imperative-style development to functional programming, rather than an authoritative resource.

    I'm going to continue with the example I've been using to-date, which involves working with data that represents events coming from a process-monitoring system. Instead of including the events themselves in my source code, this time I'm going to read them from a file. Here's a rehash of what the data structures look like. as output by ghci:
         Event {timestamp =  1320512200548, className = \"java.lang.String\", lineNumber = 1293,  message = \"NPE in substring()\", properties = [Property {key =  \"userId\", value = \"smith\"},Property {key = \"sessionId\", value =  \"ABCD1234\"}]}

    My first goal is simply to read a file of these items, one per line, and echo them to the console. I'm using GHC Haskell and will be using the ghci interpreter as well. I'm going to start with my original Haskell file (which I won't reproduce here); suffice to say I would like to turn it into a module and export my data type and type synonym definitions, so I can use them from my new file processor.

    I'm starting slowly because I have to backtrack a little to fit Haskell's requirements. My original file was called eventProcessor.hs, and I see now that a Haskell module name must start with a capital letter, and that the name of the module must match the base name of the source file. My first change, then, is to rename eventProcessor.hs to EventProcessor.hs (this will be enjoyable for those of you working on a Windows system). Then, inside my file, I add the following module declaration at the top of my file (note: this also comes before any import statements):

    module EventProcessor
    (
    Timestamp,
    ClassName,
    LineNumber,
    Message,
    Key,
    Value,
    Properties,
    Property,
    Event,
    isUserWithId,
    eventGeneratedByUser,
    getEventsGeneratedByUser
    ) where

    I've decided to export basically everything I defined in this file. Don't forget, of course, that the above indentation of the parentheses is significant, rather than cosmetic. My original file also contained a number of variable definitions. I'll take those out once I'm convinced I can actually retrieve them from a file!

    Leaving the EventProcessor module for a moment, let's read the contents of my events file and output them to the screen. This example won't do much, but it will provide some notation we need to examine. Here is the source code:

    import System.IO

    main :: IO ()
    main = do
    inh <- openFile "sampleEvents" ReadMode
    inContents <- hGetContents inh
    putStrLn inContents
    hClose inh

    This is the first Haskell program I have run from a console. Note the System.IO import, and the familiar main function. Not as familiar -- the type of main. It is an "IO ()". The IO in the type not only lets you know I/O is involved, but lets you know you are dealing with a function (main) that may have side effects (is not pure). The "()" part just means that main does not return a value ("()" is an empty tuple, called "unit"). So, "main :: IO ()" is a function which has side effects and does not return a value. Additionally, of course, main is the function which is called when you execute the compiled version of the source code for this file.

    The function gets an input handle on my events file, uses a special function to get the entire contents of the file (remember that Haskell's lazy evaluation allows such a function to be practical), and puts the output to the console, finally closing the handle. Before we inspect the logic any further, here is how we compile and run the program, and what we see for output:

    c:\HaskellDev\eventProcessor>ghc --make processEventsFile.hs
    [1 of 1] Compiling Main ( processEventsFile.hs, processEventsFile.o )
    Linking processEventsFile.exe ...

    c:\HaskellDev\eventProcessor>processEventsFile.exe
    Event {timestamp = 1320512200548, className = \"java.lang.String\", lineNumber = 1293, message = \"NPE in substring()\", properties= [Property {key = \"userId\", value = \"smith\"},Property {key = \"sessionId\", value = \"ABCD1234\"}]}
    ...
    Event {timestamp = 1320512265147, className = \"javax.swing.JPanel\", lineNumber = 388, message = \"initialized\", properties = [Property {key = \"userId\", value = \"smith\"},Property {key = \"sessionId\", value = \"ABCD1234\"}]}]"


    The next interesting thing you see in the sample code is the do. Before we discuss do, let's discuss main again, with its type of IO (). In Haskell, anything with a type starting with IO is called an I/O action. I/O actions are lazily evaluated, too, and are executed when their parent I/O actions are executed. In this program, main is an I/O action, so putStrLn, also an I/O action, will be executed when main is executed. In other words, their side effects do not occur when they are evaluated, but when they are actually performed. A do block is used to chain (join) actions together; the do is necessary only if you have more than one action. The value of a do block is the last action in the block to be performed.

    Note also the <- notation. Assignment from an I/O action looks a little different than assignment from a pure function. The <- operator takes the result from performing an I/O action and stores it in a variable. As an example, let's look at openFile's type in ghci:

    *EventProcessor> :type System.IO.openFile
    System.IO.openFile
    :: FilePath
    -> GHC.IO.IOMode.IOMode
    -> IO GHC.IO.Handle.Types.Handle


    So openFile is an I/O action of type "IO GHC.IO.Handle.Types.Handle." In our sample program, the openFile I/O action is performed when main is performed, and at that point the result of openFile is assigned to the inh handle using the <- operator.

    If the operator <- takes the result of an I/O action out of System.IO, then you might naturally ask what takes values in the opposite direction. The return keyword provides this functionality. Don't attempt to correlate the Haskell return with the return keyword in any imperative programming language you have used. As an example, suppose we have a function which returns an I/O action:

    import System.IO

    main :: IO ()
    main = do
    inh <- openFile "sampleEvents" ReadMode
    inContents <- getContentsHereInstead inh
    putStrLn inContents
    hClose inh

    getContentsHereInstead :: Handle -> IO String
    getContentsHereInstead handle = do
    inContents <- hGetContents handle
    return inContents


    I have modified my earlier program and now delegate the actual "getting" of the contents to a function. Notice that the type of this function is a "IO String" -- it has side effects as does main, only this time it returns an I/O action which, when performed, returns a String containing the file contents.

    Interestingly, note the assignment statements. In getContentsHereInstead, I use return to wrap inContents in an IO String, as that is the return value type of the function. Because hGetContents handle defines an I/O action, I could just as easily have defined the function as follows:
         getContentsHereInstead :: Handle -> IO String
    getContentsHereInstead handle = do
    hGetContents handle

    and saved myself the trouble of (1) extracting the I/O action result from the IO system, and (2) re-wrapping it in IO for the purposes of returning the correct type. An interesting situation exists in main, where I use the <- operator to assign the results of getContentsHereInstead to a pure value, then pass the pure value to putStrLn. Note, however, that I really do need the pure value to pass to putStrLn. From ghci:
         Prelude> :type putStrLn
    putStrLn :: String -> IO ()

    putStrLn returns IO (), but it actually does expect a pure String as input. So, it would be incorrect to try to "putStrLn" the actual output of getContentsHereInstead; if I try something like
         putStrLn (getContentsHereInstead inh)


    I will get the following complaint from ghci:
         processEventsFile2.hs:7:13:     
    Couldn't match expected type `[Char]' with actual type `IO String'
    Expected type: String
    Actual type: IO String
    In the return type of a call of `getContentsHereInstead'
    In the first argument of `putStrLn', namely
    `(getContentsHereInstead inh)'


    The problem is that putStrLn expected a pure String and instead got an IO String. So we see that <- maps the result of an I/O action to a pure value, and return wraps the pure value in System.IO, and that we need to ensure we're passing around values of the expected data types.

    I still haven't unmarshalled my lines in my data file to Event objects, which was the motivation of this post. To accomplish this goal, I first want to get the contents of the file, as before, and this time, pass a pure value to a non-I/O function which will re-hydrate the Event objects. To keep the post to a manageable length, I'm going to call it a day, and call this Part 1. In Part 2, coming later this week, I'll finish up.
  9. For the last few weeks, I have been building a Java process monitoring tool based on the Java Debug Interface. Although I've done much of this work before, it has been a few years, and so now I'm retracing my steps. As I remember the details and pitfalls, I've been posting my notes in the hope that you'll find them useful.

    Today I'm going to talk about ClassPrepareEvents, after a little background. As you probably already know, you can attach a debugger to an already-running Java process, or launch the target process itself from your debugger (using various command-line switches). In my project, I'm always going to be attaching to a running process, as the point is to collect process data on an as-needed basis. The reason JDI's ClassPrepareEvent is interesting is that, when you launch a debug target process, or even when you attach to an already-running process, it's likely that some of your desired breakpoints lie in classes which have not yet been loaded.

    In my usual scenario, I call the com.sun.jdi.VirtualMachine's allClasses() method to get a list of all loaded ReferenceTypes. One way to think of a ReferenceType is as a chunk of a Java class definition. If your Java class has inner classes, then they will be broken out by JDI into separate ReferenceTypes. Each ReferenceType contains a collection of line locations; these correspond to lines of code on which breakpoints can be set and are identified by (among other things) the source-code line number. If a line of source code cannot be the target of a breakpoint, then there will not be a line location for it in the ReferenceType. In my debugger-based applications, I step through the line locations of all the ReferenceTypes, matching up line locations with breakpoint specifications, and then register my breakpoint requests.

    As you can guess, I have a potential problem: what should I do if a class I need has not yet been loaded at the time I'm constructing my breakpoint requests? The answer is: JDI's ClassPrepareEvent. The entry point for using this part of the API is the EventRequestManager's createClassPrepareRequest() method. Having made our request, the same event-listener loop we use to wait for breakpoint events can also be used to wait for class prepare events (see the JVM specification for a definition of class preparation).

    One thing I remember from my previous development on this API is that there is a timing risk here. You probably want to create the class prepare request before you iterate over the list of currently-loaded classes. The reason is that you don't want to fall into this trap:
    1. Iterate over a set of the currently-loaded classes, processing and making breakpoint requests.
    2. Suddenly, a class you need is loaded!
    3. You register your class-prepare event and start getting events as classes are loaded, but you miss the class that loaded in between step #1 and step #3.
    Here's another possible trap:
    1. Register for class-prepare events so you don't get caught by the above issue.
    2. Iterate over the currently-loaded classes, requesting breakpoints as necessary.
    3. Process newly-loaded classes, requesting breakpoints as necessary.
    The problem with this second approach is that you may process the same breakpoint twice. Why? By the time you iterate over the currently-loaded classes, some of the classes in that list are very likely going to be classes which have shown up in your class-prepare listener. Neither of these problems can be fixed by slapping a synchronized keyword somewhere.

    Whether you launch your target application from your debugger or attach to it after the fact, you will have to deal with some variation of this issue. The way I deal with it is to add some state to the class I use to define each breakpoint specification. As each corresponding loaded class is found and the breakpoint request is made, I set a flag on the specification so that I know the request was registered. Further, I follow the second approach outlined above (better to have duplicates than to miss one). If I see a class-prepare event for a class I've already processed from the VM's ReferenceType list, then I simply skip over it. I do the same for the reverse situation, in which my list of ReferenceTypes contains ReferenceTypes which I have just processed in my ClassPrepareEvent listener.

    Finally, one issue I have not looked at before (either for this development effort, or in my previous development in this area) -- what happens when a class is unloaded, especially a class on which you have registered breakpoint requests. For example, will a registered breakpoint request prevent a class from being unloaded? Do you care about a stranded breakpoint request if the class isn't even loaded? (Answer: yes, I suppose, if it gets reloaded and you no longer have a valid breakpoint request for it). JDI does have a ClassUnloadEvent, for which you can also register a listener. As I said, I have not dealt with this (possible) issue, having never seen a target class get unloaded before, but it's good to know "there's an API for that".


  10. After my last post scrolled off the bottom of the page, I realized I missed a couple of opportunities: one related to some additional code optimization, and one related to the topic of lazy (or nonstrict) evaluation.

    First, let me review what I was doing. I was processing a data structure I was using to represent events generated from a process monitor. The structure includes a timestamp, a class name and source line number, a text message, and a set of properties. Displaying one of these in ghci yields the following:

    Event {timestamp = 1320513130333, className = "com.adamsresearch.jarview.JarView", lineNumber = 388, message = "fileNotFound", properties = [Property {key = "userId", value = "adams"},Property {key = "sessionId", value = "EFGH5678"}]}

    My goal was to write -- in authentic FP style -- some code to extract, from a List of these events, events whose userId property value matched a particular user ID. Here is the code (minus type definitions, etc.) I ended up using:

    eventGeneratedByUser :: String -> Event -> Bool
    eventGeneratedByUser id event =
    if (containsUserPropertyWithValue (properties event) id)
    then True
    else False
    where
    containsUserPropertyWithValue (x:xs) id =
    if (isUserWithId x id)
    then True
    else containsUserPropertyWithValue xs id
    where
    isUserWithId prop id =
    if (key prop == "userId" && value prop == id)
    then True
    else False
    containsUserPropertyWithValue [] _ = False

    Finally, in ghci, I retrieved the list of events I wanted with the following:

    *Main> let filterPred = eventGeneratedByUser "adams"
    *Main> let eventsForAdams = filter filterPred eventList

    One theme object-oriented developers hear about repeatedly in FP is to stop thinking in loops, and start thinking in terms of recursion. An additional theme is (as best as I can state it) to avoid recursion when you can use a mapping-style function, that is, a function which structurally is defined to operate on all the elements of a list. When you set up a recursive function to process a list, a large part of what you are building is boilerplate. Boilerplate is the developer's curse: some live with it (care to pound out the boilerplate for Java's GridBagLayout and GridBagConstraints, anyone?), while others replace it with "frameworks" that the rest of us usually find even less appealing than writing the boilerplate code ourselves.

    As you can tell from the above source code, I used recursion to create the boolean function which determines if an event is associated with a user. However, when I actually filtered the list, instead of using a recursive function I used Haskell's filter, passing it my partial function. For that last step, I feel I finally was writing in the spirit of Haskell and FP in general, but now let's take a look at that source code again and see if there's room for improvement.

    Note first my innermost nested function, isUserWithId:

    isUserWithId prop id =
    if (key prop == "userId" && value prop == id)
    then True
    else False

    Note from my old post that I used Haskell record syntax to define the Property data type and that Haskell has used type inference to recognize that I am passing a Property structure to the isUserWithId function. All this function does is look at a key-value pair and return True if the key is userId and the value matches the parameter I pass in. I think this function is fine "as is".

    The enclosing function, found in the where clause of the function eventGeneratedByUser, is a recursive function. Let's repeat the whole function:

    eventGeneratedByUser :: String -> Event -> Bool
    eventGeneratedByUser id event =
    if (containsUserPropertyWithValue (properties event) id)
    then True
    else False
    where
    containsUserPropertyWithValue (x:xs) id =
    if (isUserWithId x id)
    then True
    else containsUserPropertyWithValue xs id
    where
    isUserWithId prop id =
    if (key prop == "userId" && value prop == id)
    then True
    else False
    containsUserPropertyWithValue [] _ = False

    What I am doing with this function is essentially iterating over the Property elements of an Event, returning True as soon as I find a Property that matches the desired condition.

    Can this be converted from a recursive function to something a little more compact? As it turns out, yes, easily. Haskell contains a function called any which acts on a List and returns True if any element of the List is True. To use this function in my example, I would rearrange my code to calculate the isUserWithId value for each Property in the data structure, wrap those boolean values in a List, and apply the Haskell function any to the List. If you are wondering if I will be wasting cycles calculating values I may not need, bear with me for a few minutes.

    The first question is how to generate a List of boolean values, the results of applying isUserWithId to a List of properties. To start, let me promote isUserWithId to a standalone function, but before we do that, let's do the same thing we did in my last post -- plan ahead to be able to use it as a partial function. By this I mean we'll want to apply this function to a list of Property values, where the "user ID" will be built-in, so to speak, to the function definition. I'll redefine the function to take the user ID first, so I can create a partial function later which I can apply directly to a Property (note this involves re-ordering the arguments in the isUserWithId equation, also):

    isUserWithId :: String -> Property -> Bool
    isUserWithId id prop =
    if (key prop == "userId" && value prop == id)
    then True
    else False

    We can verify the function, both in its "full form", where we pass in both the user ID and the Property and in its partial form, where we first create a partial function with the user ID specified, and then pass in only a Property:

    *Main> prop4
    Property {key = "userId", value = "smith"}
    *Main> isUserWithId "smith" prop4
    True
    *Main> let containsSmith = isUserWithId "smith"
    *Main> containsSmith prop4
    True

    Next, I'll take the existing three, short Property Lists from my example and concatenate them into a larger list:

    *Main> let bigPropList = concat [propList1,propList2,propList3]
    *Main> bigPropList
    [Property {key = "userId", value = "smith"},Property {key = "sessionId", value = "ABCD1234"},Property {key = "userId", value = "adams"},Property {key = "sessionId", value = "EFGH5678"},Property {key = "userId", value = "jobim"},Property {key = "sessionId", value = "WXYZ9876"}]

    Now let's use any to return True if any item in the list matches our partial function, which is testing to see if a Property has user ID "smith":

    *Main> any containsSmith bigPropList
    True

    and that is literally all we had to do to get our result. Note the interesting invocation of any: our partial function takes a Property, not a List of Property values. The any function steps through the elements of the list and returns True as soon as the invocation of containsSmith against a single Property returns True.

    To integrate this code into our new version of eventGeneratedByUser, we could do the following:

    eventGeneratedByUser :: String -> Event -> Bool
    eventGeneratedByUser id event =
    any (isUserWithId id) (properties event)

    and then test this in ghci:

    *Main> head eventList
    Event {timestamp = 1320512200548, className = "java.lang.String", lineNumber = 1293, message = "NPE in substring()", properties = [Property {key = "userId", value = "smith"},Property {key = "sessionId", value = "ABCD1234"}]}
    *Main> eventGeneratedByUser "smith" (head eventList)
    True

    Note the code is considerably more compact, and (unlike in some languages where "terse" = "dense") actually more readable. In fact, things are so clean we can go a step farther than we did before and provide a function to return a subset of the List, something we just did interactively in ghci the last time:

    getEventsGeneratedByUser :: String -> [Event] -> [Event]
    getEventsGeneratedByUser id events =
    filter (eventGeneratedByUser id) events

    and, testing in ghci:

    *Main> getEventsGeneratedByUser "jobim" eventList
    [Event {timestamp = 1320512200699, className = "javax.swing.JPanel", lineNumber = 388, message = "initialized", properties = [Property {key = "userId", value = "jobim"},Property {key = "sessionId", value = "WXYZ9876"}]},Event {timestamp = 1320512203699, className = "javax.swing.JList", lineNumber = 1255, message = "model replaced", properties = [Property {key = "userId", value = "jobim"},Property {key = "sessionId", value = "WXYZ9876"}]},Event {timestamp = 1320512259333, className = "javax.lang.Integer", lineNumber = 133, message = "number format exception", properties = [Property {key = "userId", value = "jobim"},Property {key = "sessionId", value = "WXYZ9876"}]}]

    Note how compact and readable our resulting code is. We have one function which determines if a Property contains a user ID of a specified value -- 5 lines of code, including the type declaration. We have two additional functions, one of which returns True if an Event is generated by a specified user, and one which returns a subset of a List of Events which satisfy this condition -- each 3 lines of code, including the type declaration. And as I mentioned earlier, in this case, compact actual means readable.

    Lazy Evaluation

    In the above code, you might have been wondering "What is the performance impact of looping through a List of, say, 5000 properties and returning True if only one of them meets the test? Especially if the first property meets the test?". Haskell evaluates expressions only when they are needed. Technically in Haskell this is called nonstrict evaluation, and it allows you to do some things that would truly look odd in Java. For example, in Haskell you can define a list as a range of elements using ..; the following defines a list of integer values from 1 through 9:

    *Main> [1..9]
    [1,2,3,4,5,6,7,8,9]

    While this may look normal, what do you think of this?

    Prelude> [1..]
    [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
    ,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40
    ,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60
    ,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80
    ,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
    ,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120
    ,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135
    ...

    where I have added the line breaks to make it easier to read. Big surprise: this is an infinite list starting with the integer 1, and ghci will happily continue printing this output to the screen forever, or until Something Really Bad Happens, whichever comes first. I had to hit Control-C to stop the output. But, you can assign this infinite list to a variable:

    Prelude> let infiniteList = [1..]
    Prelude>

    with no problems. The reason is that infiniteList has not yet been evaluated (populated). If you were to try to show this in ghci, you'd be back to the infinite output again. Nonstrict evaluation allows us to do some interesting things. For example, you can take the first 11 items of an infinite list:

    Prelude> take 11 infiniteList
    [1,2,3,4,5,6,7,8,9,10,11]
    Prelude>

    This feature allows us to write code that (for example) returns True if "any" element in a List satisfies some predicate. Haskell isn't going to sift through the entire List, evaluating every expression, and then step through the List and apply the predicate. So it doesn't matter how many properties my events have; I'll return a True as soon as I get a match. Note that there are cases where all the elements will indeed be evaluated; for example, if you are summing the elements of a List (of course, you didn't expect Haskell to give you the sum of the elements of an infinite List, right?).

    Nonstrict evaluation is already one of my favorite features of Haskell, right up there with standard library functions that operate on all the elements of a List. Next post, I expect to discuss I/O in Haskell, continuing along with the process-event-monitoring example I've been using to date.
About Me
About Me
My Photo
I'm a software architect/consultant in Boulder, Colorado.
Picture
Picture
Blog Archive
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.