Batch

Batch processing for Java Platform – partitions and error handling

13th March 2016 Batch, Software development No comments

In previous post I wrote about basic Batch processing for Java Platform capabilities. I tried out partitions and error handling now. Let’s look at error handling first. With chunk style steps have following options:

  • Skipping chunk step when a skippable exception is thrown.
  • Retry a batch step when a retryable exception is thrown. Transaction for current step will be rolled back unless exception is also configured as no-rollback type exception
  • No rollback for given exception class

Processing applies for exceptions thrown from all phases (read, process, write) and checkpoint commit. We can specify what exception classes are to be included in each option – that is, if given exception class is skippable exception. We can also exclude some exception classes and batch processing engine will use nearest class algorithm to choose appropriate strategy to use. If we include exception class A as skippable and this class has two subclasses B and C, and we configure exclude rule for class C then exceptions A and B will cause engine to skip step while exception C will not cause the engine to skip step excecution. If no other strategy is configured then job execution will be marked as FAILED for C.

Let’s see an example

Here we skip step for pl.spiralarchitect.kplan.batch.InvalidResourceFormatException exception. We add code that throws this exception to factory method for KnowledgeResource:

In test file we make one line invalid:

Step for this line will be skipped.

For Batchlet style steps we need to handle exceptions our selves. If exception gets out of step job will be marked as failed.

Another more advanced feature is partitioning of steps. Partitioning is available for both chunk as well as batchlet style steps. Consider example xml below:

In this configuration we specify that there are to be two partitions and two threads are to be used to process them, so one thread for a partition. This configuration can be also specified using a partition mapper, as the comment in xml configuration snippet describes.

Partition collector’s role is to gather data from each partition to analyzer. There is a separate collector for each thread.

Partition analyzer is to collect data from all partitions and it runs on main thread.  It can also decide on batch status value to be returned.

In order to understand how this works it may be helpful to look at algorithm descriptions in JSR Spec, chapter 11. Important detail that is described here is that for each partition step intermediary data is stored in StepContext. With this knowledge we can create a super simple collector – keeping in mind that we could process the intermediary result here, we just don’ t need to do this:

This collector will be executed for each step with result returned by writer step – you can  find modified KnowledgeResourceWriter below

Then we can write code for super simple analyzer:

KnowledgeResourcePartitionAnalyzer adds all resources to same list that writer step did. It also sets parameters used by next step, that were previously set in writer.

Now when we execute modified example we will see that we have some strange output:

Yikes !

1. We did not tell each partition what data are to be analyzed by it nor we did contain this logic (to decide what data should be processed) in step definition itself.

2. storing resource: KnowledgeResource [title=Hadoop 2 and Hive, published= 2015] is repeated four times ! – this is because:

“The collector is invoked at the conclusion of each checkpoint for chunking type steps and again at the end of partition; it is invoked once at the end of partition for batchlet type steps.”

Back to the lab then

To fix first problem we simply split the file in first step into two files – one for each partition. We will create two files statically (artciles1.txt and articles2.txt). Each partition will read one file – file name will be configured (simplification for demo – in real life application this would be probably a bit more complicated). So we need to change implementation for reading files and configuration.

Important lines in configuration above are:

* <jsl:property name=”fileName” value=”articles1.txt”/> and <jsl:property name=”fileName” value=”articles2.txt”/> – here we configure file name for each partition

* <jsl:property name=”fileName” value=”#{partitionPlan[‘fileName’]}”/> – here we configure file name for reader component, this needs to be done to transfer configured file name from partition plan to step properties (inconvenience ; ) – JSR-352 is young).

Reader component will get fileName injected (cool feature of young JSR)

And it will use this file name to construct resource path:

This fixes first problem.

To fix second problem we need to know if this is end of partition. We can check it in a few ways :

* check if we processed given number of elements, so there would be a need for some service that would monitor all partition processing

* check thread name – partitions are executed on separate thread, poor solution as threads may be pooled so this won’t work

* check if we already processed given element (using some hash)

I haven’t found any API that would tell if partition execution is done and we this is the last call to collector. Maybe this last call is an error in JBeret (JBoss’ implementation of Batch processing for Java platform).

We will try last solution- we could check if we processed chunk in analyzer, but since this is more technical detail (the important thing was to find out why we got duplicated element) we will just check this in data structure for KnowledgeResources – we will simply replace List with Set:

After this changes we get correct number of processed entries 🙂

Take a look at this post on Roberto Cortez blog regarding Java EE 7 batch processing.

Have a nice day ! Code is on github.

 

Trying out Java EE Batch

7th March 2016 Batch, Java EE, Software development No comments

I decided to try Java EE Batch, then try Spring Batch and finally Spring For Hadoop with Pig. Java EE Batch is relatively new specification but for a number of use cases it will be sufficient. This sentence says it all – I thing that Java EE Batch is ok. Nothing more yet. As you will see you have to do a lot things yourself and doing other things is just cumbersome.

So what is is Java EE Batch, a.k.a. JSR 352 (Batch Processing for Java Platform) – it is heavily inspired by Spring Batch, Java EE family specification describing framework for batch job processing. It allows to describe jobs in Job Specification Language (JSL) – only XML format is supported. I guess that for batch apps this may be ok. I can live with it 😉

Java EE Batch specified two types of processing:

  • Chunked – here we have 3 phases – reading, processing and writing. Chunked steps can be restarted from some Checkpoint.
  • Batchlet – this is a free form step – do whatever is required.

Java EE Batch has many features like:

  • Routing processing using decision steps
  • Grouping groups of steps into flows
  • Partitioned steps – step instances to be executed in parallel. Partitioned step allows to split job into multiple threads in more technical way. We can specify partition mapper, partition reducer and other partition specific elements. I won’t go into detail on this now.
  • Splits – splits specify flows to be executed in parallel – there can be a few flow definitions. so this is more process or business oriented parallelization of work.
  • Exception handling:
    • Skipping steps
    • Retying steps
  • JobContext to pass batch job data to steps

CDI is used and supported. But with all of these specifying job context parameters is inconvenient (in 2005 I would say that it’s ok ;)), we must do a lot our selves – like moving files, reading them and passing intermediate results. It would be nice to have some utilities to help with standard batch related chores.

Ok let’s see some example – below you can fine a JSL definition for a demo batch job. You can build this XML using Eclipse xml editor with ease (or use some plugins for NetBeans) – just choose XML from schema option and then select XML Catalog and jobXML_1_0.xsd entry:

xml_schema_jpl

First we move file from import location to some temporary location so it can be processed without any problem:

next=”resource-router” specified next step to be invoked – there is a possibility for a typo here. So inconvenience.

<jsl:batchlet ref=”initializeResourceImport” /> – specifies a batchlet (free form job) step that will move file. Instance of class for this step will be instantiated as a CDI bean named “initializeResourceImport”:

Here we move file to a workDir location.

Next we start a flow and execute another step type – chunked step. Next step id is given in next attribute.

Each of phases is implemented by a CDI Bean. So reading looks like this:

Open and close stream in this step, w provide checkpoint definition in order to restart from some point. Single invocation of this reader is only reading single line that is passed down to process part of step. It would be easier if Batch framework provided some reading utilities and let us worry about doing important stuff with data than to worry about opening and proper closing of files. But this foundation functionality that is provided now is also required and features that make using Batch Processing for Java Platform will be added in next versions of spec.

Next phase is processing line of data read from file – only thing that is done here is conversion from String to KnowledgeResource instance

And finally we can write the data

This is also simple processing in order to have some fun with Java EE Batch framwork without worrying about more job related details. Important parts here is that this phases receives a list of objects that are results of processing phase.

Use of application scoped CDI bean of class KnowledgeResources – this is done this way in order to ease testing. Of course in real life batch job I would not think about using application scoped bean without any special handling – it’s like having state field in a Servlet where requests store data 😉 Very bad idea. This job passes a result of processing that will be used by Decider instance to route processing. This simplish demo uses a hard coded value.

After processing data in resource-processor step we proceed to decision step – path-decision. This step is also implemented by CDI Bean named pathDecider.

Depending on what value is returned from Decider instance some route will be chosen. Steps that are to be invoked next are designated by id in to attribute. Java Based config would allow here to get rid of some typos.

Again implementation is simplish:

Decider chooses on of two steps (ok, it does choose STORE step every time…):

CDI beans for both steps are similar:

Important note about running test – we have to use Statless EJB to start batch because of class loader ordering – Arquillian does not see Java EE Batch resources:

So we use BatchImportService to start batch. In real app we would probably use a Claim check like processing in order to pass data to an SFTP server and trigger processing using JMS message or Web Service call. We could also monitor SFTP using some service (Oracle Service Bus for example can monitor SFTP and start processing).

And this is it – a simple evaluation of Java EE Batch (or Batch Processing for Java Platform) basic features. Next thing would be to add partitioned or split features. In the mean time – code is on github.

Apart from batch applications we could use stream or event processing. Chosen architecture depends on requirements for system under construction. If order is important but processing delay is not so important than batch processing is worth to check. If we need to react more quickly, or we want to continue processing data despite processing of some part of this data failed and we want to have high reliability than using messaging may be way to go. But all this is a subject for another article 🙂

Have a nice day and a lot of fun with Java EE !