Saturday, February 27, 2010

Parsing Huge Text Files Using Java and JSapar

Last week a friend of mine and I decided to parse a huge size text file that consists some reports of legacy devices. After few times trying we got that, opening and parsing huge text files in Java is very time and resource consuming. We started with a 35MB log file. We have never worked with such a huge size text files. So we tried to find the relevant solution. Indeed, Java is not the best solution for this kind of problems. I believe Python or Perl could perform this requirement by a higher performance. However regards to later developments and project requirements we decided to use Java. After some searching through web we found a brilliant tool. Tigris.org has some valueable open source projects. JSapar is one of them. JSapar is a Java library providing a schema based parser/producer of CSV (Comma Separated Values) and flat files. The goal of this project is to create a java library that contains a parser of flat files and csv files.
The file imports in to an object oriented model that we called it telegrams. The parser produces a Document class, representing the content of the file, or you can choose to receive events for each line that has been successfully parsed. Tigris claims that JSapar can handle huge files without loading everything into memory.
The library is simple to use and possible to extend. Our log file consists thousands of lines just the same as below sample line:

948853 : 47 E6 18 FF 04 CD 0B 1D B1 C1 D1 1E ;

This is a telegram. First part is row number (948853) and next bytes contains a message. This two part are separated by a ":". At first sight it seems it is a straight forward procedure, however, it is not as much easy as it looks. Millions of these lines makes a real slow running and unstable application if you use standard java scaner or parsers. First we defined a schema for csv files:

<?xml version="1.0" encoding="UTF-8"?>

<schema xmlns="http://jsapar.tigris.org/JSaParSchema/1.0">

<csvschema lineseparator="\n">

<line occurs="*" linetype="Telegram" cellseparator=":">

<cell name="Row No" />

<cell name="Body"/>

</line>

</csvschema>

</schema>

Then we just used a simple java code to read a 40MB text file into memory in less than 10 seconds.

public final void loadTelegrams() throws SchemaException, IOException, JSaParException {
Document
telegrams;
Reader schemaReader = new FileReader("schema/schema.xml");
Xml2SchemaBuilder xmlBuilder = new Xml2SchemaBuilder();
Reader fileReader = new FileReader("repo/dat.txt");
Parser parser = new Parser(xmlBuilder.build(schemaReader));
telegrams = parser.build(fileReader);
fileReader.close();
}

Using below command we moves whole of the file cell by cell so quickly.

telegrams.getLine(i).getCell(j).getStringValue()
;

Tuesday, February 23, 2010

OFBiz, The Apache Open for Business

I am writing this to introduce the next generation of web based business applications that I am developing. As you know I used to develop a small J2EE based application that I called it Easy Tracking. Easy Tracking consists inventory, orders and sales applications. Also it has some small features and a rich Ajax based multi-language client. This is more than 4 years that my customers use it as a rental software and data hosting service on my servers. I have gained a lot of valuable experience in this kind of business service providing among the years.
Now, different requests of individual customers makes me interested to use an open source professional business solution. During last months I reviewed a vast of ERP solutions and finely I have decided to use OFBiz. OFBiz is an Apache product; moreover, I reviewed its architecture and technologies and found them all matured and practical. Actually OFBiz is an open source framework to help developers in creating high quality business solutions. Development and customization both are available using straight forward procedures. However, OFBiz needs a mass of modification and development to be adapted with most of Iranian business companies requirements. In addition, it doesn't have a Farsi face yet.
After a while I will need a patient rich customer to let me implement his business requirements by the solution and a holding software company who needs to have a localized, reliable and ready to use product.

Wednesday, February 17, 2010

Distributed Application Debugging Tips

Every distributed application works based on a collaborative environment that consists of services and method calls. Moreover, in a distributed environment almost services are hosted on different platforms and far machines that talks through network. Distribution makes some difficulties in the debugging of this kind of applications.

During the development of a J2EE based core banking software, I as a member of the team have had many struggled challenges for debugging. The solution uses some far, distributed components through network that work together using message queuing following SOA disciplines.

DB2 Database, IBM MQ, IBM Websphere and Swing reach client are talking together using a vast of message sending and receiving. Also architecture uses Mule ESB, Spring, JMS and a vast of configuration files. We have found and fixed most of bugs using below simple techniques.

Use a Map. Just draw a symbolic map of the environment that causes the problem. Highlight servers, clients, firewalls, routers, with their specific IP and ports of each one that used in the scenario. May be an IP, a port or a firewall policy has been changed and the problem has arisen. Using Ping and Telnet be sure that service listeners are available at least.
If the service was working and a problem has arisen eventually; then, configuration changing is the most error prone factor.

Interpret Log Files. Review log files and exception messages carefully. Almost every raised exception message points to the problem exactly. In the other hand, logging is not a centralized activity in a layered distributed application. Indeed, each layer or component may logs its exceptions separately. So check them all. Check log files just like a detector and use your imagination to guess the problem.

Go Deeper. Adjust logging level to the proper value to let logger catch more detail messages.

Set Break Points to Watch. Set some break points to watch out what is happening during run time. Be sure messages send and receive by end points correctly. Then, step toward inner layers to find out what happened.

Use Replacements. Replace the component you suspected in with the correct one to find out if they are working probably. For example, you have another available application server, message queue or database use them regard the situation.

Check Configuration Files. Be sure the build routines has done their tasks perfectly. Maven and Ant write values of variables in the compiled version of configuration files. Such as context.xml, web.xml and other xml files. Lack of privilege or wrong configuration may prevent build process to finish its task completely. So check the compiled and built version of configuration files.

Be Patient. In a distributed system, method call doesn't perform as much fast as the standalone application. So check if timeout values are enough or not. Sometimes increasing timeout value is a key to solve the problem.

Do it Faster. Faster compiling and running on lighter machines is the an effective approach to test more situations rapidly. So use lighter application servers and databases. For example, use Tomcat over Websphere; Also, use MySQL rather than DB2 during development.

Wednesday, February 3, 2010

Routing multiple domains to an application using URL rewrite filter in JSP contatiners and JEE application servers

My client has assigned two separated domain address to a certain IP address. My client was going to map two domain addresses to an application which hosted on JBoss 4.2.2.
Each domain address should route requests to some certain pages of the application. I had to assign firstdomain.com and seconnddomain.com to the a certain application that I call it "app" here. My task was clear; indeed, I had to implement below mapping:
  • http://firstdomain.com .... http://1.2.3.4/app/first.html
  • http://seconddomain.com .... http://1.2.3.4/app/repo/second.html
As you can see, IP and application name are the common parts and inner paths goes different. I used UrlRewriteFilter that simulates Apache rewrite URL method in JSP containers and JEE app servers. It is just enough to follow the instruction. UrlRewriteFilter has a vast of samples but I have not found the proper rule that maps different url on a certain IP address. So, after some tries and errors; finally, below rule worked for me:
<rule>
<name>Changing Domain</name>
<condition name="host" operator="equal">firstdomain.com</condition>
<from>^/(.*)</from>
<to type="permanent-redirect">http://1.2.3.4/app/repo/second.html$1</to>
</rule>

This is just enough to add above rule into the urlrewrite.xml.
first.html is the default page. So you just need to handle the request which comes to the seconddomain.com. I have got my routing goals on Tomcat 6.0 and JBoss 4.2.2 with this simple URL re-writing.

Tuesday, February 2, 2010

Why I use Google Docs

Following extending an office automation, I worked on Apache Jackrabbit content repository some years ago. After it, I was looking for a free personal content repository to keep my documents, sample codes and commands over HTTP. Something just like a personal wiki, personal blog or a repository.
I preferred to use a free one that supported by the web giants such as Google and Yahoo. Finally, I have found it. My friends in Facebook suggested me to try Google Documents two weeks ago. Google Docs is exactly the tool I was looking for. Google means this is improving faster and wiser than others. Moreover, Google Documents is the most interesting tool that I found in the Google shelf. It works very fast; moreover, it is pretty much. Google Docs enables me to keep my valuable documents in a safety box on WWW. I shared some documents with the trusted friends of mine. Also, a friend and I are using Google Docs in an amazing way to keep synchronize together around an educational program. In addition, Google Docs, keeps all revisions on modifications. In fact, you can track the document growing life cycle. In my opinion every body should have his or her Google Documents account because of below reasons:
  • Your knowledge is always accessible.
  • You don't need to care about hard disk failures and viruses attack any more.
  • You can edit your documents as well as possible just using a browser.
  • Google Docs is a repository so, all changes are traceable.
  • Google Docs Sharing mechanism works pretty secured.
  • I don't care about currently 1024 MB capacity limitation. Google is growing always.
  • Good feeling of extending my memory using Google Docs features is another special kind of experience that I gain by using Google Documents. You will never loose it if you keep it by Google Documents.

Monday, February 1, 2010

Good points for putting breaks across Mule ESB classes

This is the second year that I work as a J2EE developer during development of a modern J2EE based core banking project. The architecture uses Spring and Mule ESB as its major components widely. Mule ESB (Enterprise Service Bus) allows us to connect the applications together as much simple as possible. The architecture uses Mule ESB as integration platform and service container. Mule ESB provides some easy to use methods for talking through JMS, HTTP, FTP and many more transfer protocols. Moreover, Mule ESB implemented a set of relevant mechanisms for message routing, message transformation and transaction management. However, debugging the started services which use Mule ESB functionalities is not an easy activity. Because there is a lot of layered listeners and complex network devices/services which make problems bolder. The framework we used is a well design art work of my friend, Ara Abrahamian and it works like a Mercedes Benz in most of areas. But, the modules added by developers almost have some problems. I noticed most of problems that we raise, has the origin in the wrong configuration files such as muleesb-config.xml and Spring's context.xml. Also, below classes could be a good place to start debugging:
  • JmsMessageDispatcher.java
  • JmsMessageConsumerImpl.java
JmsMessageDispatcher is the the most outer place just before sending message. Also, JmsMessageConsumerImpl is the first JMS message receiver.

In my experience debugging this two classes is an easy way to find what happened to the code or underlying services malfunctions.