Saturday, February 27, 2010

Parsing Huge Text Files Using Java and JSapar

Last week a friend of mine and I decided to parse a huge size text file that consists some reports of legacy devices. After few times trying we got that, opening and parsing huge text files in Java is very time and resource consuming. We started with a 35MB log file. We have never worked with such a huge size text files. So we tried to find the relevant solution. Indeed, Java is not the best solution for this kind of problems. I believe Python or Perl could perform this requirement by a higher performance. However regards to later developments and project requirements we decided to use Java. After some searching through web we found a brilliant tool. Tigris.org has some valueable open source projects. JSapar is one of them. JSapar is a Java library providing a schema based parser/producer of CSV (Comma Separated Values) and flat files. The goal of this project is to create a java library that contains a parser of flat files and csv files.
The file imports in to an object oriented model that we called it telegrams. The parser produces a Document class, representing the content of the file, or you can choose to receive events for each line that has been successfully parsed. Tigris claims that JSapar can handle huge files without loading everything into memory.
The library is simple to use and possible to extend. Our log file consists thousands of lines just the same as below sample line:

948853 : 47 E6 18 FF 04 CD 0B 1D B1 C1 D1 1E ;

This is a telegram. First part is row number (948853) and next bytes contains a message. This two part are separated by a ":". At first sight it seems it is a straight forward procedure, however, it is not as much easy as it looks. Millions of these lines makes a real slow running and unstable application if you use standard java scaner or parsers. First we defined a schema for csv files:

<?xml version="1.0" encoding="UTF-8"?>

<schema xmlns="http://jsapar.tigris.org/JSaParSchema/1.0">

<csvschema lineseparator="\n">

<line occurs="*" linetype="Telegram" cellseparator=":">

<cell name="Row No" />

<cell name="Body"/>

</line>

</csvschema>

</schema>

Then we just used a simple java code to read a 40MB text file into memory in less than 10 seconds.

public final void loadTelegrams() throws SchemaException, IOException, JSaParException {
Document
telegrams;
Reader schemaReader = new FileReader("schema/schema.xml");
Xml2SchemaBuilder xmlBuilder = new Xml2SchemaBuilder();
Reader fileReader = new FileReader("repo/dat.txt");
Parser parser = new Parser(xmlBuilder.build(schemaReader));
telegrams = parser.build(fileReader);
fileReader.close();
}

Using below command we moves whole of the file cell by cell so quickly.

telegrams.getLine(i).getCell(j).getStringValue()
;

No comments:

Post a Comment