NEJUG April Meeting: Java Server Faces

If you live in the Boston area, check out the NEJUG meeting @ the Sun office in Burlington this Thursday. David Geary will be giving an talk on Java Server Faces. Registration & further information here, you can read more about his book (Core JavaServer Faces) here.

I’ll be commuting from the city out to Burlington, ping me if you need a ride.

Wanted: Extracting summary from HTML text

As part of a project I’m working on I need to extract content from an HTML page, in some sense creating a short 200 character summary of the document. Google does a fantastic job of extracting text and presenting a summary of the document in their search listings, I’m wondering how they do that. Here’s the process I’m using right now:

a) Remove all of the HTML comments from the page (ie: <!– –>) because JavaScript is sometimes inside comments, which sometimes includes > and or < which causes (d) to fail

b) Remove everything above the <body> tag, because there isn’t anything valuable there anyway.

c) Remove all the &lta href… > tags, because text links are usually navigation and are repeated across a site… they’re noise and I don’t want them. However, sometimes links are part of the summary of a document… removing a link in the first paragraph of a document can render the paragraph unreadable, or at least incomplete.

b) Remove all the HTML tags, the line breaks, the tabs, etc.. using a regular expression.

For the most part, the above 4 steps do the job, but in some cases not. I’ll go out on a ledge and say that most HTML documents contain text that is repeated throughout the site again and again (header text like Login Now! or footer text like copyright 2004, etc…). My problem is that I want to somehow locate the snippets that are repeated and not include them in the summaries I create… For example, on google do this search and then check out the second result:

Fenway Park. … Fenway Park opened on April 20, 1912, the same day as Detroit’s Tiger Stadium and before any of the other existing big league parks. …

That text is way about 1/4 of the way down in the document. How do they extract that?

Parameters: a) I don’t know anything about the documents that I’m analyzing, they could be valid XHTML or garbled HTML from 1996, b) it doesn’t have to be extremely fast, c) I’m using Java (if that matters) , d) I’ve tried using the org.apache.lucene.demo.html.HTMLParser class, which has a method getSummary(), but it doesn’t work for me (nothing is ever returned)

Any and all ideas would be appreciated!

PGP Encryption using Bouncy Castle

It can’t be that hard. So given a couple hours of hacking with the library, here’s a fully illustrated example that shows how to encrypt a file using the Bouncy Castle Cryptography API and PGP. First, giving credit where credit is due, the example comes mostly from the KeyBasedFileProcessor example that ships with the Bouncy Castle PGP libraries. You can find it in the /src/org/bouncycastle/openpgp/examples directory if you download the source. I’ve simply unpacked the example a little, providing some pretty pictures and explanation of what the various pieces are.

As in any example, you need to have downloaded a couple libraries; in this case you need to visit http://www.bouncycastle.org/latest_releases.html and download the bcprov-jdk14-122 and bcpg-jdk14-122 jar files. Add those to your project, as in this example, simply make sure to add them to the classpath when running the example from the command line.

Next, while you don’t need to have PGP installed, you do need to have a at least one public keyring file available on your system. I’m using PGP 6.5.8 on Windows which automatically saves my public keyring for me. You can find the location of the keyring file by Edit –> Options –> Files from within the PGP Keys window. You should see something like this:
PGP Options
Note the location of the Public Keyring File.

Second, you’ll need to generate a keypair (if you don’t already have one). I won’t go into the how or why (I assume you know the how and why) but you do need to make sure that you create what the Bouncy Castle folks call a ‘RSA key’ or ‘El Gamal key’ (source) rather than a DSA key. If you try to use a DSA keypair (which I’m assuming is synonomous with Diffie-Hellman/DSS?), that I ran into:
org.bouncycastle.openpgp.PGPException: Can't use DSA for encryption, which again is explained by the link above.

Now that you downloaded the appropriate libraries, created an RSA keypair and located your public keyring file, we’re ready to start. Open up your favorite Java IDE (I’m using Eclipse) and start by importing the appropriate libraries:

import java.io.*;
import java.security.*;
import org.bouncycastle.bcpg.*;
import org.bouncycastle.jce.provider.*;
import org.bouncycastle.openpgp.*;

I took a shortcut above and didn’t specify exactly what classes I wanted to import for clarity, if you’re using Eclipse you can easily clean that up by selecting Source –> Organize Imports (or by downloading the source code at the end of this example). Next the class declaration and the standard public static void main etc.. The KeyBasedFileProcessor example on the BouncyCastle website lets you pass in the location of the public keyring and the file you want to encrypt, I’m hardcoding it in my code so that it’s crystal clear what everything is:

// the keyring that holds the public key we're encrypting with
String publicKeyFilePath = "C:\\pgp6.5.8\\pubring.pkr";

and then use the static addProvider() method of the java.security.Security class:

Security.addProvider(new BouncyCastleProvider());

Next I chose to create a temporary file to hold the message that I want to encrypt:

File outputfile = File.createTempFile("pgp", null);
FileWriter writer = new FileWriter(outputfile);
writer.write("the message I want to encrypt".toCharArray());
writer.close();

Read the public keyring file into a FileInputStream and then call the readPublicKey() method that was provided for us by the KeyBasedFileProcessor:

FileInputStream in = new FileInputStream(publicKeyFilePath);
PGPPublicKey key = readPublicKey(in);

At this point it’s important to note that the PGPPublicKeyRing class (at least in the version I was using) appears to have a bug where it only recognizes the first key in the keyring. If you use the getUserIds() method of the object returned you’ll only see one key:

for (java.util.Iterator iterator = key.getUserIDs(); iterator.hasNext();) {
System.out.println((String)iterator.next());
}

This could cause you problems if you have multiple keys in your keyring and if the first key is not an RSA or El Gamal key.

Finally, create an armored ASCII text file and call the encryptFile() method (again provided us by the KeyBasedFileProcessor example:

FileOutputStream out = new FileOutputStream(outputfile.getAbsolutePath() + ".asc");
// (file we want to encrypt, file to write encrypted text to, public key)
encryptFile(outputfile.getAbsolutePath(), out, key);

The rest of the example is almost verbatim from the KeyBaseFileProcessor example, I’ll paste the code here, but I didn’t do much to it:

out = new ArmoredOutputStream(out);
ByteArrayOutputStream bOut = new ByteArrayOutputStream();
PGPCompressedDataGenerator comData = new PGPCompressedDataGenerator(PGPCompressedDataGenerator.ZIP);
PGPUtil.writeFileToLiteralData(comData.open(bOut), PGPLiteralData.BINARY, new File(fileName));
comData.close();
PGPEncryptedDataGenerator cPk = new PGPEncryptedDataGenerator(PGPEncryptedDataGenerator.CAST5, new SecureRandom(), "BC");
cPk.addMethod(encKey);
byte[] bytes = bOut.toByteArray();
OutputStream cOut = cPk.open(out, bytes.length);
cOut.write(bytes);
cPk.close();
out.close();

One last thing that I gleamed from their web-based forum was that one of the exceptions thrown by the above code is a PGPException, which itself doesn’t tell you much (in my case it was simply saying exception encrypting session key. PGPException can be a wrapper for an underlying exception though, and you should use the getUnderlyingException() method to determine what the real cause of the problem is (which lead me to the Can't use DSA for encryption message that I mentioned above).

You can download the source code and batch file for the example above here:

bouncy_castle_pgp_example.zip

Updated 04/07/2004: David Hook wrote to let me know that there is a bug in the examples, I updated both the sample code above and the zip file that contains the full source code. Look at the beta versions for the updated examples.