Extracting Text From MS Word

Someone on the Lucene User list wanted to know if it was possible to search MS Word documents using Lucene. The normal response is to go and take a look at the Jakarta POI project (new blog by the way). Ryan Ackley submitted his website (textmining.org) along with a plug for his TextMining.org Word Text Extractor v0.4 and some sample code:

FileInputStream in = new FileInputStream ("test.doc");
WordExtractor extractor = new WordExtractor();
String str = extractor.extractText();


Someone else noted that the Python version of Lucene (called Lupy) has an indexer for MS Word and PDF as well, although it appears to only work on Windows.

ASP.NET: The View State is invalid for this page and might be corrupted

I fixed a tricky bug yesterday on one of our sites that runs ASP.NET. Like all good webmasters, anytime a 500 error occurs on the site, an email is sent to me that contains all the juicy details: the URL the user was visiting, the date/time, what browser they were using, which server in the cluster it occured on, the stack trace, any form variables, the querystring, session variables, and cookies. This particular error would occur on any page with a form and the stack trace would look like this:

Exception of type System.Web.HttpUnhandledException was thrown.
source = System.Web
target site = Boolean HandleError(System.Exception)
stack trace = at System.Web.UI.Page.HandleError(Exception e) at
System.Web.UI.Page.ProcessRequestMain() at
System.Web.UI.Page.ProcessRequest() at System.Web.UI.Page.ProcessRequest(HttpContext context) at System.Web.CallHandlerExecutionStep.System.Web.HttpApplication+IExecutio
nStep.Execute() at System.Web.HttpApplication.ExecuteStep(IExecutionStep
step, Boolean& completedSynchronously)

which isn’t very helpful at all. So my first problem was that I wasn’t getting enough information; I couldn’t tell what the problem really was. The HttpModule that I wrote that listens for Application_Error events was retrieving the exception using this code:

HttpApplication application = (HttpApplication)source;
Exception error = application.Server.GetLastError();

which technically does retrieve the error that caused the event, but I was missing one key step: the last line should have read:

Exception error = application.Server.GetLastError().InnerException;

and once I fixed that, I magically started receiving a much richer error description:

System.Web.HttpException: The viewstate is invalid for this page and might be corrupted. at System.Web.UI.Page.LoadPageStateFromPersistenceMedium() at System.Web.UI.Page.LoadPageViewState() at System.Web.UI.Page.ProcessRequestMain()

That description leads to a couple different knowledge base articles on support.microsoft.com; Q323744 and Q312906 were the ones that fixed the problem I was experiencing. Turns out that when you a) run an ASP.NET application in a cluster (in my case it’s an ASP.NET application load balanced behind with an F5), b) don’t use sticky sessions, and c) utilize view state, you must setup your web servers to use the exact same machine key, which is a setting in machine.config. Microsoft supplies a C# script that will generate the appropriate machineKey element for you machine.config in Q312906.

Enable the 30 Second Skip on your Remote

From weaknees.com, a site that offers a bunch of Tivo related upgrades and accessories, a tip that save your eyes and your thumb:

Every TiVo can do a 30 second skip – you just have to enable it.
Here’s how you do it:

  1. Start playing any recording.
  2. During playback press:
    Select – Play – Select – 3 – 0 – Select
  3. You should hear three bongs (if you don’t have the TiVo sounds disabled), and you’re done.

Your “skip to beginning/end” button (the arrow pointing to a line)
is now a 30 second skip button. During fast-forwarding or rewinding,
the button will still “skip to tick.”

For the most part you can press the skip button about 4x’s and you’ll be reasonably close to the next segment. I’m not sure what the usual commercial break is, but in the short amount of testing I did, it was anywhere from 2 minutes to 2.5 minutes.

Bad Hardware

I was out of commission all last week because of what may turn out to be a herniated disk in my lower back. Last Sunday I stepped out of the car and felt something go wrong. I took Monday off, tried to go work Tuesday and only made it to 3pm. Wednesday I worked from the floor, Thursday I finally made it to the doctor. The doctor told me to stay in bed for 48 hours and gave me Flexerall, which put me to sleep for 36 of the 48 hours. Lots of Tivo, a couple books and a couple DVD’s later, I feel a little better. I had a MRI taken this morning, I’ll go back to work tomorrow.

By the way, if you ever have a bad back, check out the Nada Chair, it’s a life saver.

Hibernate Hibtags

From the Hibernate Developer list,Serge Knystautas announced the availability of a JSTL tag library that wraps Hibernate 2.1 “… including find, filter,load, refresh, save, update, and delete” called Hibtags. It certainly brings to mind a discussion on java.net not so long ago about the JSTL SQL tags and how they really shouldn’t be used, these tags could certainly be helpful for doing a POC or for writing reports.

If you’re still curious, take a look at the examples. Serge has posted an example of each of the aforementioned functions (find, filter, load, refresh, save, update, delete).