Extracting Text From MS Word
Someone on the Lucene User list wanted to know if it was possible to search MS Word documents using Lucene. The normal response is to go and take a look at the Jakarta POI project (new blog by the way). Ryan Ackley submitted his website (textmining.org) along with a plug for his TextMining.org Word Text Extractor v0.4 and some sample code:
FileInputStream in = new FileInputStream ("test.doc");
WordExtractor extractor = new WordExtractor();
String str = extractor.extractText();
Nice.
Someone else noted that the Python version of Lucene (called Lupy) has an indexer for MS Word and PDF as well, although it appears to only work on Windows.
11 Comments