Aaron Johnson Now with 50% less caffeine!

Posted
24 August 2004 @ 8am

Tagged
J2EE, Lucene, Python, Software Development

Extracting Text From MS Word

Someone on the Lucene User list wanted to know if it was possible to search MS Word documents using Lucene. The normal response is to go and take a look at the Jakarta POI project (new blog by the way). Ryan Ackley submitted his website (textmining.org) along with a plug for his TextMining.org Word Text Extractor v0.4 and some sample code:

FileInputStream in = new FileInputStream ("test.doc");
WordExtractor extractor = new WordExtractor();
String str = extractor.extractText();

Nice.

Someone else noted that the Python version of Lucene (called Lupy) has an indexer for MS Word and PDF as well, although it appears to only work on Windows.


11 Comments

Posted by
Steve G.
8 September 2004 @ 9am

I’m running a CFMX application that integrates Lucene, antiword and xPDF to index and search word and pdf files. It’s fast and has been flawless so far!


Posted by
Mark
5 November 2004 @ 9pm

Hi, I am interested in using lucene to index and search word document using CFMX. I was wondering if you had any sample codes or pointers. I have no Idea where to begin.


Posted by
AJ
6 November 2004 @ 8am

hi Mark,

I wrote 2 articles on using Lucene and CFMX:

http://www.sys-con.com/coldfusion/article.cfm?id=629

http://www.sys-con.com/coldfusion/article.cfm?id=639

Those should at least get you started with Lucene and ColdFusion. After you get that working and you understand what’s going on, you should be able to extract text from MS Word with no problem.

AJ


Posted by
Patrick Simon
11 May 2005 @ 9pm

Due to the conflict between the version of log4j (1.1.3 I think) that CFMX uses and the latest version that PDFBox uses, you can’t use the latest version of PDFBox to rip text out of a PDF document.

However, I have succesfully used PDFBox-0.6.2 do extract text (and it only uses native java, no need for COM)

Don’t know how code will look in your comments, but here is my code…


Posted by
Aluysio
16 June 2005 @ 7am

Hi,

Anybody knows if there is a tool to scan and extract every word of a doc (from MS-WORD, etc.) to build an automatic index archive ?

Thanks

Aluysio


Posted by
e-lopez
17 June 2005 @ 5pm

Hi Patrick Simon,
I want to ask you how di you use PDFBox-0.6.2 to extract text, I am no able to see the code in the comments


Posted by
Tathagata Roy
21 November 2006 @ 8pm

Use WordExtractor class of POI


Posted by
Tathagata Roy
21 November 2006 @ 8pm

I want to use open office for extracting text from word. But dont know how? Any idea?


Posted by
Travis
25 April 2007 @ 10pm

I am interesting in extracting text from word documents in order to create my own index… I want to use this on a site I wrote in ASP… Anybody know if this is possible?


Posted by
Jenny
3 August 2007 @ 2am

Hi,
I am having the same task of extracting text from the word file.I am trying to use textmining of Randy.
I am not yet successful in execution of word extractor.
Any body have the idea of what are the steps needed to successfully execute the wordextractor of text mining. And also the list of jar files needed…
Thanks in advance,
Jenny


Posted by
Hon
24 October 2007 @ 11pm

We have about 7 people in our team providing review comments on certain vendor prepared documents. I’m looking for a way of extracting these comment (pref. what was commented on, the section in the doc [i.e. table or heading etc.) to excel or another doc.


Leave a Comment

ASP.NET: The View State is invalid for this page and might be corrupted daily links