Extracting Text From MS Word

Someone on the Lucene User list wanted to know if it was possible to search MS Word documents using Lucene. The normal response is to go and take a look at the Jakarta POI project (new blog by the way). Ryan Ackley submitted his website (textmining.org) along with a plug for his TextMining.org Word Text Extractor v0.4 and some sample code:

FileInputStream in = new FileInputStream ("test.doc");
WordExtractor extractor = new WordExtractor();
String str = extractor.extractText();

Nice.

Someone else noted that the Python version of Lucene (called Lupy) has an indexer for MS Word and PDF as well, although it appears to only work on Windows.

This entry was posted in J2EE, Lucene, Python, Software Development. Bookmark the permalink.

11 Responses to Extracting Text From MS Word

  1. Steve G. says:

    I’m running a CFMX application that integrates Lucene, antiword and xPDF to index and search word and pdf files. It’s fast and has been flawless so far!

  2. Mark says:

    Hi, I am interested in using lucene to index and search word document using CFMX. I was wondering if you had any sample codes or pointers. I have no Idea where to begin.

  3. AJ says:

    hi Mark,

    I wrote 2 articles on using Lucene and CFMX:

    http://www.sys-con.com/coldfusion/article.cfm?id=629

    http://www.sys-con.com/coldfusion/article.cfm?id=639

    Those should at least get you started with Lucene and ColdFusion. After you get that working and you understand what’s going on, you should be able to extract text from MS Word with no problem.

    AJ

  4. Due to the conflict between the version of log4j (1.1.3 I think) that CFMX uses and the latest version that PDFBox uses, you can’t use the latest version of PDFBox to rip text out of a PDF document.

    However, I have succesfully used PDFBox-0.6.2 do extract text (and it only uses native java, no need for COM)

    Don’t know how code will look in your comments, but here is my code…

  5. Aluysio says:

    Hi,

    Anybody knows if there is a tool to scan and extract every word of a doc (from MS-WORD, etc.) to build an automatic index archive ?

    Thanks

    Aluysio

  6. e-lopez says:

    Hi Patrick Simon,
    I want to ask you how di you use PDFBox-0.6.2 to extract text, I am no able to see the code in the comments

  7. Tathagata Roy says:

    Use WordExtractor class of POI

  8. Tathagata Roy says:

    I want to use open office for extracting text from word. But dont know how? Any idea?

  9. Travis says:

    I am interesting in extracting text from word documents in order to create my own index… I want to use this on a site I wrote in ASP… Anybody know if this is possible?

  10. Jenny says:

    Hi,
    I am having the same task of extracting text from the word file.I am trying to use textmining of Randy.
    I am not yet successful in execution of word extractor.
    Any body have the idea of what are the steps needed to successfully execute the wordextractor of text mining. And also the list of jar files needed…
    Thanks in advance,
    Jenny

  11. Hon says:

    We have about 7 people in our team providing review comments on certain vendor prepared documents. I’m looking for a way of extracting these comment (pref. what was commented on, the section in the doc [i.e. table or heading etc.) to excel or another doc.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>