{"id":625,"date":"2004-08-24T08:07:09","date_gmt":"2004-08-24T12:07:09","guid":{"rendered":"http:\/\/wordpress.cephas.net\/?p=625"},"modified":"2004-08-24T08:07:09","modified_gmt":"2004-08-24T12:07:09","slug":"extracting-text-from-ms-word","status":"publish","type":"post","link":"https:\/\/cephas.net\/blog\/2004\/08\/24\/extracting-text-from-ms-word\/","title":{"rendered":"Extracting Text From MS Word"},"content":{"rendered":"<p>Someone on the <a href=\"http:\/\/jakarta.apache.org\/site\/mail.html\">Lucene User list<\/a> wanted to know if it was possible to search MS Word documents using <a href=\"http:\/\/jakarta.apache.org\/lucene\/\">Lucene<\/a>. The normal response is to go and take a look at the <a href=\"http:\/\/jakarta.apache.org\/poi\/\">Jakarta POI project<\/a> (new <a href=\"http:\/\/nagoya.apache.org\/poi\/news\/\">blog<\/a> by the way). Ryan Ackley submitted his website (<a href=\"http:\/\/textmining.org\/\">textmining.org<\/a>) along with a plug for his <a href=\"http:\/\/textmining.org\/modules.php?op=modload&amp;name=Downloads&amp;file=index&amp;req=viewdownload&amp;cid=2\">TextMining.org Word Text Extractor v0.4<\/a> and some sample code:<br \/>\n<code><br \/>\nFileInputStream in = new FileInputStream (\"test.doc\");<br \/>\nWordExtractor extractor = new WordExtractor();<br \/>\nString str = extractor.extractText();<br \/>\n<\/code><br \/>\nNice.<\/p>\n<p>Someone else noted that the <a href=\"http:\/\/www.divmod.org\/Home\/Projects\/Lupy\/\">Python version of Lucene<\/a> (called Lupy) has an <a href=\"http:\/\/www.methods.co.nz\/docindexer\/\">indexer for MS Word and PDF<\/a> as well, although it appears to only work on Windows.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Someone on the Lucene User list wanted to know if it was possible to search MS Word documents using Lucene. The normal response is to go and take a look at the Jakarta POI project (new blog by the way). Ryan Ackley submitted his website (textmining.org) along with a plug for his TextMining.org Word Text &hellip; <a href=\"https:\/\/cephas.net\/blog\/2004\/08\/24\/extracting-text-from-ms-word\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Extracting Text From MS Word<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[3,19,25,2],"tags":[],"_links":{"self":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/625"}],"collection":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/comments?post=625"}],"version-history":[{"count":0,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/posts\/625\/revisions"}],"wp:attachment":[{"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/media?parent=625"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/categories?post=625"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cephas.net\/blog\/wp-json\/wp\/v2\/tags?post=625"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}