DTP


 
Lively discussions on the graphic arts and publishing — in print or on the web


Go Back   Desktop Publishing Forum > General Discussions > Software

Reply
 
Thread Tools Display Modes
Old 04-23-2011, 09:01 AM   #1
RJ Emery
Member
 
Join Date: Mar 2005
Posts: 248
Default Index to a Corpus

A corpus is a "body of writing." I have a great number of news articles captured as PDFs for which I wish to build an index based on the words therein. Ideally, I would like to open the Index, enter a few keywords, and have the Index return what articles (and in what folders) match those keywords.

Does Adobe or any third party have such an Index facility?

   
__________________
RJ Emery, Eastern USA
WordPerfect 8 User on XP Pro SP3 System
OCR ScanSoft PaperPort SE v9 on
Brother MFC-8840DN Printer/Scanner/Fax
RJ Emery is offline   Reply With Quote
Old 04-23-2011, 02:30 PM   #2
terrie
Staff
 
Join Date: Oct 2004
Posts: 8,918
Default

Quote:
rj: I would like to open the Index, enter a few keywords, and have the Index return what articles (and in what folders) match those keywords.
That would be pretty cool...it seem to me that what you want is a meta-index--an index of indices?

Terrie
terrie is offline   Reply With Quote
Old 04-23-2011, 04:24 PM   #3
RJ Emery
Member
 
Join Date: Mar 2005
Posts: 248
Default

Not an index of an index. I need a tool to build an index of PDF files, and then the capability of searching within that index for files containing the keywords I seek.

   
__________________
RJ Emery, Eastern USA
WordPerfect 8 User on XP Pro SP3 System
OCR ScanSoft PaperPort SE v9 on
Brother MFC-8840DN Printer/Scanner/Fax
RJ Emery is offline   Reply With Quote
Old 04-23-2011, 05:13 PM   #4
annc
Sysop
 
annc's Avatar
 
Join Date: Oct 2004
Location: Subtropical Queensland, Australia, between the mountains and the Coral Sea
Posts: 4,434
Default

Quote:
Originally Posted by RJ Emery View Post
Not an index of an index. I need a tool to build an index of PDF files, and then the capability of searching within that index for files containing the keywords I seek.
If you put them all in one folder and then use Advanced search in one of the PDFs, you can tell it to search the terms you select in all the files in the folder. At least, you can in Acrobat standard or pro. Not sure about non-Adobe PDF apps or Reader.

   
__________________
annc is offline   Reply With Quote
Old 04-23-2011, 05:26 PM   #5
RJ Emery
Member
 
Join Date: Mar 2005
Posts: 248
Default

Yes, I am aware of that search method. However, it doesn't work too well when one has thousands of PDF files across many folders. It is also very slow.

   
__________________
RJ Emery, Eastern USA
WordPerfect 8 User on XP Pro SP3 System
OCR ScanSoft PaperPort SE v9 on
Brother MFC-8840DN Printer/Scanner/Fax
RJ Emery is offline   Reply With Quote
Old 04-23-2011, 08:18 PM   #6
Steve Rindsberg
Staff
 
Join Date: Nov 2004
Posts: 6,712
Default

Adobe used to but I'm not sure if they do any longer or if they do, what they call it nowadays.

Agent Ransack is free; doesn't build an index but seems to do a pretty good job of searching PDFs (and many other file types). I doubt it's anywhere near as fast as a product that maintained an index would be, though.

http://www.mythicsoft.com/default.aspx

dtSearch does create an index and is probably much faster. Costs more too.
http://www.dtsearch.com/

   
__________________
Steve Rindsberg
====================
www.pptfaq.com
www.pptools.com
and stuff
Steve Rindsberg is offline   Reply With Quote
Old 04-23-2011, 08:27 PM   #7
Howard Allen
Member
 
Howard Allen's Avatar
 
Join Date: Oct 2007
Location: Calgary, Alberta, Canada
Posts: 824
Default

Acrobat (the Pro version for sure, and possibly the "Standard" version) has an indexing feature where you point it toward a bunch of documents (say, a folder full, or a bunch of folders, or a volume) and it will build a keyword index. Depending on how many documents you're indexing, it can take quite a while to grind away building the index, but once it's done, keyword searches are very fast--much faster than the standard "search" routine. You can choose to omit minor words ("the", "it", "and") to keep your index files smaller and quicker. And if you add more documents to your collection at a later date, you can update the index.

In Acrobat Pro 8 it's under the Advanced menu, Document Processing/Full Text Index with Catalog...

I see the "OCR ScanSoft Paperport" in your signature, so I guess I don't have to tell you that the PDF'd news items have to be OCR'd...yeah, I didn't think so.

   
__________________
Howard

OSX 10.10.5
Howard Allen is offline   Reply With Quote
Old 04-23-2011, 09:09 PM   #8
Michael Beloved
Member
 
Join Date: Sep 2008
Location: Brooklyn NY
Posts: 141
Default

I am not sure if this would help but under Tools on the Acrobat Menu Bar, there is Advanced Editing and then the Link tool.

And for bookmarks there is the Add a Bookmark under the Document tab of the Menu Bar.

Recently I had to put some of print edition books into kindle format and also epub format. I had to created an hyperlinked index which was totally different.

Here is what I did using Expression Web 2. Dreamweaver can be used just as well.

First I went through the book and marked out sentences for the index. Then I bookmarked each of those marked sentences or phrases in an html file.

Then I created an hyperlinked index using that list of bookmarked entries. It takes time but once you get the hang of it, it goes fast.

There is one serious headache which comes up and that is how do you alphabetize those hyperlinked index entries. Right now it cannot be done with an html editor (program). And that means that it has to be moved into a text editor like Word and then moved back, with the problem of Word or whatever editor, adding coding marks which are not part of the W3C convention.
So that means that you have to clean up all those coding mark, if you require your file to be W3C approved.

   
__________________
michael beloved
Michael Beloved is offline   Reply With Quote
Old 04-24-2011, 07:31 AM   #9
Steve Rindsberg
Staff
 
Join Date: Nov 2004
Posts: 6,712
Default

Thanks Howard. I thought this might still be a part of Acrobat.

As I recall, there was a certain amount of rigamarole associated with opening indexes when you wanted to search, but you could also set up a single small PDF such that when you opened it, one or more indices would automatically be opened as well. Much more convenient that way. Open the pdf, search.

IIRC, you could also open multiple indices if you wanted to choose which batches of indexed documents you wanted to search.

Have they improved the search capability past the fairly simple boolean/quoted strings stuff they had years ago?

   
__________________
Steve Rindsberg
====================
www.pptfaq.com
www.pptools.com
and stuff
Steve Rindsberg is offline   Reply With Quote
Old 04-24-2011, 08:31 AM   #10
Howard Allen
Member
 
Howard Allen's Avatar
 
Join Date: Oct 2007
Location: Calgary, Alberta, Canada
Posts: 824
Default

Steve--

Yes, there is a bit of rigamarole, but it's pretty painless. You open the Search panel (Edit menu), then at the bottom of the panel there's a link, "Use Advanced Search Options", which lets you select an index to use for searching. In my experience, Acro remembers the path to the last index(ices) I had open and keeps it active, so I don't have to activate it each time. Even if you wanted to use a different index, it's just a case of navigating to the index file with a standard Open dialog.

Acro 8 is pretty long in the tooth by now, so I don't know how or if the search options have changed in more recent (9, 10?) versions. V. 8's search options are pretty basic, as you say, but I've never had to use anything more rigorous than simple keyword inputs to find what I'm looking for in PDFs. The only times I need industrial strength options like GREP are when I'm in a search-and-replace scenario, which of course isn't a PDF thing. Having said that, Acro's "advanced" search options do include three levels of metadata, creation/modification dates, etc. and you can choose to include text in comments, bookmarks and attachments.

I've attached a screen shot of the search panel.
Attached Thumbnails
Click image for larger version

Name:	Acro8 search panel.png
Views:	16
Size:	55.5 KB
ID:	1700  

   
__________________
Howard

OSX 10.10.5
Howard Allen is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
InDesign index snafu Howard Allen Software 0 01-10-2009 06:43 PM
Auto Index of names??? kazik General Publishing Topics 1 01-29-2007 01:12 PM
Index of PDF Content RJ Emery General Publishing Topics 9 01-21-2006 11:42 AM
Directories & index.html files ktinkel Web Site Building & Maintenance 32 12-10-2005 01:58 PM


All times are GMT -8. The time now is 10:38 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Contents copyright 2004–2014 Desktop Publishing Forum and its members.