Monday, October 01, 2007

How to index PDFs in MOSS 2007 document libraries

SharePoint document libraries are phenomenal tools for collaborative environments where files are shared. And SharePoint's ability to search files in document libraries makes finding files easy. Well, unless the document is a non-Microsoft file type, such as the ever-present PDF file.


The sad fact of the matter is that Windows SharePoint Services (WSS) 3.0 and Microsoft Office SharePoint Server (MOSS) 2007 can't index PDFs by default. That's not news to many veteran SharePoint professionals. Nor is the fact that you can add an icon for PDFs, reindex existing documents, and so forth. However, many administrators are new to SharePoint, and will hit their heads hard against this problem. I was disappointed to see that, despite extensive searching on Google, I could find no single, authoritative, and (most importantly) complete guide for how to do so.

The "bottom line" is that you must install an iFilter for PDFs on your SharePoint servers--specifically, any server that performs search, which would be all WSS servers and your MOSS search server. iFilters are plug-ins that enable indexing of file types. Although iFilter is a Microsoft specification, it is generally through vendors or third parties that you'll get iFilters--not through Microsoft itself.

After you add the iFilter, you must configure SharePoint to index the file type (.PDF). But then, you still have two problems. The biggest is that SharePoint will index only files that are added or existing files whose properties change. So SharePoint will not index existing PDFs when you add the PDF iFilter. You must rebuild your index. The second challenge, purely a cosmetic one, is that you enable SharePoint to display an appropriate icon for PDFs.

This installment will focus on 32-bit WSS servers. Both of these documents contain the word "iFilter" in them, but a search produces only the Word document. Now, let's fix the problem!

1. You will need two downloads: The Adobe PDF iFilter version 6.0, available from Adobe click here
An icon for PDFs, also available from Adobe. Check their licensing page then download the gif


2. Install the iFilter. Note: Many guides on the Internet suggest shutting down Microsoft IIS or the Shared Service Provider (SSP) or the WSS application(s). I found this was not necessary, and Microsoft's own KB article 927675 did not specify it was necessary.


3. Add a registry entry for the .pdf extension in the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\\Gather\Search\Extensions\ExtensionList. (Open the registry editor. Navigate to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\\Gather\Search\Extensions\ExtensionList\. Identify the highest "number" value in the key. On a default installation of WSS, the highest entry is 37. Note they are not sorted in numeric order because registry value names are strings. Create a registry value for the next number, e.g. 38, by choosing Edit à New à String Value then naming the value the next highest number (e.g. 38). Double-click the value you just created and, in the Value Data box, type: pdf. Note there is no dot preceding the extension.


4. There are two registry keys with specific values that must exist. Verify that these exist and, if not, create them: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\.pdf - Value Name: Default; Type: REG_MULTI_SZ; Data: {4C904448-74A9-11D0-AF6E-00C04FD8DC02})
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Setup\Filters\.pdf (Value Name: Default; Type: REG_SZ; Data: (value not set) - Value Name: Extension; Type: REG_SZ; Data: pdf
- Value Name: FileTypeBucket; Type: REG_DWORD; Data: 0x00000001 (1)
- Value Name: MimeTypes; Type: REG_SZ; Data: application/pdf


5. Restart the Windows SharePoint Services Search service. Open a command prompt. Type net stop spsearch, then net start spsearch. Perform a search, and existing PDFs will not be returned. But newly added PDFs will (once indexed by SharePoint) appear in search results. If you modify any property of an existing PDF, it will be indexed. But who wants to modify all existing PDFs in a document library? This is where I found a lot of misinformation online. Even Microsoft's KB 927675 didn't suggest the right solution! It's easy! STSADM, SharePoint's ubercommand, to the rescue!


6. Rebuild the WSS search index.- Open a command prompt.- Navigate to Program Files\Common Files\Microsoft Shared\web server extensions\12\BIN and type the following commands
stsadm.exe -o spsearch -action fullcrawlstop
stsadm.exe -o spsearch -action fullcrawlstart
The existing PDFs will, after being indexed, appear in search results. But they will still not have correct icons. So, while your site is being indexed, keep going with these steps to configure the icon.


7. Open the folder Program Files\Common Files\Microsoft Shared\Web Server Extensions\12\Template\Images.


8. Copy the gif you downloaded in Step 1 into the folder.


9. Open the folder Program Files\Common Files\Microsoft Shared\Web server extensions\12\Template\Xml.


10. Right-click the file docicon.xml and choose Open With and select Notepad.


11. In the element, you'll see a number of elements. You will add one for pdf. It does not have to be in alphabetical order. The element you need to add is:


12. Save that file and close Notepad.Now, the moment of truth. A search now provides the results.

7 comments:

  1. Anonymous2:20 pm

    I've both of these but I am still finding that the actual content inside my pdf files are not being indexed. Anywhere I should start looking?

    ReplyDelete
  2. Hey, this is a great post but you forget step 13: At the end you must make an IISReset else the pdf icon is not activated.

    ReplyDelete
  3. Anonymous2:26 am

    A word to the wise on the icon piece of this (which I experienced). Depending on how you put the GIF icon into the \images directory (copy/paste vs. cut/paste), the GIF file may not inherit the appropriate permissions from the parent directory, thereby denying regular users read permissions to that file. We had a lot of people with PDF's in their document libraries getting repeated credentials prompts because IIS was trying to load that icon in the document list. Eventually they'd get into the library with a missing image placeholder, but it was very interruputive to users. To rectify this, just go into the advanced ACL settings, and check the box for inheritance of the parent object for the GIF file itself.

    ReplyDelete
  4. Anonymous7:02 pm

    I find that on my MOSS 2007 server that there is nothing below 'Applications' key in the registry:

    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\

    So I can't get this to work, any thoughts?

    ReplyDelete