Exchange 2007 Content Indexing – attachment behavior / IFilters
Exchange 2007 mailbox databases have content indexing enabled by default. This is new to Exchange 2007. The search process and indexing process is much faster and much improved over previous versions. Several posts could be made regarding Exchange Search, however this post will just discuss how attachments can affect your indexing and search results.
Filters are used to extract the text from specific types of documents, html, doc, xml, xls, pdf, and so on. In the registry under HKLM\Software\Microsoft\Exchange\MSSearch\Filters there will be a list of filters that the server is able to use; see picture below.
Now here is what you need to understand, if a filter exists for a file type Exchange Search will attempt to index the attachment. If the server does not have a filter installed for a file type, Exchange Search will skip the attachment. Yeah OK, so that makes sense, but why is that important? Well if the Exchange Server has a filter installed for a file type that Exchange can not access, then the entire message will fail to get indexed. For example, lets say you have an xls file with DRM or is password protected, Exchange Search will not be able to read that attachment, and will not index the message either. Now let’s say that same example, but this time you DO NOT have an xls filter, then Exchange Search will NOT try to read the attachment and WILL index the message.
So, Message will fail to be indexed if Exchange fails to index an attachment that Exchange HAS a filter installed for.
Message will succeed indexing process if Exchange does NOT have a filter installed.
All of this only matters in situations where Exchange can not read the attachment. If a normal xls attachment is sent and can be opened and read normally then both the attachment and the email message will be indexed (as long as there is an xls IFilter). This all works great, the problem is where there are issues with the attachments.
There is one other interesting piece to this, in situations where the file type is not a true file type. For example, let’s say some random NON-Microsoft application spits out automated spreadsheet’s or documents and gives them an xls or .doc file extension. This file was not created in Microsoft Excel so the attachment will not be read correctly, meaning again both the attachment and message will fail indexing. In the same sense that you can not rename a .pdf to .doc and expect Word to open it correctly, you can not expect an IFilter to be able to read a file type that was not truly created by that file types application.
When troubleshooting Exchange Search cases where you are seeing missing items in your search results, check the attachments can you find anything that sets them apart. If they were created by another application or process what are your options?
1. Open each attachment and resave with the appropriate application. Example: 3rd party app creates .xls files, they open with Excel fine but were not created with Excel – Indexing will fail to index this file. Open each attachment in Excel and re-save, they attachment and message should now be added to the content index catalog.
2. Remove the associated Filter. Example: delete the xls IFilter from the registry, Exchange will now skip attempting to read xls attachments and the actual message will now succeed, of course this means none of your xls attachments will be indexed.
3. Disable Content Indexing for the mailbox database in question. By disabling Content Indexing for this database you will force Exchange to use Store Search when clients query for items. Store Search does not index attachments and would not be affected by the behavior explained above. Store Search is much slower and has some limitations but it would work around your issue.
Side Note:Exchange 2007 does not have IFilters for Office 2007 by default you need to install these - http://support.microsoft.com/kb/944516
