What are Unindexable Documents on DataCove?

 

Understanding DataCove’s Email Processing and Indexing processes

On occasion, DataCove will run into what it refers to as an “unindexable” email, often first noticed on DataCove’s nightly email notification about the previous day’s traffic. These unindexable emails are emails, or pieces of them, that could not be fully indexed for one reason or another. This will be discussed in detail a little further down.

An example of a nightly email notification showing unindexable documents:

To explain unindexable emails, we must first understand how DataCove treats emails upon their arrival into the system. No matter the source of the emails, whether a POP3 or IMAP4 fetcher, direct SMTP receipt or an upload of PST or EML files, DataCove treats all new emails it sees the same way.

First comes the Processing phase:

  1. When an email first arrives on DataCove, they are broken out into multiple respective “documents” or pieces, with the email headers and the email body being one plaintext document, any HTML content being a separate HTML or Rich Text document, followed by any attachments being broken out into their own separate documents. Every “email” on DataCove is really composed of at least two documents as a result, sometimes more depending on how many attachments may be present.

  2. If any of those attachments already exist on the system, deduplication is performed via hash comparison and they get thrown away and a “pointer” put in its place instructing the DataCove that if anyone attempts to download or view that attachment, they get “pointed” at the single original copy of the attachment that the DataCove stores.

  3. All of the respective documents are then spread across various database tables that will allow for searching upon their individual properties, once indexing is completed. They are no longer held as a contiguous email at this point.

These document totals and deduplication totals can be seen by logging into DataCove, selecting Status in the top header bar, then selecting Summary Counts on the left hand side menu.

In the top Totals section of the page, a breakout will occur of the various documents on DataCove:

  1. Emails are the total number of emails on the system, just as one would think of any singular email they received.

  2. Attachments are the total number of attachments known of by the system, which include the HTML portions of emails, PDF files, Excel spreadsheet, Word documents, etc.

  3. Unique Attachments are the total number of attachments being stored by the DataCove, with all other attachments having been replaced with pointers. The difference between these two numbers shows how many documents have been deduplicated and their respective amounts of space conserved by DataCove for further email archival. In the example picture below, there is an approximate 47.3% deduplication occurring against that number of attachments, rendering a pretty incredible space savings.

  4. Stored Documents are the combined counts of Emails and Unique Attachments forming the total number of Documents on the system. These are the Documents that the next phase talks about.

Moving on to the Index Related section below, the Emails Processing queue shows how many emails are pending Processing, the first phase of email insertion described above.

Once Processing of emails has completed, the now-disassembled Documents are moved into the Indexing queue, which is what “reads” them to make all of their words searchable. Many aspects of these documents are indexed, with a non-exhaustive list of fields being tracked listed below, such as:

  1. Sender email address

  2. Recipient email address

  3. Subject line

  4. Message ID

  5. Email body

  6. Attachment filename

  7. Attachment file type

  8. Attachment contents

Documents that are currently in the Indexing queue are visible under the Documents Being Indexed field, also under the Index Related section. The total number of Indexed Documents on the entire system is directly below that, and generally will match the Stored Documents count from the Totals section. Once emails have been both Processed and Indexed, they will be searchable on the system and will remain present until a Data Retention Policy removes them (if one has been configured).

Below the total number of Indexed Documents is the total number of Emails with Unindexable Documents, which is this article’s subject. Now that we have an understanding of what the DataCove does in the Processing and Indexing phases, we can discuss the nature of Unindexable Documents.

 

What happens when an email cannot be read?

When an email cannot be read by DataCove’s indexing engine, it winds up as an Unindexable Document. This can occur to any field in the email, such as the email message headers, body or attachments, and if that field cannot be read, they cannot be indexed and thusly made searchable. Common reasons for this inability to read a document are:

  1. Malformed content that does not comply with the RFC822 standard for email formatting and layout. Invalid line breaks in message headers are a common cause for this.

  2. Disrupted transmission, like an email that got garbled while in transit from the sending server to the recipient server, or to DataCove.

  3. Use of non-UTF-8 encoding, common in malicious spam emails. These were sometimes used in an attempt to bypass spam filters; an old strategy, but used to work fairly effectively until around 2012.

  4. Encrypted body content or attachments, including use of Information Rights Management.

  5. Password protected attachments.

  6. Compressed attachments.

  7. Attachments not in a format that can be indexed; music or movie files, for example.

  8. Hotlinked content in the body of a message that no longer exists on the source location. This used to be an occasional issue, but this is becoming more common in phishing emails that use temporary websites or websites that change based on the campaign being run. '

  9. Filesize over 35MB.

With all that said, only a portion of the unindexable document is not actually searchable, not the entire email. DataCove will continue to attempt to index all other aspects of the email and many other fields will likely be read and indexed just fine.

As an example, in the case of a password protected spreadsheet attached to an email, DataCove doesn’t have that password and cannot open the file to then index the contents for searchability, but that email can still be searched for/on via the From/To/Subject Line/Date/Attachment Type/Attachment Name and Text in the body of an email, all of which are still indexed and perfectly fine.

Documents that cannot be read at this current moment are not necessarily unindexable forever; newer versions of DataCove and various updates may allow for indexing of these documents (depending on the original cause of the illegibility), although a DataCove-to-DataCove migration is necessary for this reprocessing and reindexing.

The general expectation for unindexable emails is less than 1% of any given day’s traffic, as while they are common enough to occur, they are very small in number overall. If for any reason a large spike of unindexable documents is observed that are unexplained (no spam filter breach or mail server setting change, no mass mailing worm or user compromise, etc) exceeding 1% of your regular day’s traffic, DataCove Support should be contacted for follow-on investigation. Most organizations will see a couple roll through per day as meeting one of the aforementioned criteria is simply a fact of modern email and data transfer mechanisms, but seeing a lot of them can indicate a mail delivery, transport or networking problem, or depending on what they are, may suggest that a different manner of data transfer and collaboration should be investigated, like ShareFile, Sharepoint, OneDrive, DropBox, rather than people sending large scale files around via email.

 

How to search “around” an email

Similarly to how astronomers have found black holes, which could not be directly observed, by searching for their theorized effects on things around them, searching for emails with unindexable documents has a comparable method.

As mentioned above, whilst a specific Document of an email may not be indexed, the rest of the email is likely to be just fine. Searching “around” an email is still feasible in order to ascertain that the exact content may be relevant to a case. The most likely component document of an email that might be unindexable is an attachment, simply due to their variety and protection mechanisms that can be applied, so searching for other indexed content is the best option.

With the example of a password-protected spreadsheet attachment, searching for related content via the attachment name or file type can help yield the desired results, even if the content itself cannot be read.

  • Tagging it as an email of interest to then follow up to ascertain a password for (or brute forcing it via cracking applications) would be the next logical step. Someone sending password-protected content would at some point needed to have communicated that password with the recipient, and if they also did that via email or Microsoft Teams chat message, the odds are good that the pieces can be put back together easily simply be searching for that kind of information.

Outside of fringe searches for content, these unindexable documents can be viewed with their respective emails by clicking Email Viewing in the top header bar, then selecting Parse Failures on the left hand side menu, which shows a calendar view of those emails to directly look through when needed. Selecting the respective days that the search calls for can help isolate emails that may be affected and allows direct analysis of the content.

Reviewing these can be prudent when searches are expected to find content that is otherwise not coming up in the search results, especially for more technically savvy senders and recipients who may have encrypted the content.

While generally needing to be a service configured in advance of having to deal with encrypted content, Tangent’s guide on activating decryption for journaled messages can be a great aid in preventing some of these from happening in the first place.

This guide can be found here: https://datacove.net/knowledge-base/activating-journal-decryption-on-o365

Previous
Previous

Complying with a Legal Email Deletion Order

Next
Next

Using Search Exclusions to prevent visibility of select Email Addresses and their Content