Document Deduplication Module

Parashift's Document Deduplication module provides users the ability to find duplicate documents within Alfresco. This module uses a search index to find documents that are duplicates of eachother, and displays the results within Alfresco Share:

Similar

Features

  • Ability to index content in an external solr instance
  • Ability to find similar documents based upon the contents of the document
  • Ability to find similar images based upon their image signature

Changelog

The change log for Document Deduplication can be found here

Installing

Alfresco

Document Deduplication comes with both a Share and Repo amp, and a Solr core configuration. Please follow our Installation guide on how to install this module into Alfresco, and follow the instructions below on how to install and configure Solr.

Solr

Deduplication relies on a seperate Solr index to provide deduplication detection:

Configuring

Add the following configurations to alfresco-global.properties:

solr.content.enable=true
external.solr.url=http://<your_solr_instance>:8983/deduplicate

Initial Data Import

If you are installing to an existing environment, you can have solr reprocess existing files by running the Data Import Handler from the Solr Administration panel.

Before doing so, change data-config.xml in the deduplicate directory to add the username/password of your administrator account.

Usage

To find a duplicate document:

  • Navigate to the document details for a document
  • Scroll down the bottom of the page.
  • Similar and duplicate documents will be listed underneath the Version History