Document Deduplication Module¶
Parashift's Document Deduplication module provides users the ability to find duplicate documents within Alfresco. This module uses a search index to find documents that are duplicates of eachother, and displays the results within Alfresco Share:
- Ability to index content in an external solr instance
- Ability to find similar documents based upon the contents of the document
- Ability to find similar images based upon their image signature
The change log for Document Deduplication can be found here
Document Deduplication comes with both a Share and Repo amp, and a Solr core configuration. Please follow our Installation guide on how to install this module into Alfresco, and follow the instructions below on how to install and configure Solr.
Deduplication relies on a seperate Solr index to provide deduplication detection:
- Follow the instructions of installation here: https://cwiki.apache.org/confluence/display/solr/Installing+Solr
- Copy the
deduplicatefolder into the cores directory
- Copy the two jar files into the
distdirectory of solr
Add the following configurations to
Initial Data Import¶
If you are installing to an existing environment, you can have solr reprocess existing files by running the Data Import Handler from the Solr Administration panel.
Before doing so, change
data-config.xml in the
deduplicate directory to add the username/password of your administrator account.
To find a duplicate document:
- Navigate to the document details for a document
- Scroll down the bottom of the page.
- Similar and duplicate documents will be listed underneath the