Document Deduplication Module¶
Parashift's Document Deduplication module provides users the ability to find duplicate documents within Alfresco. This module uses a search index to find documents that are duplicates of eachother, and displays the results within Alfresco Share:
Features¶
- Ability to index content in an external solr instance
- Ability to find similar documents based upon the contents of the document
- Ability to find similar images based upon their image signature
Changelog¶
The change log for Document Deduplication can be found here
Installing¶
Alfresco¶
Document Deduplication comes with both a Share and Repo amp, and a Solr core configuration. Please follow our Installation guide on how to install this module into Alfresco, and follow the instructions below on how to install and configure Solr.
Solr¶
Deduplication relies on a seperate Solr index to provide deduplication detection:
- Follow the instructions of installation here: https://cwiki.apache.org/confluence/display/solr/Installing+Solr
- Copy the
deduplicate
folder into the cores directory - Copy the two jar files into the
dist
directory of solr
Configuring¶
Add the following configurations to alfresco-global.properties
:
solr.content.enable=true external.solr.url=http://<your_solr_instance>:8983/deduplicate
Initial Data Import¶
If you are installing to an existing environment, you can have solr reprocess existing files by running the Data Import Handler from the Solr Administration panel.
Before doing so, change data-config.xml
in the deduplicate
directory to add the username/password of your administrator account.
Usage¶
To find a duplicate document:
- Navigate to the document details for a document
- Scroll down the bottom of the page.
- Similar and duplicate documents will be listed underneath the
Version History