Alfresco Solr Trackers Showcase

In 2014, while working at Alfresco, I helped upgrade from using Solr 1.4 to Solr 4.9, and in doing so I changed much of the Solr tracking code.  We were on Alfresco version 4.2, and our next big release would be 5.0.  Here’s an overview of my work.

We started with one solr project.  In ACE-916 and other issues we separated the code into two additional projects, solr-client, and solr4.

The solr-client Project

The solr-client project contains our Java API to connect to Alfresco for tracking.  Despite the name of this project, it really provides a proxy to Alfresco rather than to Solr.  I moved code from the original solr project that was relevant for all Solr versions to this project.  I created the SOLRAPIClientFactory and its associated SOLRAPIClientFactoryTest.  I created these adapter interfaces in order to keep some dependent code in the project free from dependencies on specific versions of those classes.  The solr and solr4 projects are each dependent upon solr-client but independent of each other.

The solr Project

The solr project contains code relevant to Solr 1.4.  I created the specific implementations of the adapters.  The original CoreTracker had > 2000 lines of code and did all kinds of tracking sequentially.  This was refactored in the solr4 project.  I also created the InformationServer interface and the LegacySolrInformationServer.

The solr4 Project

The solr4 project is specific to Solr 4.9.  I created this project, and the majority of my work was done here.  I created the ModelTracker, ContentTracker, and MetadataTracker, and I contributed to the AbstractTracker and AclTracker.  The now multiple kinds of trackers used the ThreadHandler, QueueHandler, and AbstractWorkableRunner to take advantage of multi-threading.  The ContentTracker took advantage of our new SolrContentStore that we used as a cache to prevent having to hit Alfresco for a reindex.  Associated tracker tests are here.

I also changed how we triggered tracking.  Alfresco has a separate Solr core for each Alfresco store, i.e. workspace://SpacesStore for “live” content, archive://SpacesStore for “deleted” content, etc. The AlfrescoCoreAdminHandler, which is a custom CoreAdminHandler, instantiates a SolrTrackerScheduler which schedules a CoreWatcherJob. The CoreWatcherJob goes through the Solr cores and registers with the admin handler the information server and the trackers. To do this I created a TrackerRegistry to register trackers per core.  Here are the SolrTrackerSchedulerTest and the TrackerRegistryTest.  

As was required by the new SolrCore, I created an AlfrescoSolrCloseHook along with its AlfrescoSolrCloseHookTest.  I created the InformationServer interface and the SolrInformationServer implementation for solr4.  * I wanted to have one InformationServer interface in the solr-client project, but the implementations were so different that it didn’t fit.  Now I almost feel like there is no point in having the interface for both projects, but I left them in there anyway.

I implemented the adapters mentioned above here.  Since we were trying to make our new implementation of Solr cloud-friendly, I implemented Cloud to facilitate running solr queries in the cloud.  Along with that were the SolrCoreTestBase and the CloudTest.

For ACE-3126 I ensured that module models are gotten before queries go through during installation.  I created the EnsureModelsComponent that makes queries block wait until the first model sync is done to the repository.

Testing

I prefer when writing unit tests to use a mocking framework so that the tests have no external dependencies such as a database or an app server.  That’s why I invested time in blazing a trail for using Mockito at Alfresco.  Of course we also performed integration tests like those in the AlfrescoCoreAdminTester and manual tests as documented in various Jira issues.  I didn’t participate in the performance tests, but I know they were done.

Mavenization

The code was originally being built using Ant, and I mavenized the build.  This included a solr4-ssl profile to enable secure comms in the solr4 pom and a solr-http profile to disable ssl in the solr pom.

Conclusion

I learned about Solr, multi-threading, scheduling jobs in Alfresco, Ant, Maven, and generally how to populate the Solr index with all relevant information in an enterprise content management system like Alfresco.  I have come to know that this problem is something that others like Lucidworks have solved.  I have developed a passion for this area and would like to do more on it in the future.