Disclaimer
: The views and opinions expressed in this article are from my research and learning and the work showcased here is created part of a self-learning POC created by me.
Apache-Spark with Drools Integration POC has been created to see if we can fit in an external java based rule engine to the Spark framework. The below article explains the motivation behind integrating Apache-Spark cluster computing framework with the DROOLS rule engine and the advantages of using both together.
The POC details along with the GITHUB reference for my self POC code is given below.
Motivation :
Drools provide the below-mentioned advantages :
1)
Re-usability:
All the business rules can be kept in its own component de-coupled from the application, thus increases re-usability.
2)
Flexibility:
New requirement changes in rules within the application code requires retesting and redeployment efforts.
3)
Simplified rules:
As compared to in code implementation, with separate rule engine like Drools rules are easier to understand and when used tools like Talend-BRMS, it is easier to write new rules also.
4)
De-coupling logic from Data:
the main advantage that we get is decoupling logic from data, thus making easy adaptation to new changes.
5)
Optimized performance:
Drools uses the Rete algorithm to optimize the logical flow of rules and with the new releases, the engine performance enhances further.
Reference article - explaining Drools advantages
Spark provides below advantages :
1)
In memory cluster processing
:
Spark is an in-memory distributed data processing engine, which is 10 to 100x faster than Map-reduce.
2)
Distributed application of Drools:
When used Spark provides a faster application of Drools rules to the data than traditional single-node applications.
GITHUB URL for POC code
Spark-Drools-Poc-GUTHUB URL
Note
: This code has been tested in eclipse using Winutils.exe, not tested in a cluster yet but should be working fine if little tweaking is required in any case of any serialization issues.
The architecture of the system :
The architecture of the system/application will look like this, we can call it as "Distributed Rule Engine" (powered by Apache-Spark)
POC Main Code snippets with explanation :
Input Data:
This data-set is city-specific traffic signal rules data
RDD
:
with RDD map function, we can take each generic row and create a case class object: Traffic, this object can be supplied to KieSession Object (Drools - API specific), with fireAllRules() method all the rules in the specified DRL file will be applied to the data set.
DataFrame
:
with DataFrame, using UDF, we can apply rules and get a response , the response is used to fill a new column.
DataSet:
With Data-set, first, we load data to a data-frame, create a structured data-frame, then using encoders convert the data-frame to a data set, using Data-set map() function we can apply the rules.
Drools Invocation code :
traffic.drl is the file that contains rules Drools rules.
Reference article for Drools invocation with Scala
Output:
This the output after rules applied, the action column is derived from the rules