添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
逼格高的韭菜  ·  python 调整饱和度 ...·  1 年前    · 
讲道义的青蛙  ·  typescript ...·  2 年前    · 

Disclaimer : The views and opinions expressed in this article are from my research and learning and the work showcased here is created part of a self-learning POC created by me.

Apache-Spark with Drools Integration POC has been created to see if we can fit in an external java based rule engine to the Spark framework. The below article explains the motivation behind integrating Apache-Spark cluster computing framework with the DROOLS rule engine and the advantages of using both together.

The POC details along with the GITHUB reference for my self POC code is given below.

Motivation :

Drools provide the below-mentioned advantages :

1) Re-usability: All the business rules can be kept in its own component de-coupled from the application, thus increases re-usability.

2) Flexibility: New requirement changes in rules within the application code requires retesting and redeployment efforts.

3) Simplified rules: As compared to in code implementation, with separate rule engine like Drools rules are easier to understand and when used tools like Talend-BRMS, it is easier to write new rules also.

4) De-coupling logic from Data: the main advantage that we get is decoupling logic from data, thus making easy adaptation to new changes.

5) Optimized performance: Drools uses the Rete algorithm to optimize the logical flow of rules and with the new releases, the engine performance enhances further.

Reference article - explaining Drools advantages

Spark provides below advantages :

1) In memory cluster processing :

Spark is an in-memory distributed data processing engine, which is 10 to 100x faster than Map-reduce.

2) Distributed application of Drools:

When used Spark provides a faster application of Drools rules to the data than traditional single-node applications.

GITHUB URL for POC code

Spark-Drools-Poc-GUTHUB URL

Note : This code has been tested in eclipse using Winutils.exe, not tested in a cluster yet but should be working fine if little tweaking is required in any case of any serialization issues.

The architecture of the system :

The architecture of the system/application will look like this, we can call it as "Distributed Rule Engine" (powered by Apache-Spark)

No alt text provided for this image


POC Main Code snippets with explanation :

Input Data:

This data-set is city-specific traffic signal rules data

No alt text provided for this image


RDD :

No alt text provided for this image

with RDD map function, we can take each generic row and create a case class object: Traffic, this object can be supplied to KieSession Object (Drools - API specific), with fireAllRules() method all the rules in the specified DRL file will be applied to the data set.

DataFrame :

No alt text provided for this image

with DataFrame, using UDF, we can apply rules and get a response , the response is used to fill a new column.

DataSet:

No alt text provided for this image

With Data-set, first, we load data to a data-frame, create a structured data-frame, then using encoders convert the data-frame to a data set, using Data-set map() function we can apply the rules.

Drools Invocation code :

No alt text provided for this image

traffic.drl is the file that contains rules Drools rules.

Reference article for Drools invocation with Scala


Output:

This the output after rules applied, the action column is derived from the rules

No alt text provided for this image