Apache Spark Integration with DROOLS

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

玩滑板的罐头 · 调用GetJMeterLogs获得JMete ...· 1 年前 ·

闯红灯的马克杯 · 有关vscode c++ 出现 ...· 1 年前 ·

睿智的葫芦 · hadoop - Build Ambari ...· 2 年前 ·

有爱心的打火机 · python实现SVG文件解析_python ...· 2 年前 ·

害羞的松树 · 计算机关闭系统剪切板上的内容,怎样删除电脑粘 ...· 2 年前 ·

Disclaimer : The views and opinions expressed in this article are from my research and learning and the work showcased here is created part of a self-learning POC created by me.

Apache-Spark with Drools Integration POC has been created to see if we can fit in an external java based rule engine to the Spark framework. The below article explains the motivation behind integrating Apache-Spark cluster computing framework with the DROOLS rule engine and the advantages of using both together.

The POC details along with the GITHUB reference for my self POC code is given below.

Motivation :

Drools provide the below-mentioned advantages :

1) Re-usability: All the business rules can be kept in its own component de-coupled from the application, thus increases re-usability.

2) Flexibility: New requirement changes in rules within the application code requires retesting and redeployment efforts.

3) Simplified rules: As compared to in code implementation, with separate rule engine like Drools rules are easier to understand and when used tools like Talend-BRMS, it is easier to write new rules also.

4) De-coupling logic from Data: the main advantage that we get is decoupling logic from data, thus making easy adaptation to new changes.

5) Optimized performance: Drools uses the Rete algorithm to optimize the logical flow of rules and with the new releases, the engine performance enhances further.

Reference article - explaining Drools advantages

Spark provides below advantages :

1) In memory cluster processing :

Spark is an in-memory distributed data processing engine, which is 10 to 100x faster than Map-reduce.

2) Distributed application of Drools:

When used Spark provides a faster application of Drools rules to the data than traditional single-node applications.

GITHUB URL for POC code

Spark-Drools-Poc-GUTHUB URL

Note : This code has been tested in eclipse using Winutils.exe, not tested in a cluster yet but should be working fine if little tweaking is required in any case of any serialization issues.

The architecture of the system :

The architecture of the system/application will look like this, we can call it as "Distributed Rule Engine" (powered by Apache-Spark)

POC Main Code snippets with explanation :

Input Data:

This data-set is city-specific traffic signal rules data

RDD :

with RDD map function, we can take each generic row and create a case class object: Traffic, this object can be supplied to KieSession Object (Drools - API specific), with fireAllRules() method all the rules in the specified DRL file will be applied to the data set.

DataFrame :

with DataFrame, using UDF, we can apply rules and get a response , the response is used to fill a new column.

DataSet:

With Data-set, first, we load data to a data-frame, create a structured data-frame, then using encoders convert the data-frame to a data set, using Data-set map() function we can apply the rules.

Drools Invocation code :

traffic.drl is the file that contains rules Drools rules.

Reference article for Drools invocation with Scala

Output:

This the output after rules applied, the action column is derived from the rules