{"
lower_case
":
"true"
,"
remove_brackets
":
"true"
,"
simple_chinese
":
"true"
,"
remove_blank
":
"true"
,"
remove_special
":
"true"
,"
ignore_order
":
"true"
,"
match_signal
":
"true"
}
踩坑1:不能直接传入上面的json参数
/data/app/spark/bin/spark-submit --name 测试 --class com.karakal.lanchao.process.ProcessData --driver-memory 6g --master yarn --deploy-mode cluster --executor-memory 8g --num-executors 2 --executor-cores 2 --files hdfs://hadoop-cluster-ha/lanchao/bigdata_support/songlist/testdata.txt hdfs://hadoop-cluster-ha/lanchao/bigdata_support/sparkjar.jar
/lanchao/bigdata_support/songlist/testdata.txt {"dest_catalog":"测试文件1","site":"tencent","song_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true"},"artist_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true","ignore_order":"true","match_signal":"true"}}
以上这样传入参数,在spark程序中接收到的参数是错误的:
args(0)=/lanchao/bigdata_support/songlist/testdata.txt
args(1)={"dest_catalog"
args(2)="测试文件1"
args(3)="site"
解决办法:
1.有双引号将整体包裹起来
2.里面的双引号需要加\转义
/data/app/spark/bin/spark-submit --name 测试 --class com.karakal.lanchao.process.ProcessData --driver-memory 6g --master yarn --deploy-mode cluster --executor-memory 8g --num-executors 2 --executor-cores 2 \
--files hdfs://hadoop-cluster-ha/lanchao/bigdata_support/songlist/testdata.txt hdfs://hadoop-cluster-ha/lanchao/bigdata_support/sparkjar.jar \
/lanchao/bigdata_support/songlist/testdata.txt \
"{\"dest_catalog\":\"测试文件1\",\"site\":\"tencent\",\"song_settings\":{\"lower_case\":\"true\",\"remove_brackets\":\"true\",\"simple_chinese\":\"true\",\"remove_blank\":\"true\",\"remove_special\":\"true\"},\"artist_settings\":{\"lower_case\":\"true\",\"remove_brackets\":\"true\",\"simple_chinese\":\"true\",\"remove_blank\":\"true\",\"remove_special\":\"true\",\"ignore_order\":\"true\",\"match_signal\":\"true\"}}"
这样是解决了作为一个整体传入,但是又遇到了接下来得一个坑:
程序中打印参数:
查看日志内容:
LogType:stdout
Log Upload Time:Fri Dec 08 14:32:01 +0800 2017
LogLength:423
Log Contents:
*********************************json参数输出***********************************
{"dest_catalog":"测试文件1","site":"tencent","song_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true"},"artist_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true","ignore_order":"true"
End of LogType:stdout
纳尼?最后两个大括号呢?“}}”消失了!!!!
开始第一反应还以为是spark提交脚本参数长度有限制!然而经过测试验证并不是!!!
几经测试,发现只传如一个”}”,后台可以接收到,只要两个大括号在一起就出现bug!!!!没有找到好的解决方案,目前只能先将两个大括号隔开
/data/app/spark/bin/spark-submit --name 测试 --class com.karakal.lanchao.process.ProcessData --driver-memory 6g --master yarn --deploy-mode cluster --executor-memory 8g --num-executors 2 --executor-cores 2 \
--files hdfs://hadoop-cluster-ha/lanchao/bigdata_support/songlist/testdata.txt hdfs://hadoop-cluster-ha/lanchao/bigdata_support/sparkjar.jar \
/lanchao/bigdata_support/songlist/testdata.txt \
"{\"dest_catalog\":\"测试文件1\",\"song_settings\":{\"lower_case\":\"true\",\"remove_brackets\":\"true\",\"simple_chinese\":\"true\",\"remove_blank\":\"true\",\"remove_special\":\"true\"},\"artist_settings\":{\"lower_case\":\"true\",\"remove_brackets\":\"true\",\"simple_chinese\":\"true\",\"remove_blank\":\"true\",\"remove_special\":\"true\",\"ignore_order\":\"true\",\"match_signal\":\"true\"},\"site\":\"tencent\"}"
这样程序接收到完整的json参数
LogType:stdout
Log Upload Time:Fri Dec 08 14:50:32 +0800 2017
LogLength:447
Log Contents:
*********************************json参数输出***********************************
{"dest_catalog":"测试文件1","song_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true"},"artist_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true","ignore_order":"true","match_signal":"true"},"site":"tencent"}
End of LogType:stdout
这不是最佳解决方案,有知道的朋友欢迎,留言交流!
Spark Doris Connector(apache-doris-spark-connector-2.3_2.11-1.0.1-incubating-src.tar.gz)
Spark Doris Connector Version:1.0.1
Spark Version:2.x
Scala Version:2.11
Apache Doris是一个现代MPP分析数据库产品。它可以提供亚秒级查询和高效的实时数据分析。通过它的分布式架构,高达10PB级的数据集将得到很好的支持,易于操作。
Apache Doris可以满足各种数据分析需求,包括历史数据报告、实时数据分析、交互式数据分析和探索性数据分析。让您的数据分析更容易!
该库的目标是在将json数据加载到Apache Spark中时支持输入数据的完整性。 为此,该库:
读取现有的json模式文件
解析json模式并构建一个Spark DataFrame模式
将json数据加载到Spark中时,可以使用生成的架构。 这验证输入数据符合给定的架构,并能够过滤出损坏的输入数据。
将库包含在以下坐标下:
libraryDependencies + = " org.zalando " %% " spark-json-schema " % " 0.6.1 "
通过提供输入文件的路径来解析给定的json模式文件。 该文件应相对于resources文件夹:
val schema = SchemaConverter .convert( " schemaFile.json " )
或者,您可以将convertCont
Spark Doris Connector(apache-doris-spark-connector-3.1_2.12-1.0.1-incubating-src.tar.gz)
Spark Doris Connector Version:1.0.1
Spark Version:3.x
Scala Version:2.12
Apache Doris是一个现代MPP分析数据库产品。它可以提供亚秒级查询和高效的实时数据分析。通过它的分布式架构,高达10PB级的数据集将得到很好的支持,易于操作。
Apache Doris可以满足各种数据分析需求,包括历史数据报告、实时数据分析、交互式数据分析和探索性数据分析。让您的数据分析更容易!
val eventsFromJSONDF = Seq (
(0, """{"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cca3": "USA", "cn": "United States", "temp": 25, "signal": 23, "battery_level": 8, "c02_level": 917, "timestamp" :1475600496 }"""),
(1, """{"devic
假如现在向spark-submit传参json字符串:{"transformer":[{"funcName":"replacefbw","parameter":{"columnName":"ARTI","params":[3,0,"****"]}}]}
json串会被我们程序里用到,这里涉及到的注意事项有,如果使用双引号包裹json串,则json里的双引号要用\来转义,即如下:
spark-submit "{\"transformer\":[{\"funcName\":\"replacefbw\",\"
###原因如下,Spark 源码 中会对把}} 和 {{替换掉
@VisibleForTesting
public static String expandEnvironment(String var,
Path containerLogDir) {
var = var.replace(ApplicationConstants.LOG_DIR_EXPANSION_VAR,
containerLogDir.toS
1.读文件通过sc.textFile(“file://")方法来读取文件到rdd中。val lines = sc.textFile("file://")//文件地址或者HDFS文件路径本地地址"file:///home/hadoop/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.json"HDFS文件地址"hdfs://...
Spark SQL能够自动将JSON数据集以结构化的形式加载为一个DataFrame读取一个JSON文件可以用SparkSession.read.json方法指定DataFrame的schema1,通过反射自动推断,适合静态数据2,程序指定,适合程序运行中动态生成的数据重要的方法2,get_json3,explode。
spark提交任务,参数的形式是JSON
比如:spark2-submit --class com.iflytek.test.Jcseg_HiveDemo spark_hive.jar {"tablename":"dhzp","fields":["text1","text2"]} {"tablename":"dhzp111","fields":["text1_jcseg","text2_j