添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
" dest_catalog ": "测试文件1" , " site ": "tencent" , " song_settings ": {" lower_case ": "true" ," remove_brackets ": "true" ," simple_chinese ": "true" ," remove_blank ": "true" ," remove_special ": "true" } , " artist_settings ": {" lower_case ": "true" ," remove_brackets ": "true" ," simple_chinese ": "true" ," remove_blank ": "true" ," remove_special ": "true" ," ignore_order ": "true" ," match_signal ": "true" }

踩坑1:不能直接传入上面的json参数

/data/app/spark/bin/spark-submit --name 测试 --class com.karakal.lanchao.process.ProcessData --driver-memory 6g --master yarn --deploy-mode cluster --executor-memory 8g --num-executors 2 --executor-cores 2 --files hdfs://hadoop-cluster-ha/lanchao/bigdata_support/songlist/testdata.txt hdfs://hadoop-cluster-ha/lanchao/bigdata_support/sparkjar.jar 
/lanchao/bigdata_support/songlist/testdata.txt {"dest_catalog":"测试文件1","site":"tencent","song_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true"},"artist_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true","ignore_order":"true","match_signal":"true"}}

以上这样传入参数,在spark程序中接收到的参数是错误的:

args(0)=/lanchao/bigdata_support/songlist/testdata.txt
args(1)={"dest_catalog"
args(2)="测试文件1"
args(3)="site"

解决办法:
1.有双引号将整体包裹起来
2.里面的双引号需要加\转义

/data/app/spark/bin/spark-submit --name 测试 --class com.karakal.lanchao.process.ProcessData --driver-memory 6g --master yarn --deploy-mode cluster --executor-memory 8g --num-executors 2 --executor-cores 2 \
--files hdfs://hadoop-cluster-ha/lanchao/bigdata_support/songlist/testdata.txt hdfs://hadoop-cluster-ha/lanchao/bigdata_support/sparkjar.jar \
/lanchao/bigdata_support/songlist/testdata.txt \
"{\"dest_catalog\":\"测试文件1\",\"site\":\"tencent\",\"song_settings\":{\"lower_case\":\"true\",\"remove_brackets\":\"true\",\"simple_chinese\":\"true\",\"remove_blank\":\"true\",\"remove_special\":\"true\"},\"artist_settings\":{\"lower_case\":\"true\",\"remove_brackets\":\"true\",\"simple_chinese\":\"true\",\"remove_blank\":\"true\",\"remove_special\":\"true\",\"ignore_order\":\"true\",\"match_signal\":\"true\"}}"

这样是解决了作为一个整体传入,但是又遇到了接下来得一个坑:
程序中打印参数:
这里写图片描述

查看日志内容:
这里写图片描述

LogType:stdout
Log Upload Time:Fri Dec 08 14:32:01 +0800 2017
LogLength:423
Log Contents:
*********************************json参数输出***********************************
{"dest_catalog":"测试文件1","site":"tencent","song_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true"},"artist_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true","ignore_order":"true"
End of LogType:stdout

纳尼?最后两个大括号呢?“}}”消失了!!!!
开始第一反应还以为是spark提交脚本参数长度有限制!然而经过测试验证并不是!!!
几经测试,发现只传如一个”}”,后台可以接收到,只要两个大括号在一起就出现bug!!!!没有找到好的解决方案,目前只能先将两个大括号隔开

/data/app/spark/bin/spark-submit --name 测试 --class com.karakal.lanchao.process.ProcessData --driver-memory 6g --master yarn --deploy-mode cluster --executor-memory 8g --num-executors 2 --executor-cores 2 \
--files hdfs://hadoop-cluster-ha/lanchao/bigdata_support/songlist/testdata.txt hdfs://hadoop-cluster-ha/lanchao/bigdata_support/sparkjar.jar \
/lanchao/bigdata_support/songlist/testdata.txt \
"{\"dest_catalog\":\"测试文件1\",\"song_settings\":{\"lower_case\":\"true\",\"remove_brackets\":\"true\",\"simple_chinese\":\"true\",\"remove_blank\":\"true\",\"remove_special\":\"true\"},\"artist_settings\":{\"lower_case\":\"true\",\"remove_brackets\":\"true\",\"simple_chinese\":\"true\",\"remove_blank\":\"true\",\"remove_special\":\"true\",\"ignore_order\":\"true\",\"match_signal\":\"true\"},\"site\":\"tencent\"}"

这样程序接收到完整的json参数
这里写图片描述

LogType:stdout
Log Upload Time:Fri Dec 08 14:50:32 +0800 2017
LogLength:447
Log Contents:
*********************************json参数输出***********************************
{"dest_catalog":"测试文件1","song_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true"},"artist_settings":{"lower_case":"true","remove_brackets":"true","simple_chinese":"true","remove_blank":"true","remove_special":"true","ignore_order":"true","match_signal":"true"},"site":"tencent"}
End of LogType:stdout

这不是最佳解决方案,有知道的朋友欢迎,留言交流!

Spark Doris Connector(apache-doris-spark-connector-2.3_2.11-1.0.1-incubating-src.tar.gz) Spark Doris Connector Version:1.0.1 Spark Version:2.x Scala Version:2.11 Apache Doris是一个现代MPP分析数据库产品。它可以提供亚秒级查询和高效的实时数据分析。通过它的分布式架构,高达10PB级的数据集将得到很好的支持,易于操作。 Apache Doris可以满足各种数据分析需求,包括历史数据报告、实时数据分析、交互式数据分析和探索性数据分析。让您的数据分析更容易! 该库的目标是在将json数据加载到Apache Spark中时支持输入数据的完整性。 为此,该库: 读取现有的json模式文件 解析json模式并构建一个Spark DataFrame模式 将json数据加载到Spark中时,可以使用生成的架构。 这验证输入数据符合给定的架构,并能够过滤出损坏的输入数据。 将库包含在以下坐标下: libraryDependencies + = " org.zalando " %% " spark-json-schema " % " 0.6.1 " 通过提供输入文件的路径来解析给定的json模式文件。 该文件应相对于resources文件夹: val schema = SchemaConverter .convert( " schemaFile.json " ) 或者,您可以将convertCont Spark Doris Connector(apache-doris-spark-connector-3.1_2.12-1.0.1-incubating-src.tar.gz) Spark Doris Connector Version:1.0.1 Spark Version:3.x Scala Version:2.12 Apache Doris是一个现代MPP分析数据库产品。它可以提供亚秒级查询和高效的实时数据分析。通过它的分布式架构,高达10PB级的数据集将得到很好的支持,易于操作。 Apache Doris可以满足各种数据分析需求,包括历史数据报告、实时数据分析、交互式数据分析和探索性数据分析。让您的数据分析更容易! val eventsFromJSONDF = Seq ( (0, """{"device_id": 0, "device_type": "sensor-ipad", "ip": "68.161.225.1", "cca3": "USA", "cn": "United States", "temp": 25, "signal": 23, "battery_level": 8, "c02_level": 917, "timestamp" :1475600496 }"""), (1, """{"devic 假如现在向spark-submit传参json字符串:{"transformer":[{"funcName":"replacefbw","parameter":{"columnName":"ARTI","params":[3,0,"****"]}}]} json串会被我们程序里用到,这里涉及到的注意事项有,如果使用双引号包裹json串,则json里的双引号要用\来转义,即如下: spark-submit "{\"transformer\":[{\"funcName\":\"replacefbw\",\" ###原因如下,Spark 源码 中会对把}} 和 {{替换掉 @VisibleForTesting public static String expandEnvironment(String var, Path containerLogDir) { var = var.replace(ApplicationConstants.LOG_DIR_EXPANSION_VAR, containerLogDir.toS 1.读文件通过sc.textFile(“file://")方法来读取文件到rdd中。val lines = sc.textFile("file://")//文件地址或者HDFS文件路径本地地址"file:///home/hadoop/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.json"HDFS文件地址"hdfs://... Spark SQL能够自动将JSON数据集以结构化的形式加载为一个DataFrame读取一个JSON文件可以用SparkSession.read.json方法指定DataFrame的schema1,通过反射自动推断,适合静态数据2,程序指定,适合程序运行中动态生成的数据重要的方法2,get_json3,explode。 spark提交任务,参数的形式是JSON 比如:spark2-submit --class com.iflytek.test.Jcseg_HiveDemo  spark_hive.jar  {"tablename":"dhzp","fields":["text1","text2"]}  {"tablename":"dhzp111","fields":["text1_jcseg","text2_j