contains hive json数组函数 hive解析json嵌套数组_mob64ca13f6bbea的技术博客_

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

强健的企鹅 · 近代青岛旅游大事记· 1 年前 ·

风流倜傥的匕首 · 纪念潘光旦诞辰120周年研讨会：老一代学者的 ...· 1 年前 ·

悲伤的鸭蛋 · 【赠书】深入浅出Python量化交易实战 ...· 1 年前 ·

有情有义的帽子 · 沈阳市政府工作报告（2022年1月8日 ...· 1 年前 ·

苦闷的杯子 · 【Python爬虫】60秒破解百度文库收费限 ...· 1 年前 ·

在hive的内置函数中，提供了两个函数用来解析json，分别是get_json_object和json_tuple

以下出现的hive函数不懂的请到 hive2.0中常用的内置函数大全这篇博客去查看了解

首先来查看下get_json_object函数的详细信息

> desc function extended get_json_object;
tab_name
get_json_object(json_txt, path) - Extract a json object from path 
Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input js
on string is invalid.A limited version of JSONPath supported:
  $   : Root object
  .   : Child operator
  []  : Subscript operator for array
  *   : Wildcard for []
Syntax not supported that is worth noticing:
  ''  : Zero length string as key
  ..  : Recursive descent
  @   : Current object/element
  ()  : Script expression
  ?() : Filter (script) expression.
  [,] : Union operator
  [start:end:step] : array slice operator
Function class:org.apache.hadoop.hive.ql.udf.UDFJson
Function type:BUILTIN
Time taken: 1.33 seconds, Fetched: 18 row(s)

语法: get_json_object(string json_string, string path)
返回值: string
说明：解析json的字符串json_string,返回path指定的内容。如果输入的json字符串无效，那么返回NULL。

select get_json_object('[{"website":"","name":"浮云"},{"website":"/article/details/86606379","name":"hive中的开窗函数"},{"website":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]'
, "$.[0].website"); -- 返回第一个
Time taken: 0.448 seconds, Fetched: 1 row(s)
select get_json_object('[{"website":"","name":"浮云"},{"website":"/article/details/86606379","name":"hive中的开窗函数"},{"website":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]'
, "$.[0,1,2].website"); -- 返回第一、二、三个
["","/article/details/86606379","/article/details/82470278"]
Time taken: 0.585 seconds, Fetched: 1 row(s)
select get_json_object('[{"website":"","name":"浮云"},{"website":"/article/details/86606379","name":"hive中的开窗函数"},{"website":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]'
, "$.[0:2:1].website"); -- 返回从下标为0-2的，步长为1的
["","/article/details/86606379","/article/details/82470278"]
Time taken: 0.585 seconds, Fetched: 1 row(s)

json解析函数：json_tuple

语法:json_tuple(json_string, path1, path2, …, pathn)
返回值: tuple
说明：解析json的字符串json_string,返回path1-pathn指定的内容。如果输入的json字符串无效，那么返回NULL。

select json_tuple('{"website":"/article/list/1","name":"浮云"}', 'website', 'name');
c0	c1
Time taken: 0.788 seconds, Fetched: 1 row(s)

虽然get_json_object函数能通过下标解析指定位置的json，但当不知道json数组有多少个时，或者数组长度不断变化时，无法进行解析

我们可以通过以下办法解析json数组

先通过正则表达式将数组中的,用;代替

select regexp_replace(
'[{"website":"","name":"浮云"},{"website":"/article/details/86606379","name":"hive中的开窗函数"},{"website":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]', '\\}\\,\\{','\\}\\;\\{'
[{"website":"","name":"浮云"};{"website":"/article/details/86606379","name":"hive中的开窗函数"};{"web
site":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]
Time taken: 0.458 seconds, Fetched: 1 row(s)

再通过正则表达式把数组中的[]去掉

select regexp_replace(
regexp_replace(
'[{"website":"","name":"浮云"},{"website":"/article/details/86606379","name":"hive中的开窗函数"},{"website":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]', '\\}\\,\\{','\\}\\;\\{'
),'\\[|\\]',''
{"website":"","name":"浮云"};{"website":"/article/details/86606379","name":"hive中的开窗函数"};{"webs
ite":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}
Time taken: 0.521 seconds, Fetched: 1 row(s)

接下来用split函数按;进行分割

select 
split(
regexp_replace(
regexp_replace(
'[{"website":"","name":"浮云"},{"website":"/article/details/86606379","name":"hive中的开窗函数"},{"website":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]', '\\}\\,\\{','\\}\\;\\{'
),'\\[|\\]',''
),'\\;'
["{\"website\":\"\",\"name\":\"浮云\"}","{\"website\":\"/article/details/86606379\",\"name\":\"hive中
的开窗函数\"}","{\"website\":\"/article/details/82470278\",\"name\":\"利用hive统计连续登陆的用户数\"}"]
Time taken: 0.434 seconds, Fetched: 1 row(s)

再利用explode函数将数组转换按列输出

select explode(
split(
regexp_replace(
regexp_replace(
'[{"website":"","name":"浮云"},{"website":"/article/details/86606379","name":"hive中的开窗函数"},{"website":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]', '\\}\\,\\{','\\}\\;\\{'
),'\\[|\\]',''
),'\\;'
{"website":"","name":"浮云"}
{"website":"/article/details/86606379","name":"hive中的开窗函数"}
{"website":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}
Time taken: 0.416 seconds, Fetched: 3 row(s)

最后用json_tuple函数将explode转换的json格式的列进行解析

select json_tuple(json, 'website', 'name') 
select explode(
split(
regexp_replace(
regexp_replace(
'[{"website":"","name":"浮云"},{"website":"/article/details/86606379","name":"hive中的开窗函数"},{"website":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]', '\\}\\,\\{','\\}\\;\\{'
),'\\[|\\]',''
),'\\;'
) as json
) fuyun;
c0	c1
Time taken: 0.497 seconds, Fetched: 3 row(s)

还可以通过自定义UDF函数实现，继承UDF类，实现evaluate方法，在evaluate方法中写实现功能的代码，再打成jar包，将jar包添加到hive中，在创建函数进行使用
具体代码实现：

package com.jsonarray.udf.json;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONArray;
import org.json.JSONException;
import java.util.ArrayList;
@Description(name = "json_array",
        value = "_FUNC_(array_string) - Convert a string of a JSON-encoded array to a Hive array of strings.")
public class UDFJsonAsArray extends UDF {
    public ArrayList<String> evaluate(String jsonString) {
        if (jsonString == null) {
            return null;
        try {
            JSONArray extractObject = new JSONArray(jsonString);
            ArrayList<String> result = new ArrayList<String>();
            for (int ii = 0; ii < extractObject.length(); ++ii) {
                result.add(extractObject.get(ii).toString());
            return result;
        } catch (JSONException e) {
            return null;
        } catch (NumberFormatException e) {
            return null;
}

将上面的代码进行编译打包，假设打包完的 jar 包名称为 jsonarray.jar，然后我们就可以如下使用这个函数了。
添加jar包

> add jar /opt/datas/ jsonarray.jar;

创建函数

> create temporary function json_array as 'com.jsonarray.udf.json.UDFJsonAsArray';

使用自定义UDF函数实现

select json_tuple(json, 'website', 'name') 
select explode(
json_array(
'[{"website":"","name":"浮云"},{"website":"/article/details/86606379","name":"hive中的开窗函数"},{"website":"/article/details/82470278","name":"利用hive统计连续登陆的用户数"}]'
) as json
) fuyun;
c0	c1
Time taken: 0.098 seconds, Fetched: 3 row(s)

python flask 如何修改配置常量后全局生效 flask环境配置

Flask项目的配置信息1.安装pipenv并创建虚拟环境windows: pip install pipenv Linux或macOS: sudo pip install pipenv如果有多个版本的python注意是安装到哪个python上，查看使用pip还是pip3。检查pipenv是否安装和版本号pipenv --version创建虚拟环境先在项目跟目录下创建.venv文件夹，这样使得虚