Read multiline json string using Spark dataframe in azure databricks

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

要出家的小熊猫 · 踩了个C++的未定义标识符"cout"的坑_ ...· 昨天 ·

善良的水煮鱼 · 解决C++遇到的未定义标识符 ...· 昨天 ·

深情的长颈鹿 · VS2019使用getline()报错 ...· 昨天 ·

呐喊的打火机 · Invoke-RestMethod ...· 昨天 ·

不爱学习的番茄 · CHAR (Transact-SQL) - ...· 23 小时前 ·

紧张的柠檬 · 一份牵挂装在心间, 相互问候真情无限_网易订阅· 7 月前 ·

千杯不醉的遥控器 · python生成13位或16位时间戳以及反向 ...· 7 月前 ·

聪明的小熊猫 · （转）优化时该用SGD，还是用Adam？—— ...· 11 月前 ·

想表白的茶壶 · 49岁蔡少芬查出绝症，首度回应离婚：我出事张 ...· 1 年前 ·

独立的奔马 · 苹果面容ID不可用有什么影响_头条· 1 年前 ·

I am reading the contents of an api into a dataframe using the pyspark code below in a databricks notebook. I validated the json payload and the string is in valid json format. I guess the error is due to multiline json string. The below code worked fine with other json api payloads.

Spark version < 2.2

import requests
user = "usr"
password = "aBc!23"
response = requests.get('https://myapi.com/allcolor', auth=(user, password))
jsondata = response.json()
from pyspark.sql import *
df = spark.read.option("multiline", "true").json(sc.parallelize([data]))
df.show()
JSON payload:  
  "colors": [
      "color": "black",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
        "hex": "#000"
      "color": "white",
      "category": "value",
      "code": {
        "rgba": [
        "hex": "#FFF"
      "color": "red",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
        "hex": "#FF0"
      "color": "blue",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
        "hex": "#00F"
      "color": "yellow",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
        "hex": "#FF0"
      "color": "green",
      "category": "hue",
      "type": "secondary",
      "code": {
        "rgba": [
        "hex": "#0F0"
Error:  
pyspark.sql.dataframe.DataFrame = [_corrupt_record: string]
Thanks for the ask and using the Microsoft Q&A platform.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
You may checkout the below threads which addressing similar issue:
https://stackoverflow.com/questions/38895057/reading-json-with-apache-spark-corrupt-record
https://www.mail-archive.com/user@天地微光  .apache.org/msg59206.html
Hope this helps. Do let us know if you any further queries.
Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.
			 Hello @ PRADEEPCHEEKATLA-MSFT   
Modified code:  
spark.sql("set spart.databricks.delta.preview.enabled=true")
spark.sql("set spart.databricks.delta.retentionDutationCheck.preview.enabled=false")
import json
import requests
from requests.auth import HTTPDigestAuth
import pandas as pd
user = "username"
password = "password"
myResponse = requests.get('https://myapi.com/allcolor', auth=(user, password))
if(myResponse.ok):
  jData = json.loads(myResponse.content)
  s1 = json.dumps(jData)
  #load data from api
  x = json.loads(s1)
  data = pd.read_json(json.dumps(x))
  #create dataframe
  spark_df = spark.createDataFrame(data)
  spark_df.show()          
  spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net","<your-storage-account-access-key>")
  spark_df.write.mode("overwrite").json("wasbs://<container>@<storage-account-name>.blob.core.windows.net/<directory>/")
else:
  myResponse.raise_for_status()
Output not in right format as the source.  
  "colors": 
      "color": "black",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
        "hex": "#000"
  "colors":     
      "color": "white",
      "category": "value",
      "code": {
        "rgba": [
        "hex": "#FFF"
Could you please point to me where I am going wrong as the output file I am storing in ADLS Gen2 does not match the source api json payload.  
Thank you

推荐文章

要出家的小熊猫 · 踩了个C++的未定义标识符"cout"的坑_未定义标识符cout

昨天

善良的水煮鱼 · 解决C++遇到的未定义标识符 “string“、未定义标识符 “cout“、“name”: 未知重写说明符错误_未定义标识符怎么解决

昨天

深情的长颈鹿 · VS2019使用getline()报错（未定义标识符)_未定义标识符getline

昨天

呐喊的打火机 · Invoke-RestMethod (Microsoft.PowerShell.Utility) - PowerShell | Microsoft Learn

昨天

不爱学习的番茄 · CHAR (Transact-SQL) - SQL Server | Microsoft Learn

23 小时前

紧张的柠檬 · 一份牵挂装在心间, 相互问候真情无限_网易订阅

7 月前

千杯不醉的遥控器 · python生成13位或16位时间戳以及反向解析时间戳 - maplethefox - 博客园

7 月前

聪明的小熊猫 · （转）优化时该用SGD，还是用Adam？——绝对干货满满！_sgd adam-CSDN博客

11 月前

想表白的茶壶 · 49岁蔡少芬查出绝症，首度回应离婚：我出事张晋肯定再娶，看开了|模特|刘銮雄|钟丽缇_网易订阅

1 年前

独立的奔马 · 苹果面容ID不可用有什么影响_头条

1 年前