I am reading the contents of an api into a dataframe using the pyspark code below in a databricks notebook. I validated the json payload and the string is in valid json format. I guess the error is due to multiline json string. The below code worked fine with other json api payloads.
Spark version < 2.2
import requests
user = "usr"
password = "aBc!23"
response = requests.get('https://myapi.com/allcolor', auth=(user, password))
jsondata = response.json()
from pyspark.sql import *
df = spark.read.option("multiline", "true").json(sc.parallelize([data]))
df.show()
JSON payload:
"colors": [
"color": "black",
"category": "hue",
"type": "primary",
"code": {
"rgba": [
"hex": "#000"
"color": "white",
"category": "value",
"code": {
"rgba": [
"hex": "#FFF"
"color": "red",
"category": "hue",
"type": "primary",
"code": {
"rgba": [
"hex": "#FF0"
"color": "blue",
"category": "hue",
"type": "primary",
"code": {
"rgba": [
"hex": "#00F"
"color": "yellow",
"category": "hue",
"type": "primary",
"code": {
"rgba": [
"hex": "#FF0"
"color": "green",
"category": "hue",
"type": "secondary",
"code": {
"rgba": [
"hex": "#0F0"
Error:
pyspark.sql.dataframe.DataFrame = [_corrupt_record: string]
Thanks for the ask and using the Microsoft Q&A platform.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
You may checkout the below threads which addressing similar issue:
https://stackoverflow.com/questions/38895057/reading-json-with-apache-spark-corrupt-record
https://www.mail-archive.com/user@天地微光 .apache.org/msg59206.html
Hope this helps. Do let us know if you any further queries.
Please don’t forget to Accept Answer
and Up-Vote
wherever the information provided helps you, this can be beneficial to other community members.
Hello @ PRADEEPCHEEKATLA-MSFT
Modified code:
spark.sql("set spart.databricks.delta.preview.enabled=true")
spark.sql("set spart.databricks.delta.retentionDutationCheck.preview.enabled=false")
import json
import requests
from requests.auth import HTTPDigestAuth
import pandas as pd
user = "username"
password = "password"
myResponse = requests.get('https://myapi.com/allcolor', auth=(user, password))
if(myResponse.ok):
jData = json.loads(myResponse.content)
s1 = json.dumps(jData)
#load data from api
x = json.loads(s1)
data = pd.read_json(json.dumps(x))
#create dataframe
spark_df = spark.createDataFrame(data)
spark_df.show()
spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net","<your-storage-account-access-key>")
spark_df.write.mode("overwrite").json("wasbs://<container>@<storage-account-name>.blob.core.windows.net/<directory>/")
else:
myResponse.raise_for_status()
Output not in right format as the source.
"colors":
"color": "black",
"category": "hue",
"type": "primary",
"code": {
"rgba": [
"hex": "#000"
"colors":
"color": "white",
"category": "value",
"code": {
"rgba": [
"hex": "#FFF"
Could you please point to me where I am going wrong as the output file I am storing in ADLS Gen2 does not match the source api json payload.
Thank you