Hello,
I am unable to run a simple spark.sql() (ex. df = spark.sql("SELECT * FROM table1")) in Synapse notebooks. I am able to load and view the file without using SQL, but when using spark.sql() I receive errors for all files including csv and parquet file types.
I have tried different sized clusters, restarting clusters, spark versions, and changing the language and code from PySpark to Scala. My workspace has permission to access my data in ADLS Gen 2. Apologies if this question has already been answered elsewhere. Below is the error I am receiving.
AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException;
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in
call
answer, self.gateway_client, self.target_id, self.name)
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 75, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException;
Thanks
I am using the "Getting Started with Delta Lake" from the Gallery, that Microsoft created. I am getting an error in cell 33, see below. I didn't really do anything except run the notebook a cell at atime. I had already created a delta database from a very simple script.
I can't believe the answer is to create a new workspace. For me that would be allot of work. Is there no way to fix.
Does anyone have a better answer?
Thanks,
AnalysisException: Table default.ManagedDeltaTable already exists
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1158, in saveAsTable
self._jwrite.saveAsTable(name)
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: Table default.ManagedDeltaTable already exists
I am loading with the folder path. Such as the following:
path = 'abfss://<container>@<account>.dfs.core.windows.net/<path>
df = spark.read.option('header', 'true') \
.option('delimiter', ',') \
.csv(path)
dfSQL = spark.sql("SELECT * FROM df")
The parquet file was loaded with this:
parquetFile = spark.read.parquet("<path>")
parquetFile.createOrReplaceTempView("parquetFile")
df = spark.sql("SELECT * FROM parquetFile")
df.show()
If I removed df = spark.sql("SELECT * FROM parquetFile"), and ran parquetFile.show() it would load.
ahh, no, you can't select from a dataframe in that way
If you want to use SQL on a dataframe, then as you did in subsequent comment you can df.createOrReplaceTempView(<name of temp view>) and then you can do "spark.sql('select * from <name of temp view>')"
Otherwise you can use the dataframe directly by doing df.select(...) or df.filter(...) etc
Hmmm, are you suggesting something like the following?
parquetFile = spark.read.parquet("<path>")
parquetFile.createOrReplaceTempView("df")
df2 = spark.sql("SELECT * FROM df")
df2.show()
This seems to shoot me back an error as well. Sorry for all the trouble with something simple!
parquetFile = spark.read.parquet(<path>)
display(parquetFile)
Here is the error that I get when running:
parquetFile = spark.read.parquet("<path>")
parquetFile.createOrReplaceTempView("df")
df2 = spark.sql("SELECT * FROM df")
df2.show()
Error:
AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException;
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 75, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException;
# Write data to a new managed catalog table.
## old...... data.write.format("delta").saveAsTable("ManagedDeltaTable")
##new
data.write.format("delta").mode("overwrite").saveAsTable("ManagedDeltaTable")