我正在尝试使用Jupiter Notebook在pyspark中进行编码。 在使用split()函数时遇到了问题 我正在使用的数据框架
import_csv=spark.read.csv("F:\\Learning\\PySpark\\DATA\\Iris.csv",header="true")
import_csv.show()
import_csv=spark.read.csv("F:\\Learning\\PySpark\\DATA\\Iris.csv",header="true")
import_csv.show()
+---+-------------+------------+-------------+------------+-----------+
| Id|SepalLengthCm|SepalWidthCm|PetalLengthCm|PetalWidthCm| Species|
+---+-------------+------------+-------------+------------+-----------+
| 1| 5.1| 3.5| 1.4| 0.2|Iris-setosa|
| 2| 4.9| 3.0| 1.4| 0.2|Iris-setosa|
| 3| 4.7| 3.2| 1.3| 0.2|Iris-setosa|
| 4| 4.6| 3.1| 1.5| 0.2|Iris-setosa|
| 5| 5.0| 3.6| 1.4| 0.2|Iris-setosa|
| 6| 5.4| 3.9| 1.7| 0.4|Iris-setosa|
| 7| 4.6| 3.4| 1.4| 0.3|Iris-setosa|
| 8| 5.0| 3.4| 1.5| 0.2|Iris-setosa|
| 9| 4.4| 2.9| 1.4| 0.2|Iris-setosa|
| 10| 4.9| 3.1| 1.5| 0.1|Iris-setosa|
| 11| 5.4| 3.7| 1.5| 0.2|Iris-setosa|
| 12| 4.8| 3.4| 1.6| 0.2|Iris-setosa|
| 13| 4.8| 3.0| 1.4| 0.1|Iris-setosa|
| 14| 4.3| 3.0| 1.1| 0.1|Iris-setosa|
| 15| 5.8| 4.0| 1.2| 0.2|Iris-setosa|
| 16| 5.7| 4.4| 1.5| 0.4|Iris-setosa|
| 17| 5.4| 3.9| 1.3| 0.4|Iris-setosa|
| 18| 5.1| 3.5| 1.4| 0.3|Iris-setosa|
| 19| 5.7| 3.8| 1.7| 0.3|Iris-setosa|
| 20| 5.1| 3.8| 1.5| 0.3|Iris-setosa|
+---+-------------+------------+-------------+------------+-----------+
only showing top 20 rows
试图在","(逗号)的基础上分割RDD的每一行。
csv_split= import_csv.rdd.map(lambda var1: var1.split(','))
print(csv_split.collect())
得到的错误是'split' is not in list
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 65.0 failed 1 times, most recent failure: Lost task 0.0 in stage 65.0 (TID 65, DESKTOP-NPEMBC9, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark-3.0.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\types.py", line 1595, in __getattr__
idx = self.__fields__.index(item)
ValueError: 'split' is not in list