添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

pyspark中Split()函数的用法

2 人关注

我正在尝试使用Jupiter Notebook在pyspark中进行编码。 在使用split()函数时遇到了问题 我正在使用的数据框架

import_csv=spark.read.csv("F:\\Learning\\PySpark\\DATA\\Iris.csv",header="true")
import_csv.show()
import_csv=spark.read.csv("F:\\Learning\\PySpark\\DATA\\Iris.csv",header="true")
import_csv.show()
+---+-------------+------------+-------------+------------+-----------+
| Id|SepalLengthCm|SepalWidthCm|PetalLengthCm|PetalWidthCm|    Species|
+---+-------------+------------+-------------+------------+-----------+
|  1|          5.1|         3.5|          1.4|         0.2|Iris-setosa|
|  2|          4.9|         3.0|          1.4|         0.2|Iris-setosa|
|  3|          4.7|         3.2|          1.3|         0.2|Iris-setosa|
|  4|          4.6|         3.1|          1.5|         0.2|Iris-setosa|
|  5|          5.0|         3.6|          1.4|         0.2|Iris-setosa|
|  6|          5.4|         3.9|          1.7|         0.4|Iris-setosa|
|  7|          4.6|         3.4|          1.4|         0.3|Iris-setosa|
|  8|          5.0|         3.4|          1.5|         0.2|Iris-setosa|
|  9|          4.4|         2.9|          1.4|         0.2|Iris-setosa|
| 10|          4.9|         3.1|          1.5|         0.1|Iris-setosa|
| 11|          5.4|         3.7|          1.5|         0.2|Iris-setosa|
| 12|          4.8|         3.4|          1.6|         0.2|Iris-setosa|
| 13|          4.8|         3.0|          1.4|         0.1|Iris-setosa|
| 14|          4.3|         3.0|          1.1|         0.1|Iris-setosa|
| 15|          5.8|         4.0|          1.2|         0.2|Iris-setosa|
| 16|          5.7|         4.4|          1.5|         0.4|Iris-setosa|
| 17|          5.4|         3.9|          1.3|         0.4|Iris-setosa|
| 18|          5.1|         3.5|          1.4|         0.3|Iris-setosa|
| 19|          5.7|         3.8|          1.7|         0.3|Iris-setosa|
| 20|          5.1|         3.8|          1.5|         0.3|Iris-setosa|
+---+-------------+------------+-------------+------------+-----------+
only showing top 20 rows

试图在","(逗号)的基础上分割RDD的每一行。

csv_split= import_csv.rdd.map(lambda var1: var1.split(','))
print(csv_split.collect())

得到的错误是'split' is not in list

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 65.0 failed 1 times, most recent failure: Lost task 0.0 in stage 65.0 (TID 65, DESKTOP-NPEMBC9, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "C:\spark-3.0.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\types.py", line 1595, in __getattr__
    idx = self.__fields__.index(item)
ValueError: 'split' is not in list
    
1 个评论
兄弟,你的var1是rdd行,不是逗号分隔的字符串, spar.read.csv() 方法已经帮你做了,分割了csv并存储在你的spark数据框中 import_csv
python
apache-spark
pyspark
APRIL
APRIL
发布于 2020-07-21
1 个回答
dsk
dsk
发布于 2020-07-22
已采纳
0 人赞同

在Spark中,split()被用来根据一些标识符将一个字符串/列分割成多个,并返回一个List/AttayType。

df_b = spark.createDataFrame([('1','ABC-07-DEF')],[ "ID","col1"])
df_b = df_b.withColumn('post_split', F.split(F.col('col1'), "-"))
df_b.show()
+---+----------+--------------+
| ID|      col1|    post_split|
+---+----------+--------------+
|  1|ABC-07-DEF|[ABC, 07, DEF]|
+---+----------+--------------+

此外,你可以使用getItem()以便从该Arry列中提取列,如下所示

df_b = df_b.withColumn('split_col1', F.col('post_split').getItem(0)).withColumn('split_col2', F.col('post_split').getItem(1)).withColumn('split_col3', F.col('post_split').getItem(2))
df_b.show()
+---+----------+--------------+----------+----------+----------+