我想读一个很大的CSV文件。我所有的列都是浮动的,但是比罗似乎在推测int64。
如何为所有列指定一个dtype?
import gcsfs import pyarrow.dataset as ds fs = gcsfs.GCSFileSystem(project='my-google-cloud-project') my_dataset = ds.dataset("bucket/foo/bar.csv", format="csv", filesystem=fs) my_dataset.to_table()
它产生:
ArrowInvalid Traceback (most recent call last) ........py in <module> ----> 65 my_dataset.to_table() File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/_dataset.pyx:491, in pyarrow._dataset.Dataset.to_table() File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/_dataset.pyx:3235, in pyarrow._dataset.Scanner.to_table() File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/error.pxi:143, in pyarrow.lib.pyarrow_internal_check_status() File /opt/conda/envs/py39/lib/python3.9/site-packages/pyarrow/error.pxi:99, in pyarrow.lib.check_status() ArrowInvalid: In CSV column #172: Row #28: CSV conversion error to int64: invalid value '6.58841482364418'
发布于 2022-03-18 23:48:14
Pyarrow的dataset模块以块形式读取CSV文件(我认为缺省值为1MB ),并并行处理这些块。这使得列推断有点棘手,它通过使用第一个块来推断数据类型来处理这个问题。因此,当文件的第一个块有一个看起来是整数的列时,您所得到的错误是非常常见的,但是在以后的块中,该列有十进制值。
如果预先知道列名,则可以指定列的数据类型:
import pyarrow as pa import pyarrow.csv as csv import pyarrow.dataset as ds column_types = {'a': pa.float64(), 'b': pa.float64(), 'c': pa.float64()} convert_options = csv.ConvertOptions(column_types=column_types)