pandas常用操作（时间序列csv）

kanchigo

‍黑名单长度探索者

1.读csv不要索引（index）

在使用pandas读csv（read_csv())时，会默认产生一列索引，当你要把处理过后的csv文件生成一个新的csv文件时，就会多出一列索引值且这一列没有名字，不方便通过drop(columns=[‘name’])来删除，可以一开始在读的时候就让它不要产生索引(index_col=0)。

df = pd.read_csv('filename.csv', encoding='utf-8', index_col=0)

同样在生成csv文件时（to_scv()）也可以避免生成索引列，方法为添加参数（index=False）

df.to_csv('C:/filepath/xxx.csv', index=False)

做数据分析的时候遇到有的行的某个关键值为空，那最好删除掉一整行以免影响后面的操作。

df1 = df.dropna(subset=['列名'])

有的数据可能是与时间先后顺序有关，需要按照时间先后顺序排序，这时候需要先把带有时间的列转为date_time格式，再进行排序。

df1['time'] = pd.to_datetime(df1['time'])
df1.sort_values('time', inplace=True)

有时候可能会遇到需要增加一列数据，例如增加一列全为1的数据，方便按时间合并行过后统计次数。

df['xxx number'] = 1

使用pandas求两列时间的差，也就是统计csv数据集种某个任务持续的时间。

df['end time'] = pd.to_datetime(df['end time'])
df['start time'] = pd.to_datetime(df['start time'])