我怎样才能优化这个Python循环？

1 人关注

我正在一个大的csv文件（150万行）上运行这段代码。有什么方法可以优化吗？

df是一个pandas数据框架。我取了一行，想知道在接下来的1000行中首先发生了什么。

我发现我的值+0.0004或我发现我的值-0.0004

result = []
for row in range(len(df)-1000):
    start = df.get_value(row,'A')
    win = start + 0.0004
    lose = start - 0.0004
    for n in range(1000):
        ref = df.get_value(row + n,'B')
        if ref > win:
            result.append(1)
            break
        elif ref <= lose:
            result.append(-1)
            break
        elif n==999 :
            result.append(0)
该数据框架就像:  
         timestamp           A         B
0   20190401 00:00:00.127  1.12230  1.12236
1   20190401 00:00:00.395  1.12230  1.12237
2   20190401 00:00:00.533  1.12229  1.12234
3   20190401 00:00:00.631  1.12228  1.12233
4   20190401 00:00:01.019  1.12230  1.12234
5   20190401 00:00:01.169  1.12231  1.12236 
the result is : result[0,0,1,0,0,1,-1,1,…]  
这是在工作，但需要很长的时间来处理这么大的文件。


           
            
             E. Zeytinci
            
            ：


           
            
             你能分享一下预期产出吗？


           
            
             请发布一个样本数据框架和你想要的输出。


           
            
             @Cleb : 我添加了一个样本数据框架，输出是一个有1、-1或0值的列表。


           
            
             所以，如果B比A多
             
              0.004
             
             ，那么你想在列表中添加1，如果它小于
             
              0.004
             
             ，那么-1，否则0？


           
            
             @Datanovice :  I take the value A of a given row, and whant to know witch case happen first in the 1000 following rows :       - I found a value in B > A+0.0004 => I return 1       - Or I found a value in B <= A-0.0004 => I return -1       - I found nothing in 1000 rows (A-0.0004 < B < A+0.0004) => I return 0


         
          python


         
          pandas


         
          performance


         
          for-loop


          
           已采纳


          
           
            为了生成 "第一个离群点 "的数值，定义以下函数。
           
           def firstOutlier(row, dltRow = 4, dltVal = 0.1):
    ''' Find the value for the first "outlier". Parameters:
    row    - the current row
    dltRow - number of rows to check, starting from the current
    dltVal - delta in value of "B", compared to "A" in the current row
    rowInd = row.name                        # Index of the current row
    df2 = df.iloc[rowInd : rowInd + dltRow]  # "dltRow" rows from the current
    outliers = df2[abs(df2.B - row.A) >= dlt]
    if outliers.index.size == 0:  # No outliers within the range of rows
        return 0
    return int(np.sign(outliers.iloc[0].B - row.A))
然后将其应用于每一行。
df.apply(firstOutlier, axis=1)
这个函数依赖于这样一个事实：DataFrame的索引是由从0开始的连续数字组成的。
的连续数字组成，从0开始，因此，有了ind- 的索引。
任何一行的索引，我们可以调用df.iloc[ind]来访问它，并调用一个片状的n行。
从这一行开始，调用df.iloc[ind : ind + n]。
对于我的测试，我将参数的默认值设置为。
dltRow = 4 - look at 4 rows, starting from the current one,
dltVal = 0.1 - look for rows with B column "distant by" 0.1
or more from A in the current row.
我的测试DataFrame是。
      A     B
0  1.00  1.00
1  0.99  1.00
2  1.00  0.80
3  1.00  1.05
4  1.00  1.20
5  1.00  1.00
6  1.00  0.80
7  1.00  1.00
8  1.00  1.00
结果（对于我的数据和参数的默认值）是。
0   -1
1   -1
2   -1
3    1
4    1
5   -1
6   -1
7    0
8    0
dtype: int64
根据你的需要，将参数的默认值改为1000和0.0004 respectively.


           
            
             
              
               
                
                 I think you must change dlt for dltVal in "outliers = df2[abs(df2.B - row.A) >= dlt]". Thanks you, i'am testing your solution.


           
            
             
              
               
                
                 谢谢，在10 000行上，你的代码需要8.5秒，我的是78.2秒，处理150万行仍然需要很长的时间，但这是一个很大的进步!


          
           
            
             
              
               
                我们的想法是循环浏览
                
                 A
                
                和
                
                 B
                
                ，同时保持一个排序的
                
                 A
                
                值列表。然后，对于每个
                
                 B
                
                ，找到输掉的最高的
                
                 A
                
                和赢得的最低的
                
                 A
                
                。由于它是一个排序的列表，所以要搜索的是
                
                 O(log(n))
                
                。只有那些索引在最后1000位的
                
                 A
                
                被用来设置结果向量。之后，不再等待
                
                 B
                
                的
                
                 A
                
                被从这个排序列表中删除，以保持其小。
               
               import numpy as np
import bisect
import time
N = 10
M = 3
#N=int(1e6)
#M=int(1e3)
thresh = 0.4
A = np.random.rand(N)
B = np.random.rand(N)
result = np.zeros(N)
l = []
t_start = time.time()
for i in range(N):
    a = (A[i],i)
    bisect.insort(l,a)
    b = B[i]
    firstLoseInd = bisect.bisect_left(l,(b+thresh,-1))
    lastWinInd = bisect.bisect_right(l,(b-thresh,-1))
    for j in range(lastWinInd):
        curInd = l[j][1]
        if curInd > i-M:
            result[curInd] = 1
    for j in range(firstLoseInd,len(l)):
        curInd = l[j][1]
        if curInd > i-M:
            result[curInd] = -1
    del l[firstLoseInd:]
    del l[:lastWinInd]
t_done = time.time()
print(A)
print(B)
print(result)
print(t_done - t_start)
这是一个输出样本。
[ 0.22643589  0.96092354  0.30098532  0.15569044  0.88474775  0.25458535
  0.78248271  0.07530432  0.3460113   0.0785128 ]
[ 0.83610433  0.33384085  0.51055061  0.54209458  0.13556121  0.61257179