Python - 显示csv文件中具有重复值的行

6 人关注

我有一个.csv文件，其中有几列，其中一列是随机数字，我想在那里找到重复的值。如果有的话--情况很奇怪，但这毕竟是我想检查的--我想显示/存储这些值的完整行。

为了说清楚，我有这样的东西。

第一，无论如何，230，无论如何，等等。
第二，无论如何，11，无论如何，等等。
第三，无论如何，46，无论如何，等等。
第四，无论如何，18，无论如何，等等。
第五，无论如何，14，无论如何，等等。
第六，无论如何，48，无论如何，等等。
第七，无论如何，91，无论如何，等等。
第八，无论如何，18，无论如何，等等。
第九，无论如何，67，无论如何，等等。

而我希望能有。

第四，无论如何，18，无论如何，等等。
第八，无论如何，18，无论如何，等等。

为了找到重复的值，我将该列存储到一个字典中，并计算每一个键，以发现它们出现的次数。

import csv
from collections import Counter, defaultdict, OrderedDict
with open(file, 'rt') as inputfile:
        data = csv.reader(inputfile)
        seen = defaultdict(set)
        counts = Counter(row[col_2] for row in data)
print "Numbers and times they appear: %s" % counts
  Counter({' 18 ': 2, ' 46 ': 1, ' 67 ': 1, ' 48 ': 1,...}) 
现在问题来了，因为我没有设法将钥匙与重复的内容联系起来，并在以后计算它。如果我这样做
for value in counts:
        if counts > 1:
            print counts
我将只取键，这不是我想要的，还有每个值（更不用说我想打印的不仅是这个，还有整个行...）。
基本上，我正在寻找一种方法来做
If there's a repeated number:
        print rows containing those number
        print "No repetitions"
提前感谢。


           
            
             
              用awk来回答可以吗？这是很直接的问题。


           
            
             
              Informatico_Sano
             
             ：


           
            
             
              是的...如你所愿。我完全没有AWK的背景，但由于Python是一种多范式的语言，也许这个解决方案可以被调整。


         
          
           python

csv


         
          
           dictionary


        
         
          
           
           
            Informatico_Sano
           
          
          
           发布于
           
           2014-07-11


          
           
            已采纳


          
           
            
             试试这个可能对你有用。
            
            entries = []
duplicate_entries = []
with open('in.txt', 'r') as my_file:
    for line in my_file:
        columns = line.strip().split(',')
        if columns[2] not in entries:
            entries.append(columns[2])
        else:
            duplicate_entries.append(columns[2]) 
if len(duplicate_entries) > 0:
    with open('out.txt', 'w') as out_file:
        with open('in.txt', 'r') as my_file:
            for line in my_file:
                columns = line.strip().split(',')
                if columns[2] in duplicate_entries:
                    print line.strip()
                    out_file.write(line)
else:
    print "No repetitions"


           
            
             
              
               Informatico_Sano
              
              ：


           
            
             
              
               这正是我所要求的。谢谢!


           
            
             
              
               joel goldstick
              
              ：


           
            
             
              
               如果你的文件很长，你可以通过在 duplicate_entries 中把键（columns[2]）和完整的行保存为一个 tuple 或 dict 来避免重读文件。


           
            
             
              
               @joelgoldstick 在这种情况下，你会在每个重复的条目中错过第一次出现的情况。


           
            
             
              
               Informatico_Sano
              
              ：


           
            
             
              
               对不起@Sar009，但我应该如何做才能将重复的内容存储到文件中而不是打印出来？我无法将多于一行的内容写进一个文件中（或者可能是覆盖每一行的内容）。先谢谢了。-


           
            
             
              
               Informatico_Sano
              
              ：


           
            
             
              
               @Sar009 非常感谢你


          
           
            
             
              你应该像下面这样创建你的字典，这样重复的条目就不会互相覆盖了。
             
             if(dict.has_key(num) == 0):
     dict[num] = []
     dict[num].append(val)
else:
     dict[num].append(val)
然后循环浏览字典中的每个列表值，如果某个键的值大于1，那么它就会出现不止一次。


           
            
             
              
               
                Informatico_Sano
               
               ：


           
            
             
              
               
                但我已经在用我的那段代码做这个了。我的问题出现在你提到的循环中，因为我不知道如何在同一时间将钥匙与它出现的次数联系起来。


          
           
            
             
              
               Let's just loop through the file twice:
              
              
               first keep track of how many times each single 3rd column is appearing.
              
              
               second loop through lines printing those containing a 3rd column that appeared more than once.
              
              awk -F, 'FNR==NR{a[$3]++; next}
         {if (a[$3]>1) {print}}' file file
$ awk -F, 'FNR==NR{a[$3]++; next} {if (a[$3]>1) {print}}' a a
Fourth, Whatever, 18, Whichever, etc
Eighth, Whatever, 18, Whichever, etc


          
           
            
             
              
               
                
                 
                 
                  Tal Folkman
                 
                
                
                 发布于
                 
                 2022-04-07