用grep搜索文件并只输出部分行数

Question 1

我正在查看日志文件，并试图在我的最终文件中获得一个不那么杂乱的输出。如果我用grep搜索一个值，我想把输出格式化，删除除日期和网址以外的任何内容。

例如，这里是文件的一行。

Sep 25 08:07:51 10.20.30.40 FF_STUFF[]: 1545324890 1 55.44.33.22 10.9.8.7 - 10.60.154.41 http://website.com 0 BYF ALLOWED CLEAN 2 1 0 0 0 (-) 0 - 0 - 0 - 0 sqm.microsoft.com - [-] sqm.microsoft.com - - 0

我想做一个grep，或者必要时做一个更好的命令，输出到一个.txt文件中，只列出粗体条目。基本上列出日期和URL。那么，我如何告诉它列出包括空格在内的前15个字符，然后找到第一个http/https，并列出所有内容，直到第一个空位？每一行的长度都不一样，所以我不能只按字符的位置去做。

So my output would be

Sep 25 08:07:51 http://website.com

Question 2

你不能轻易地在 grep 中使用 -o 选项，因为你有两个模式，中间隔着可变数量的字符（而 -o 会打印完整的匹配部分）。

如果你想只提取URL，这就足够了。

$ grep -oE 'https?:[^ ]+' file
http://website.com
但要同时提取日期和URL，最简单的解决方案可能是用GNU awk:
$ awk '{ match($0, /https?:[^ ]+/, url); print $1, $2, $3, url[0]; }' file
Sep 25 08:07:51 http://website.com
其中，你打印前三个字段（$1至$3，以空格分隔），然后搜索一个URL，用match()(假设它不包含空格，即空格字符总是被正确地转义；要么是+，要么是%20），然后打印找到的第一个URL（在日期之后）。
如果你有POSIX awk(或调用带有--posix标志的gawk），解决方法就比较繁琐了。因为POSIX match()不支持将匹配的部分保存到一个数组中（第三个参数，url），你必须明确地提取URL，用substr()当找到一个匹配的时候。
$ awk '{ match($0, /https?:[^ ]+/); print $1, $2, $3, substr($0, RSTART, RLENGTH); }' file
Sep 25 08:07:51 http://website.com

Question 3


          
           
            
             为了补充@r和omir的答案，我们也可以使用
             
              sed
             
             。
            
            $ sed 's/\(.\{15\}\).*\(https\?:\/\/[^ ]\+\).*/\1 \2/' < input.txt > output.txt
这个模式假定前15个字符构成日期，并且URL不包含空格。它同时适用于http和https URLs.
Edit- 来解决这个问题--为了学习，我们也可以调用sed来执行像grep一样的行匹配操作。
sed -n '/10\.45\.19\.151/p' < input.txt
...将输出在input.txt包含IP地址的10.45.19.151。替换代码6】选项抑制了每一行的输出。我们把这个选项和p命令来print只有符合该模式的行。
We can merge this approach with the first命令来"grep" for lines 和 transform them using a single comm和: 
sed -n '/<line-match-pattern>/ s/<...>/<...>/ p' < input.txt
...will select only the lines that match <line-match-pattern>, perform the substitution,和output the result. To illustrate, here's an example using the information provided in the comment: 
sed -n '/10\.45\.19\.151/ s/\(.\{15\}\).*\(https\?:\/\/[^ ]\+\).*/\1 \2/ p' \
    < messages-20171001 \
    > /backup/mikesanders-fwlog-10012017.txt

Question 4


          
           
            
             
              
               
                awk '{match($0,/http[^com]*/);print $1,$2,$3,substr($0,RSTART,RLENGTH+3)}'  Input_file
对上述代码的解释。
awk '{
match($0,/http[^com]*/);                  ##Using match default utility of awk where I am searching for regex where it will look for string http till string com comes.
print $1,$2,$3,substr($0,RSTART,RLENGTH+3)##Now printing the 1st, 2nd and 3rd column which are date and time in current line and printing sub string of current line where it should start substring from the value of RSTART till value of RLENGTH(which will be http complete URL actually). Now point to be noted here variables RSTART and RLENGTH are default variables of awk which will be set once a regex match is found in match utility of awk.
' Input_file                              ##Mentioning the Input_file name here.

Question 5


          
           
            
             
              
               
                
                 
                  
                   你可以使用
                   
                    grep -o
                   
                   来匹配你想要的每一个行段，然后重新组合grep返回的行。
                  
                  $ grep -Eo '^.{15}|https?://[^ ]+' f | paste - -
Sep 25 08:07:51 http://website.com
请注意，在FreeBSD或OSX中，他们使用的GNU grep的旧版本（2.5.1）是有问题的，所以要进行更明确的日期识别。
$ grep -Eo '[A-Z][a-z]{2} ([0-9]{2}[ :]){3}[0-9]{2}|https?://[^ ]+' f | paste - -
Sep 25 08:07:51 http://website.com
在FreeBSD中的一个变通方法是使用bsdgrep，它的功能与gnu grep相当，但没有bug。在MacOS中，可能需要使用homebrew或macports安装一个替代品......或者直接使用另一个答案中的POSIX awk解决方案。
总之，在这两种情况下，正则表达式由两个用or-bar连接的表达式组成（|，在https之前）。第一个子表达式匹配你的日期，第二个子表达式匹配你的URLs。
只要是every如果输入的一行包含符合这两个元素的文本，你应该从grep中为每个日志条目得到两行输出。然后paste将它们重新组合成一行。

Question 6


          
           
            
             
              
               
                
                 
                  
                   
                    
                     就像1个命令行。
                    
                    
                     
                      msr -p my.log -t "^(.*?\d+:\d+:\d+).*?(https?://\S+).*" -o '$1 $2' -PIC > output.txt
                     
                    
                    
                     
                      如果
                      
                       first 15 characters
                      
                      比模式
                      
                       "^(.*?\d+:\d+:\d+)"
                      
                      更可靠。
                     
                     
                      Use
                      
                       "^(.{15})"
                      
                      like:
                      
                       -t "^(.{15}).*?(https?://\S+).*"
                      
                     
                    
                    
                     
                      如果你想进一步过滤，比如包含一个ip
                      
                       10.9.8.7
                      
                      作为纯文本（
                      
                       -x
                      
                      ）。
                     
                     
                      
                       msr -p my.log -x 10.9.8.7 -t "^(.*?\d+:\d+:\d+).*?(https?://\S+).*" -o '$1 $2'
                      
                     
                    
                    
                     
                      如果必须包含更多的IP，如
                      
                       10.9.8.7
                      
                      。【替换代码9
                      
                       10.9.8.9
                      
                      ，或进一步处理。
                     
                     
                      
                       msr -p my.log -t "^(.*?\d+:\d+:\d+).*?(https?://\S+).*" -o '$1 $2' -PAC | msr -t "10\.9\.8\.[7-9]" -PAC > output.txt
                      
                     
                    
                    
                     
                      msr.exe
                     
                     /
                     
                      msr.gcc*
                     
                     is a
                     
                      single exe tool
                     
                     for such ETL alike work (负载 -> 萃取物 -> 转型 or Replace file)
                     
                      in my open project
                     
                     ，约1.6MB，无依赖性，有跨平台版本加
                     
                      x86
                     
                     /
                     
                      x64
                     
                     版本。
                    
                    
                     
                      
                       负载
                      
                      文件递归（
                      
                       -r
                      
                      ）和过滤目录名、文件名、时间、大小等。
                     
                     
                      
                       -r -p dir1,dirN,file1,fileN  -f "\.(log|txt)$" --w1 2017-09-25
                      
                      and
                      
                       --nf "excluded-files"
                      
                      
                       --nd "excluded-directories"
                      
                      ,
                      
                       --s1 1.5MB
                      
                      
                       --s2 30MB
                      
                      ,
                      
                       --w2 "2017-09-30 22:30:50"
                      
                      etc.
                     
                    
                    
                     
                      
                       萃取物
                      
                      by
                      
                       一般
                      
                      
                       Regex
                      
                      unlike
                      
                       sed
                      
                      or
                      
                       awk
                      
                      , exactly same as
                      
                       C++
                      
                      /
                      
                       C#
                      
                      /
                      
                       Java
                      
                      /
                      
                       Scala
                      
                      /etc.:
                     
                     
                      
                       -t "^(.*?\d+:\d+:\d+).*?(https?://\S+).*"
                      
                      ignore case: add
                      
                       -i
                      
                      like:
                      
                       -i -t
                      
                      or
                      
                       -it
                      
                     
                    
                    
                     
                      
                       转型
                      
                      output like:
                     
                    
                    
                     
                      -o '$1 $2'
                     
                     for Linux or
                     
                      Cygwin
                     
                     /
                     
                      Powershell
                     
                     on Windows.
                    
                    
                     
                      -o "$1 $2"
                     
                     for Windows
                     
                      CMD console window
                     
                     or
                     
                      *.bat
                     
                     /
                     
                      *.cmd
                     
                     files.