Have I been using grep wrong this whole time?

At some point in our lives we may stop and ask ourselves - are we doing the right thing? I've asked myself that question numerous times, most recently - am I using the grep wrong?

Let's start from the beginning - what is grep?

grep is a pattern searching command line tool in linux which goes into the file and searches for pattern you have provided and prints out the result. I use it quite a lot, however, rather simple, nothing fancy. Here is its description from the man pages[1]:

DESCRIPTION
       grep searches for PATTERNS in each FILE.  PATTERNS is one or more
       patterns separated by newline characters, and grep prints each
       line that matches a pattern.  Typically PATTERNS should be quoted
       when grep is used in a shell command.

       A FILE of “-” stands for standard input.  If no FILE is given,
       recursive searches examine the working directory, and
       nonrecursive searches read standard input.

Okay, that was straight forward. Now the reason for me questioning my decisions - I often times use grep in combination with cat. What is cat now?

cat is, despite being a domestic species of a small carnivorous mammal[2], also a popular command line tool in Linux which is used primarily (at least, this is how I use it) for showing the output of some file. Here is the short description from man pages[3]:

DESCRIPTION
       Concatenate FILE(s) to standard output.

       With no FILE, or when FILE is -, read standard input.

How do I use them together you may ask?

I use them in combination - cat some file and piping it to grep to search for some pattern. Something like the below snippet:

$ cat some-file.txt | grep <pattern>

And what is wrong with the above command? Well, if we don't take into account that we spin up additional process, and type a bit longer, no, nothing is wrong with the above command, at least I think it isn't.

This got me thinking - why do I not just use the grep instead? Was this the faster way to do the searching? As it turns out, this question of mine was mentioned in this reddit post [4], and person who started this thread was quite annoyed that people were using it wrong - the cat | grep way instead of grep alone. The blogpost was not available, so I needed to consult the wayback machine to get to the source article[5]. Okay, seems that I've been doing it wrong this whole time. Nevermind that, however, maybe I can actually perform some testing and find out for sure if I was doing it wrong.

Let's find out. Below, is the output of the first testing I've performed on my CentOS 7 machine:

[user@host]$ du -sh file.log
1.4G    file.log
[user@host]$ time cat kubelet.log | grep "E0111" > cat_grep.log

real    0m24.291s
user    0m3.778s
sys     0m7.941s
[user@host]$ time grep "E0111" kubelet.log > grep.log

real    0m22.256s
user    0m0.676s
sys     0m10.507s
[user@host]$ wc -l cat_grep.log
288355 cat_grep.log
[user@host]$ wc -l grep.log
288355 grep.log

First command shows the file size of the log file. As you can see, the file is a big one, taking 1.4G on the machine. The second command measures the time of the process[6].

In the first part, I'm running cat, piping it to grep and outputting everything into grep.log file. Why? Because I want to see the number of lines that the grep command found for the comparison sake.

Second part is almost the same as previous, but instead of running cat, I'm running grep directly. The last commands wc -l just outputs the number of lines in a file[7].

The thing with the above test is that it might not be the appropriate one, because it writes the lines into the separate file, which can be different from time to time, based on the disk IO. I've tested the above part several times and each time I've got different numbers, sometimes cat | grep was better, and the other times grep alone showed better times.

However, if we exclude the writing to disk part, and just pipe the output into a wc command, the numbers are a bit different:

[user@host]$ time cat kubelet.log | grep "E0111" | wc -l
288355

real    0m10.072s
user    0m2.320s
sys     0m6.277s

[user@host]$ time grep "E0111" kubelet.log | wc -l
288355

real    0m15.221s
user    0m1.224s
sys     0m7.518s

Each time I've run this test, the time command showed better processing time of cat | grep command. That was really interesting to me, especially because I've expected that the grep alone will be faster. Maybe, the reason for this is that I've piped everything into wc command, for easier output. Okay, lets run it last time, but this time without last pipe:

[user@host]$ time cat kubelet.log | grep "E0111" 
...
...
real    6m37.486s
user    0m43.496s
sys     0m18.369s

[user@host]$ time grep "E0111" kubelet.log 
...
...
real    6m58.121s
user    0m44.362s
sys     0m23.814s

The last test shows that the cat | grep option is faster, however, I understand that many more things are going below the surface when we run each and every command from the above. As to why the cat | grep option is faster? I cannot give appropriate answer now, because I don't know. I might explore this in some other post(s).

For now, I'm going to keep using my pattern cat | grep and maybe use grep from time to time, when I actually get bored with typing, and I'm totally okay with that, because I feel that in this case - there is no right or wrong! :)

Footnotes


  1. https://www.man7.org/linux/man-pages/man1/grep.1.html ↩︎

  2. https://en.wikipedia.org/wiki/Cat ↩︎

  3. https://www.man7.org/linux/man-pages/man1/cat.1.html ↩︎

  4. https://www.reddit.com/r/linux/comments/b1fqp/stop_piping_cat_into_grep/ ↩︎

  5. https://web.archive.org/web/20130402064017/http://www.rootninja.com/stop-piping-cat-into-grep/ ↩︎

  6. https://man7.org/linux/man-pages/man2/time.2.html ↩︎

  7. https://man7.org/linux/man-pages/man1/wc.1.html ↩︎