Ghost in the subshell
Bash Subshells
Submitted by Mark Clarke on Thu, 08/03/2017 - 14:27
Bash is a deceptively simply interpreter, seemingly easy to use but harder to master its subtleties. But this is what makes free software so much more fun and intellectually stimulating than stuff rushed out a corporate door with deadlines and minimal viable product mentality.
One of those shell subtleties that takes a while to grok is subshells and in this case subshells invoke via pipes. Most people are familiar with bash pipes and redirection of "stdin" and "stdout". If not you should attend one of our Linux training courses :).
With pipes we can architect solutions from various components (commands & shell functionalities etc) to perform a task impossible to achieve with a single tool alone, and for which no specific utility exists. One subtlety about pipes is that, besides redirecting input and output streams, the shell invoke a subshell for each command in the pipeline. i.e child processes that run in parallel with each other rather than serially are created. Lets look at a trivial example using pipes.
When we see a pipeline such as
cat /etc/passwd | tr a-z A-Z
instinctively we assume the "cat" command is run, then its output is passed to "tr" and then the "tr" command is run. Although it might appear to us that the commands are running serially this is a misconception. In fact both commands are running concurrently, it is just that the command on the right is waiting on its standard input stream for data, which it is getting from the command on the left. To prove this we can use the a pipeline with two where the right command is not dependent on the left commands output.
To demo this we will also introduce the "time" commands. As it's name suggests this provide timing information on commands. e.g.
time sleep 2
should show an elapsed time slightly longer than 2 seconds. So if we took two sleep commands and piped them together like so
time sleep 2 | sleep 2
What time would we expect to see? If they ran serially then it would be slightly more than 4 seconds but it they run in parallel the time would be slightly more than 2 seconds. As you see if you run the command they time is slightly more than 2 seconds - showing the commands are running in parallel.
So how can we use this functionality? Lets say you have a long list of numbers you need to sum, how would you do this in bash and how can we optimise it?
First some explanations on our examples to follow so that we can separate out new commands from the subshell concept we are trying to show . To simulate a "list of numbers" we will generate a sequence from 1 to 9,000,000 using the "seq" command, and then add them sum them all up. ("seq 1 9000000")
Yes, that's right basically we are going to calculate the sum of a range of integers. We are using a "large" range from 1 to 9 million so we can actually see the difference when we optimise our initial approach later. For a processor doing billions of operations a second this workload is no challenge. On faster machines you may need to up the range to be able t observe any improvements.
We could, of course, do this calculation in constant time O(c) using Gauss's formula but that would reduce the didactic value of this post :) So to add up the range generated by "seq" we will use "awk" and the "time" command to get some readings for comparison.
Note because we are using a sequence of commands and passing them to the "time" command we need to let "time" know we want to measure the execution time of the entire pipeline not just the first command so we have to wrap our command up in "sh -c" call. We also have to escape the "awk" script so bash doesn't try and interpret the $ variables as bash variables. Later when we put this in a script it will be much better.
time bash -c "seq 1 9000000 | awk 'BEGIN { SUM=0 } { SUM=SUM+\$0 } END{ printf \"%13.0f\", SUM }'"
Running this 4 times on my desktop gave me an average time of 0.7355s. We can check the accuracy of the calculation with Gauss's formula -
$(( 9000000(9000000+1)/2 ))
So how can we speed this up? As already stated pipelines do more than "funnel stdout to stdin", they also invoke each command in a subshells or child process of the parent process. Essentially both commands are running in parallel as we explained above. The entire pipeline exits once all subshells processes return. So if there are no dependencies then the longest running process in the pipeline determines the run time of the pipeline.
We can use the fact that each command is running in a separate process to split up our number ranges into several concurrent processes. have them summed individual subsets and then gather the results and aggregate them in a map/reduce scatter/gather type of algorithm.
We are going to achieve this by dividing our sequence into 3 subsets and launching explicit subshells for each data set and sum the subsets up. We will use a pipe to gather the results of the subshells and aggregate the results. (If this was a file containing a long list of numbers we could use "sed -n p1,x numbers.txt" to apportion the file.)
First lets create a pipe with the "mkfifo" command. A named pipe is just like a bash pipe except its lifetime is independent of any command that us it, it can be written to by more than one command at a time and read from by more than one command at a time. We are going to use this ability to be written to by more than one command to gather the results of our separate subshells.
mkfifo ~/pipe
With our pipe established we can use it in our process as follows:
time bash -c "seq 1 3000000 | awk 'BEGIN { SUM=0 } { SUM=SUM+\$0 } END { printf \"%13.0f\\n\", SUM }' > ~/pipe | seq 3000001 6000000 | awk 'BEGIN { SUM=0 } { SUM=SUM+\$0 } END { printf \"%13.0f\\n\", SUM }' > ~/pipe | seq 6000001 9000000 | awk 'BEGIN { SUM=0 } { SUM=SUM+\$0 } END { printf \"%13.0f\\n\", SUM }' > ~/pipe | awk 'BEGIN {SUM=0} {SUM=SUM+\$0} END {printf \"%13.0f\" ,SUM }' <~/pipe"
Sjo, that's a lot of text so lets break it down. The first 3 core command sequences are just the same as our initial command except the results are redirected to the named pipe. Each "step" handles a subset of the range 1 to 9,000,000.
seq 1 3000000 | awk 'BEGIN { SUM=0 } { SUM=SUM+\$0 } END { printf \"%13.0f\\n\", SUM }' > ~/pipe
Since the stdout of each one fo these steps is written to a pipe they all run independently. In fact each "step" invokes two "linked" subshells to get its job done.
The last step is aggregating the totals from the 3 steps by reading from the pipe
awk 'BEGIN {SUM=0} {SUM=SUM+\$0} END {printf \"%13.0f\" ,SUM }' <~/pipe
On my system the average time for 4 iterations was 0.27325s. This is a massive improvement over the 0.7355s it took to complete the task in a single subshell. Its more than 2.5 times faster.
To make this easier to read and less repetitive we can put the subset sum step into a function and invoke it in a script.
#!/bin/bash set -e set -o pipefail function sum { seq $1 $2 | awk 'BEGIN { SUM=0} {SUM=SUM+$0 } END { printf "%13.0f\n", SUM }'; } sum 1 3000000 > ~/pipe | sum 3000001 6000000 > ~/pipe | sum 6000001 9000000 > ~/pipe | awk 'BEGIN {SUM=0} {SUM=SUM+$0} END {printf "%13.0f\n" ,SUM }' <~/pipe exit 0;
Understanding subshells in Bash is one concept which takes time to really grok. Its quick to pick up the pipe character as a way to redirect stdin and stdout but it is so much more subtle than just that. There are commands such as gnu parallels which take parallel processing to a whole new level.