I’ve been tepidly looking for documentation on which CPU is the slow one and which the fast one when looking at a “top” or “htop” command output for the Odroid XU4. It has 8 cores, 4 of them slower Cortex A7 type and 4 of them faster A15 type. Well, I got tired of the occasional poking around not yielding much, so decided to just test it myself.
In theory, an operating system can be tuned for maximum performance or for minimum energy consumption, as desired. The intent of the “Big / little” architecture is to let you make a system using a SBC (Single Board Computer) with an ARM “Big / little” chip in it and have it use very little power when idle, but ramp up to high performance when needed. The idea being that you use the A7 (lower power and lower speed) cores until they are not enough, THEN you jump up to the A15 cores to “get ‘er done”.
But in the “htop” command, the usage bars typically are in cores 5, 6, 7 & 8 with only minor blips up to cores 1-4 when doing normal things like running a browser and having a terminal window open. So were cores 1-4 the A15 cores? And if so, why, when something demanding launched, would it start in cores 5-8 and then stay there?
Well, the answer is that cores 5-8 are the high performance A15 cores, and Debian just starts things in them most of the time by default. Now this chip does have frequency scaling and some other power management built into it; so that isn’t as wasteful as you might think. Running an A15 core at a low clock rate is not going to use much power at all, then ramping up the clock with demand lets you easily gain power without a lot of fancy scheduler work to move the process to a different type of core.
In essence, the XU4 under Debian (Devuan uplift or not) really acts like a quad-core 2 MHz A15 chip that can glue on 4 more cores of A7 1.4 MHz performance if needed; and it sporadically tosses very small tasks to the A7 chips under normal use (there are tiny blips of use of them in htop)
I’m OK with that. For this board, it is run from mains power, so it isn’t like I need to save every Watt of power from a Lion battery… I’d rather have the bigger cores running and avoid the scheduler action and context switch penalty.
I made a little script called “looper”. It lets me load up a core with a task that has no I/O so pegs the CPU at 100% “doing nothing really”, but not interacting with other shared systems (like I/O) that might cause it to enter wait states. “bcat” is a little script that prints out my personal scripts from my “bin” directory. A specialized “cat” (concatenate and print) command, if you will. Saves me typing a “cd ~/bin” when I want to look at one of my scripts ;-) Over the years, that kind of ‘mini script’ can save hours of typing …
chiefio@odroidxu4:~$ bcat bcat cd $HOME/bin more $*
I have another one, ‘cmd’, that does a “cd” (change directory) into my bin directory and pops up a given command into the vi editor (or opens a new session to make a new command) and then sets the permission bits to ‘executable’.
chiefio@odroidxu4:~$ bcat cmd cd $HOME/bin vi $1 allow $1
And yes, “allow” is another one of mine. It does a chmod +x $1 … I actually wrote it first, because I’d write something and forget to set the execute bits and I just wanted to say “allow it, damn it”… so I did ;-)
But that’s all a digression to explain why I have “bcat looper” to show what is in “looper”. It also nicely shows how shell scripting is a threaded interpretive language and some of the advantages of a library of “words” you create to do things for you in Linux / Unix land. In many ways, the Linux I use is full of commands different from the one everyone else uses. My own shorthand. So I first wanted to just allow things to run, then I wanted a command to make commands, then one to just look at them, then… All of them saving me typing and remembering for me exactly which option flags I wanted to set.
But, back at “looper”:
chiefio@odroidxu4:~$ bcat looper i=0 lim=$1 while [ $i -lt $lim ] do i=$(( i+1 )) done echo $i
It basically just takes in one argument “$1” that is the loop limit. I used 1,000,000 in my runs. It has a counter (increment counter, or $i ) that gets initialized to zero. Then it has a “do math only” loop from 0 to $lim ( I could have just used $1 there, but $i and $1 look similar so I stuffed the value into $lim that’s easier to notice is different and is a limit.
So the “while” loop looks for less than limit, and once the limit is reached, goes on down to print out, or “echo” the final value reached by the loop counter. Inside the loop, I just increment the loop counter by 1 each pass. Basically, it’s a count to a million loop with the parameter I passed into it. Again, not liking to type things over and over, why type the argument “1,000,000” 8 times when I can make a new word to do that for me? I named it loop1 for loop 1 million.
chiefio@odroidxu4:~$ bcat loop1 time looper 1000000
I also put the “time” command in front of it so it will report the “real” or elapsed time, how much was “user” time in the script and how much was “sys” system time doing overhead to run the script. I then launched 8 of these into the background.
loop1 > loop1& loop1 > loop2& loop1 > loop3& loop1 > loop4& loop1 > loop5& loop1 > loop6& loop1 > loop7& loop1 > loop8 &
Note that the “&” puts a given command running in the background, the “greater than” sign sends the output to a file (in this case sending 1000000 into each file, a kind of silly thing to do really, but gets it out of the report from the “time” command that by default does not go to the “standard output” but goes to your screen as the “standard error” device) So this long line rapidly launched 8 jobs and sends any output to a different file in my current working directory for each of them.
I hit “enter” and watch “htop”. The usage bars go to 100%, starting in the lower register of 5-8, and then filling the upper four of 1-4. After 28 seconds, first batch of four completed, then after 1 minute 2 seconds, the second batch finished. Of interest to me was that, once running in a slower A7 core, even after the A15 cores were no longer busy, the processes stayed in those cores. The scheduler didn’t move them. I would expect that any tight loop would be treated that way, only moving a process on an interrupt of some sort. That also likely explains why they set up the OS to start with the A15 cores. It would take some new code to do the “launch an interrupt and move a process to a different faster core IF it is at 100% in a slow core and the A15 is available”, and that code isn’t written yet. Basically, they didn’t want to fool around with the scheduler in the first porting effort to this board (or nobody volunteered to do it – schedulers are tricky things).
So here’s the output to the screen:
real 0m27.959s user 0m27.800s sys 0m0.015s real 0m28.246s user 0m28.135s sys 0m0.005s real 0m28.284s user 0m28.130s sys 0m0.020s real 0m28.316s user 0m27.995s sys 0m0.005s real 1m1.838s user 1m1.815s sys 0m0.015s real 1m1.845s user 1m1.835s sys 0m0.010s real 1m1.850s user 1m1.825s sys 0m0.005s real 1m2.338s user 1m2.220s sys 0m0.015s
You can see that the 4 A15 cores finished first, then I got to sit here staring at the htop display showing cores 1-4 pegged at 100% for another half minute until the last 4 results came in.
All in all, from first sitting down at the keyboard until results were known was about 4 minutes. Far far faster than writing it up in this article. That’s what I like about shell scripting in *nix, and personal mini-script tools. Things can be done “right quick” ;-)
So now I know a bit more about how to use this board. Make sure, IFF I’m going to load it up with hard tasks, to launch the 4 slowest and harder ones first, then load up the rest of the cores.
Still TBD is to run the test with, say, 16 loopers running and see if ‘taking interrupts’ has them all finish equally, or if the scheduler has core type stickiness based on first dispatch core type.
This is not the kind of information the average desktop user would need to know, but it does matter if using the board for ‘distcc’ compiles of whole operating systems (where you load up all the cores in the cluster with jobs and final completion time depends on how fast each job gets done) or in running things like models and simulations. But now I know. Load up 4 “big ones” and let the rest take the small ones.