Author Archives: zo0ok

Testing Mackmyra

Since 1999 there is a Swedish malt whisky distillery: Mackmyra. They have released a large number of small series of single malt whiskies and people can “buy” their own casks as well. But I am not interested in those now, I will focus on Mackmyra standard products and try to answer the simple question: are they any good?

My expectation is that a standard Mackmyra is comparable to common Scottish malts. For that level of quality I would be willing to pay a little premium (for Mackmyra being a small, new, Swedish distillery).

I bought the following Mackmyra single malts (no age indication)

  • Mackmyra Brukswhisky (hard to translate, but the cheapest one)
  • Mackmyra Svensk Ek (Swedish Oak)
  • Mackmyra Svensk Rök (Swedish Smoke)

My testing and tasting method is simple. On each testing occation I try the Mackmyra and a Scottish malt that I expect to be similar. The idea is to decide if the Mackmyra is comparable, better or worse. Note that I did the three testings on different days.

1. Mackmyra Brukswhisky vs Glenturrent 10 years old
Glenturret is one of the single malt whiskies that they let me try in The Whisky Experience in Edinburgh. My bottle says 10 years old and not much more. It does not get more standard when it comes to Scottish Single malt, I think.

Appearance: Very similar, the Mackmyra being slightly paler.

Aroma: Glenturret has a richer, sweeter aroma, but also one I dont find entirely pleasant (it smells Blend, to me). Mackmyra is more subtle, and a bit more fruity (not sweet, perhaps pear).

Taste: Glenturret is quite bitter, fading away with time. Mackmyra has some bitterness, tastes a bit wood (young/dry/burnt), and fades away quicker. At second try the Glenturrent reveals more fruitiness. Adding a little bit of water to the Mackmyra brings out much more fruitiness and that pear I felt in my nose. Adding water to the Glenturret: it has some spiciness and heaviness and improves a little as I slowly finish the small glasses.

Badness: Both of them just have very little badness. The Mackmyra tastes slightly too young (the freshly cut and slightly burnt dry wood, like the smell in a carpenter shop). The Glenturret on the other hand, a little bit chemical and too bitter.

Conclusions: The Glenturret tastes older, and ridiculous as it may be – it tastes more scottish. The younger Mackmyra is a bit different, but it clearly tastes like a single malt.

Winner: no winner. You can serve me Glenturrent or Mackmyra – I will be equally satisfied.

2. Mackmyra Svensk Ek vs Clynelish 14 years old
Both Mackmyra Svensk Ek and Clynelish 14 years old are about 46% strong. The Clynelish I got from a package of three Classic Malts.

Appearance: Very similar, Mackmyra slightly paler.

Aroma: Clynelish clean and elegant. Mackmyra more fruit and vanilla (it’s probably oak). Clynelish a bit heavier and sweeter.

Taste: Clynelish quite thin, a little bitter (probably needs water). Mackmyra some oak, some sourness and bitterness (also in need of water). At this stage, both smell better than they taste so I add water to both.

Clynelish got a nice bourbon flavour with some water. The Swedish oak is clearly there in the Mackmyra – a slightly unusual whisky flavour. While the Clynelish taste is quite well defined, the Mackmyra is more everywhere in the mouth, and a little bit burnt in the finish. I add more water to both.

Well, I have thought about it since the first taste, there is clearly pear in Mackmyra. I think the water did its job and the Mackmyra is now softer, but it also tastes a little diluted. The Clynelish is more oily, sweeter and has more flavour – not bad, but not particularly interesting.

Badness: If you like whisky, there is nothing bad about the Clynelish, but it is not remarkable either. The Mackmyra needs water (and at 46% that is ok) for me to appreciate it, but it quickly tastes a little diluted – to me this is a sign that there simply is not enough flavour in it, and for a young whisky that is not so strange.

Conclusions: The Clynelish is very solid: perfected at 14 years in Bourbon cast to the point that it is not very interesting at all. My impression is that it tastes like a perfect blend, but with little character (Clynelish is not Brora, after all). The Mackmyra, with enough water, tastes fine. But it requires a friendly attitude to come out good.

Winner: The Clynelish wins, and I believe it does for two reasons. First, whisky is Scottish business and while the Clynelish is very solid, the Mackmyra is a little too different, too fruity and too young. Second, the Mackmyra with too little water is not a premium experience. That said, the Mackmyra is more interesting than the rather boring Clynelish, to me. And with enough water, the Mackmyra is a tasty drink.

3. Mackmyra Svensk Rök vs Bunnahabhain 8 years old
I decided to try the Mackmyra Svensk Rök (Swedish Smoke) against a Bunnahabhain from Gordon MacPhails, 8 years old. It is labeled “heavily heated”, and my hope was that the level of peatiness/smokiness would be quite the same for the two contestants.

Appearance: Mackmyra is slightly paler, perhaps, they look very similar.

Aroma: Bunnahabhain has a classic Islay smell (which I don’t usually expect in a Bunnahabhain). It is a powerful yet soft smell, not so dominated by peat and smoke after all. Unfortunately, I should have smelled the Mackmyra first, because now I realise that the Bunnahabhain is too powerful and the character of Mackmyra appears to be very subtle. However, after waiting a little while the Mackmyra has a clear and pleasant smell, with not so little smoke (it is not peat) after all. The Mackmyra more resembles (as I remember them) the earlier two Mackmyra, than it resembles Bunnahabhain.

Taste: Mackmyra first now: at 46% some smoke, some fruitiness and some sourness, but it clearly needs water. With little water a nice yet quite subtle smokiness is revealed and behind it the dry flavour of young wood. But it still needs more water. What happens here (with more water) is that the (still subtle) smokiness hides the fruity and nice character of the two previous Mackmyra.

The Bunnahabhain (which arguably smells closet) has a long, complex and soft taste (at its original 43%). It is clearly not as heavy as its more famous Islay neighbours, but compared to Mackmyra it is very rich and oily.

Switching back to Mackmyra is surprisingly pleasant (it tastes nicer after I had the Bunnahabhain, not worse as I expected after something heavier). But the Mackmyra, after Bunnahabhain, is mostly fruity and fresh, not smoky at all.

Badness: Mackmyra, again, requires a certain amount of water get right. It is a rather thin experience, especially with this competition. Bunnahabhain, I would not call it elegant, is rather wild. It is not that it is very powerful or peaty, it is just a little bit everywhere, and now and then, in some places in my nose and mouth, not very refined or elegant. It is a young little Islay brother.

Conclusions: I knew it was going to be tricky to pick a contestant to Mackmyra Svensk Rök. I did not find a Highland Park in my stash, that could have been better. I have a Jura Superstition: it would perhaps have been less peaty and for that reason a better opponent to Mackmyra. But I really like that Jura and I did not want to pit Mackmyra against a personal favourite.

Winner: Bunnahabhain beats Mackmyra, and usually, head to head, a much heavier whisky beats the lighter one. But Bunnahabhain did not come out as fantastic this evening. But there was just too little to explore in the Mackmyra.

Conclusions
My impression is that while the Clynelish is much better than the Glenturrent, the Svensk Ek is not much better than the Brukswhisky. And while Bunnahabhain is not necessarily much better than Clynelish, the Mackmyra Svensk Rök fails to improve much compared to the other two. At least, this is my impression when testing them head to head on different occasions.

In fact Svensk rök was the most disappointing experience (but perhaps the competition was completely unfair).

I think it is unreasonable to expect of a little young Swedish distillery that they produce world class whisky immediately, especially in a business where long storage time is a significant factor in product quality. Mackmyra claims they use small casks to speed up the process, but perhaps this shortcut is not perfect. Mackmyra needs water, but it quickly tastes diluted – there is not so much flavour to reveal. I think it needs more time (and perhaps it needs better casks, I don’t know about that).

I appreciate Mackmyra for being different (sometimes it reminds me of something coming from south of Sweden rather than from west of Sweden). I don’t find Mackmyra unpleasant (disgusting, chemical, bad, as I sometimes do with whisky). But if it is going to beat Scottish whiskies head to head, it needs weaker opponents or more time to mature.

Upgrading ownCloud 7.0.4 to 8.1.1

I am running ownCloud on a Debian machine with Mysql and I have been a little behind with upgrading it. Today I upgraded from 7.0.4 to 8.1.1 following the standard instructions. A few more notes on my environment:

  1. I don’t use encryption for files
  2. I don’t use https/ssl (I am behind an openvpn server)
  3. I did the upgrade in one step (7.0.4 to 8.1.1, not via intermediate versions)

It basically went fine. When I ran:

$ sudo -u www-data php occ upgrade
ownCloud or one of the apps require upgrade - only a limited number of commands are available
Checked database schema update
Checked database schema update for apps
Updated database
Disabled 3rd-party app: calendar
Disabled 3rd-party app: contacts
Disabled 3rd-party app: documents
Updating  ...
Updated  to 0.7
Updating  ...
Updated  to 0.6
Updating  ...
Updated  to 1.1.10
Updating  ...
Updated  to 2.0.1
Updating  ...
Updated  to 0.6.2
Updating  ...
Updated  to 0.6.3
Updating  ...
Updated  to 0.6.0
Update 3rd-party app: calendar
Exception: App does not provide an info.xml file
Update failed
Maintenance mode is kept active

I am a little surprised, because I don’t remember calendar, contacts and documents being 3rd party apps before (?). Anyway, the server did not come up, so I ran the command again:

$ sudo -u www-data php occ upgrade
ownCloud or one of the apps require upgrade - only a limited number of commands are available
Turned on maintenance mode
Checked database schema update
Checked database schema update for apps
Updated database
Update successful
Turned off maintenance mode

Now it worked. Logged in, no traces of the three 3rd-party apps. Whatever, I use ownCloud for the files.

Performance after upgrade
ownCloud is not particularly fast. I did a very quick and unscientific performance check: before upgrading I uploaded a folder (17 files, 4.1 MB) to ownCloud: it took 30 seconds (for the desktop client to complete syncing). After the upgrade the same folder took 19 seconds to sync. This proves nothing of course, but at least it seems promising.

Ubuntu Client
My Ubuntu client used ownCloud vs 1.7. It does not work with 8.1.1. Installing ownCloud client from external repository worked fine. Same thing for Debian, obviously.

Rotation speed meter for model engine

I have a model Stirling engine (Böhm Stirling Technik HB13) and a steam engine (Wilesco D10), and I have been curious about how fast they run (rpm). After some not so successful attempts, I managed to build a solution that works very well.

The Sensor
The idea is to use an IR-diode and an IR-transistor (it basically lets electric current through when being exposed to IR light), and mount these on opposite sides of some part of the engine that moves. First a device especially built for the Stirling engine:
rotmeter-stirling1

Here is a picture of a sensor designed to work with the steam engine:
rotmeter-steam

It works just fine with the Stirling engine too:
rotmeter-stirling2

The electronic part is quite simple. I am going to power the sensor and pick up the signal with an Arduino so I design for 5V:

  • The Diode is connected in series with a 330 Ohm resistor. Anode (long) to 5V and cathode (short) to ground.
  • The IR Transistor has two connectors: Collector and Emitter. I connected Emitter (long) to ground. The Collector (short) was connected to 5V via a 10kOhm resistor.

Now the collector can be used as an IR detector: LOW (close to 0V) means there is an IR source present and HIGH (close to 5V) means there is no IR source.


5V --------+------------------+----------------+
           |                  |                |
        330 Ohm            10k Ohm          330 Ohm
           |                  |                |
           |                  |           Indicator LED
           |                  |                |
           |                  +----------------+---- measure here
           |                  |
      IR Diode   ~IR~>   IR Transistor
           |                  |
Ground ----+------------------+

The Indicator LED (and its 330 Ohm resistor) is optional. It will be ON when the IR light from the Diode reaches the IR transistor. It will be OFF when IR beam is blocked.

The Arduino
I decided to use an Arduino to pick up the signal and display the speed. I bought the official Arduino Start kit which contained everything I needed (except the IR Diode and IR Transistor):
rotmeter-uno2
There is no indicator LED in this design. What you see in the picture from the Starter kit:

  • Arduino UNO
  • breadboard
  • 16×2 Character Display
  • 1 potentiometer (to adjust contrast of display)
  • 2 buttons (for a little menu)
  • cables (some are not from the starter kit)
  • resistors
  • a little wooden board to mount the UNO and breadboard on

When everything worked I wanted to make my device more permanent and make my prototyping Arduino UNO available for other projects. I bought an Arduino Nano clone and other things from DealExtreme and put it all together in a more compact design:
rotmeter-nano
This design has two input ports for the two different sensors I built (each having their own indicator LED).

Signal Noise
Based on earlier experiences (see Failed designs below) my primary concern was signal noise. Basically (the sensor is HIGH when IR beam is blocked):


Ideal signal

HIGH:                    -------                   ------

LOW:  -------------------       -------------------      ------------------


My fear

HIGH:                   - -- --- -                 - -    -

LOW:  ------------------ -  -   - ----------------- - ---- -----------------

For this reason I made a configurable max-rpm-value (like 2000 rpm), which in turn could be turned into a silent/numb period during which I ignore any signal:

 =WAITING============== ! =NUMB===== =WAITING===== ! =NUMB===== =WAITING====  

It turned out however, that for this (IR Diode/Transistor) design, the signal was very close to perfect.

Analog, Digital and Interrupt
The analog reality is of course that my input signal was not going to be perfectly 0 or 5 volts. I was not really sure that Digital or Interrupt input would work fine. Analog (10 bit) input would be in the range [0,1023].

Actual ranges were not completely consistent but could be [16,680] or [180,890] or [200,825]. In the last case 825 was with a paper between the Diode/Transistor. Hiding the Transistor inside a box raised the value from 825 to 1015.

Background IR radiation and other factors clearly play in here, but not enough to cause any problems. It turns out in practice both Digital and Interrupt work just fine.

I anyway built my device and program in such a way that I could choose between Digital, Analog and Interrupt input.

Sensor Performance
A rotation speed of 1200 rpm equals 20Hz and 50ms per revolution. If the HIGH is short (5ms) compared to the LOW (45ms) a sampling interval less than 5ms is required, to not fail to detect a revolution (unless I rely on interrupts).

The Steam engine flywheel has five spokes, each being much smaller than the space between them (see picture above). To measure five beats per revolution would require at approximately 50 samples per revolution, and at 1800 rpm this equals 1500Hz or 600 microseconds (us) per sample.

The transistor itself is fast. Arduino indicates that the speed of AnalogRead is 100us, and digital read is supposed to be faster (about 30us faster, it seems). The main loop must (worst case) complete in ~500us:

loop() {
  readInput();
  processInput();
  outputResultToLcd();
}

At 16Mhz and an 8bit CPU, this does not allow for a lot of wasteful input analysis.

Display Performance
The worst performance problem turned out to be the 16×2 character display. Initially I updated it a few times per second with code like:

LCD.clear();
LCD.setCursor(0,0);
LCD.print("rpm=");
LCD.print(rpmval);
LCD.setCursor(0,1); // first character, second row
LCD.print("something else");

This typically takes 12ms. Even with the spoke-less Stirling engine this design broke down at 1500rpm.

I ended up having two character arrays (16 bytes each). Whenever I wanted to update the output I just wrote to these two arrays. In each loop() iteration, I then called LCD.write() (at most) once in the end (writing just one character per iteration). This method still updates the display much faster than it is capable to turn the pixels on/off, but avoiding the LCD.clear() improves the visual impression. This is clearly not Arduino Best Practice.

There are I2C-compatible 16×2 displays: it would be interesting to know if they are faster or slower than the display I got with the Starter kit.

I am also thinking about sending output via serial. This will clearly require some thinking at 9600bps (unless the Arduino has some serial buffer working in the background).

Optimization
Optimizing for the Arduino is a bit different from optimization on Linux or Windows: since there is no context switching and no other processes, wasting CPU cycles does not matter: minimizing the maximum loop() time is everything, even if it means a lot of unnecessary operations are performed during the quicker loops.

In the end my loops (analog input) took from ~30us to ~600us. A bit simplified, the loop:

  1. Reads sensor input (~100us, but not when “numb”)
  2. Reads button 1 input
  3. Reads button 2 input
  4. Analyzes sensor inputer
  5. Analyzes button input do decide how to respond to user action
  6. Changes state
  7. Reformats LCD output based on input and state
  8. Outputs LCD output (~300us)
  9. Updates the internal LED (#13)

My overall optimization strategy was to perform as few of these steps as possible every iteration, to minimize the maximum delay between reading sensor input.

Apart from that, performance is of course about datastructures and algorithms (as usual). For the Arduino, also the datatype matters: 32 bit division is not for free on an 8 bit micro processor (and finding time intervals, average speeds and things like that requires division). No FPU anywhere.

I first wrote my program in C style rather than C style, as in:

C style:    timer_update(&timer);
C++ style:  timer.update();

I later rewrote my program creating proper C++ classes. This had essentially no cost whatsoever, neither on memory usage nor on performance, but I have to agree that the C++ got cleaner and easier to read.

Source Code
I uploaded the source code to DropBox. I would of course have wanted to clean it up, document and comment it better before publishing it. But this project has taken some time already, and I doubt it will happen, so I publish it as is instead. Feel free to drop comments or questions below.

Power consumption
My device seems to use about 45mA when powered with a 9V battery. With the sensor plugged out it was down to 25mA.

45mA at 9V gives 400mW and 200 Ohms.

Rotation Speeds
The Stirling engine starts slowly at about 300rpm. It runs for about 30 minutes, and can reach maximum speeds of just over 2000rpm (that typically happens after 20-25 minutes). A little extra heating (just burning a match in the flame) makes a big difference. I don’t know why it runs faster and faster with time and if this is normal.

The steam engine quickly reaches 2000rpm or a bit more, and then slows down during the 10 minutes it typically runs.

Both engines could probably be pushed a bit more, but I don’t want a catastrophic failure, especially not with the steam engine.

Failed designs
I have failed to measure rotation speed of my Stirling engine before.

Mobile applications blinking quickly can be used to determine rotation speed (with the right setting, a rotating wheel will look like it is still). I was not satisfied with this.

I recorded the sound (noise) of the Stirling engine and spent a few days writing a program trying to analyze the signal do determine the frequency of the engine. I should have:

  1. used a much better microphone to reduce noise
  2. used FFT or something like that, instead of trying to invent some heuristics like I did

In the end, it kind of gave me the speed, or I could interpret the speed from the output I got. One uncertainty here is that it is not entirely clear if the engine makes one, two (or more) noise-bursts per revolution.

I built an electrical sensor, basically connecting the engine to ground, and every revolution letting a moving part of the engine touch a piece of metal, connected to an Arduino input. This was not entirely unsuccessful, but the design had its disadvantages:

  1. Noise, and sometimes no signal at all
  2. Interferes with engine mechanically
  3. At high speeds, sensor has to be moving (back) fast enough

I did not try to use a photoresistor because they are clearly too slow for my purposes.

Buying cheap Arduino clones

I got curious about Arduino a little while ago and bought the official Arduino Starter Kit. I can really recommend it! It is very nicely put together and the project book really helps to get you started in no time. Even if you dont care about the projects themselves, they are a great way to learn how to use the Arduino.

After a few projects from the Starter Kit I started building my own project, which took me to a point where I felt I wanted at least one more Arduino.

I also felt that perhaps the UNO is not the right model for a more permanent build. After not so little research I decided the Arduino Nano is quite perfect.

Well, buying an Arduino (UNO, Nano, whatever) is not the easiest thing:

  • The arduino.cc vs arduino.org conflict causes some confusion, and has caused some limited availability of original (Italian) boards, it seems.
  • Some models are depricated
  • There are kind of official boards with funny names (Adafruit, RedBoard)
  • There are even more inofficial boards
  • The ATMega328 is not a very powerful chip, and original UNO and Nano are quite pricey

This made me consider a cheap Chinese copy. Those are not illegal in any way, they just come with the usual issues:

  • Delivery time
  • Build quality
  • No (or questionable) contribution back to community
  • Compability
  • Control, ethical, environmental and other aspects

I decided to give it a try and ordered:

I belive those mini-breadboards together with the Nano make a perfect Arduino. The clones I received were just fine. When you look at them and touch them, of course they don’t have the same quality as the original beautiful Italian-made Arduino I got with the Starter Kit. Especially the headers seem to be of lower quality than the orginal (not to talk about the print). Most of the clones use a cheaper and less capable USB-controller (CH340G instead of ATMEGA16U2). For Linux this makes no difference whatsoever, but for Windows you probably need to install drivers.

I think it can be good to have a few cheap clones to build into stuff or play with. At the same time, the official Starter Kit is great and the official board is good as reference so you know the clones do what they should. I would not start with only a cheap clone and no start kit.

Simple blinking LED circuit

For reasons I will not go into here, I wanted to build a simple electrical circuit that:

  • Has a blinking LED
  • Is driven by 5V
  • Does not require any IC or Microcontrollers

I found a few options, but this one seemed best.


(image linked from other site)

Well, it was not 5V, and I did not have the right components so I replaced:

  1. 470 Ohm resistors with 220 Ohm
  2. 100k Ohm resistors with 22k Ohm
  3. 10uF capacitors with 100uF
  4. The 2N3904 NPN transistors with BC547B

blinkled
(Picture with 1uF and 10k Ohm/100k Ohm)

That worked fine! But I wanted faster blinking speeds and I wanted to try different blinking speeds. I am quite sure the transistors and the 220 Ohm (470 Ohm) resistors do not significantly affect blinking speed. But I tried different combinations of the capacitors and high resistors.

(Hz)      100uF   22uF    1uF
1000 Ohm  10      47      
2200 Ohm  4.1     19      480
5100 Ohm  1.5     7.0     153
10k Ohm   0.87    4.0     88
100k Ohm                  10

It seems that replacing the capacitors give a very near-linear change in frequency (and my capacitors are quite accurate). However, changing the resistors is not giving a completely linear change in frequency (or the resistors are of lower quality with more error tolerance). Nevertheless, with the table above, I think it should be quite possible to generate quite the frequency you need. To save energy, you should use small capacitors and high resistors.

Mixing resistors
Mixing resistors gives what it should. I mixed 10k with 100k and got a frequency of 15Hz (basically, half of them are 100k-slow, and the other ones are much faster). The “side” with the “higher” resistor is the side that is off most of the time.

Measuring method
I built an Arduino tool that measures IR blink speed (to measure for example rotation speed). I built my circuit as a test circuit. Clearly, it would have been easier to use a second Arduino with the simple Blink program than to build a circuit like this.

OpenWrt 15.05 on Legacy Devices (16Mb RAM)

There are 86 devices on the OpenWrt homepage listed as supported, but with only 16Mb of RAM. Those devices work just fine with OpenWrt Backfire 10.03.1, but not with more recent OpenWrt releases.

I myself own a Linksys WRT54GL and I used Barrier Breaker 14.07 with some success.

With 15.05 there is a new feature available: zram-swap. A bit simplified, it means the system can compress its memory, effectively making better use of it.

I decided to try out 15.05 RC3 on my WRT54GL.

The standard image
The standard image is 3936256 and the device page for WRT54GL says: As the WRT54GL has only 4Mb flash, any image sent to the device must be 3866624 bytes or smaller. So the standard image is out of the question. Instead I downloaded the Image Builder from the same folder.

The Image Builder
The Image Builder is very easy to use and requires an x64 linux computer.

make image PROFILE=Broadcom-b43 PACKAGES="zram-swap -kmod-ppp -kmod-pppox -kmod-pppoe -ppp -ppp-mod-pppoe -kmod-b43 -kmod-b43legacy -kmod-mac80211 -kmod-cfg80211"

After a little while this has produced custom images, minus ppp-stuff, minus wireless stuff (more on that later), plus zram-swap. Also, LuCi is not there. The image is found in bin/brcm47xx, it is 3012kb and is installed the normal way on your WRT54GL.

Trying 15.05
Logging in via ssh (dropbear) is fine:

BusyBox v1.23.2 (2015-06-18 17:05:04 CEST) built-in shell (ash)

  _______                     ________        __
 |       |.-----.-----.-----.|  |  |  |.----.|  |_
 |   -   ||  _  |  -__|     ||  |  |  ||   _||   _|
 |_______||   __|_____|__|__||________||__|  |____|
          |__| W I R E L E S S   F R E E D O M
 -----------------------------------------------------
 CHAOS CALMER (15.05-rc3, r46163)
 -----------------------------------------------------
  * 1 1/2 oz Gin            Shake with a glassful
  * 1/4 oz Triple Sec       of broken ice and pour
  * 3/4 oz Lime Juice       unstrained into a goblet.
  * 1 1/2 oz Orange Juice
  * 1 tsp. Grenadine Syrup
 -----------------------------------------------------
root@OpenWrt:~# 

Top looks tight but not alarming (as usual):

Mem: 11568K used, 1056K free, 44K shrd, 1208K buff, 3228K cached
CPU:   8% usr   8% sys   0% nic  83% idle   0% io   0% irq   0% sirq
Load average: 0.13 0.23 0.11 1/31 1061
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
 1061  1056 root     R     1488  12%  17% top
  631     1 root     S     1656  13%   0% /sbin/netifd
  848     1 root     S     1492  12%   0% /usr/sbin/ntpd -n -S /usr/sbin/ntpd-h
 1056  1055 root     S     1492  12%   0% -ash
  735   631 root     S     1488  12%   0% udhcpc -p /var/run/udhcpc-eth0.2.pid
    1     0 root     S     1444  11%   0% /sbin/procd
 1055   757 root     S     1224  10%   0% /usr/sbin/dropbear -F -P /var/run/dro
  650     1 root     S     1196   9%   0% /usr/sbin/odhcpd
  757     1 root     S     1156   9%   0% /usr/sbin/dropbear -F -P /var/run/dro
  580     1 root     S     1060   8%   0% /sbin/logd -S 16
  868     1 nobody   S      996   8%   0% /usr/sbin/dnsmasq -C /var/etc/dnsmasq
  308     1 root     S      916   7%   0% /sbin/ubusd
  737   631 root     S      812   6%   0% odhcp6c -s /lib/netifd/dhcpv6.script
  337     1 root     S      772   6%   0% /sbin/askfirst /bin/ash --login
    4     2 root     SW       0   0%   0% [kworker/0:0]
    8     2 root     SW       0   0%   0% [kworker/u2:1]
    3     2 root     SW       0   0%   0% [ksoftirqd/0]
   14     2 root     SW       0   0%   0% [kswapd0]
    6     2 root     SW       0   0%   0% [kworker/u2:0]
  237     2 root     SWN      0   0%   0% [jffs2_gcd_mtd5]

The swap seems to work, at least in theory:

root@OpenWrt:~# free
             total         used         free       shared      buffers
Mem:         12624        11552         1072           44         1208
-/+ buffers:              10344         2280
Swap:         6140           72         6068

But that is the end of the good news.

opkg runs out of memory
Trying to install a package fails (in a new way):

# opkg install kmod-b43
Installing kmod-b43 (3.18.17+2015-03-09-3) to root...
Downloading http://downloads.openwrt.org/chaos_calmer/15.05-rc3/brcm47xx/legacy/packages/base/kmod-b43_3.18.17+2015-03-09-3_brcm47xx.ipk.
Collected errors:
 * gz_open: fork: Cannot allocate memory.
 * opkg_install_pkg: Failed to unpack control files from /tmp/opkg-lE7SIf/kmod-b43_3.18.17+2015-03-09-3_brcm47xx.ipk.
 * opkg_install_cmd: Cannot install package kmod-b43.

This happens also without zram-swap installed. I tried different packages but none of those I tried installed successfully. Effectively opkg is broken. One way to deal with this is to build an image with exactly the packages I need, and rebuild the image every time I want a new package. Which leads me to the next problem.

sysupgrade runs out of memory
I have found that flashing my 15.05 image to my WRT54GL (from 10.03.1 or 14.07) is fine. But flashing from 15.05 is tricky because it seems there is not enough RAM for sysupgrade. And it is quite scary when sysupgrade stalls, because you dont know if it is in the middle of flashing but failing to let you know.

One way to get around this is to flash a smaller image that use less space on /tmp. I tried 8.09.1 for the first time ever for this reason. Another (not recommended way) is to pipe from nc to mtd directly.

I found out (the hard way) about system recovery mode: start your WRT54GL, press the reset button on the back side (more is better), and it starts in recovery mode where you can telnet to it and sysupgrade runs just fine.

Not even in recovery mode everything is fine: for example, when trying the firstboot command it did not finish properly and I had to reset the WRT54GL.

A few times I forgot to use the -n option with sysupgrade: that is not such a good thing when you run it in recovery mode and perhaps flash a different firmware version.

testing wifi
I built a new image with WiFi installed and flashed it from failsafe mode:

make image PROFILE=Broadcom-b43 PACKAGES="zram-swap -kmod-ppp -kmod-pppox -kmod-pppoe -ppp -ppp-mod-pppoe"

Well, I tried different things… on one occation I had WiFi without encryption working. However, most of the time, activating WiFi just makes the WRT54GL not responding or very slow.

Conclusion
zram-swap is not the silver bullet that makes OpenWrt run on 16Mb devices. As with 14.07, you can probably use Image Builder to build a useful minimal image: get rid of the firewall, the WiFi, LuCl of course, and use it for something else – fine! But as a WiFi router: use Tomato or 10.03.1 instead.

For now, my WRT54GL is flashed with 10.03.1, completely unconfigured, and stored away for future adventures. At least it is not bricked, and I never needed to connect a TTL-cable to it.

Building Node.js for OpenWrt (mipsel)

I managed to build (and run) Node.js v0.10.38 for OpenWrt and my Archer C20i with a MIPS 24K Little Endian CPU, without FPU (target=ramips/mt7620).

First edit (set to false):

deps/v8/build/common.gypi

    54      # Similar to vfp but on MIPS.
    55      'v8_can_use_fpu_instructions%': 'false',
   
    63      # Similar to the ARM hard float ABI but on MIPS.
    64      'v8_use_mips_abi_hardfloat%': 'false',

Then, use this script to run configure:

#!/bin/sh -e

#Tools
export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib

export CC="mipsel-openwrt-linux-uclibc-gcc"
export CXX="mipsel-openwrt-linux-uclibc-g++"
export LD="mipsel-openwrt-linux-uclibc-ld"

export CFLAGS="-isystem${CSTOOLS_INC}"
export CPPFLAGS="-isystem${CSTOOLS_INC}"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=mipsel --dest-os=linux --without-npm

bash --norc

Then just “make”.

In order to run the node binary on OpenWrt you need to install:

# opkg update
# opkg install librt
# opkg install libstdcpp

I have had no success with v0.12.2 and v0.12.4.

Other MIPS?
The only other MIPS I have had the opportunity to try was my WDR3600, a Big Endian 74K. It does not work:

  • v0.10.38 does not build at all (big endian MIPS seems unsupported)
  • v0.12.* builds, but it does not run (floating point exceptions), despite I managed to build for Soft Float.

It seems the big endian support is rather new, so perhaps it will mature and start working in a later version.

Conclusion
It is quite questionable if CPUs without FPU will be supported at all in Node.js/V8 in the future.

Perhaps for OpenWrt LuaNode would be a better choice?

Node.js performance of Raspberry Pi 1 sucks

In several previous posts I have studied the performance of the Raspberry Pi (version 1) and Node.js to find out why the Raspberry Pi underperforms so badly when running Node.js.

The first two posts indicate that the Raspberry Pi underperforms about 10x compared to an x86/x64 machine, after compensation for clock frequency is made. The small cache size of the Raspberry Pi is often mentioned as a cause for its poor performance. In the third post I examine that, but it is not that horribly bad: about 3x worse performance for big memory needs compared to in-cache-situations. It appears the slow SDRAM of the RPi is more of a problem than the small cache itself.

The Benchmark Program
I wanted to relate the Node.js slowdown to some other scripted language. I decided Lua is nice. And I was lucky to find Mandelbrot implementations in several languages!

I modified the program(s) slightly, increasing the resolution from 80 to 160. I also made a version that did almost nothing (MAX_ITERATIONS=1) so I could measure and substract the startup cost (which is signifacant for Node.js) from the actual benchmark values.

The Numbers
Below are the average of three runs (minus the average of three 1-iteration rounds), in ms. The timing values were very stable over several runs.

 (ms)                           C/Hard   C/Soft  Node.js     Lua
=================================================================
 QNAP TS-109 500MHz ARMv5                 17513    49376   39520
 TP-Link Archer C20i 560MHz MIPS          45087    65510   82450
 RPi 700MHz ARMv6 (Raspbian)       493             14660   12130
 RPi 700MHz ARMv6 (OpenWrt)        490    11040    15010   31720
 Eee701 900MHz Celeron x86         295               500    7992
 3000MHz Athlon II X2 x64           56                59    1267

Notes on Hard/Soft floats:

  • Raspbian is armhf, only allowing hard floats (-mfloat-abi=hard)
  • OpenWrt is armel, allowing both hard floats (-mfloat-abi=softfp) and soft floats (-mfloat-abi=soft).
  • The QNAP has no FPU and generates runtime error with hard floats
  • The other targets produce linkage errors with soft floats

The Node.js versions are slightly different, and so are the Lua versions. This makes no significant difference.

Findings
Calculating the Mandelbrot with the FPU is basically “free” (<0.5s). Everything else is waste and overhead.

The cost of soft float is about 10s on the RPI. The difference between Node.js on Raspbian and OpenWrt is quite small – either both use the FPU, or none of them does.

Now, the interesting thing is to compare the RPi with the QNAP. For the C-program with the soft floats, the QNAP is about 1.5x slower than the RPi. This matches well with earlier benchmarks I have made (see 1st and 3rd link at top of post). If the RPi would have been using soft floats in Node.js, it would have completed in about 30 seconds (based on the QNAP 50 seconds). The only thing (I can come up with) that explains the (unusually) large difference between QNAP and RPi in this test, is that the RPi actually utilizes the FPU (both Raspbian and OpenWrt).

OpenWrt and FPU
The poor Lua performance in OpenWrt is probably due to two things:

  1. OpenWrt is compiled with -Os rather than -O2
  2. OpenWrt by default uses -mfloat-abi=soft rather than -mfloat-abi=softfp (which is essentially like hard).

It is important to notice that -mfloat-abi=softfp not only makes programs much faster, but also quite much smaller (10%), which would be valuable in OpenWrt.

Different Node.js versions and builds
I have been building Node.js many times for Raspberry Pi and OpenWrt. The above soft/softfp setting for building node does not affect performance much, but it does affect binary size. Node.js v0.10 is faster on Raspberry Pi than v0.12 (which needs some patching to build).

Lua
Apart from the un-optimized OpenWrt Lua build, Lua is consistently 20-25x slower than native for RPi/x86/x64. It is not like the small cache of the RPi, or some other limitation of the CPU, makes it worse for interpreted languages than x86/x64.

RPi ARMv6 VFPv2
While perhaps not the best FPU in the world, the VFPv2 floating point unit of the RPi ARMv6 delivers quite decent performance (slightly worse per clock cycle) compared to x86 and x64. It does not seem like the VFPv2 is to be blamed for the poor performance of Node.js on ARM.

Conclusion and Key finding
While Node.js (V8) for x86/x64 is near-native-speed, on the ARM it is rather near-Lua-speed: just another interpreted language, mostly. This does not seem to be caused by any limitation or flaw in the (RPi) ARM cpu, but rather the V8 implementation for x86/x64 being superior to that for ARM (ARMv6 at least).

Effects of cache on performance

It is not clear to me, why is Node.js so amazyingly slow on a Raspberry Pi (article 1, article 2)?

Is it because of the small cache (16kb+128kb)? Is Node.js emitting poor code on ARM? Well, I decided to investigate the cache issue. The 128kb cache of the Raspberry Pi is supposed to be primarily used by the GPU; is it actually effective at all?

A suitable test algorithm
To understand what I test, and because of the fun of it, I wanted to implement a suitable test program. I can imagine a good test program for cache testing would:

  • be reasonably slow/fast, so measuring execution time is practical and meaningful
  • have working data sets in sizes 10kb-10Mb
  • the same problem should be solvable with different work set sizes, in a way that the theoretical execution time should be the same, but the difference is because of cache only
  • be reasonably simple to implement and understand, while not so trivial that the optimizer just gets rid of the problem entirely

Finally, I think it is fun if the program does something slightly meaningful.

I found that Bubblesort (and later Selectionsort) were good problems, if combined with a quasi twist. Original bubble sort:

Array to sort: G A F C B D H E   ( N=8 )
Sorted array:  A B C D E F G H
Theoretical cost: O(N2) = 64/2 = 32
Actual cost: 7+6+5+4+3+2+1     = 28 (compares and conditional swaps)

I invented the following cache-optimized Bubble-Twist-Sort:

Array to sort:                G A F C B D H E
Sort halves using Bubblesort: A C F G B D E H
Now, the twist:                                 ( G>B : swap )
                              A C F B G D E H   ( D>F : swap )
                              A C D B G F E H   ( C<E : done )
Sort halves using Bubblesort: A B C D E F G H
Theoretical cost = 16/2 + 16/2 (first two bubbelsort)
                 + 4/2         (expected number of twist-swaps)
                 + 16/2 + 16/2 (second two bubbelsort)
                 = 34
Actual cost: 4*(3+2+1) + 2 = 26

Anyway, for larger arrays the actual costs get very close. The idea here is that I can run a bubbelsort on 1000 elements (effectively using 1000 memory units of memory intensively for ~500000 operations). But instead of doing that, I can replace it with 4 runs on 500 elements (4* ~12500 operations + ~250 operations). So I am solving the same problem, using the same algorithm, but optimizing for smaller cache sizes.

Enough of Bubblesort… you are probably either lost in details or disgusted with this horribly stupid idea of optimizing and not optimizing Bubblesort at the same time.

I made a Selectionsort option. And for a given data size I allowed it either to sort bytes or 32-bit words (which is 16 times faster, for same data size).

The test machines
I gathered 10 different test machines, with different cache sizes and instructions sets:

	QNAP	wdr3600	ac20i	Rpi	wdr4900	G4	Celeron	Xeon	Athlon	i5
								~2007   ~2010   ~2013
========================================================================================
L1	32	32	32	16	32	64	32	32	128	32
L2				128	256	256	512	6M	1024	256
L3						1024				6M
Mhz	500	560	580	700	800	866	900	2800	3000	3100
CPU	ARMv5	Mips74K	Mips24K	ARMv6	PPC	PPC	x86	x64	x64	x64
OS	Debian	OpenWrt	OpenWrt	OpenWrt	OpenWrt	Debian	Ubuntu	MacOSX	Ubuntu	Windows

Note that for the multi-core machines (Xeon, Athlon, i5) the L2/L3 caches may be shared or not between cores and the numbers above are a little ambigous. The sizes should be for Data cache when separate from Instruction cache.

The benchmarks
I ran Bubblesort for sizes 1000000 bytes down to 1000000/512. For Selectionsort I just ran three rounds. For Bubblesort I also ran for 2000000 and 4000000 but those times are divided by 4 and 16 to be comparable. All times are in seconds.

Bubblesort

	QNAP	wdr3600	ac20i	rpi	wdr4900	G4	Celeron	Xeon	Athlon	i5
========================================================================================
4000000	1248	1332	997	1120	833		507	120	104	93
2000000	1248	1332	994	1118	791	553	506	114	102	93
1000000	1274	1330	1009	1110	757	492	504	113	96	93
500000	1258	1194	959	1049	628	389	353	72	74	63
250000	1219	1116	931	911	445	309	276	53	61	48
125000	1174	1043	902	701	397	287	237	44	56	41
62500	941	853	791	573	373	278	218	38	52	37
31250	700	462	520	474	317	260	208	36	48	36
15625	697	456	507	368	315	258	204	35	49	35
7812	696	454	495	364	315	256	202	34	49	35
3906	696	455	496	364	315	257	203	34	47	35
1953	698	456	496	365	320	257	204	35	45	35

Selectionsort

	QNAP	wdr3600	ac20i	rpi	wdr4900	G4	Celeron	Xeon	Athlon	i5
========================================================================================
1000000	1317	996	877	1056	468	296	255	30	45	19
31250	875	354	539	559	206	147	245	28	40	21
1953	874	362	520	457	209	149	250	30	41	23

Theoretically, all timings for a single machine should be equal. The differences can be explained much by cache sizes, but obviously there are more things happening here.

Findings
Mostly the data makes sense. The caches creates plateaus and the L1 size can almost be prediced by the data. I would have expected even bigger differences between best/worse-cases; now it is in the range 180%-340%. The most surprising thing (?) is the Selectionsort results. They are sometimes a lot faster (G4, i5) and sometimes significantly slower! This is strange: I have no idea.

I believe the i5 superior performance of Selectionsort 1000000 is due to cache and branch prediction.

I note that the QNAP and Archer C20i both have DDRII memory, while the RPi has SDRAM. This seems to make a difference when work sizes get bigger.

I have also made other Benchmarks where the WDR4900 were faster than the G4 – not this time.

The Raspberry Pi
What did I learn about the Raspberry Pi? Well, memory is slow and branch prediction seems bad. It is typically 10-15 times slower than the modern (Xeon, Athlon, i5) CPUs. But for large selectionsort problems the difference is up to 40x. This starts getting close to the Node.js crap speed. It is not hard to imagine that Node.js benefits heavily from great branch prediction and large cache sizes – both things that the RPi lacks.

What about the 128k cache? Does it work? Well, compared to the L1-only machines, performance of RPi degrades sligthly slower, perhaps. Not impressed.

Bubblesort vs Selectionsort
It really puzzles me that Bubblesort ever beats Selectionsort:

void bubbelsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, jm1;
  uint32_t tmp;
  for ( i=len ; i>1 ; i-- ) {
    for ( j=1 ; j<i ; j++ ) {
      jm1 = j-1;
      if ( array[jm1] > array[j] ) {
        tmp = array[jm1];
        array[jm1] = array[j];
        array[j] = tmp;
      }
    }
  }
}

void selectionsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, best;
  uint32_t tmp;
  for ( i=1 ; i<len ; i++ ) {
    best = i-1;
    for ( j=i ; j<len ; j++ ) {
      if ( array[best] > array[j] ) {
        best = j;
      }
    }
    tmp = array[i-1];
    array[i-1] = array[best];
    array[best] = tmp;
  } 
}

Essentially, the difference is how the swap takes place outside the inner loop (once) instead of all the time. The Selectionsort should also be able of benefit from easier branch prediction and much fewer writes to memory. Perhaps compiling to assembly code would reveal something odd going on.

Power of 2 aligned data sets
I avoided using a datasize with the size an exact power of two: 1024×1024 vs 1000×1000. I did this becuase caches are supposed to work better this way. Perhaps I will make some 1024×1024 runs some day.

JavaScript: switch options

Is the nicest solution also the fastest?

Here is a little thing I ran into that I found interesting enough to test it. In JavaScript, you get a parameter (from a user, perhaps a web service), and depending on the parameter value you will call a particular function.

The first solution that comes to my mind is a switch:

function test_switch(code) {
  switch ( code ) {
  case 'Alfa':
    call_alfa();
    break;
  ...
  case 'Mike':
    call_mike();
    break;
  }
  call_default();
}

That is good if you know all the labels when you write the code. A more compact solution that allows you to dynamically add functions is to let the functions just be properties of an object:

x1 = {
  Alfa:call_alfa,
  Bravo:call_bravo,
  Charlie:call_charlie,
...
  Mike:call_mike
};

function test_prop(code) {
  var f = x1[code];
  if ( f ) f();
  else call_default();
}

And as a variant – not really making sense in this simple example but anyway – you could loop over the properties (functions) until you find the right one:

function test_prop_loop(code) {
  var p;
  for ( p in x1 ) {
    if ( p === code ) {
      x1[p]();
      return;
    }
  }
  call_default();
}

And, since we are into loops, this construction does not make so much sense in this simple example, but anyway:

x2 = [
  { code:'Alfa'     ,func:call_alfa    },
  { code:'Bravo'    ,func:call_bravo   },
  { code:'Charlie'  ,func:call_charlie },
...
  { code:'Mike'     ,func:call_mike    }
];

function test_array_loop(code) {
  var i, o;
  for ( i=0 ; i<x2.length ; i++ ) {
    o = x2[i];
    if ( o.code === code ) {
      o.func();
      return;
    }
  }
  call_default();
}

Alfa, Bravo…, Mike and default
I created exactly 13 options, and labeled them Alfa, Bravo, … Mike. And all the test functions accept invalid code and falls back to a default function.

The loops should clearly be worse for more options. However it is not obvious what the cost is for more options in the switch case.

I will make three test runs: 5 options (Alfa to Echo), 13 options (Alfa to Mike) and 14 options (Alfa to November) where the last one ends up in default. For each run, each of the 5/13/14 options will be equally frequent.

Benchmark Results
I am benchmarking using Node.js 0.12.2 on a Raspberry Pi 1. The startup time for Nodejs is 2.35 seconds, and I have reduced that from all benchmark times. I also ran the benchmarks on a MacBook Air with nodejs 0.10.35. All benchmarks were repeated three times and the median has been used. Iteration count: 1000000.

(ms)       ======== RPi ========     ==== MacBook Air ====
              5      13      14         5      13      14
============================================================
switch     1650    1890    1930        21      28      30
prop       2240    2330    2890        22      23      37
proploop   2740    3300    3490        31      37      38
loop       2740    4740    4750        23      34      36

Conclusions
Well, most notable (and again), the RPi ARMv6 is not fast running Node.js!

Using the simple property construction seems to make sense from a performance perspective, although the good old switch also fast. The loops have no advantages. Also, the penalty for the default case is quite heavy for the simple property case; if you know the “code” is valid the property scales very nicely.

It is however a little interesting that on the ARM the loop over properties is better than the loop over integers. On the x64 it is the other way around.

Variants of Simple Property Case
The following are essentially equally fast:

function test_prop(code) {
  var f = x1[code];   
  if ( f ) f();       
  else call_x();                        
}   

function test_prop(code) {
  var f = x1[code];   
  if ( 'function' === typeof f ) f();
  else call_x();                        
}   

function test_prop(code) {
  x1[code]();                          
}   

So, it does not cost much to have a safety test and a default case (just in case), but it is expensive to use it. This one, however:

function test_prop(code) {
  try {
    x1[code]();
  } catch(e) {
    call_x();
  }
}

comes at a cost of 5ms on the MacBook, when the catch is never used. If the catch is used (1 out of 14) the run takes a full second instead of 37ms!