Author Archives: zo0ok

OpenWrt 15.05 on Legacy Devices (16Mb RAM)

There are 86 devices on the OpenWrt homepage listed as supported, but with only 16Mb of RAM. Those devices work just fine with OpenWrt Backfire 10.03.1, but not with more recent OpenWrt releases.

I myself own a Linksys WRT54GL and I used Barrier Breaker 14.07 with some success.

With 15.05 there is a new feature available: zram-swap. A bit simplified, it means the system can compress its memory, effectively making better use of it.

I decided to try out 15.05 RC3 on my WRT54GL.

The standard image
The standard image is 3936256 and the device page for WRT54GL says: As the WRT54GL has only 4Mb flash, any image sent to the device must be 3866624 bytes or smaller. So the standard image is out of the question. Instead I downloaded the Image Builder from the same folder.

The Image Builder
The Image Builder is very easy to use and requires an x64 linux computer.

make image PROFILE=Broadcom-b43 PACKAGES="zram-swap -kmod-ppp -kmod-pppox -kmod-pppoe -ppp -ppp-mod-pppoe -kmod-b43 -kmod-b43legacy -kmod-mac80211 -kmod-cfg80211"

After a little while this has produced custom images, minus ppp-stuff, minus wireless stuff (more on that later), plus zram-swap. Also, LuCi is not there. The image is found in bin/brcm47xx, it is 3012kb and is installed the normal way on your WRT54GL.

Trying 15.05
Logging in via ssh (dropbear) is fine:

BusyBox v1.23.2 (2015-06-18 17:05:04 CEST) built-in shell (ash)

  _______                     ________        __
 |       |.-----.-----.-----.|  |  |  |.----.|  |_
 |   -   ||  _  |  -__|     ||  |  |  ||   _||   _|
 |_______||   __|_____|__|__||________||__|  |____|
          |__| W I R E L E S S   F R E E D O M
 -----------------------------------------------------
 CHAOS CALMER (15.05-rc3, r46163)
 -----------------------------------------------------
  * 1 1/2 oz Gin            Shake with a glassful
  * 1/4 oz Triple Sec       of broken ice and pour
  * 3/4 oz Lime Juice       unstrained into a goblet.
  * 1 1/2 oz Orange Juice
  * 1 tsp. Grenadine Syrup
 -----------------------------------------------------
root@OpenWrt:~# 

Top looks tight but not alarming (as usual):

Mem: 11568K used, 1056K free, 44K shrd, 1208K buff, 3228K cached
CPU:   8% usr   8% sys   0% nic  83% idle   0% io   0% irq   0% sirq
Load average: 0.13 0.23 0.11 1/31 1061
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
 1061  1056 root     R     1488  12%  17% top
  631     1 root     S     1656  13%   0% /sbin/netifd
  848     1 root     S     1492  12%   0% /usr/sbin/ntpd -n -S /usr/sbin/ntpd-h
 1056  1055 root     S     1492  12%   0% -ash
  735   631 root     S     1488  12%   0% udhcpc -p /var/run/udhcpc-eth0.2.pid
    1     0 root     S     1444  11%   0% /sbin/procd
 1055   757 root     S     1224  10%   0% /usr/sbin/dropbear -F -P /var/run/dro
  650     1 root     S     1196   9%   0% /usr/sbin/odhcpd
  757     1 root     S     1156   9%   0% /usr/sbin/dropbear -F -P /var/run/dro
  580     1 root     S     1060   8%   0% /sbin/logd -S 16
  868     1 nobody   S      996   8%   0% /usr/sbin/dnsmasq -C /var/etc/dnsmasq
  308     1 root     S      916   7%   0% /sbin/ubusd
  737   631 root     S      812   6%   0% odhcp6c -s /lib/netifd/dhcpv6.script
  337     1 root     S      772   6%   0% /sbin/askfirst /bin/ash --login
    4     2 root     SW       0   0%   0% [kworker/0:0]
    8     2 root     SW       0   0%   0% [kworker/u2:1]
    3     2 root     SW       0   0%   0% [ksoftirqd/0]
   14     2 root     SW       0   0%   0% [kswapd0]
    6     2 root     SW       0   0%   0% [kworker/u2:0]
  237     2 root     SWN      0   0%   0% [jffs2_gcd_mtd5]

The swap seems to work, at least in theory:

root@OpenWrt:~# free
             total         used         free       shared      buffers
Mem:         12624        11552         1072           44         1208
-/+ buffers:              10344         2280
Swap:         6140           72         6068

But that is the end of the good news.

opkg runs out of memory
Trying to install a package fails (in a new way):

# opkg install kmod-b43
Installing kmod-b43 (3.18.17+2015-03-09-3) to root...
Downloading http://downloads.openwrt.org/chaos_calmer/15.05-rc3/brcm47xx/legacy/packages/base/kmod-b43_3.18.17+2015-03-09-3_brcm47xx.ipk.
Collected errors:
 * gz_open: fork: Cannot allocate memory.
 * opkg_install_pkg: Failed to unpack control files from /tmp/opkg-lE7SIf/kmod-b43_3.18.17+2015-03-09-3_brcm47xx.ipk.
 * opkg_install_cmd: Cannot install package kmod-b43.

This happens also without zram-swap installed. I tried different packages but none of those I tried installed successfully. Effectively opkg is broken. One way to deal with this is to build an image with exactly the packages I need, and rebuild the image every time I want a new package. Which leads me to the next problem.

sysupgrade runs out of memory
I have found that flashing my 15.05 image to my WRT54GL (from 10.03.1 or 14.07) is fine. But flashing from 15.05 is tricky because it seems there is not enough RAM for sysupgrade. And it is quite scary when sysupgrade stalls, because you dont know if it is in the middle of flashing but failing to let you know.

One way to get around this is to flash a smaller image that use less space on /tmp. I tried 8.09.1 for the first time ever for this reason. Another (not recommended way) is to pipe from nc to mtd directly.

I found out (the hard way) about system recovery mode: start your WRT54GL, press the reset button on the back side (more is better), and it starts in recovery mode where you can telnet to it and sysupgrade runs just fine.

Not even in recovery mode everything is fine: for example, when trying the firstboot command it did not finish properly and I had to reset the WRT54GL.

A few times I forgot to use the -n option with sysupgrade: that is not such a good thing when you run it in recovery mode and perhaps flash a different firmware version.

testing wifi
I built a new image with WiFi installed and flashed it from failsafe mode:

make image PROFILE=Broadcom-b43 PACKAGES="zram-swap -kmod-ppp -kmod-pppox -kmod-pppoe -ppp -ppp-mod-pppoe"

Well, I tried different things… on one occation I had WiFi without encryption working. However, most of the time, activating WiFi just makes the WRT54GL not responding or very slow.

Conclusion
zram-swap is not the silver bullet that makes OpenWrt run on 16Mb devices. As with 14.07, you can probably use Image Builder to build a useful minimal image: get rid of the firewall, the WiFi, LuCl of course, and use it for something else – fine! But as a WiFi router: use Tomato or 10.03.1 instead.

For now, my WRT54GL is flashed with 10.03.1, completely unconfigured, and stored away for future adventures. At least it is not bricked, and I never needed to connect a TTL-cable to it.

Building Node.js for OpenWrt (mipsel)

I managed to build (and run) Node.js v0.10.38 for OpenWrt and my Archer C20i with a MIPS 24K Little Endian CPU, without FPU (target=ramips/mt7620).

First edit (set to false):

deps/v8/build/common.gypi

    54      # Similar to vfp but on MIPS.
    55      'v8_can_use_fpu_instructions%': 'false',
   
    63      # Similar to the ARM hard float ABI but on MIPS.
    64      'v8_use_mips_abi_hardfloat%': 'false',

Then, use this script to run configure:

#!/bin/sh -e

#Tools
export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib

export CC="mipsel-openwrt-linux-uclibc-gcc"
export CXX="mipsel-openwrt-linux-uclibc-g++"
export LD="mipsel-openwrt-linux-uclibc-ld"

export CFLAGS="-isystem${CSTOOLS_INC}"
export CPPFLAGS="-isystem${CSTOOLS_INC}"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=mipsel --dest-os=linux --without-npm

bash --norc

Then just “make”.

In order to run the node binary on OpenWrt you need to install:

# opkg update
# opkg install librt
# opkg install libstdcpp

I have had no success with v0.12.2 and v0.12.4.

Other MIPS?
The only other MIPS I have had the opportunity to try was my WDR3600, a Big Endian 74K. It does not work:

  • v0.10.38 does not build at all (big endian MIPS seems unsupported)
  • v0.12.* builds, but it does not run (floating point exceptions), despite I managed to build for Soft Float.

It seems the big endian support is rather new, so perhaps it will mature and start working in a later version.

Conclusion
It is quite questionable if CPUs without FPU will be supported at all in Node.js/V8 in the future.

Perhaps for OpenWrt LuaNode would be a better choice?

Node.js performance of Raspberry Pi 1 sucks

In several previous posts I have studied the performance of the Raspberry Pi (version 1) and Node.js to find out why the Raspberry Pi underperforms so badly when running Node.js.

The first two posts indicate that the Raspberry Pi underperforms about 10x compared to an x86/x64 machine, after compensation for clock frequency is made. The small cache size of the Raspberry Pi is often mentioned as a cause for its poor performance. In the third post I examine that, but it is not that horribly bad: about 3x worse performance for big memory needs compared to in-cache-situations. It appears the slow SDRAM of the RPi is more of a problem than the small cache itself.

The Benchmark Program
I wanted to relate the Node.js slowdown to some other scripted language. I decided Lua is nice. And I was lucky to find Mandelbrot implementations in several languages!

I modified the program(s) slightly, increasing the resolution from 80 to 160. I also made a version that did almost nothing (MAX_ITERATIONS=1) so I could measure and substract the startup cost (which is signifacant for Node.js) from the actual benchmark values.

The Numbers
Below are the average of three runs (minus the average of three 1-iteration rounds), in ms. The timing values were very stable over several runs.

 (ms)                           C/Hard   C/Soft  Node.js     Lua
=================================================================
 QNAP TS-109 500MHz ARMv5                 17513    49376   39520
 TP-Link Archer C20i 560MHz MIPS          45087    65510   82450
 RPi 700MHz ARMv6 (Raspbian)       493             14660   12130
 RPi 700MHz ARMv6 (OpenWrt)        490    11040    15010   31720
 Eee701 900MHz Celeron x86         295               500    7992
 3000MHz Athlon II X2 x64           56                59    1267

Notes on Hard/Soft floats:

  • Raspbian is armhf, only allowing hard floats (-mfloat-abi=hard)
  • OpenWrt is armel, allowing both hard floats (-mfloat-abi=softfp) and soft floats (-mfloat-abi=soft).
  • The QNAP has no FPU and generates runtime error with hard floats
  • The other targets produce linkage errors with soft floats

The Node.js versions are slightly different, and so are the Lua versions. This makes no significant difference.

Findings
Calculating the Mandelbrot with the FPU is basically “free” (<0.5s). Everything else is waste and overhead.

The cost of soft float is about 10s on the RPI. The difference between Node.js on Raspbian and OpenWrt is quite small – either both use the FPU, or none of them does.

Now, the interesting thing is to compare the RPi with the QNAP. For the C-program with the soft floats, the QNAP is about 1.5x slower than the RPi. This matches well with earlier benchmarks I have made (see 1st and 3rd link at top of post). If the RPi would have been using soft floats in Node.js, it would have completed in about 30 seconds (based on the QNAP 50 seconds). The only thing (I can come up with) that explains the (unusually) large difference between QNAP and RPi in this test, is that the RPi actually utilizes the FPU (both Raspbian and OpenWrt).

OpenWrt and FPU
The poor Lua performance in OpenWrt is probably due to two things:

  1. OpenWrt is compiled with -Os rather than -O2
  2. OpenWrt by default uses -mfloat-abi=soft rather than -mfloat-abi=softfp (which is essentially like hard).

It is important to notice that -mfloat-abi=softfp not only makes programs much faster, but also quite much smaller (10%), which would be valuable in OpenWrt.

Different Node.js versions and builds
I have been building Node.js many times for Raspberry Pi and OpenWrt. The above soft/softfp setting for building node does not affect performance much, but it does affect binary size. Node.js v0.10 is faster on Raspberry Pi than v0.12 (which needs some patching to build).

Lua
Apart from the un-optimized OpenWrt Lua build, Lua is consistently 20-25x slower than native for RPi/x86/x64. It is not like the small cache of the RPi, or some other limitation of the CPU, makes it worse for interpreted languages than x86/x64.

RPi ARMv6 VFPv2
While perhaps not the best FPU in the world, the VFPv2 floating point unit of the RPi ARMv6 delivers quite decent performance (slightly worse per clock cycle) compared to x86 and x64. It does not seem like the VFPv2 is to be blamed for the poor performance of Node.js on ARM.

Conclusion and Key finding
While Node.js (V8) for x86/x64 is near-native-speed, on the ARM it is rather near-Lua-speed: just another interpreted language, mostly. This does not seem to be caused by any limitation or flaw in the (RPi) ARM cpu, but rather the V8 implementation for x86/x64 being superior to that for ARM (ARMv6 at least).

Effects of cache on performance

It is not clear to me, why is Node.js so amazyingly slow on a Raspberry Pi (article 1, article 2)?

Is it because of the small cache (16kb+128kb)? Is Node.js emitting poor code on ARM? Well, I decided to investigate the cache issue. The 128kb cache of the Raspberry Pi is supposed to be primarily used by the GPU; is it actually effective at all?

A suitable test algorithm
To understand what I test, and because of the fun of it, I wanted to implement a suitable test program. I can imagine a good test program for cache testing would:

  • be reasonably slow/fast, so measuring execution time is practical and meaningful
  • have working data sets in sizes 10kb-10Mb
  • the same problem should be solvable with different work set sizes, in a way that the theoretical execution time should be the same, but the difference is because of cache only
  • be reasonably simple to implement and understand, while not so trivial that the optimizer just gets rid of the problem entirely

Finally, I think it is fun if the program does something slightly meaningful.

I found that Bubblesort (and later Selectionsort) were good problems, if combined with a quasi twist. Original bubble sort:

Array to sort: G A F C B D H E   ( N=8 )
Sorted array:  A B C D E F G H
Theoretical cost: O(N2) = 64/2 = 32
Actual cost: 7+6+5+4+3+2+1     = 28 (compares and conditional swaps)

I invented the following cache-optimized Bubble-Twist-Sort:

Array to sort:                G A F C B D H E
Sort halves using Bubblesort: A C F G B D E H
Now, the twist:                                 ( G>B : swap )
                              A C F B G D E H   ( D>F : swap )
                              A C D B G F E H   ( C<E : done )
Sort halves using Bubblesort: A B C D E F G H
Theoretical cost = 16/2 + 16/2 (first two bubbelsort)
                 + 4/2         (expected number of twist-swaps)
                 + 16/2 + 16/2 (second two bubbelsort)
                 = 34
Actual cost: 4*(3+2+1) + 2 = 26

Anyway, for larger arrays the actual costs get very close. The idea here is that I can run a bubbelsort on 1000 elements (effectively using 1000 memory units of memory intensively for ~500000 operations). But instead of doing that, I can replace it with 4 runs on 500 elements (4* ~12500 operations + ~250 operations). So I am solving the same problem, using the same algorithm, but optimizing for smaller cache sizes.

Enough of Bubblesort… you are probably either lost in details or disgusted with this horribly stupid idea of optimizing and not optimizing Bubblesort at the same time.

I made a Selectionsort option. And for a given data size I allowed it either to sort bytes or 32-bit words (which is 16 times faster, for same data size).

The test machines
I gathered 10 different test machines, with different cache sizes and instructions sets:

	QNAP	wdr3600	ac20i	Rpi	wdr4900	G4	Celeron	Xeon	Athlon	i5
								~2007   ~2010   ~2013
========================================================================================
L1	32	32	32	16	32	64	32	32	128	32
L2				128	256	256	512	6M	1024	256
L3						1024				6M
Mhz	500	560	580	700	800	866	900	2800	3000	3100
CPU	ARMv5	Mips74K	Mips24K	ARMv6	PPC	PPC	x86	x64	x64	x64
OS	Debian	OpenWrt	OpenWrt	OpenWrt	OpenWrt	Debian	Ubuntu	MacOSX	Ubuntu	Windows

Note that for the multi-core machines (Xeon, Athlon, i5) the L2/L3 caches may be shared or not between cores and the numbers above are a little ambigous. The sizes should be for Data cache when separate from Instruction cache.

The benchmarks
I ran Bubblesort for sizes 1000000 bytes down to 1000000/512. For Selectionsort I just ran three rounds. For Bubblesort I also ran for 2000000 and 4000000 but those times are divided by 4 and 16 to be comparable. All times are in seconds.

Bubblesort

	QNAP	wdr3600	ac20i	rpi	wdr4900	G4	Celeron	Xeon	Athlon	i5
========================================================================================
4000000	1248	1332	997	1120	833		507	120	104	93
2000000	1248	1332	994	1118	791	553	506	114	102	93
1000000	1274	1330	1009	1110	757	492	504	113	96	93
500000	1258	1194	959	1049	628	389	353	72	74	63
250000	1219	1116	931	911	445	309	276	53	61	48
125000	1174	1043	902	701	397	287	237	44	56	41
62500	941	853	791	573	373	278	218	38	52	37
31250	700	462	520	474	317	260	208	36	48	36
15625	697	456	507	368	315	258	204	35	49	35
7812	696	454	495	364	315	256	202	34	49	35
3906	696	455	496	364	315	257	203	34	47	35
1953	698	456	496	365	320	257	204	35	45	35

Selectionsort

	QNAP	wdr3600	ac20i	rpi	wdr4900	G4	Celeron	Xeon	Athlon	i5
========================================================================================
1000000	1317	996	877	1056	468	296	255	30	45	19
31250	875	354	539	559	206	147	245	28	40	21
1953	874	362	520	457	209	149	250	30	41	23

Theoretically, all timings for a single machine should be equal. The differences can be explained much by cache sizes, but obviously there are more things happening here.

Findings
Mostly the data makes sense. The caches creates plateaus and the L1 size can almost be prediced by the data. I would have expected even bigger differences between best/worse-cases; now it is in the range 180%-340%. The most surprising thing (?) is the Selectionsort results. They are sometimes a lot faster (G4, i5) and sometimes significantly slower! This is strange: I have no idea.

I believe the i5 superior performance of Selectionsort 1000000 is due to cache and branch prediction.

I note that the QNAP and Archer C20i both have DDRII memory, while the RPi has SDRAM. This seems to make a difference when work sizes get bigger.

I have also made other Benchmarks where the WDR4900 were faster than the G4 – not this time.

The Raspberry Pi
What did I learn about the Raspberry Pi? Well, memory is slow and branch prediction seems bad. It is typically 10-15 times slower than the modern (Xeon, Athlon, i5) CPUs. But for large selectionsort problems the difference is up to 40x. This starts getting close to the Node.js crap speed. It is not hard to imagine that Node.js benefits heavily from great branch prediction and large cache sizes – both things that the RPi lacks.

What about the 128k cache? Does it work? Well, compared to the L1-only machines, performance of RPi degrades sligthly slower, perhaps. Not impressed.

Bubblesort vs Selectionsort
It really puzzles me that Bubblesort ever beats Selectionsort:

void bubbelsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, jm1;
  uint32_t tmp;
  for ( i=len ; i>1 ; i-- ) {
    for ( j=1 ; j<i ; j++ ) {
      jm1 = j-1;
      if ( array[jm1] > array[j] ) {
        tmp = array[jm1];
        array[jm1] = array[j];
        array[j] = tmp;
      }
    }
  }
}

void selectionsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, best;
  uint32_t tmp;
  for ( i=1 ; i<len ; i++ ) {
    best = i-1;
    for ( j=i ; j<len ; j++ ) {
      if ( array[best] > array[j] ) {
        best = j;
      }
    }
    tmp = array[i-1];
    array[i-1] = array[best];
    array[best] = tmp;
  } 
}

Essentially, the difference is how the swap takes place outside the inner loop (once) instead of all the time. The Selectionsort should also be able of benefit from easier branch prediction and much fewer writes to memory. Perhaps compiling to assembly code would reveal something odd going on.

Power of 2 aligned data sets
I avoided using a datasize with the size an exact power of two: 1024×1024 vs 1000×1000. I did this becuase caches are supposed to work better this way. Perhaps I will make some 1024×1024 runs some day.

JavaScript: switch options

Is the nicest solution also the fastest?

Here is a little thing I ran into that I found interesting enough to test it. In JavaScript, you get a parameter (from a user, perhaps a web service), and depending on the parameter value you will call a particular function.

The first solution that comes to my mind is a switch:

function test_switch(code) {
  switch ( code ) {
  case 'Alfa':
    call_alfa();
    break;
  ...
  case 'Mike':
    call_mike();
    break;
  }
  call_default();
}

That is good if you know all the labels when you write the code. A more compact solution that allows you to dynamically add functions is to let the functions just be properties of an object:

x1 = {
  Alfa:call_alfa,
  Bravo:call_bravo,
  Charlie:call_charlie,
...
  Mike:call_mike
};

function test_prop(code) {
  var f = x1[code];
  if ( f ) f();
  else call_default();
}

And as a variant – not really making sense in this simple example but anyway – you could loop over the properties (functions) until you find the right one:

function test_prop_loop(code) {
  var p;
  for ( p in x1 ) {
    if ( p === code ) {
      x1[p]();
      return;
    }
  }
  call_default();
}

And, since we are into loops, this construction does not make so much sense in this simple example, but anyway:

x2 = [
  { code:'Alfa'     ,func:call_alfa    },
  { code:'Bravo'    ,func:call_bravo   },
  { code:'Charlie'  ,func:call_charlie },
...
  { code:'Mike'     ,func:call_mike    }
];

function test_array_loop(code) {
  var i, o;
  for ( i=0 ; i<x2.length ; i++ ) {
    o = x2[i];
    if ( o.code === code ) {
      o.func();
      return;
    }
  }
  call_default();
}

Alfa, Bravo…, Mike and default
I created exactly 13 options, and labeled them Alfa, Bravo, … Mike. And all the test functions accept invalid code and falls back to a default function.

The loops should clearly be worse for more options. However it is not obvious what the cost is for more options in the switch case.

I will make three test runs: 5 options (Alfa to Echo), 13 options (Alfa to Mike) and 14 options (Alfa to November) where the last one ends up in default. For each run, each of the 5/13/14 options will be equally frequent.

Benchmark Results
I am benchmarking using Node.js 0.12.2 on a Raspberry Pi 1. The startup time for Nodejs is 2.35 seconds, and I have reduced that from all benchmark times. I also ran the benchmarks on a MacBook Air with nodejs 0.10.35. All benchmarks were repeated three times and the median has been used. Iteration count: 1000000.

(ms)       ======== RPi ========     ==== MacBook Air ====
              5      13      14         5      13      14
============================================================
switch     1650    1890    1930        21      28      30
prop       2240    2330    2890        22      23      37
proploop   2740    3300    3490        31      37      38
loop       2740    4740    4750        23      34      36

Conclusions
Well, most notable (and again), the RPi ARMv6 is not fast running Node.js!

Using the simple property construction seems to make sense from a performance perspective, although the good old switch also fast. The loops have no advantages. Also, the penalty for the default case is quite heavy for the simple property case; if you know the “code” is valid the property scales very nicely.

It is however a little interesting that on the ARM the loop over properties is better than the loop over integers. On the x64 it is the other way around.

Variants of Simple Property Case
The following are essentially equally fast:

function test_prop(code) {
  var f = x1[code];   
  if ( f ) f();       
  else call_x();                        
}   

function test_prop(code) {
  var f = x1[code];   
  if ( 'function' === typeof f ) f();
  else call_x();                        
}   

function test_prop(code) {
  x1[code]();                          
}   

So, it does not cost much to have a safety test and a default case (just in case), but it is expensive to use it. This one, however:

function test_prop(code) {
  try {
    x1[code]();
  } catch(e) {
    call_x();
  }
}

comes at a cost of 5ms on the MacBook, when the catch is never used. If the catch is used (1 out of 14) the run takes a full second instead of 37ms!

Getting your locale right

So, you get this annoying error (in Debian, some Ubuntu, or perhaps any other Linux or even Unix system):

locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory

Perhaps you get it when running apt-get?
You Google for answers and you find nothing useful and/or just contradictory information?
You are connecting over SSH?
You are connecting from another system (like Mac OS X)?
Read on…
In it simplest incarnation, the problem looks like this.

$ man some-none-existing-program
man: can't set the locale; make sure $LC_* and $LANG are correct
No manual entry for some-none-existing-program

$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=en_GB:en
LC_CTYPE=UTF-8
LC_NUMERIC=sv_SE.UTF-8
LC_TIME=sv_SE.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=sv_SE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=sv_SE.UTF-8
LC_NAME=sv_SE.UTF-8
LC_ADDRESS=sv_SE.UTF-8
LC_TELEPHONE=sv_SE.UTF-8
LC_MEASUREMENT=sv_SE.UTF-8
LC_IDENTIFICATION=sv_SE.UTF-8
LC_ALL=

What happens here, in the second case is that “UTF-8″ is not a valid LC_CTYPE (in Debian) and since LC_ALL is not set it can not fall back properly. That is all. Why is LC_CTYPE invalid? Perhaps because you have used ssh from Mac OS X (or something else):

mac $ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

mac $ ssh user@debianhost

The point is, here LC_CTYPE=UTF-8, and it is valid in OS X. But it is not valid in Debian. That is all there is to it. Try:

# This will fix the problem
mac $ LC_CTYPE=en_US.UTF-8 ssh user@debianhost

# This will produce the problem even from a Debian machine
debianclient $ LC_CTYPE=UTF-8 ssh user@debianhost

# This will (kind of) eleminate the problem when already logged in
$ LC_CTYPE=en_US.UTF-8 locale
-bash: warning: setlocale: LC_CTYPE: cannot change locale (UTF-8)
LANG=en_US.UTF-8
LANGUAGE=en_GB:en
LC_CTYPE=en_US.UTF-8
LC_NUMERIC=sv_SE.UTF-8
LC_TIME=sv_SE.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=sv_SE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=sv_SE.UTF-8
LC_NAME=sv_SE.UTF-8
LC_ADDRESS=sv_SE.UTF-8
LC_TELEPHONE=sv_SE.UTF-8
LC_MEASUREMENT=sv_SE.UTF-8
LC_IDENTIFICATION=sv_SE.UTF-8
LC_ALL=

So, the problem is that all the guides on the internet presume you have a problem with your locales on your server, which you dont! The locale is probably fine. It is just that the ssh client computer has an LC_CTYPE which does not happen to be valid on the server.

The Debian Locale guide itself is of course correct. And it states that you should not set LC_ALL=en_US.UTF-8 (!). That is however the only “working” answer I found online.

Solution
I guess the easiest thing is to add a line to your .profile file (on the server):

$ cat .profile 
.
.
.
LC_CTYPE="en_US.UTF-8"

If you are not using a bourne-shell you probably know how to set a variable for your shell. Any other method that will set LC_CTYPE to a locale valid on your Debian machine will also work. Clearly, you dont need (and you probably should not) set LC_ALL.

Not happy with en_US.UTF-8?
Not everyone use the American locale. To find out what locales are available/valid on your system:

$ locale -a
C
C.UTF-8
en_GB.utf8
en_US.utf8
POSIX
sv_SE.utf8

Want others?

$ sudo dpkg-reconfigure locales

or read the Debian Locale Guide.

Upgrading Qnap TS109 from Wheezy to Jessie

The Qnap TS-109 runs Debian Wheezy just fine, but it has to be upgraded from Squeeze (no direct install). How about upgrading Wheezy to Jessie? This device is old and slow, so I decided to find out, and if it does not work, so be it.

Debian links:

Below follows a shorter version of the upgrade guide, focusing on what I actually did with my Qnap.

Getting ready
First you should of course backup your stuff if the device contains anything you can not lose.

It can also be a good idea to clean out packages that you dont need (to both avoid problems when upgrading them, and to save time during the upgrade):

$ sudo dpkg -l
$ sudo apt-get purge SOMEPACKAGES

It is also adviced to make sure you dont have packages on hold. You can do this with:

$ sudo dpkg --get-selections | grep 'hold$'

I confirmed that my system was then fully wheezy-updated

$ sudo apt-get update

$ sudo apt-get upgrade
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

$ sudo apt-get dist-upgrade
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

With that, I decided I was ready for an upgrade to Jessie.

Upgrade itself
I first updated /etc/apt/sources.list.

Old version:

deb http://ftp.se.debian.org/debian/ wheezy main
deb-src http://ftp.se.debian.org/debian/ wheezy main non-free

deb http://security.debian.org/ wheezy/updates main
deb-src http://security.debian.org/ wheezy/updates main non-free

deb http://ftp.df.lth.se/debian wheezy-backports main

New version:

deb http://ftp.se.debian.org/debian/ jessie main
deb-src http://ftp.se.debian.org/debian/ jessie main non-free

deb http://security.debian.org/ jessie/updates main
deb-src http://security.debian.org/ jessie/updates main non-free

I just deleted the backports source.

And begin upgrade:

$ sudo apt-get update
$ sudo apt-get upgrade

That went fine, so I did a reboot. It took longer than usual, but it came up. So I proceeded with:

$ sudo apt-get dist-upgrade

…which went fine, and I rebooted.

System came up and it seems good! If I encounter problems later I will write about it here.

Node.js Benchmark on Raspberry Pi (v1)

I have experimented a bit with Node.js and Raspberry Pi lately, and I have found the performance… surprisingly bad. So I decided to run some standard tests: benchmark-octane (v9).

Octane is essentially run like:

$ npm install benchmark-octane
$ cd node_modules/benchmark-octane
$ node run.js

The distilled result of Octane is a total run time and a score. Here are a few results:

                         OS             Node.js                   Time    Score
QNAP TS-109 500MHz       Debian        v0.10.29 (Debian)         3350s      N/A
Raspberry Pi v1 700MHz   OpenWrt BB    v0.10.35 (self built)     2267s      140
Raspberry Pi v1 700MHz   Raspbian       v0.6.19 (Raspbian)       2083s      N/A
Raspberry Pi v1 700MHz   Raspbian       v0.12.2 (self built)     2176s      104
Eee701 Celeron 900Mhz    Xubuntu       v0.10.25 (Ubuntu)          171s     1655
Athlon II X2@3Hz         Xubuntu       v0.10.25 (Ubuntu)           49s     9475
MacBook Air i5@1.4Ghz    Mac OS X      v0.10.35 (pkgsrc)           47s    10896
HP 2560p i7@2.7Ghz       Xubuntu       v0.10.25 (Ubuntu)           41s    15450

Score N/A means that one test failed and there was no final score.

When I first saw the RPi performance I thought I had done something wrong building (using a cross compiler) Node.js myself for RPi and OpenWRT. However Node.js with Raspbian is basically not faster, and also RPi ARMv6 with FPU is not much faster than the QNAP ARMv5 without FPU.

I think the Eee701 serves as a good baseline here. At first glance, possible reasons for the RPi underperformance relative to the Celeron are:

  • Smaller cache (16kb of L1 cache and L2 only available to GPU, i Read) compared to Celeron (512k)
  • Bad or not well utilised FPU (but there at least is one on the RPi)
  • Node.js (V8) less optimized for ARM

I found that I have benchmarked those to CPUs against each other before. That time the Celeron was twice as fast as the RPi, and the FPU of the RPi performed decently. Blaming the small cache makes more sense to me, than blaming the people who implemented ARM support in V8.

The conclusion is that Raspberry Pi (v1 at least) is extremely slow running Node.js. Other benchmarks indicate that RPi v2 is significantly faster.

Eee701 in 2015

My eee701 is not doing very much anymore, but sometimes it is handy to have it around. I have not upgraded it since Lubuntu 13.10, and that version is not supported anymore. I found that:

  1. Since 13.10 is abandoned upgrading with apt-get generated errors.
  2. 15.04 Lubuntu desktop ISO complains that the hard drive is less than 4.1Gb.
  3. 14.04.2 Lubuntu desktop ISO complains that the hard drive is less than 4.5Gb.

Thus, the standard upgrade or installation paths were blocked. And I was not very interested in putting lots of effort into getting my eee701 running a current system.

Instead, i tried the Ubuntu mini-iso. That was very nice! The iso itself is written with dd (rather than the Startup Disk Creator) to USB drive. The installation is text (curses) based, but very guided (just like Debian). I choose a single 4GB ext4 partition for root, no swap (since I have 2GB RAM) and “Xubuntu minimal” desktop. Keyboard, Wifi and timezones were all correctly set up. When installation was complete and system restarted I had 1.8Gb used and 1.7Gb available. Not even a web browser was installed, but Xubuntu itself was fine.

Not bad at all!

Seeding Xubuntu and Lubuntu

It is not much I do to contribute to the Ubuntu community so when Ubuntu 15.04 was released a few weeks ago I downloaded 4 isos with bittorrent and kept seeding them for the benefit of Ubuntu users.

Now (2015-05-10):

Image                                Size     Ratio
lubuntu-15.04-desktop-i386.iso      696Mb       120
lubuntu-15.04-desktop-amd64.iso     690Mb      60.0
xubuntu-15.04-desktop-i386.iso      970Mb      68.5
xubuntu-15.04-desktop-amd64.iso     963Mb      68.2

Does this mean anything?

If the ratio has anything to do with the popularity of different ubuntu versions:

  1. It surprised my that Lubuntu is so popular
  2. It surprised me that i386 is still very popular

There are, however, a number of factors that could disturb the correlation between my ratio and the true popularity of different Ubuntu flavours:

  • The way Lubuntu vs Xubuntu push people to the torrent (rather than the direct http/ftp) download, I found it harder to find the Lubuntu torrent, though.
  • Lubuntu and Xubuntu may have different audiences, with different attitude to torrent downloads.
  • i386 and amd86 may have different audiences as well
  • If more downloaders mean more seeders, and people on average seed close to 1.0 (or even above) then my Ratio may mean very little.
  • The tracker may have identified me as a stable seeder and sent downloaders of less popular images (both to download and seed) my way, in an attempt to provide equally good service for everyone (no idea if trackers do this).

The i386-heaviness of Lubuntu indicates that there should be correlation between general popularity and my Ratio.