Lisp is fun! Well, since I first knew about Lisp I was fascinated, but I have found it hard to learn Lisp and to play with it in a meaningful way. A few years ago I wrote about it here and here. As usual, the first steps of learning something new can be the hardest.

Occationally I use Hackerrank to find programming challanges to solve for fun. I particularly like the Project Euler competition. I find it particularly good for trying out new languages: you get a “meaningful” challenge, a simple environment prepared in your web browser, and automated test cases. So, this time I didn’t waste my time trying to find the right Lisp implementation for me, I just started hacking on Project Euler 38 and 39 on Hackerrank.

Problem 38 was quite simple, but 39 was more interesting. When I had solved it, I found my implementation was not at all fast enough, so I started experimenting locally (the Hackerrank environment is not optimal for tweaking, optimization and debugging).

**Choosing a (Common) Lisp implementation**

There are quite many Common Lisp implementations out there. The one Hackerrank uses is SBCL. That is clearly the Common Lisp implementation I would recommend (based on my little experience) if it is available for your platform.

I installed SBCL with apt-get in Ubuntu. I also downloaded binaries directly for my Mac OS X computer and my Raspberry Pi (v1) running Arch linux. Installation is a bit non-standard, but you can actually run it without installing (just execute run-sbcl.sh in downloaded folder).

I also tried clisp and ecl, none of these could deal with the memory usage (stack size) of my program. For clisp I found no way to manipulate stack sizes at all. For ecl I made some progress but I could not make it run my program.

SBCL is a Lisp compiler, and it produces fast and efficient code. I later compared it to C.

**Project Euler 39**

Project Euler 39 is basically about finding integer solutions to Pythagoras theorem. For a given, large, perimeter, how many right triangles are there? For example:

*300000^2 + 400000^2 = 500000^2*

This triangle has a perimeter of 300000+400000+500000=1200000. What other values for a and b so that

*a + b = 700000*

*a^2 + b^2 = 500000^2*

are there? The Hackerrank challenge requires you to work with perimeters up to 5000000. If you implement a solution, a few things to immediately note:

- The squares wont fit in a 32bit integer. They will fit with no loss of precision in the 53 bits of a 64 bit double and they will also fit in a 64 bit integer. (This matters not for Common Lisp)
- If you want to do recursion (and of course you want when you code Lisp) it will be millions of recursion steps, which will be a challenge to the stack size. (This also turned out not to matter for SBCL)

**The Lisp implementation**

It turned out that the SBCL compiler optimized the recursion is such a way that the memory usage was quite low. SBCL successfully runs my program on RPi/Arch, Intel/Ubuntu and Intel/OSX with quite reasonable memory usage.

Since this is about learing Lisp I wanted a 100% functional programming implementation. Only pure functions. A lot of my code is about generating, modifying and testing triangles. A triangle (a b c) can obviously be represented as a Lisp list (a b c) and this was my first implementation. Then if you want to read a, b or c from a list abc, or create the list from a, b and c, you can do:

a: (car abc)
b: (car (cdr abc))
c: (car (cdr (cdr abc)))
abc: (list a b c)

I found this cumbersome. It became a lot of list-to-variables and variables-to-list overhead (I didnt care so much about performance, more about my code readability). I learnt that Lisp functions can return multiple values using *value* and that you can bind them with *multiple-value-bind* and use them as arguments to a function using *multiple-value-call*. This felt functional and pure enough, and it made my code 25% faster than the car/cdr pattern above.:

; a (stupid) function returning a triangle as three values
(defun get-345-triangle ()
(values 3 4 5))
; a function calculating the perimeter of triangle (from a function)
(defun triangle-perimeter-1 (tri-func)
(multiple-value-bind (a b c) (funcall tri-func)
(+ a b c)))
; and in this case you dont need to bind, you can use + directly
(defun triangle-perimeter-2 (tri-func)
(multiple-value-call #'+ (funcall tri-func)))
; now this works
(triangle-perimeter-1 #'get-345-triangle)
(triangle-perimeter-2 #'get-345-triangle)

Since I am a very inexperienced Lisp programmer I appreciate suggestions for improvement.

**Performance of Lisp**

My final Hackerrank submission of Lisp code executes in about 4.5 seconds on my Intel i5/Ubuntu. It takes about the same time on the Hackerrank web page, which is fast enough to pass all tests. On the Raspberry Pi v1 (ARMv6 @700 MHz) it takes more than 700 seconds. My intuition told me that 4.5 seconds was very good. This made me ask two questions. How would Lisp compare to C? And why is the ARM more than 100 times slower, how would that compare in C?

**The C implementation**

My ambition was to rewrite Lisp to C line by line. So my C-program has exactly the same functions which take almost exactly the same arguments. All calculations are identical and performed in exactly the same order. The C-program relies entirely on recursion instead of loops (just like the Lisp program). However…

**Functions** in C can not return multiple variables. While Lisp had *values* I decided to use a reference to a struct in C:

(defun get-a-triangle()
(values x y z))
void get_a_triangle(struct triangle *t) {
t->a = x;
t->b = y;
t->c = z;
}

If the C-triangle struct is a local variable on the callers stack the difference is quite small (from a practical point of view, from a theoretic strict functional programming perspective its a different story).

**Numbers** in Lisp have arbitrary precision integers and floats make no difference. So, when porting to C, I had to pick numeric types. For most purposes, int32_t was good enough. But, for the purpose of calculating Pythagoras theorem higher precision was needed (as I wrote above, the 53 bits of double, or 64 bits of int64_t are good). So I ended up with 5 versions of the C-program (to compare performance):

- All 64-bit integers
- 32-bit integers, 64-bit for “triangles”
- 32-bit integers, double for “triangles”
- 32-bit integers, 64-bit only for pythagoras calc
- 32-bit integers, double only for pythagoras calc

(In cases 2,3 the struct triangle has int64_t/doubles properties, and all manipulations and calculations on triangles use these datatypes. In cases 4,5 everything is int32_t, except the internals of a single function, which casts to higher precision before doing its calculations.)

The C-program requires a significant stack size. The stack size can be obtain and changed like (numbers in kb, all values given with ulimit -a):

$ ulimit -s
8192
$ ulimit -s 100000

For my program, a stack size much higher than 8192 is needed (see below). It seems impossible to get large stack than 64Mb in Mac OS X, so my C program could never run there.

**Benchmark findings**

All C-programs are compiled with gcc -O2.

CPU MHZ SBCL 64 32/64 32/double 32(64) 32(double)
==================================================================================
Time (s)
i5-4250U 1300-2600 4.5 1.52 1.52 1.60 1,54 1.58
ARMv6 700 ~715 85 83 45 42 39
ARMv7 900 357 23 21 13 12 10
Max Res (MB)
i5-4250U 41 103 103 103 103 103
ARMv6 50 220 210 79 110 76
ARMv7 57 180 160 87 97 62

This is not too easy to interpret! The ironic thing is that the fastest thing on the x64-cpu (64-bit integers everywhere) is the slowest on the ARMv6. However, the fastest option on the ARMv6 (32-bit everywhere, and when absolutely needed, use double) is almost the worst on the i5 CPU.

When it comes to the 64-bit i5, it basically does not matter what datatypes you use.

When it comes to the ARMv6, the most important thing is to not store the triangles as int64_t. The strange thing here is the stack sizes. Why does it double (compared to x64) when triangles are stored as int64_t? And the doubles, why do they reduce stack size so much (where are all these doubles actually stored)?

The time command gives max resident memory usage. If I set *ulimit -s 128* the first two programs fail (with Segmentation fault 11), and the last three ones succeed, on the ARMv6.

I have found before that the performance of the ARMv6 suffers because of its slow memory and small cache. It is quite possible that the poor performance of the ARMv6 compared to the i5 is related to its slow memory, and the recursion (and stack memory) heavy algorithm.

Finally, SBCL in x64 has very good performance even compared to C (however, an iterative C-implementation, fitting completely in cache, would probably be faster). Note that I am a novice Lisp programmer and this is a math heavy program where the generic number type of Lisp will come at a cost. On the ARMv6, Lisp performance suffers much more.

**Windows stack size limit**

For Windows, stack size limit is set in the binary, not in the shell. With Cygwin/GCC use the flag *-Wl,–stack,1000000* for one million bytes. Note that these are options passed on to the linker.

**Future investigations**

And I am curious about how much faster a minimal-memory-footprint loop-based C-program would perform.

**The source code**

Since this code solves a problem in Hackerrank I hesitate to publish it. If you want it for any other reason than just running it on Hackerrank let me know.