## Monday, July 14, 2014

### Cythonize!

I'm at the stage where my code essentially works: it does everthing I initially set out to have it do, including computing the aforementioned zero sums for elliptic curve $L$-functions. However, the code is written in pure Python, and it is therefore not as fast as it could be.

This is often not a problem; Python is designed to be easy to read and maintain, and I'm hoping that the Python code I wrote is both of those. If we were just planning to run it on elliptic curves with small coefficients - for example, the curve represented by the equation $y^2=x^3-16x+16$ - this wouldn't be an issue. Curves with small coefficients have small conductors and thus few low-lying zeros near the central point, which allows us to run the zero sum code on them with small Delta parameter values. A small Delta value means the computation will finish very quickly regardless of how efficiently it's implemented, so it probably wouldn't be worth my while trying to optimize the code in that case.

To illustrate this point, here is the first most high-level, generic version of the method that computes the sum $\sum_{\gamma} \text{sinc}^2(\Delta \gamma)$ over the zeros of a given elliptic curve $L$-function (minus documentation):

[Of course, there's plenty going on in the background here. I have a separate method, self.cn() which computes the logarithmic derivative coefficients, and I call the SciPy function spence() to compute the part of the sum that comes from the Fourier transform of the digamma function $\frac{\Gamma^{\prime}}{\Gamma}$. Nevertheless, the code is simple and straightforward, and (hopefully) it's easy to follow the logic therein.]

However, the whole point of my GSoC project is to produce code that can be used for mathematical research; ultimately we want to push computations as far as we can and run the code on elliptic curves with large conductor, since curves with small conductor are already well-understood. Ergo, it's time I thought about going back over what I've written and seeing how I can speed it up.

There are two distinct ways to achieve speedups. The first is to rewrite the code more cleverly to eliminates unnecessary loops, coercions, function calls etc. Here is a second version I have written of the same function (still in Python):

The major change I've made between the two versions is improving how the sum involving the logarithmic derivative coefficients is computed - captured in the variable y. In the first version, I simply iterated over all integers $n$ up to the bound $t$, calling the method self.cn() each time. However, the logarithmic derivative coefficient $c_n$ is zero whenever $n$ is not a prime power, and knowing its value for $n=p$ a prime allows us to efficiently compute its value for $p^k$ any power of that prime. It therefore makes sense to do everything "in-house": eliminate the method call to self.cn(), iterate only over primes, and compute the logarithmic derivative coefficients for all powers of a given prime together.

Let's see how the methods match up in terms of speed. Below we run the two versions of the zero sum method on the elliptic curve $E: y^2=x^3-16x+16$, which is a rank 1 curve of conductor 37:

sage: import sage.lfunctions.zero_sums as zero_sums
sage: ZS = zero_sums.LFunctionZeroSum(EllipticCurve([-16,16]))
sage: ZS._zerosum_sincsquared(Delta=1)
1.01038406984
sage: ZS._zerosum_sincsquared_fast(Delta=1)
1.01038406984
sage: %timeit ZS._zerosum_sincsquared(Delta=1)
10 loops, best of 3: 20.5 ms per loop
sage: %timeit ZS._zerosum_sincsquared_fast(Delta=1)
100 loops, best of 3: 3.46 ms per loop

That's about a sixfold speedup we've achieved, just by reworking the section of the code that computes the $c_n$ sum.

The downside, of course, is that the code in method version 2 is more complicated - and thus less readable - than that in version 1. This is often the case in software development: you can write code that is elegant and easy to read but slow, or you can write code that is fast but horribly complicated and difficult to maintain. And when it comes to mathematical programming, unfortunately, sometimes the necessity for speed trumps readability.

The second major way to achieve speedups is to abandon pure Python and switch to a more low-level language. I could theoretically take my code and rewrite it in C, for example; if done relatively intelligently I'm sure the resulting code would blow what I have here out the water in terms of speed. However, I have no experience writing C code, and even if I did getting the code to interface with the rest of the Sage codebase would be a major headache.

Thankfully there is a happy middle ground: Cython. Cython is a programming language - technically a superset of Python - that allows direct interfacing with C and C++ data types and structures. Code written in Cython can be as fast as C code and nearly as readable as pure Python. Crucially, because it's so similar to Python it doesn't require rewriting all my code from scratch. And Sage already knows how to deal with Cython, so there aren't any compatibility issues there.

I am therefore in the process of doing exactly that: rewriting my code in Cython. Mostly this is just a cut-and-paste job and is pleasingly hassle-free; however, in order to achieve the desired speedups, the bottleneck methods - such as our $\text{sinc}^2$ zero sum method above - must be modified to make use of C data types.

Here is the third, most recent version of the _zerosum_sincsquared() method for our zero sum class, this time written in Cython:

Definitely longer and uglier. I now must declare my (C) variables up front; previously Python just created them on the fly, which is nice but slower than allocating memory space for the variables a priori. I've eliminated the use of complex arithmetic, so that everything can be done using C integer and float types. I still iterate over all primes up to the bound $t$; however now I deal with those primes that divide the conductor of $E$ (for which the associated summand is calculated slightly differently) beforehand, so that in the main loop I don't have to check at each point if my prime $p$ divides the conductor or not [This last check is expensive: the conductor $N$ can be very large - too large to cast as a $C$ long long even - so we would have to use slower Python or Sage data types to represent it. Better to get rid of the check altogether].

Let's see how this version holds up in a speed test. The Cython code has already been built into Sage and the class loaded into the global namespace, so I can just call it without having to attach or load any file:

sage: ZS = LFunctionZeroSum(EllipticCurve([-16,16]))
sage: ZS._zerosum_sincsquared_fast(Delta=1)
1.01038406984
sage: %timeit ZS._zerosum_sincsquared_fast(Delta=1)
1000 loops, best of 3: 1.72 ms per loop

The good news: the Cythonized version of the method produces the same output as the Python versions, and it's definitely faster. The bad news: the speedup is only about a factor of 2, which isn't hugely impressive given how much uglier the code is.

Why is this? Crucially, we still iterate over all integers up to the bound $t$, testing for primality at each step. This is very inefficient: most integers are not prime (in fact, asymptotically 0 percent of all positive integers are prime); we should be using sieving methods to eliminate primality testing at those integers which we know before checking are composite. For example, we should at the very least only ever iterate over the odd numbers beyond 3. That immediately halves the number of primality tests we have to do, and we should therefore get a comparable speedup if primality testing is what is dominating the runtime in this method.

This is therefore what I hope to implement next: rework the zero sum code yet again to incorporate prime sieving. This has applicability beyond just the $\text{sinc}^2$ method: all explicit formula-type sums at some point invoke summing over primes or prime powers, so having access to code that can do this quickly would be a huge boon.

1. «we should be using sieving methods to eliminate primality testing at those integers which we know before checking are composite»

What about using the built-in prime_range function?

2. The built-in prime_range() function would work wonderfully for small Delta values (i.e. Delta < ~2.5), and would be a quick way to get impressive speedups there. However, the function would only be good for small Delta values, and would be *horrendously* bad for larger Delta values - which is what I'm really after.

This is because prime_range() wraps PARI's primes() function, which (as it's called in Sage) runs a Sieve of Eratosthenes method to return a list of all primes up to the specified bound.

The Sieve of Eratosthenes works by creating a list in memory of integers up to the bound, and then successively "crossing out" all integers which are a multiple of some other integer; at the end, the only uncrossed-out numbers are by definition prime numbers.

This is much faster than testing each integer for primality, but the tradeoff is that one has to create in memory an array the size of the bound parameter. The issue for my project is that the bound we need to enumerate primes up to is exponential in the parameter Delta. For Delta=2, we need a sieve table about 300 kilobytes in size - no problem. However, use Delta=3 and we're up to 150 MB, and Delta=4 requires a table 82 GB in size. The latter is unworkable, which is why I can't just naively call the prime_range() function.

There are more intelligent ways to sieve, though. For example, we could break the range up into chunks of manageable size and only work with primes in a given interval at each step.This is perhaps a bit slower than a pure start-from-1 sieve, but way more memory efficient. PARI even has a method that does exactly this, I believe; however it's not wrapped in Sage, so I'll have to do a bit of extra work to get at its functionality.

1. I see, thank you for the detailed clarification. :)