Today marks the last day of my Google Summer of Code 2014 project. Evaluations are due midday Friday PDT, and code submissions for successfully passed projects start soon thereafter. The end result of my project can be found at Sage Trac Ticket 16773. In total it's just over 2000 lines of Python and Cython code, to be added to the next major Sage release (6.4) if/when it gets a final thumbs-up review.

When I write just the number of lines of code it doesn't sound like very much at all - I'm sure there are GSoC projects this year that produced at least 10k lines of code! However, given that what I wrote is complex mathematical code that needed a) significant optimization, and b) to be be mathematically sound in the first place, I'd say that isn't too bad. Especially since the code does what the project description aimed for it to do: compute analytic ranks of rational elliptic curves very quickly with a very low rate of failure. Hopefully this functionality can and will be used to produce numerical data for number theory research in the years to come.

The Google Summer of Code has been a thoroughly rewarding experience for me. It's a great way to sharpen one's coding skills and get paid in the process. I recommend it for any university student who's looking to go into any industry that requires programming skills; I'd apply to do it again next year if I wasn't planning to graduate then!

## Friday, August 22, 2014

## Thursday, August 14, 2014

### Things are Better in Parallel

A recent improvement I've implemented in my GSoC code is to allow for parallelized computation. The world has rapidly moved to multi-core as a default, so it makes sense to write code that can exploit this. And it turns out that the zero sum rank estimation method that I've been working on can be parallelized in a very natural way.

But first: how does one compute in parallel in Sage? Suppose I have written a function in a Sage environment (e.g. a SageMathCloud worksheet, .sage file, Sage console etc.) which takes in some input and returns some other input. The simple example f below takes in a number n and returns the square of that number.

This is a fairly straightforward beast; put in a value, get a value out. But what if we have some computation that requires evaluating that function on a large number of inputs? For example, say we want to compute the sum of the first 10 million squares. If you only have one processor to tap, then you're limited to calling f over and over again in series:

In this example you could of course invoke the formula for the sum of the first $n$ squares and just write down the answer without having to add anything up, but in general you won't be so lucky. You can optimize the heck out of f, but in general when you're limited to a single processor you're confined to iterating over all the values you need to consider sequentially .

However, if you have 2 processors available one could try write code that splits the work into two roughly equal parts that can run relatively independently. For example, for our function we could compute the sum of all the even squares up to a given bound in one process and the sum of all the odd squares in another, and then add the two results together to get the sum of all square up to our bound.

Sage has a readily available mechanism to do exactly this: the@parallel decorator. To enable parallel computation in your function, put @parallel above your function definition (the decorator can take some parameters; below ncpus=2 tells it that we want to use 2 processors). Note that we also have to modify the function: now it no longer only takes the bound up to which we must add squares, but also a flag indicating whether we should consider even or odd squares.

### THE @parallel DECORATOR IN SAGE

But first: how does one compute in parallel in Sage? Suppose I have written a function in a Sage environment (e.g. a SageMathCloud worksheet, .sage file, Sage console etc.) which takes in some input and returns some other input. The simple example f below takes in a number n and returns the square of that number.

sage: def f(n):

....: return n*n

....:

sage: f(2),f(3),f(5)

(4, 9, 25)

This is a fairly straightforward beast; put in a value, get a value out. But what if we have some computation that requires evaluating that function on a large number of inputs? For example, say we want to compute the sum of the first 10 million squares. If you only have one processor to tap, then you're limited to calling f over and over again in series:

sage: def g(N):

....: y = 0

....: for n in range(N+1):

....: y += f(n)

....: return y

....:

sage: %time g(10000000)

CPU times: user 17.5 s, sys: 214 ms, total: 17.7 s

Wall time: 17.6 s

333333383333335000000

In this example you could of course invoke the formula for the sum of the first $n$ squares and just write down the answer without having to add anything up, but in general you won't be so lucky. You can optimize the heck out of f, but in general when you're limited to a single processor you're confined to iterating over all the values you need to consider sequentially .

However, if you have 2 processors available one could try write code that splits the work into two roughly equal parts that can run relatively independently. For example, for our function we could compute the sum of all the even squares up to a given bound in one process and the sum of all the odd squares in another, and then add the two results together to get the sum of all square up to our bound.

Sage has a readily available mechanism to do exactly this: the

sage: @parallel(ncpus=2)

....: def h(N,parity):

....: y = 0

....: if parity=="even":

....: nums = range(0,N+1,2)

....: elif parity=="odd":

....: nums = range(1,N+1,2)

....: for n in nums:

....: y += f(n)

....: return y

Instead of calling h with its standard sequence of parameters, we can pass it a list of tuples, where each tuple is a valid sequence of inputs. Sage then sends each tuple of inputs off to an available processor and evaluates the function on them there. What's returned is a generator object that can iterate over all the outputs; we can always see the output directly by calling list() on this returned generator:

sage: for tup in list(h([(1000,"even"),(1000,"odd")])):

....: print(tup)

....:

(((1000, 'even'), {}), 167167000)

(((1000, 'odd'), {}), 166666500)

Note that the tuple of inputs is also returned. Because we're doing things in parallel, we need to know which output corresponds to which input, especially since processes may finish at different times and return order is not necessarily the same as the order of the input list.

Finally, we can write a wrapper function which calls our parallelized function and combines the returned results:

sage: def i(N):

....: y = 0

....: for output in h([(N,"even"),(N,"odd")]):

....: y += output[1]

....: return y

....:

sage: %time i(10000000)

CPU times: user 1.76 ms, sys: 33.2 ms, total: 35 ms

Wall time: 9.26 s

333333383333335000000

Note that i(10000000) produces the same output at g(10000000) but in about half the time, as the work is split between two processes instead of one. This is the basic gist of parallel computation: write code that can be partitioned into parts that can operate (relatively) independently; run those parts on different processors simultaneously; and then collect returned outputs and combine to produce desired result.

### PARALLELIZING THE ZERO SUM COMPUTATION

Let's take a look at the rank estimating zero formula again. Let $E$ be a rational elliptic curve with analytic rank $r$. Then

\begin{align*}

r < \sum_{\gamma} \text{sinc}^2(\Delta\gamma) &= \frac{1}{\pi \Delta}\left(-\eta+\log\left(\frac{\sqrt{N}}{2\pi}\right)\right) \\

&+ \frac{1}{2\pi^2\Delta^2}\left(\frac{\pi^2}{6}-\text{Li}_2\left(e^{-2\pi\Delta}\right) \right) \\

&+ \frac{1}{\pi \Delta}\sum_{\substack{n = p^k \\ n < e^{2\pi\Delta}}} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right)

\end{align*}

where

\begin{align*}

r < \sum_{\gamma} \text{sinc}^2(\Delta\gamma) &= \frac{1}{\pi \Delta}\left(-\eta+\log\left(\frac{\sqrt{N}}{2\pi}\right)\right) \\

&+ \frac{1}{2\pi^2\Delta^2}\left(\frac{\pi^2}{6}-\text{Li}_2\left(e^{-2\pi\Delta}\right) \right) \\

&+ \frac{1}{\pi \Delta}\sum_{\substack{n = p^k \\ n < e^{2\pi\Delta}}} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right)

\end{align*}

where

- $\gamma$ ranges over the nontrivial zeros of the $L$-function attached to $E$
- $\Delta$ is a positive parameter
- $\eta$ is the Euler-Mascheroni constant $=0.5772\ldots$
- $N$ is the conductor of $E$
- $\text{Li}_2(x)$ is the dilogarithm function, defined by $\text{Li}_2(x) = \sum_{k=1}^{\infty} \frac{x^k}{k^2}$
- $c_n$ is the $n$th coefficient of the logarithmic derivative of the $L$-function of $E$, which is zero when $n$ is not a perfect prime power.

The right hand side of the sum, which is what we actually compute, can be broken up into three parts: the first term involving the curve's conductor $N$; the second term involving the dilogarithm function $Li_2(x)$; and the sum over prime powers. The first two parts are quick to compute: evaluating them can basically be done in constant time regardless of the magnitude of $N$ or $\Delta$.

It is therefore not worth considering parallelizing these two components, since the prime power sum dominates computation time for all but the smallest $\Delta$ values. Instead, what I've done is rewritten the zero sum code so that the prime power sum can be evaluated using multiple cores.

As mentioned in this post, we can turn this sum into one indexed by the primes (and not the prime powers); this actually makes parallelization quite straightforward. Recall that all primes except $2$ are odd, and all primes except $2$ and $3$ are either $1$ or $5$ remainder $6$. One can scale this up: given a list of small primes $[p_1,p_2,\ldots,p_n]$, all other primes fall into one of a relatively small number of residue classes modulo $p_1 p_2\cdots p_n$. For example, all primes beyond $2$, $3$, $5$ and $7$ have one of the following 48 remainders when you divide them by $210 = 2\cdot 3\cdot 5 \cdot 7$:

\begin{align*}

&1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79,\\

&83, 89, 97, 101, 103, 107, 109, 113, 121, 127, 131, 137, 139, 143, 149, \\

&151, 157, 163, 167, 169, 173, 179, 181, 187, 191, 193, 197, 199, 209,

\end{align*}

and no other.

If we had 48 processors available. the natural thing to do would be to get each of them to iterate over all integers in a particular residue class up to $e^{2\pi\Delta}$, evaluating the summand whenever that integer is prime, and returning the sum thereof. For example, if the bound was 1 million, then processor 1 would compute and return $\sum_{n \equiv 1 (\text{mod } 210)}^{1000000} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right)$. Processor 2 would do the same with all integers that are $11$ modulo $210$, etcetera.

In reality, we have to figure out a) how many processors are available, and b) partition the work relatively equally among those processors. Thankfully sage.parallel.ncpus.ncpus() succinctly addresses the former, and the latter is achieved by splitting the residue classes into $n$ chunks of approximately equal size (where $n$ is the number of available CPUs) and then getting a given processor to evaluate the sum over those residues in a single chunk.

Here is the method I wrote that computes the $\text{sinc}^2$ zero sum with (the option of) multiple processors:

Note that I've defined a subfunction to compute the prime sum over a given subset of residue classes; this is the part that is parallelized. Obtaining the residue chunks and computing the actual summand at each prime are both farmed out to external methods.

Let's see some timings. The machine I'm running Sage on has 12 available processors:

Same answer in a ninth of the time! Note that the parallelized method has some extra overhead, so even with 12 processors we're unlikely to get a full factor of 12 speedup. Nevertheless, the speed boost will allow us to run the zero sum code with larger $\Delta$ values, allowing us to investigate elliptic curves with even larger conductors.

It is therefore not worth considering parallelizing these two components, since the prime power sum dominates computation time for all but the smallest $\Delta$ values. Instead, what I've done is rewritten the zero sum code so that the prime power sum can be evaluated using multiple cores.

As mentioned in this post, we can turn this sum into one indexed by the primes (and not the prime powers); this actually makes parallelization quite straightforward. Recall that all primes except $2$ are odd, and all primes except $2$ and $3$ are either $1$ or $5$ remainder $6$. One can scale this up: given a list of small primes $[p_1,p_2,\ldots,p_n]$, all other primes fall into one of a relatively small number of residue classes modulo $p_1 p_2\cdots p_n$. For example, all primes beyond $2$, $3$, $5$ and $7$ have one of the following 48 remainders when you divide them by $210 = 2\cdot 3\cdot 5 \cdot 7$:

\begin{align*}

&1, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79,\\

&83, 89, 97, 101, 103, 107, 109, 113, 121, 127, 131, 137, 139, 143, 149, \\

&151, 157, 163, 167, 169, 173, 179, 181, 187, 191, 193, 197, 199, 209,

\end{align*}

and no other.

If we had 48 processors available. the natural thing to do would be to get each of them to iterate over all integers in a particular residue class up to $e^{2\pi\Delta}$, evaluating the summand whenever that integer is prime, and returning the sum thereof. For example, if the bound was 1 million, then processor 1 would compute and return $\sum_{n \equiv 1 (\text{mod } 210)}^{1000000} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right)$. Processor 2 would do the same with all integers that are $11$ modulo $210$, etcetera.

In reality, we have to figure out a) how many processors are available, and b) partition the work relatively equally among those processors. Thankfully sage.parallel.ncpus.ncpus() succinctly addresses the former, and the latter is achieved by splitting the residue classes into $n$ chunks of approximately equal size (where $n$ is the number of available CPUs) and then getting a given processor to evaluate the sum over those residues in a single chunk.

Here is the method I wrote that computes the $\text{sinc}^2$ zero sum with (the option of) multiple processors:

Note that I've defined a subfunction to compute the prime sum over a given subset of residue classes; this is the part that is parallelized. Obtaining the residue chunks and computing the actual summand at each prime are both farmed out to external methods.

Let's see some timings. The machine I'm running Sage on has 12 available processors:

sage: sage.parallel.ncpus.ncpus()

12

12

sage: E = EllipticCurve([12838,-51298])

sage: Z = LFunctionZeroSum(E)

sage: Z = LFunctionZeroSum(E)

sage: %time Z._zerosum_sincsquared_fast(Delta=2.6)

CPU times: user 36.7 s, sys: 62.9 ms, total: 36.8 s

Wall time: 36.8 s

2.8283629046

sage: %time Z._zerosum_sincsquared_parallel(Delta=2.6)

CPU times: user 7.87 ms, sys: 133 ms, total: 141 ms

Wall time: 4.06 s

2.8283629046

## Tuesday, August 5, 2014

### How big should Delta be?

Let's take a look at the central formula in my GSoC rank estimation code. Let $E$ be a rational elliptic curve with analytic rank $r$. Then

$$ r < \sum_{\gamma} \text{sinc}^2(\Delta\gamma) = \frac{1}{\pi\Delta} \left[ C_0 + \frac{1}{2\pi\Delta}\left(\frac{\pi^2}{6}-\text{Li}_2\left(e^{-2\pi\Delta}\right) \right) + \sum_{n=1}^{e^{2\pi\Delta}} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right) \right] $$

where

$$ r < \sum_{\gamma} \text{sinc}^2(\Delta\gamma) = \frac{1}{\pi\Delta} \left[ C_0 + \frac{1}{2\pi\Delta}\left(\frac{\pi^2}{6}-\text{Li}_2\left(e^{-2\pi\Delta}\right) \right) + \sum_{n=1}^{e^{2\pi\Delta}} c_n\left(1 - \frac{\log n}{2\pi\Delta}\right) \right] $$

where

- $\gamma$ ranges over the nontrivial zeros of the $L$-function attached to $E$
- $\Delta$ is a positive parameter
- $C_0 = -\eta + \log\left(\frac{\sqrt{N}}{2\pi}\right)$; $\eta$ is the Euler-Mascheroni constant $=0.5772\ldots$ and $N$ is the conductor of $E$
- $\text{Li}_2(x)$ is the dilogarithm function, defined by $\text{Li}_2(x) = \sum_{k=1}^{\infty} \frac{x^k}{k^2}$
- $c_n$ is the $n$th coefficient of the logarithmic derivative of the $L$-function of $E$.

The thing I want to look at in this post is that parameter $\Delta$. The larger you make it, the closer the sum on the left hand side over the zeros is to the analytic rank, so when trying to determine the rank of $E$ we want to pick as large a $\Delta$ as we can. However, the bigger this parameter is the more terms we have to compute in the sum over the $c_n$ on the right hand side; moreover the number of terms - and thus the total computation time - scales exponentially with $\Delta$. This severely constrains how big we can make $\Delta$; generally a value of $\Delta=2$ may take a second or two for a single curve on SageMathCloud, while $\Delta=3$ may take upwards of an hour. For the average rank project I'm working on I ran the code on 12 million curves using $\Delta=1.3$; the total computation time was about 4 days on SMC.

However, it should be clear that using too large a $\Delta$ is overkill: if you run the code on a curve with $\Delta=1$ and get a bound of zero out, you know that curve's rank exactly zero (since it's at most zero, and rank is a non-negative integer). Thus using larger $\Delta$ values on that curve will do nothing except provide you the same bound while taking much longer to do so.

This begs the question: just how big of a $\Delta$ value is good enough? Can we, given some data defining an elliptic curve, decide a priori what size $\Delta$ to use so that a) the computation returns a bound that is likely to be the true of the curve, and b) it will do so in as little time as possible?

The relevant invariant to look at here is conductor $N$ of the elliptic curve; go back to the formula above and you'll see that the zero sum includes a term which is $O\left(\frac{\log(N)}{2\pi\Delta}\right)$ (coming from the $\frac{1}{\pi \Delta} C_0$ term). This means that size of the returned estimate will scale with $\log(N)$: for a given $\Delta$, the bound returned on a curve with 10-digit conductor will be about double that which is returned for a 5-digit conductor curve, for example. However, we can compensate this by increasing $\Delta$ accordingly.

To put it all more concretely we can pose the following questions:

- Given an elliptic curve $E$ with conductor $N$, how large does $\Delta$ need to be in terms of $N$ so that the returned bound is
*guaranteed*to be the true analytic rank of the curve? - Given a conductor size $N$ and a proportionality constant $P \in [0,1]$, how big does $\Delta$ have to be in terms of $N$ and $P$ so that at least $P\cdot 100$ percent of all elliptic curves with conductor of size about $N$ will, when the rank estimation code is run on them with that $\Delta$ value, yield returned bounds that are equal to their true analytic rank?

[Note: in both of the above questions we are making the implicit assumption that the returned rank bound is monotonically decreasing for increasing $\Delta$. This is not necessarily the case: the function $y = \text{sinc}^2(x)$ is

*not*a decreasing function in $x$. However, in practice any upwards fluctuation we see in the zero sum is small enough to be eliminated when we take the floor function to get an integer rank bound.]### A $\Delta$ CHOICE GOOD ENOUGH FOR MOST CURVES

The first question is easier to phrase, but more difficult to answer, so we will defer it for now. To answer the second question, it is useful mention what we know about the distribution and density of nontrivial zeros of an elliptic curve $L$-function.

Using some complex analysis we can show that, for the $L$-function of an elliptic curve with conductor $N$, the expected number of zeros in the critical strip with imaginary part at most $T$, is $O(T\log N+ T\log T)$. That is, expected zero density has two distinct components: a part that scales linearly with log of the conductor, and a part that doesn't scale with conductor (but does scale slightly faster than linearly with how far out you go).

Consider the following: if we let

$$\Delta(N) = \frac{C_0}{\pi} = \frac{1}{\pi}\left(-\eta+\log\left(\frac{\sqrt{N}}{2\pi}\right)\right)$$

then the first term in the right hand side of the zero sum formula is precisely 1 - this is the term that comes from the $\log N$ part of the zero density. The next term - the one involving $\frac{\pi^2}{6}-\text{Li}_2\left(e^{-2\pi\Delta}\right)$ - is the term that comes from the part independent of $N$; because the right hand side is divided by $\Delta$ it therefore goes to zero as the curve's conductor increases. The third term contains the $c_n$ coefficients which (per Sato-Tate) will be positive half the time and negative half the time, so the entire sum could be positive or negative; we therefore expect its contribution to be small on average when we look at large number of elliptic curves.

It thus stands to reason that for this value of Delta, and when the conductor $N$ is sufficiently large, the zero sum will be about 1, plus or minus a smallish amount coming from the $c_n$ sum. This argument is by no means rigorous, but we might therefore expect the zero sum to be within 2 of the actual analytic rank most of the time. Couple that with knowledge of the root number and you get a rank upper bound out which is equal to the actual analytic rank in all but a few pathological cases.

Empirical evidence bears this out. I ran my rank estimation code with this choice of $\Delta$ scaling on the whole Cremona database, which contains all elliptic curves up to conductor 350000:

As you can see, the percentage of curves for which the rank bound is strictly greater than rank holds fairly constant at about 0.25%. That's minuscule: what this is saying is that if you type in a random Weierstrass equation, there is only about a 1 in 1000 chance that the rank bounding code with $\Delta = \frac{C_0}{\pi}$ will return a value that isn't the actual analytic rank. Moreover, the code runs

*considerably*faster than traditional analytic rank techniques, so if you wanted to, for example, compute the ranks of a database of millions of elliptic curves, this would be a good first port of call.
Of course, this $\Delta$ scaling approach is by no means problem-free. Some more detailed analysis will show that that as stated above, the runtime of the code will actually be $O(N)$ (omitting log factors), i.e. asymptotic scaling is actually

*worse*than traditional analytic rank methods, which rely on evaluating the $L$-function directly and thus are $O(\sqrt{N})$. It's just that with this code we have some very small constants sitting in front, so the crossover point is at large enough conductor values that neither method is feasible anyway.
This choice of $\Delta$ scaling works for conductor ranges up to about $10^9$; that corresponds to $\Delta \approx 2.5$, which will give you a total runtime of about 10-20 seconds for a single curve on SMC. Increase the conductor by a factor of 10 and your runtime will also go up tenfold.

For curves of larger conductor, instead of setting $\Delta = \frac{C_0}{\pi}$ we can choose to set $\Delta$ to be $\alpha\cdot \frac{C_0}{\pi}$ for any $\alpha \in [0,1]$; the resulting asymptotic runtime will then be $O(N^{\alpha})$, at the expense of having a reduced proportion of elliptic curves where rank bound is equal to true rank.

### HOW LARGE DOES $\Delta$ HAVE TO BE TO GUARANTEE TRUE RANK?

When we use $\Delta = \frac{C_0}{\pi}$, the curves for which the returned rank estimate is strictly larger than the true rank are precisely those which have unusually low-lying zeros. For example, the rank 0 curve with Cremona label 256944c1, has a zero with imaginary part at 0.0256 (see here for a plot), compared to an expected value of 0.824. Using $\Delta = \frac{C_0}{\pi}$ on this curve means $\Delta \approx 1.214$; if we compute the corresponding zero sum with this value of $\Delta$ we get a value of 2.07803. The smallest value of $\Delta$ for which we get a zero sum value of less than 2 is empirically about 2.813; at this point taking the floor and invoking parity tells us that the curve's rank is zero.

The above example demonstrates that if we want to guarantee that the returned rank bound is the true analytic rank, we are forced to increase the size of $\Delta$ to something larger than $\frac{C_0}{\pi}$. Do we need to increase $\Delta$ by a fixed value independent of $N$? Do we need to increase it by some constant factor? Or does it need to scale faster than $\log N$? These are hard questions to answer; it comes down to determining how close to the central point the lowest nontrivial zero can be as a function of the conductor $N$ (or some other invariants of $E$), which in turn is intimately related to estimating the size of the leading coefficient of the $L$-series of $E$ at the central point. This is already the topic of a previous post: it is a question that I hope to make progress in answering in my PhD dissertation.

Subscribe to:
Posts (Atom)