I’m not certain this is a novel approach, but I’m pretty sure it is. I’ve done a fairly large web search and not found anyone else doing this. (Then again, it’s a big web and any search done by a mere human will be too small for any certainty).
Why primes? Welll…..
Aside from the fact that they have fascinated Mathematicians for thousands of years… aside from my having a Math Award (cash from an outside agency) at High School graduation and seriously thinking to taking a math track in college… aside from me being a bit Aspe and tending to tease issues to The End and this having no clear end yet… aside from practical issues (like making more efficient database hash codes or making / cracking encryption)… aside from all that: They are just something I got started on decades back and have an interest in researching. It is one of the “infinite computes” problems of the world that can consume all possible compute power for the foreseeable future (and then some) so finding a way to make it work better would be of benefit.
OK, some background.
In High School I learned about the Sieve of Eratosthenes. It was somewhat fascinating as it is a regular process, but yields irregular results. I worked the Sieve by hand to somewhat more than required for the class ;-)
In college, I played a bit with programming to find Prime Numbers. Not to excess, but some. (Compute time was limited and charged by the hour then. Despite having “cracked a code” that let me “find” any amount of compute time I wanted, “it would be wrong” to use it for personal interests that would deprive other folks of their compute time. The Burroughs Mainframe was pretty much run flat out 24 x 7. So some minor playing, but not to excess. Oh, you want to know the Hack? Seems that they used any old free disk for swap space. It was not scrubbed on release. So in FORTRAN there was an “open for random read / write” file option. Do that, you got a chunk of ‘free space’ that might have been used as swap at some point. OK, open the file. Write record 1, write record 1,000,000… it was up to the programmer to “zero” the space in between on ‘first write’… I decided to read it instead. Search for “assword” and print the following 100 chars ;-)
Later, an early job was doing database work on the old HP3000 mini-computer. Using the Image and Query database / languages. They used the size of the database as a ‘seed’ to the hash table function. If the database was a prime number in size, the hash function was more efficient. So I tossed together a sort of ‘brute force’ Sieve and calculated the first 1,000,000 prime numbers. (Needed relatively large ones for typical database sizes). Printed this out on 120 char wide green stripe computer paper – two big ‘tabs’ of it about a foot tall – and never had to do more than a ‘lookup’ again when building new databases. Also made an online program to which you could give a ‘size’ and it would find the next prime larger than that. It was used by others in the company. That was done after the printout, and more or less obsoleted it..
Now Primes are more important for cryptology than for database hash tables. So most of the current Public Key Cryptography depends on the fact that it is very hard to factor a very large number that is made of Very Large Primes. You must do a very large number of ‘attempted divisions’ before you find out it’s a 35 digit prime number that ‘goes’… I have an idea about how to ‘break’ that encryption (involving a few peta-bytes of storage ;-) but it also would need either massive amounts of money (any TLA interested in funding me, just shoot me an Email ;-) or a clever way to find ALL the primes in a large block of size.
Most of the primes that are large, and found, are special subsets that fit in special areas of high probability of finding a prime. (A couple of methods…. The Prime Spiral being one of the more interesting ones…) But I want to find ALL the primes (as someone could be using any of them). Turns out that’s rather hard when you get to very large primes.
Because that’s hard, and because The Glory goes to the “Largest Prime” not to “The Most Complete Set”, most folks have concentrated on finding ways to get random primes that are very large. Not much goes into finding ALL the primes. Heck, even for me, I’ve mostly let it lay there as a ‘someday thing’.
There are lots of ways to find prime numbers, and more ways to prove what you found is in fact prime. I won’t go into them. What I’m interested in is the question of “where do you search?”. IFF you could identify that place most likely to contain the next bunch of primes (constraining the search to a smaller area) you can more easily find those primes. Basically, putting the lamp post next to where you lost the keys…
Some years back (first time I was on contract to Disney about 9? Years Ago) and was stuck in a hotel room in Florida, I had the old Compaq Laptop with me and not much else to do that night, so started it to finding primes. I then simply plotted them.
Why? Because that’s what Mr. Theil did. He was my Math Teacher in High School and had this thing about graphing. So when we’d start looking at something new, he’d put some kind of graph up. Sometime it was useless. Sometimes it explained all. He basically thought you always ought to “visualize the data” just to see if anything showed up. So that’s what I did.
What I noticed was that the Primes were a “Stochastic Parabaloid”. Basically a parabolic like shape, but with some ‘random chatter’ about the target value. (They ARE integers, after all ;-) So take a bit of graph paper. Take the value “1″ (that some folks think is a prime, but lately has been ‘cast out’ by folks who have a pet theory that doesn’t work if 1 is a prime. For me, it’s a prime. It is “divisible by 1 and itself”. That those are both the same thing makes it a Very Special Prime IMHO. Not an outcast). At any rate, as it’s “Special”, plot it at zero (another special number). Now the next prime is two. Plot it at “1, 2″ then move on to three at (2,3) and five at 3,5 and on it goes. When you look at that graph with a few dozen numbers on it, it looks rather like a parabolic curve.
So, gee, I thought, could it be that easy? No, it’s not… There is “stochastic jitter” about the expected value.
BUT, that means that it can serve very well as a ‘place to look’. One method is to find a decent parabolic formula fit to the data, then simply search both above and below that projected point. Odds are very high that you will find the next prime reasonably close to that projection. (I’ll leave it to others to do some MATLAB curve fit to the primes and see what’s a decent fit… I did my ‘curve fit’ by hand, so the automated fit will undoubtedly be better).
A different approach is to put “bounds” on the search. What curve means “you’ve likely gone too far!”.
That graph is a bit small and kind of hard to read, but it’s ‘for illustration only’ anyway. What it shows is two curves that lay above and below several samples of Primes. As I’ve “put away” my notebook “somewhere” and the Compaq is bootable, but also put away “somewhere”, I recreated some of the work in a spreadsheet in a crude way. So I typed in several dozens of primes and fit the curves by trial and error (mostly error ;-). They aren’t the same as the ones I originally did, but close.
What does this show?
Well, almost impossible to see are the ‘few dozen’ prime numbers near zero / one and up to a couple of hundred. Then there’s a gap, and I’ve put some more primes in the table at a prime value near 1020. There’s a TINY LITTLE DOT on the graph at that point. A similar dot at the very far right is primes about the 10,000th prime. The green “linear fit” line more or less obliterates them, but they are there.
What about the other lines? One is a ‘lower bound’ curve fit of a sort. The other is an ‘upper bound’ fit. The curve formula that fits best at low ordinal (counting position) of primes is not the same as what fits best at high count values. That mostly means I’m likely being way to simplistic in my formula and it needs a couple of more terms…
So since the graph is somewhat hard to read, I’m going to put some “table data” here. I’m not going to put in a table for the very low values They’ve already been found anyway. One other point to make is that green “linear fit” line. The primes are below it at the start, then likely move above it at the end (just after the last data on the chart which is an intercept). At very large values, a parabola approaches a linear shape. One could do a decent search just doing a ‘linear fit’ to the known data and then searching above that line going forward… As more data are found, do a new fit. Repeat.
For the parabolic fit, or for other quadratic fits to limit curves, a similar ‘recalculate’ will bring the bounds closer to the actual value of interest. So even though these formula will likely diverge at very large values, one could simply calculate a few thousand values, then ‘rebounds’ the curves. (That is, tune them to still be outside the new data, but not too far outside…)
Via this iterative approach, the method ought to be extensible to any degree. (Finding even more accurate curve fits would help, but is not essential).
So what are the formula as used in this graph? For the upper bound, it is:
SO what’s an “OrdinalPrimeValue”? Well, 2 is the “first prime” so it’s ordinal (count) value is 1. Next comes 3, so it’s Ordinal is 2. The 3rd prime is 5, then the fourth ordinal goes with 7. This method depends on knowing ALL the primes, otherwise you can’t ‘count’ them and don’t know the ordinal value…. So it can extend all the known primes, but can’t find ‘the biggest’ prime out in the far netherworld…
Up at the 10,000th prime, the ordinal value is 10,000 and the prime is 104729. The lower bound formula has a value of 104189 and the upper bound value is 119999. So this simple crude first cut would have a search of at most 15810 values to find the 10,000th prime. (a close curve fit, or a search from the ‘middle out’, or any of several other ‘directed search’ methods could be used to make this better. Oh, and a 30,000 core massively parallel processor machine would test all the values at once and have half it’s power left unused ;-) As each set found lets the bounds be better fit for the next batch, the ability to find primes ought to be roughly linear work with size (or a very slighly growing work value if the degree of ‘curve fit’ improvement isn’t linear).
The lower bound?
All in all, not all that hard.
(One trivial enhancement is to just plot the linear intercept trend line to the last found value and search above it first. Only on very rare occasions ought the next value be below it. The next enhancement is to search a “little bit” below the intercept value just to avoid surprises… then tuning ‘how much’ to search below based on experience.)
OK, some tables. There are four columns here. The first is the ordinal value, then the prime number, followed by the lower bound, then the upper bound.
Count Prime Low B Up Bound 972 7649 6318 7625 973 7669 6326 7634 974 7673 6334 7644 975 7681 6341 7653 976 7687 6349 7662 977 7691 6357 7671 978 7699 6365 7680 979 7703 6372 7689 980 7717 6380 7699 981 7723 6388 7708 982 7727 6396 7717 983 7741 6404 7726 984 7753 6411 7735 985 7757 6419 7744 986 7759 6427 7754 987 7789 6435 7763 988 7793 6442 7772 989 7817 6450 7781 990 7823 6458 7790 991 7829 6466 7800 992 7841 6474 7809 993 7853 6481 7818 994 7873 6489 7827 995 7877 6497 7836 996 7879 6505 7846 997 7883 6513 7855 998 7901 6520 7864 999 7907 6528 7873 1000 7919 6536 7883
I had a ‘fit’ with closer match at this point, but it diverged at the 10,000 ordinal prime. If you look at this carefully you will find that the ‘upper bound’ is smaller than the prime. This was larger than the prime prior to the last change to the formula to fit out at the 10,000 end. ( only the coefficients were made slightly different). As an example of ‘post iteration’ it is interesting. I’d had a formula that was clearly outside the bounds, but now is inside with the ‘latest fit’… but as I’ve already ‘found’ these primes it’s not so important… Generally, I think an ‘iterative fit’ works best; but even here, a search of a few hundred values to find the next prime is not a very large compute load. I expect that at even larger values, the fit will become even tighter (as the parabolic shape becomes ever more linear…)
The earlier value had been:
So you can see how I iterated the value over time as more data further out were added. Again, a better automated curve fit to known values would be beneficial, but the real value is in iterative ‘fit’ as new data is added and that this is a target for new value searches; not a hard wall. So “search here first” ought to find the next value first; but when it doesn’t the value is very nearby…
The first few million primes have already been computed, so using a larger set of data would give an even tighter fit to the bounds. (Something to do “real soon now” ;-)
Finally, note that the ‘bounds’ are an indication of probability, not of certainty. This tells you where to start looking and gets you close to the next value. Searching a bit outside the lines on those occasions where no prime was found is unlikely to be a major compute cost issue. Even if you have computed overly tight ‘bounds’ on a further iteration of the fit.
Here’s the ‘fit’ out at the 10,000th prime point (post iteration):
9981 104549 103950 119727 9982 104551 103962 119741 9983 104561 103975 119756 9984 104579 103987 119770 9985 104593 104000 119784 9986 104597 104013 119799 9987 104623 104025 119813 9988 104639 104038 119827 9989 104651 104050 119842 9990 104659 104063 119856 9991 104677 104076 119870 9992 104681 104088 119885 9993 104683 104101 119899 9994 104693 104113 119913 9995 104701 104126 119928 9996 104707 104139 119942 9997 104711 104151 119956 9998 104717 104164 119971 9999 104723 104176 119985 10000 104729 104189 119999
You can see that it’s about 550 terms to test above the lower bound, and about 5,000 to the upper bound. Clearly a bit better fit to do ;-) Still, searching a ‘thousand scale’ area for the next prime is not a lot of compute work. Then, a few thousand primes more along, computing the ‘new bounds’ lets the process continue at about constant work load growth.
That’s the idea and the method. Iterative curve fitting of upper and lower bounds. Probabilistic prediction of ‘which way to look’ and ‘how far’.
As I already have a few million prime numbers, I’ve not computed the next batch(es)… but if anyone needs them, it ought to be pretty easy to do with these methods. (Breaking current crypto techniques would likely take a few hundred thousand dollars and a year or three of work… Oh, and some ‘way cool’ computer gear ;-)
There’s more ‘stuff’ about primes than any sane human could read… but just a couple of pointers here:
The first 10,000 primes:
A Wolfram Alpha curve fitting page:
The first 100,000 prime numbers:
The first 50 Million Primes:
There’s a whole lot more math theory out there about primes. What I find frustrating is that it tends to fall into one of two camps. 1) Methods for showing a number is prime or variation on brute force finding primes. 2) Ways to find a very large prime but skipping over large areas of primes that are not yet known.
The whole area of ‘directed search’ tends to be ignored. IMHO there’s a very large opportunity to ‘put bounds on the search’ and find a whole lot of primes. Just not the biggest one… FWIW, I also ran into a formula that gave a large number of primes over one small span. It had variations with 4th power and 5th power functions. IMHO, that’s just a statistical quirk in that segment of the prime space. It ought not take a 5th power term to put bounds on primes….
OK, so that’s the “idea”. And it’s on my “someday” list of things to do to tighten it up a bit… but as I’ve got way more “things to do” than “time to do them”, I figured maybe someone else could benefit from the idea. The simple approach is to plot the first few million prime numbers, put a ‘best fit’ curve through them (automated codes for that exist now) and put ‘bounds’ curves on each side of them. At that point, searching each way from the best fit curve and out to the bounds (or just a touch beyond) ought to exhaustively find all primes with minimal computes. What to do with them is “for another day” ;-)
One can only wonder if someone else has thought of this. Pre-compute a whole lot of ‘products of primes’. Put them in a Very Large Lookup Table and then when it’s desired to ‘factor a very large number’ it is just a single ‘lookup’ in said peta-byte of data… not a very large compute workload at all…. As TB scale storage is on the order of $100 (or less…) and a single ‘lookup’ doesn’t take a lot of seek nor transfer speed, it is very do-able to precompute and store a very large number of such products into very slow and cheap storage. It’s the “size” that matters, not the transfer speed….
With that, I hope I’ve not just obsoleted a large amount of crypto gear ;-) I’m not too worried, as it likely would take a large budget to prove the thesis and those folks are already well funded and ‘further along’ than I am; so at most I’d be ‘spilling the beans’ to the rest of us… and that’s a good thing…
Peta-FLOP scale computers are ‘reasonable’ and ‘femto-byte’ storage is possible. Folks who can fund that scale of compute farm have a whole different way of looking at problems…. so “Why factor a large number when you can just do a lookup?” is a reasonable question at that scale…
I think i’ve shown that computing the data to fill that lookup-store is a reasonable problem to solve.