382 stories

Balls Into Bins In Distributed Systems

1 Share

Balls Into Bins In Distributed Systems

Throwing things can be fun.

If you've come across the Balls Into Bins problem, you probably heard about in context of hash tables. When you hash things into a hash table (especially with separate chaining) it's really useful to be able to ask "If I throw đ‘€ balls into đ‘ bins, what is the distribution of balls in bins?" You can see how this is fundamental to hash tables: the amortized complexity argument for hash tables depends on their being some load factor (i.e. đ‘€/đ‘) for which most bins contain a small number of items. Once this stops being true, lookup and insertion time on hash tables starts to get ugly. So from that perspective it's already clearly an important problem.

Load Balancing and Work Allocation

Hash tables aren't the only place that the Balls Into Bins problem is interesting. It comes up often in distributed systems, too. For one example, think about a load balancer (in this case a distributor of independent requests) sending load to some number of backends. Requests (đ‘€) are balls, and the backends are bins (đ‘) and typically there are multiple requests going to each backend (đ‘€ > đ‘). If we know how to solve for the number of balls in each bin, we can understand the limits of random load balancing, or whether we need a stateful load balancing algorithm like least connections. This is an important question to ask, because sharing consistent state limits scalability, and sharing eventually-consistent state can even make load balancing decisions worse. Load balancing is much easier if it can be done statelessly.

A related problem is push-based work allocation. Here, there is some co-ordinator handing out work items to a fleet of workers, and trying to have those workers do approximately equal amounts of work. One way that systems end up with this pattern is if they are using shuffle sharding or consistent hashing to distribute work items (or records). These hashing-based methods can be great for scaling, and so are widely used across all kinds of large-scale systems. Just as with load balancing, its interesting to be able to understand how well requests are distributed.

Traditionally, papers about this problem have been most concerned about the expectation of the maximum number of balls in a bin ("how bad can it get?"), but other statistics like the expectation of the mean and expectation of the median can be interesting when planning and designing for load. It's also interesting to understand the variance of the maximum, and the size of the right tail on the distribution of the maximum. If the maximum can get really high, but will do so infrequently, then load testing can be difficult.

Closed-Form Analysis

Gaston Gonnet's Expected Length of the Longest Probe Sequence in Hash Code Searching, was one of the first papers to tackle analyzing the problem, in context of separate-chaining hash tables2. Michael Mitzenmacher's PhD thesis (the power of two choices in randomized load balancing) simplifies Gonnet's analysis and finds, for the đ‘€=đ‘ case, the maximum number of balls is1 𝝝(log đ‘/log log đ‘). That's not a curve you'll come across often, so this is what it looks like:

In other words, it grows, but grows pretty slowly. Most real-world cases are probably going to have đ‘€>đ‘, and many will have đ‘€â‰Ťđ‘. To understand those cases, we can turn to one of my favorite papers in this area, Balls Into Bins: A Simple and Tight Analysis by Raab and Steger. They provide a great overview of the problem, a useful survey of the literature, and a table of bounds on the maximum number of balls in any bin, depending on the relationship between đ‘€ and đ‘. The proofs are interesting, and somewhat enlightening, but not necessary to understand to find the paper useful.

While this kind of analysis is very useful, when I've needed to solve these problems in the past, I haven't tended to use the results directly. Instead, I've used them to sanity check the output of simulations. This is where engineering practice diverges a bit from computer science theory (although it does it in a pretty theoretically rigorous way).

Simulating the Problem

There are a couple of limitations on the usefulness of the closed-form analysis of this problem. One problem is that it's fairly difficult to understand clearly (at least for me), and quite complex to communicate. The bigger problem, though, is that it's quite inflexible: extending the analysis to include cases like balls of different sizes (as requests are in the real world) and balls coming out of bins (requests completing) is difficult, and difficult to code review unless you are lucky enough to work with a team that's very mathematically sophisticated. The good news, though, is that this problem is exceptionally easy to simulate.

When I think about doing these kinds of simulations, I don't generally think about using specialized simulation tools or frameworks (although you could certainly do that). Instead, I generally think about just writing a few tens of lines of Python or R which directly try the thing that I want an answer for many times, and then output data in a form that's easy to plot. Computer simulation is a broad and subtle topic, but this kind of thing (throw balls into bins, count, repeat) avoids many of the subtleties because you can avoid floating point (its just counting) and because you can avoid being too concerned about the exact values.

Knowing the closed-form analysis makes it easy to sanity-check the simulation. According to Gonnet, the đ‘€=đ‘ case should approach logđ‘/loglogđ‘ (1+đ‘œ(1)), and we can plot that curve (choosing a value of đ‘œ(1) to minimize the difference) alongside the simulation results to see if the simulation matches the theory. The results look pretty good.

Gonnet's paper also contains a table of example values, which compare very well to our simulated and modelled numbers. That all increases our confidence that the simulation is telling us sensible things.

You can also extend this basic counting to be closer to real-world load-balancing. Follow a Poisson process (a fancy way of saying "use exponentially distributed random numbers to decide how long to wait") to add random balls into bins over time, and follow your completion time distribution (probably exponential too) to pull them out of the bins. Every so often, sample the size of the biggest bin. Next, output those samples for analysis. If you have real arrival-time data, and completion-time distributions, you can use those to avoid making any statistical assumptions. Which is nice.

When you've got the basic simulation, it's easy to add in things like different-sized requests, or bursts of traffic, or the effects of scaling up and down the backends.

Some Basic Results

For small M and N, the constant factors are a big problem. With đ‘€=đ‘=100, I get an expected maximum of around 4.2. In other words, we can expect the busiest backend to be over 4x busier than the average. That means that you either need to significantly over-scale all your backends, or put up with the problems that come with hotspotting. This problem also gets worse (although very very slowly, going back to the closed-form) with scale.

In the closer-to-reality case with đ‘€=1000 and đ‘=100, the gap shrinks. The expected maximum comes to as 18.8, compared to a mean (aka đ‘€/đ‘) of 10. That still means that the hottest backend gets 80% more traffic, but the gap is closing. By đ‘€=10000 and đ‘=100, the gap has closed to 25%, which starts to be close to the realm of acceptable. Up to đ‘€=100,000 and the gap is closed to 8%. In most distributed systems contexts, 8% is probably within the variation in performance due to other factors.

Still, the conclusion of all of this is that random load balancing (and random shuffle-sharding, and random consistent hashing) distributed load rather poorly when đ‘€ is small. Load-sensitive load-balancing, either stateful or stateless with an algorithm like best-of-two is still very much interesting and relevant. The world would be simpler and more convenient if that wasn't the case, but it is.


  1. That's a Big Theta, if you're not familiar with it wikipedia has a good explanation of what it means. If you don't feel like reading that, and replace it with a big O in your head, that's close enough in this case.
  2. That paper also contains some great analysis on the numerical properties of different hash-table probing strategies versus seperate chaining. If you like algorithm analysis, the conclusion section is particularly interesting.
Read the whole story
23 days ago
Share this story




More cat adventures.

Read the whole story
87 days ago
Share this story

Announcing OpenFermion: The Open Source Chemistry Package for Quantum Computers


“The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble.”
-Paul Dirac, Quantum Mechanics of Many-Electron Systems (1929)

In this passage, physicist Paul Dirac laments that while quantum mechanics accurately models all of chemistry, exactly simulating the associated equations appears intractably complicated. Not until 1982 would Richard Feynman suggest that instead of surrendering to the complexity of quantum mechanics, we might harness it as a computational resource. Hence, the original motivation for quantum computing: by operating a computer according to the laws of quantum mechanics, one could efficiently unravel exact simulations of nature. Such simulations could lead to breakthroughs in areas such as photovoltaics, batteries, new materials, pharmaceuticals and superconductivity. And while we do not yet have a quantum computer large enough to solve classically intractable problems in these areas, rapid progress is being made. Last year, Google published this paper detailing the first quantum computation of a molecule using a superconducting qubit quantum computer. Building on that work, the quantum computing group at IBM scaled the experiment to larger molecules, which made the cover of Nature last month.

Today, we announce the release of OpenFermion, the first open source platform for translating problems in chemistry and materials science into quantum circuits that can be executed on existing platforms. OpenFermion is a library for simulating the systems of interacting electrons (fermions) which give rise to the properties of matter. Prior to OpenFermion, quantum algorithm developers would need to learn a significant amount of chemistry and write a large amount of code hacking apart other codes to put together even the most basic quantum simulations. While the project began at Google, collaborators at ETH Zurich, Lawrence Berkeley National Labs, University of Michigan, Harvard University, Oxford University, Dartmouth College, Rigetti Computing and NASA all contributed to alpha releases. You can learn more details about this release in our paper, OpenFermion: The Electronic Structure Package for Quantum Computers.

One way to think of OpenFermion is as a tool for generating and compiling physics equations which describe chemical and material systems into representations which can be interpreted by a quantum computer1. The most effective quantum algorithms for these problems build upon and extend the power of classical quantum chemistry packages used and developed by research chemists across government, industry and academia. Accordingly, we are also releasing OpenFermion-Psi4 and OpenFermion-PySCF which are plugins for using OpenFermion in conjunction with the classical electronic structure packages Psi4 and PySCF.

The core OpenFermion library is designed in a quantum programming framework agnostic way to ensure compatibility with various platforms being developed by the community. This allows OpenFermion to support external packages which compile quantum assembly language specifications for diverse hardware platforms. We hope this decision will help establish OpenFermion as a community standard for putting quantum chemistry on quantum computers. To see how OpenFermion is used with diverse quantum programming frameworks, take a look at OpenFermion-ProjectQ and Forest-OpenFermion - plugins which link OpenFermion to the externally developed circuit simulation and compilation platforms known as ProjectQ and Forest.

The following workflow describes how a quantum chemist might use OpenFermion in order to simulate the energy surface of a molecule (for instance, by preparing the sort of quantum computation we described in our past blog post):
  1. The researcher initializes an OpenFermion calculation with specification of:
    • An input file specifying the coordinates of the nuclei in the molecule.
    • The basis set (e.g. cc-pVTZ) that should be used to discretize the molecule.
    • The charge and spin multiplicity (if known) of the system.
  1. The researcher uses the OpenFermion-Psi4 plugin or the OpenFermion-PySCF plugin to perform scalable classical computations which are used to optimally stage the quantum computation. For instance, one might perform a classical Hartree-Fock calculation to choose a good initial state for the quantum simulation.
  2. The researcher then specifies which electrons are most interesting to study on a quantum computer (known as an active space) and asks OpenFermion to map the equations for those electrons to a representation suitable for quantum bits, using one of the available procedures in OpenFermion, e.g. the Bravyi-Kitaev transformation.
  3. The researcher selects a quantum algorithm to solve for the properties of interest and uses a quantum compilation framework such as OpenFermion-ProjectQ to output the quantum circuit in assembly language which can be run on a quantum computer. If the researcher has access to a quantum computer, they then execute the experiment.
A few examples of what one might do with OpenFermion are demonstrated in ipython notebooks here, here and here. While quantum simulation is widely recognized as one of the most important applications of quantum computing in the near term, very few quantum computer scientists know quantum chemistry and even fewer chemists know quantum computing. Our hope is that OpenFermion will help to close the gap between these communities and bring the power of quantum computing to chemists and material scientists. If you’re interested, please checkout our GitHub repository - pull requests welcome!

1 If we may be allowed one sentence for the experts: the primary function of OpenFermion is to encode the electronic structure problem in second quantization defined by various basis sets and active spaces and then to transform those operators into spin Hamiltonians using various isomorphisms between qubit and fermion algebras.

Read the whole story
114 days ago
Share this story

the internet vs music, 2017

1 Comment

How do you release music on the internet in 2017?  We’ve got at least:

  • Bandcamp
  • Spotify / Apple Music / Tidal
  • SoundCloud
  • iTunes
  • Resonate.io
  • Mixcloud
  • YouTube
  • Patreon
  • DropBox
  • Self-hosting
  • ?????

Now let’s look at the tape:

Bandcamp is probably my favorite, but it lacks a bit of jazz, and is more of a store than a place where things can gather momentum.  

Spotify and the streaming services are not really great for bootleg stuff or for things aimed at DJs. (Disclosure:  I’m a Spotify employee, and it’s not perfect, but I don’t think it is bad.)

SoundCloud is probably dying.

iTunes is boring as six horses, and most people don’t buy music.

Resonate is adorable, but is even more niche than Bandcamp.

Mixcloud is designed for radio sets (and has iffffyyy design).

 YouTube makes you no money.  

Patreon doesn’t work for shy people, and has no hosting.

Dropbox and OneDrive and Google Drive are pedestrian.

Most people can’t self host.

So is the answer, as ever, for artists, to do everything?  Cut your 200 vinyl or cassettes, push it to Spotify and everywhere else with DistroKid, put a video on YouTube, link it all to your Patreon on Twitter / Instagram / Snapchat?  

It probably is, really.  As someone who’s been around this field for a minute, it is sort of frustrating that there is no easy answer for me wanting to upload a few edits where they might get a tiny amount of attention organically.  (I am, of course, probably just a lazy old man.)

But along these lines, what’s an instagram record label?  What’s a Snapchat record label?  Is AR / VR stuff going to do anything?  What about Twitch?Will there be a boom in compact-disc (or minidisc!) nostalgia soon?  What about the newsletter trend?  The blockchain?  What’s next?

Read the whole story
126 days ago
Share this story

Saturday Morning Breakfast Cereal - Noah's Ark

2 Comments and 11 Shares

Click here to go see the bonus panel!

I mean, the rainbow thing is just a phenomenon due to refraction. How self-centered do you have to be to think it's just about you?

New comic!
Today's News:

Geeks! Just about 10 days to get in your submissions for BAHFest Seattle and BAHFest San Francisco. We're going to have some really awesome geeks on stage, so please submit soon for your chance to be part of things!

Read the whole story
176 days ago
181 days ago
Share this story
1 public comment
170 days ago
earth dimension c-138

hobbes, Morgan Stanley OSS

1 Comment and 2 Shares

For the last few years, we at Morgan Stanley have been developing hobbes -- a programming language, JIT compiler, and database system. It has developed into a critical piece of infrastructure in our low-latency, high-volume trading applications, and we have decided to release the source code to the public on github (currently can be built for recent Linux and macOS platforms):


The database system is a lightweight (self-contained, header-only) library used by applications to efficiently log structured (binary) data over shared memory to minimize application latency and reflect application type structure as accurately as possible in log files. We use this to record a very detailed image of application state over time, which we analyze/query out of band.

The JIT compiler can be used embedded in another application (e.g. to "hot patch" an application with very efficient, precisely typed intercepts) or as a standalone interactive interpreter (e.g.: to monitor and query application log data).

The programming language is a variant of Haskell (HM type inference, algebraic data types, qualified types / type classes) with some adjustments to help reduce boilerplate and derive very efficient machine code. For example, we use "structural" record, variant, and (iso-)recursive types to represent data as it's naturally represented in applications and can be deconstructed generically at compile time (e.g. a record can be printed if its first field can be printed and its rest can be printed, a variant can be printed if its first case can be printed and its rest cases can be printed, a recursive type can be printed if its one-step unrolling can be printed, etc).

We are actively using this on major projects, we will continue to develop this github project as we need new features, and we are interested in engaging others outside of the firm for their thoughts, feedback, and hopefully pull requests. :)

Read the whole story
216 days ago
very cool
Share this story
Next Page of Stories