Implementing an efficient random iterator over a multi-level data structure
FRONT UPDATE:
Ken B. has pointed out that I was thinking like a C programmer in writing this: I was thinking in terms of shuffling arrays of C-like integers versus arrays of larger C data structures. In that situation, my algorithm would be more efficient.
But that is not how Python does things! These are Python lists, not C arrays, and the list is a list of pointers, whether the list is of integers or of agents. So the shuffle has the same cost either way!
In other words, I did a bunch of fancy coding that would be nice in a different language, but that makes no difference in the one I am actually using.
Still, the coding was interesting.
FRONT UPDATE II:
An important lesson I learn yet again -- I guess I am rustier than I thought! -- Don't "optimize" without profiling (timing portions of) your code!
I timed my fancy-pants version of the random iteration, as well as of reverse iteration, against the "naive" versions that just used basic Python operations: in both cases, the basic versions ran much faster than my fancy versions! I made two mistakes here:
* I thought like a C programmer instead of a Python programmer.
* I failed to account for the fact that the basic Python list operations would be implemented in C code, and thus run very fast compared to my "improved" versions, which had to run in interpreted Python code.
Here is the simple random iterator:
But there is another important point here: By interfacing with my data only through certain abstractions (these iterators), I was able to very easily swap different implementations of the iterators in and out without breaking the code of any model at all. So long as the iterator returned the "same" result -- same is in quotes as by definition a random iterator returns different results each time, but different in the "same" (random) sort of way -- I could experiment with different implementations very easily. As someone noted in 1992, using objects can greatly simplify making extensive changes to the code implementing an interface.
******************
We have a population of agents. Sometimes we want it to look like there is a list of agents... but there isn't, because most often it is more efficient to store them by their type, and access them that way. (Furthermore, different models have different numbers of agent types.) The code to iterate straight through the various populations sequentially was fairly straightforward:
Ken B. has pointed out that I was thinking like a C programmer in writing this: I was thinking in terms of shuffling arrays of C-like integers versus arrays of larger C data structures. In that situation, my algorithm would be more efficient.
But that is not how Python does things! These are Python lists, not C arrays, and the list is a list of pointers, whether the list is of integers or of agents. So the shuffle has the same cost either way!
In other words, I did a bunch of fancy coding that would be nice in a different language, but that makes no difference in the one I am actually using.
Still, the coding was interesting.
FRONT UPDATE II:
An important lesson I learn yet again -- I guess I am rustier than I thought! -- Don't "optimize" without profiling (timing portions of) your code!
I timed my fancy-pants version of the random iteration, as well as of reverse iteration, against the "naive" versions that just used basic Python operations: in both cases, the basic versions ran much faster than my fancy versions! I made two mistakes here:
* I thought like a C programmer instead of a Python programmer.
* I failed to account for the fact that the basic Python list operations would be implemented in C code, and thus run very fast compared to my "improved" versions, which had to run in interpreted Python code.
Here is the simple random iterator:
# simple random iterator:
def all_agents_list(self):
"""
Assemble all agents in a single list.
"""
all_agents = []
for var in self.varieties_iter():
all_agents += self.vars[var]["agents"]
return all_agents
def agent_random_iter(self):
"""
Loop through agents in random order.
"""
all_agents = self.all_agents_list()
random.shuffle(all_agents)
return iter(all_agents)
But there is another important point here: By interfacing with my data only through certain abstractions (these iterators), I was able to very easily swap different implementations of the iterators in and out without breaking the code of any model at all. So long as the iterator returned the "same" result -- same is in quotes as by definition a random iterator returns different results each time, but different in the "same" (random) sort of way -- I could experiment with different implementations very easily. As someone noted in 1992, using objects can greatly simplify making extensive changes to the code implementing an interface.
******************
We have a population of agents. Sometimes we want it to look like there is a list of agents... but there isn't, because most often it is more efficient to store them by their type, and access them that way. (Furthermore, different models have different numbers of agent types.) The code to iterate straight through the various populations sequentially was fairly straightforward:
# iterate over multiple lists as if one:
def __iter__(self):
alists = []
# self.varieties_iter() returns each
# type of agent we have at hand
for var in self.varieties_iter():
alists.append(
self.vars[var]["agents"])
# create an iterator that chains
# the lists together as if one:
return itertools.chain(*alists)
But how do we iterate over the agents randomly, making sure we hit each one? We can form a big list on-the-fly and use Python's random.shuffle()
to mix them up, but that will be very inefficient for large agent populations. The key to doing this somewhat simply and with decent efficiency was to generate a random sequence of integers between 0 and the size of the agent population (minus one, of course!), and then iterate through that list, figuring out which list of agents an index points into along the way. Here is the code that does this:
# iterate over multiple lists
# in a random order:
class AgentRandomIter:
"""
Iterate randomly through our agents.
Eventually this should be made
generic so it can randomly iterate
through anything that implements
an element_at() method.
"""
def __init__(self, agents):
self.i = 0
self.agents = agents
self.indices =
list(range(len(agents)))
random.shuffle(self.indices)
# at this point, indices holds
# a list of the agent indices,
# but scrambled, i.e., with
# four agents, it might hold
# 3, 0, 2, 1
def __iter__(self):
return self
def __next__(self):
"""
Return the next element or
raise exception to stop iterating.
"""
if self.i < len(self.indices):
# get an agent!
agent = self.agents.element_at(
self.indices[self.i])
self.i += 1
return agent
else:
raise StopIteration()
def agent_random_iter(self):
"""
Loop through agents in random order.
"""
return AgentPop.AgentRandomIter(self)
def element_at(self, i):
"""
Another way to treat the AgentPop
as if it were really one big list.
"""
if i < 0 or i > len(self):
raise IndexError()
else:
for var in self.varieties_iter():
l = len(self.vars[var]["agents"])
if i < l:
# that means the agent is
# in this list
return self.vars[var]["agents"][i]
else:
# otherwise, the agent lies
# in one of the
# remaining lists, so
# subtract the length
# of this one from i and
# continue.
i -= l
I forgot how beautiful Python's code is. Nice stuff Gene...
ReplyDeleteNot clear to me why the call you make to shuffle is more efficient than the call to shuffle you are doing this to avoid. Same size set of elements to shuffle.
ReplyDeleteNo, Ken, here I am shuffling a list of integers, rather then a list of agents. An agent object can be arbitrarily large.
DeleteI looked up the Python integer implementation: I guess they can get arbitrarily large as well. But realistically, for any number of agents anyone is likely to use, The index won't exceed the minimum integer size. But agents realistically could be hundreds of bites in size.
DeleteStill, I admit I was thinking like a C programmer, and assuming I was shuffling 4 or 8 byte things, not 24 byters! (In CPython, at least.)
DeleteCan you think of anything more efficient?
You will be shuffling object references in either case I think. In C terms, pointers (of fixed size). So you won't be moving the bits of the large object, just the pointer's bits.
DeleteI doubt there is a more efficient method, since ultimately everything comes down to some version of what shuffle does.
Aargh! Of course the list is pointers!
ReplyDeleteWell, it was a neat little bit of programming. A neat little bit of useless programming.
In your update on profiling, I want to make a point I suspect you have missed. You used exceptions to detect the end of the iteration. This is a very bad idea, for several reasons. One reason is it will be a big performance hit, assuming Python is remotely like any language I have worked in with exceptions.
ReplyDeleteThat is the Python standard: it is *required* to throw that exception at the end of an iteration.
DeleteIndeed. I found this https://www.python.org/dev/peps/pep-0234/
ReplyDeleteIn general, I'm terrible at code optimization. I like to stick with certain operations/routines purely for aesthetic reasons. My custom cryptographic hash runs much slower than SHA-1 on a five sentence string, for instance (several seconds).
ReplyDelete