Making Chords with the PC Speaker by Daniel Ehrman

At least once in every computer engineer's life, the question must arise,

"Can I create the illusion of a chord in QuickBASIC by rapidly multiplexing different notes on my internal PC speaker?"

Well, OK, maybe not that question exactly. But a lot of engineers can probably attest to having a craving for solving tough problems with limited tools, and since I have a passion for both music and retro computers, this was my chosen challenge last weekend.

As a little background, QuickBASIC is the simple language/compiler that came bundled with MS-DOS back in the day. It's the distant ancestor of the Visual Basic many of you may know now and Microsoft's replacement of the older GW-BASIC which arguably first exposed programming to the masses (perhaps with the exception of Applesoft BASIC of Apple II fame).

QuickBASIC lets you do a lot of fun stuff with your computer (i.e. graphics and sound) with just a couple lines of code. Like the Arduino of the 80's.

So in my search for the lost chord (yeah, Moody Blues reference), naturally I returned to this glorious language where I first cut my teeth on programming. I won't bore you with the bulk of the code, but here's what the important stuff looks like:

SUB PlayChord (SomeChord AS Chord, Duration, MultiplexingDuration)

' Convert from seconds to clock ticks.
LocalDuration = 18.2 * Duration
LocalMultiplexingDuration = 18.2 * MultiplexingDuration

' Compute local values.
NumIterations = LocalDuration / LocalMultiplexingDuration
NumNotes = SomeChord.NumNotes
NoteDuration = LocalMultiplexingDuration / NumNotes

' Play the notes.
FOR i = 1 TO NumIterations
   SOUND SomeChord.Note1, NoteDuration
   IF (SomeChord.NumNotes >= 2) THEN
      SOUND SomeChord.Note2, NoteDuration
      IF (SomeChord.NumNotes >= 3) THEN
         SOUND SomeChord.Note3, NoteDuration
      END IF


where a chord is defined as below:

TYPE Chord
   NumNotes AS INTEGER

Each integer in the chord represents the frequency of the given note. I also added a NumNotes member to give us the option of experimenting with different numbers of simultaneous notes. Pretty simple.

Sure we could improve the scalability of the code, but this isn't a programming exercise, and frankly, this beginner's language from 1985 doesn't exactly make arrays embedded in custom types easy. Our goal here is to make music, so let's get started….

I start with a simple G major chord:

DIM Gmaj AS Chord
Gmaj.NumNotes = 3
Gmaj.Note1 = 196
Gmaj.Note2 = 247
Gmaj.Note3 = 294

which sounds like this with a "multiplexing duration" of 0.02:

Note that the "multiplexing duration" effectively defines how long each note in the chord should be played. So if we increase the multiplexing duration, we should expect to hear the individual notes in the chord become clearer and clearer:

While the higher mux durations give us that cool arcade sound we all know and love, the 0.02 mux value clearly masks the multiplexing best, producing a sound closest to anything we could consider a chord.

However, the short note durations of the first track yield a distinctly "dirty" sound that's not too appealing. Taking a closer look in Audacity's Frequency Analysis, we can see why:


Whoa! That's a lot of extra frequencies that we really don't want. For reference, this is what the chord looks like if we set NumNotes to 1 and just play the G note:


Much cleaner.

OK, so what's happening in the first plot (the triad G major chord) that's giving us all of those bogus frequencies, and how bad is it really? Well, let's take a closer look at the magnitudes of each of those frequencies by exporting the data to Excel:

Frequency (Hz)Level (dB)

OK, so while this provides some clarity—note that we can see frequencies near the three expected ones: 196, 247, and 296—it also raises a couple interesting, perhaps related, questions:

  1. Why do each of the three primary frequencies come out a little flat (slightly lower) than what they should be?
  2. What is this unexpected 277-Hz frequency doing at the top of the chart??

Well, to help us better understand what's going on, let's take a look at the data for the 0.08 mux duration (i.e. longer notes):

Frequency (Hz)Level (dB)

Alright, now that looks a little better. This tells us that when the notes are played for a longer period of time, not only do they come out closer to the expected frequencies, but they also appear as the three loudest frequencies in the spectrum.

So it appears from the data (granted from only two data points) that the accuracy of the notes depletes as their duration decreases.

As it turns out, when the notes are short enough, it doesn't appear that there is enough time to generate a strong consistent tone free of aberrations:


In this waveform view of the 0.02 mux chord, the issue is painfully obvious: the G note—the root note of the chord—is too low of a frequency to be played in the small time slice allotted to each note in the chord. The time slice is so small in fact that only one cycle of the note can by played, effectively resulting in no audible note. 

Upon closer inspection, we can account for the specific reason why the frequency 190 Hz appears stronger than the expected frequency of 196 Hz: the transition from the D note to the G note is slightly longer than the pulse width of the G note itself (i.e. one half of 1/196 seconds). This slightly longer duty cycle is what yields the flat G and is likely what causes the other flat notes as well.

The final big issue is the inconsistency between note transitions. If you choose any note transition (e.g. B to D) and then look for the same transition later in the waveform, you'll likely find that the duration of the transition is slightly different. While this discrepancy might seem trivial on the scale of microseconds, our ears (as well as our software) is sensitive enough to pick up the difference, and it's these little discrepancies that are likely resulting in the unwanted extra frequencies in our chord.

Unfortunately, the transition consistency problem is far from trivial: because commands sent to the PC speaker are interrupt-driven, using these high-level constructs for generating sounds results in fairly non-deterministic behavior. To gain full control, we need lower-level code (i.e. assembly) that can manipulate the PC speaker directly. This of course, is a whole other blog entry—and probably entire project—altogether.

For reference, there is a lot of information and previous work out there for playing full-fledged WAV-like music through the PC speaker that completely blows this effort out of the water. Check out this example if you're interested:

A More Efficient Model For Problem-Solving Across Hierarchies by Daniel Ehrman

We work in organizations where different people have different specialties in different fields of knowledge, all of which we depend on in one way or another. We have some worker in Site A who's an ARM expert, another one in Site B who knows DDR, and a third one in Site C who understands verification IP. All of these people are solving their own problems independently, despite the fact that most problems consist of a variety of contributing knowledge bases; the worker pushes on through the problem, struggling to understand what he doesn't already know—but what someone else does.

We can think of this situation like a Venn diagram with each person having his own circle of knowledge or expertise. Each person approaches his problem by sort of pecking in the dark all around his circle, maybe expanding his search a bit, hoping that he eventually hits the slice of area intersecting with someone else's domain. A real-world example may be, after asking around the office and bouncing along a chain of e-mails for a few days, finally finding the person who can fill the knowledge gap—and even then, often, only partially.

Perhaps more concretely though, it's best to think of the situation as it fits in with the typical hierarchy:


This kind of a structure will be as present in a computer engineer's mind as it is in that of a human resources department. And as any computer engineer could tell you: those "leaf nodes" at the bottom of the tree—they're isolated from each other. In fact, it's inherent in this structure that the only way of getting from one leaf to the other is to go up the tree through each node's "parent."

So what's the point? The fact is passing information between leaves (i.e. lower-level employees) is remarkably easy these days with e-mail, web conferencing, and other collaboration tools. The key insight though is to recognize that the leaves only see what they've been given, and they will only communicate with each other when asked. So although in reality they may have direct connections (phone, e-mail, etc.) they remain isolated by what they're told to work on.

Taking a step back, let's think about how a task makes it's way to the individual problem solvers at the bottom of the tree. High-level goals are constructed by upper management; criteria are defined; and tasks designed to meet those criteria are assigned to workers at the next lowest level, with increasing amounts of detail at each level down.

But—the point at which a problem is assigned to a specific node at the next lowest level, an important (and potentially costly) decision has been made: the other nodes at that level, and all of the workers beneath them, have just been pruned from solving that problem. And at that moment, the decision-maker must operate with only the limited information he has been provided by the single level beneath him.

As it turns out, this is a classic problem in artificial intelligence: in the searching of decision trees, such as in the game of chess, often there are far too many possibilities to consider in a reasonable amount of time. The solution is to develop a heuristic, search only a few levels below oneself, and make an educated guess about what you think is probably the best move.

This is effectively how the business world is working now, and has been for a long time—managers making big decisions with highly filtered information.

But with the technology available to us these days, there is no reason to keep proceeding with this outdated approach. Think about how you find an answer to a question when you're searching with Google. Do you browse through a hierarchical list of categories, at each step of the way considering all of the links and choosing the one which poses the highest likelihood of containing your answer? No. That model of search has been dead for years now because people understand the power of letting the algorithms choose your results for you (based on their unbeatable knowledge of millions of available options).

So where is the Google of problem-solving in today's businesses? Where are the "page rank" algorithms for telling managers the best resources to assign a task? Imagine indexing everything related to a problem in a centralized database: products affected, people involved in the solution, who knew a lot about it (and who didn't), tools used, etc. And imagine that every time a person contributes to a solution, his information is linked to that problem, effectively creating a "portolio" of problems that worker has under his belt. Finally, we can pull in live data like individuals' schedules and project timelines to ensure that resources are always properly allocated.

Over time, we have a map of our resources embedded in this database, and we can use "big data" algorithms to intelligently assign workers to a task based on their comparative advantage.

The final result is something like matrix management but with even less barriers between teams. The new style dramatically crushes information silos and makes tribal-style management difficult to exist at all. But here it is important to be wary of a potential motivational pitfall that may result from these new highly cross-functional resources: in fact, the flattening effect that occurs when all resources become available to all possible problems may actually induce competitive issues at managerial levels. A manager is responsible for the work of his team, and thus if his people are being pulled into other teams' work, it will be difficult for the manager to find reason to support such "outside" activities.

Therefore, if such a system were implemented, it may be most effective to apply the traditional resource-pruning at a moderately high level while still leveraging the power of the algorithm's decision-making based on its in-depth knowledge of all available resources. Even still, a front end for the problem-solvers database could be provided to the lower-level employees to give them a means to get answers more quickly. In this case, employees would have the power to solve their problems efficiently, while the managers would still have the power to decide how those employees should spend their time.

Exploring Multithreaded Performance by Daniel Ehrman

Building on my last post regarding a cleaner way to use threads in C++, I decided to see what kind of performance I could expect out of these "threaded objects." My experiments ended up being more of an exploration of threads themselves than my specific implementation of them, but the results were nonetheless quite interesting.

What I found wasn't all that surprising—the overhead of swapping threads in and out of context is constantly drilled into the head of any computer engineer—but getting a first-hand look at the performance implications certainly puts that concept into perspective.

Throughput vs. Number of Threads.png

I wanted to start things off by understanding what kind of cost I incur from the "multithreadedness" alone, so I designed the threaded function to do one simple operation and then return: that way, I can mostly rule out irrelevant cycles and instead focus on the operating system's management of the threads by itself.

Note that here, I consider a "job" to be a block of work that is assigned to a thread (i.e. a workload).

As you can see in the chart above, not surprisingly, the optimum number of threads available to the program is that which aligns perfectly with the number of cores on the system. Anything less, and the cores aren't being fully utilized. Anything more, and threads are left waiting in a queue to be swapped into context. The data does imply that there is some asymptote that may approach the performance of 4 threads, but why waste resources achieving what you can do with much less memory?

Oh yeah, this is why.…

Throughput vs. Length of Job.png

Since in the first experiment we were only trying to capture the overhead of multithreading, we weren't letting the threads perform any real work and show off their stuff. Here in this data though, we give the threads something to do by varying the number of "iterations" contained in each job (just a useless loop with some foobar math).

As you can see, and again, as you'd expect, the longer the job is, the more useful multithreading becomes; essentially, with long jobs, the time saved through parallelizing the "user" portion of the program overshadows the time lost through the overhead of the "system" portion (managing the threads at the kernel level).

So this data is great and all, but what do we do with it? I see a couple of key intersections and inflection points that could be useful in designing a program, but how do I make this information general enough to be apply it to my design?

Well, the two terms I used above—"user" and "system"—were actually used with good reason. Through my experimentation I found a very handy C function called getrusage, which provides me with a lot of really cool details about how my program is operating under the hood—including a breakdown of its CPU time between "user" space and "system" space. The idea was that if I could use this function to get solid (consistent) numbers for user time and system time, I could find a pattern that would give me some insight into how to generalize this information.

So to find this pattern I returned to the original experiment, measuring only thread-related overhead and ignoring the cost of the user code. Since the test was designed to stress thread management alone, the curve of system time per thread followed the original (total time) curve exactly (but as an inverse, since we're measuring time rather than throughput):

System Time vs. Number of Threads.png

Now we have a sense of how much time it takes to manage a thread during its lifetime—let's say about 100 microseconds to make the math easy.

So what's next? Well, now we need to go back to the second experiment—evaluating the usefulness of multithreading for varying lengths of jobs. Again, here, I can see a couple of key points:

Throughput vs. Length of Job (with annotations).png

Point A is important because it marks the point at which the job is long enough to make a threaded program at least equal throughput of a sequential program doing the same job. Beyond that point, and multithreading is always better. Point B is our second interesting point because it marks the point at which the marginal throughput benefit of using 128 threads instead of only 4 is so small that it may not justify the additional resources it requires.

Now, if we take a look at the user time for programs at these two points, we should be able to get a quantifiable sense (in seconds) of how long a job needs to be in order to make use of threads.

What I found is that, for this specific kind of job, the sequential program was in user space for about 0.361 microseconds per iteration. That means that a sequential program at Point A is in user space for about 18 milliseconds, and a sequential program at Point B is in user space for about 3600 milliseconds.

But we're still lacking some generalization: how do these results to scale to another processor, another operating system, or even a system busy working on other (unrelated) processes at the same time? What we need is a way to remove the units from this data—a ratio.

The table below showing the ratio of user time to system time, when compared to the data in the second experiment, demonstrates a clear positive correlation between the "user:system" ratio and the throughput of the particular configuration.

User-to-System Ratios.png

From this data, we can gather that a user-to-system ratio greater than 2 is a reasonable lower threshold for estimating that parallelizing a program (of this style) with threads will provide a throughput benefit.

The upper threshold is interesting as well: the data suggests that at some point between 1,000,000 and 10,000,000 iterations per job, the user:system ratio of the 4-thread program and the 128-thread program converge, perhaps in the neighborhood of 5,000,000 iterations, slightly before Point B on the graph. In fact, if we look at the graph closely, we can see that indeed there appear to be two opposing inflection points on the 4-thread and 128-thread curves at precisely that point.

So where does all of this information leave us? Well, it gives us a framework for evaluating the throughput potential of parallelizing a sequential program. Of course, in order to accurately determine this potential, we need to gather some rough time measurements on a case-by-case basis. But once we have that user and system data for a given operating system and machine, it should be relatively straightforward to cross-reference the resulting user:system ratio with the results shown above to correctly size the number of threads allotted to the program.

OK, so what are we leaving out here? Well…frankly, a lot. The jobs that we were analyzing were extremely straightforward arithmetic with minimal data dependencies and practically no memory footprint of any real significance. What happens when we try these methods with lots of data dependencies—say traversing a tree? Or perhaps even more importantly, what happens when the same program is run in a different environment? Will our optimizations still be optimum?

To make this a true general solution, there needs to be some underlying automated intelligence to the design. One could imagine an API that decides internally how many threads to use for the jobs, if any at all, based on what it has learned from trying a variety of solutions in recent runtime history.

Even this "automated solution picker" method has its pitfalls: for instance, non-standard jobs break the "learning" concept pretty quickly, but those may just be corner cases, especially if the designer is considering that the jobs are uniform enough to parallelize in the first place. Perhaps more important would be ensuring that the overhead to attempt different solutions and pick the best one wouldn't be large enough to impact the performance of the program.

Hopefully in the future I'll have a chance to test out some of these concepts (or research existing solutions).

Object-Oriented Threads by Daniel Ehrman

Over the weekend, I decided to throw together a C++ class to simplify the user's (programmer's) interface with pthreads. The pthreads library is great in a C environment but exposes too much complexity for my liking in an object-oriented program. What I wanted was a way to treat an object—or more specifically, an object's member function—like a thread without having to create and join the threads myself. The end result would be, say a vector of objects of a common class, all executing in parallel and returning their result without (visible) barriers. (And yes, this basically amounts to a watered-down version of the Python Thread class.)

Here's what I came up with:

template<class return_type> class Thread
  // Virtual function to be specified by the user
  virtual return_type thread_function() = 0;

  // Start the thread.
  bool start_thread()
    return !pthread_create(&internal_thread, NULL, start_thread, this);

  // Get the return value from the thread.
  return_type thread_return()
    pthread_join(internal_thread, NULL);
    return return_value;

  // Actual function passed to pthread_create()
  static void *start_thread(void *obj_ptr)
    // Execute the user-specified virtual function and store its return value for later use.
    ((thread<return_type> *)obj_ptr)->return_value = \
      ((thread<return_type> *)obj_ptr)->thread_function();
    return NULL;

  return_type return_value;
  pthread_t internal_thread;

To use the class, you inherit it in the definition of the class that you want to make "threadable" like so:

class TestClass: public Thread<int>
  virtual int thread_function() { return some_function(); }
  int some_function() { return arg0 * arg1; }
  int arg0;
  int arg1;

Just as you would with Python's Thread class, here you need to specify which member function should execute on a thread: the simplest way to do this is to have the "thread_function" return the function of your choice. The "thread_function" is what the internal Thread functions will look for, so it's important to define it here. (Otherwise, you won't be threading anything.)

So in the end a sample usage of these threaded objects might look like this:

int results[NUM_THREADS];
TestClass some_objects[NUM_THREADS];

for (int i=0; i < NUM_THREADS; i++)
    // Initialize the object.
    some_objects[i].arg0 = i;
    some_objects[i].arg1 = i + 1;

    // Start the thread.
        cout << "ERROR: Could not create thread " << i << endl;
        return 1;

// Get the results of the threads.
for (int i=0; i < NUM_THREADS; i++)
    results[i * i] = some_objects[i].thread_return();

Sure, this is a largely useless block of code, but it demonstrates the point: there's no need to think about thread IDs, mutexes, or barriers here. Just initialize the object, kick it off, and collect the results when it comes back.

It's pretty simple, and you'll likely find many other solutions like this online. But things got interesting when I decided to test this out with some sample programs—more on the performance characteristics of these threaded objects in my next post.…