19 September 2006 - 21:22The assembly line
A while ago I worked on a project that was laid-out as an assembly line: it had an entry point which did some work after which it sent a message out to a queue. This queue was taking the message, did some work, transformed it and send it out to another queue. And so on… At the end something was produced that was stored in a DB. It was the first time I saw this architecture and I found it interesting. It is essentially message-passing done in a sequential order, in this particular project there was no branching, no task could pass a message to multiple dependent tasks. It is obvious that the architecture could support branching, but there was no need for it in this particular project.
In this posting the term “task” will represent a task that is deployed on a node in a cluster.
I grew interested in this architecture because of the similarity it has with an assembly line. I am interested in economic concepts applied to software development and such similarities attract my attention. So, rather than talk in this post about message passing programming (you can find some good literature on the web) I’ll take a look at the effects of this architecture from a managerial point of view and I will compare this architecture to the same architecture running in a single JVM. You can set-up an interaction similar to an assembly line in one JVM by passing messages between various business logic beans. From the computer science point of view the main difference between the 2 architectures is the treatment of state. If the process that is carried out on this assembly line requires that state is kept across the whole process then a distributed assembly line is probably not the best choice. If you do go ahead with this architecture then you will have to find a way to propagate this state from task to task across a cluster.
So what would be side-effects of this architecture?
The first and most important side-effect is that the failure of a task doesn’t block the process, the rest of the items that have to be processed go on. Error reporting becomes more fine-grained, but also more difuse. You don’t have to scan a log that records every step of the process in order to get to the task’s stack trace because the node on which you have the log is dedicated to that particular task. One problem is that the error which is reported is particular to that task, you do not get the benefit of seeing the whole process dumped in a stack trace. In order to determine what went wrong you have to back-track from the current task to the previous task and in order to capture the whole information pertaining to that process instance. Trying to do this along a process spread across a cluster is not the easiest thing… In order to make this more easier you would have to propagate information considered helpful from task to task in order to report errors more meaningfully.
We would also have fine grained scalability. Once the process has been split into tasks and the communication between these tasks implemented you can scale each task independently of another. The result is that a computationally-intensive task cand be carried out over a larger part of the cluster than one that doesn’t require so many resources. The bad thing about fine grained scalability is that sometimes you need coarse-grained scalability. For example, let’s say that your organization plans to increase the usage of this process by a factor of 4, it is moving from processing 1000 items to 4000 items. You would basically have to scale the computational resources for every task 4 times and then test the new cluster. For a process with a large number of tasks it could be tedious.
Another bad side-effect is the difficulty with which you version such an architecture. If your process spreads across a whole cluster you may have to replicate the whole cluster and make sure that all the relationships between tasks are maintained in the new version. If you have a high tolerance to pain you may try to mix-and-match tasks from different versions: for example task 15 from version 4 would be related to task 16 from version 3 while waiting messaged from task 14 deployed in version 2, 4 and 5.
One good point about this architecture which is being brought up is the fact that you could reuse a task. Well, this doesn’t really apply because code-reuse is related to the specificity of a task: the more specific is a task, the less the potential for reuse.
To conclude this posting I would say that this architecture is a pretty good architecture when a process consists of a series of transformations, it decomposes the process meaningfully. Most of the side-effects are related to the length of the process, so it would help to keep the number of tasks under control. In order to better manage your process you may find that you may have to propagate additional information from task to task apart from the tasks require for carrying out their work.
If you have any other thoughts on this particular architecture or you have worked with such processes please drop a comment.
2 Comments | Tags: Development, Econo-computing, Favorites