16 April 2007 - 21:20Yahoo Pipes - the revolution took place

I took a look at Yahoo Pipes and decided to write a small post about it. Don’t expect too much from it, as it is the case lately I don’t have the time to take a deep look at it.
I like the theme that Tim O’Reilly put on it in this post:
Jon expressed a vision of web sites as data sources that could be re-used, and of a new programming paradigm that took the whole internet as its platform… Using the Pipes editor, you can fetch any data source via its RSS, Atom or other XML feed, extract the data you want, combine it with data from another source, apply various built-in filters (sort, unique (with the “ue” this time:-), count, truncate, union, join, as well as user-defined filters), and apply simple programming tools like for loops. In short, it’s a good start on the Unix shell for mashups.

I concur with it, it is true: thru the use of RSS data is being exposed in a standard way and this standard is used for putting together different data sources. You could argue that RSS broke down the barriers where were separating content in different sites and that Pipes is the software which lets you combine these datasources in very imaginative ways. Pretty good. Except that I don’t think it will achieve the penetration touted by most people.
The biggest obstacle to making come true this vision of web sites as a series of datasources is the fact that Yahoo PIpes serves niche users and not breaking the barriers between various sites/datasources them. Mind you, breaking these barriers was not an easy task and keeping these barriers down relies heavily on unpaid abor (imagine what would happen to most of the Yahoo Pipes if people would stop pushing stories onto digg, book-marking items on del.icio.us, blogging about Lindsay Lohan, etc…). In a sense Yahoo Pipes lets you surf people’s past-times, but this is another story.

As I was saying, the problem is that there are way the needs to surf these datasources are very narrow. Taking a look at the numbers of times each pipe has been run would indicate that the pipes created so far are not that popular (low 5 digits are nothing today), which would indicate that they are not easily consumed. I assume that they are not easily consumed because these pipes are servicing very pointed needs, needs which are not shared by most people.
Looking at Yahoo Pipes as if it is a market you end up with the impression that it is an illiquid market: you cannot map pipe consumers to pipe producers easily. The main way of mapping pipe consumers to pipe producers is to make each pipe consumer create its own pipes by using an editor (which I must confess I have not used), the pipe consumer becomes a pipe producers and the market becomes liquid. Well, this is not happening because your average Joe will not be able to construct a meaningful Pipe, partly because it doesn’t really know where to look for content, how to filter that content efficiently and partly because it has no desire to learn how to chop-up XML (even if Yahoo claims to have made it braindead-easy).

Your average Joe needs an agent that would do that for him, that would bridge the gap between a pipe consumer and a pipe producer. Well, that agent would very likely not do it for free. So how would you put together average Joe and the agent that would service its very narrow needs? In an old post I was musing about how to service niche communities by creating mashups that would use other sites (Amazon, eBay, Google Maps and now Yahoo Pipes). An example of a Yahoo Pipe that could target a niche efficiently could be putting up a this Pipe that lets you find an apartment near parks, public libraries, produce markets, etc… You could take this pipe and put it up on a real-estate agency’s site and provide that real estate agency’s customer with something worth-while. The customer could do exactly the same thing on Yahoo Pipes, but it doesn’t know how to. Well, the real estate agency could provide this on its site and provide a better experience. The real estate agency coupled with the web-site developer that puts up pipe on the real estate agency’s site is the agent that bridges the gap between the pipe consumer and the pipe producers. I see this as a way to make the Yahoo Pipes market liquid.
Another way to make this market liquid would be create a market-place where pipe consumers would make requests for pipes along with prices and pipe producers would fulfill these requests and get paid, similar to Amazon’s MTurk. To be honest, I’m not sure if this is feasible, but it is worth noting.

Yahoo Pipes would be an interesting service. The web has been turned into a huge database where anyone with a browser can find what it needs. The revolution happened. Nobody cares.

Later Edit: In a certain way an Yahoo Pipe is a piece of code, a piece of intelectual property if you wish. Yahoo Pipes forces pipe producers to give their IP away for free (as far as I see). I wonder if this is not a barrier to adoption as well. If I make a good pipe I’d like to see some benefit coming out of it.
Another point concerning pipe creation is the large number of websites/datasources. Putting together an efficient pipe requires the pipe creator to know the data in these datasources pretty well, the barrier to pipe creation is not just chopping up data coming from different sources, but also picking up the appropriate sources for data. This is a far greater barrier to adoption as far as I see it because it shifts the effort of pipe creation from data transformation (trivial) to knowledge about data (not so trivial). My .02$

No Comments | Tags: Econo-computing, Miscellaneous

8 January 2007 - 23:37New category

I created a new category “Econo-computing” which addresses issues in the computing field seen from an economics point of view. I am not sure what will go in this category, but I find this field interesting and I will continue to explore it.

No Comments | Tags: Econo-computing

8 January 2007 - 23:36Supply and demand in a computing environment

I was talking in a previous post about how the demand for interceptions in an application was not met properly by the supply until AOP came about and created a scalable process that could handle interceptions efficiently. I am thinking that you could generalize this case and state that this mis-match between the demand for a behavior or feature and the supply for that behavior or feature betrays an ineffiency in the computing environment (language + IDEs + frameworks + containers + etc…) that tried to address the issue. In the example above implementing interceptions following strict OOP concepts was certainly possible, but it was unfeasible.
In a market environment high demand that cannot be met by supply usually translates into a high price. In the above example the high price would have been the cost of developing and, even more, maintaining, the system in which interceptions were implemented in pure OOP fashion. In the example of OOP done in C the cost would have been the cost of coordinating a team of programmers and the cost of turning a programmer into a human compiler.

Mismatches between demand and supply in a computing environment sometimes appear as the inability to scale. If in the interceptions example the inability to scale resulted in mushrooming of sub-classes which would basically re-implement the original methods with some interception logic around them, in the example of OOP done in C the inability to scale was the inability to have a large team of developers follow some coding rules that would have resulted an OOP-style coding. These 2 examples could not scale, or in other words as the demand (for interceptions) was increasing, the supply (the OOP environment in which these interceptions were coded) could not keep up. Inability to scale a behavior is basically a facet of a mis-match between demand and supply for a particular feature.
It would be interesting to see how you could address these mismatches. Looking at how the interceptions and C++ problems were addressed it would appear that it requires some large investment of some kind. This investment, which usually takes the form of a framework which addresses the problem, is strategic to a certain extent, since it will basically keep your costs down for a certain period of time.

P.S. I know that I am stretching some concepts over here, but quite frankly I don’t care. I’ll be continuing to explore this field as I find it pretty interesting.
P.P.S. I wrote this post also in a hurry, I may come back to it. I gotta go right now.

No Comments | Tags: Econo-computing, Favorites, Management

8 January 2007 - 16:20Interceptors and scalability

I was thinking about the roots of AOP, about what makes it different from others programming paradigms, about what defines it and it came down to interceptions. AOP is pretty much about intercepting method calls and doing something before or after. This interception mechanism grew into a full-fledged language in the case of AspectJ or into a full-fledged binding mechanism in the case of Spring AOP.
Intercepting method calls was doable in plain OOP but it was pretty hard to do. You could, for example, extend a class and override a method with method that would intercept the original call, do something around it and then call the original method. You could do this, but this could not scale. You could not do this for dozens of methods and you could not turn an interceptor on or off as your application would need. In comes AOP which takes this interception mechanism and builds a whole environment around it so that you can apply interceptions as you need. Both AOP environments I mentioned (AspectJ and Spring AOP) scale very well so that it can match the demand of interceptions with the supply, either thru a programming language, as in the case of AspectJ, or declaratively, as in the case of Spring AOP.
So I would say that AOP is the main way to implement interception-based programming in an OOP environment.

P.S. I wrote this post in a hurry, and unfortunately I could not expand on some aspects ;-) of this (the supply and demand of interceptions was something I wanted to explore more in detail). I would probably re-visit this post later.

1 Comment | Tags: Development, Econo-computing

18 October 2006 - 19:52Economics applied to software design

I have picked up economics as a hobby. It is a very interesting field and I think it pertains significantly to software development since it deals primarily with the efficient allocation of scarce resources.
Anyway, I started looking at various applications thru an economists’ glasses. Let’s take a look at various P2P applications. Peers on a P2P network have a weird economic status: they are both consumers and producers (distributors?) of the same good, namely content, a peer downloads content from another peer and then publishes the same content on the P2P network. Well, this is pretty interesting, but what I found more interesting was the differentiation between various P2P networks. BitTorrent emerged as the dominant P2P network primarily because its P2P client distributes the content (the product in our example) as soon as it gets it. The rest of the P2P networks do not do this, they wait for the whole file to be downloaded before making it available to the rest of the network. When you treat a BitTorrent peer as a producer/distributor you see that it makes its product (the content shared by the P2P network) available much sooner than a peer on a competitor’s network, you can say that a BitTorrent peer has a greatly reduced time-to-market than a peer on a competing P2P network. And we all know that companies with reduced time-to-market usually beat the competition ;-), this would explain both BitTorrent’s incredible success as well as the rush to copy its economic model (namely having the peer share content as it receives it rather than at the end of the file reception which translates into a reduced time-to-market) by the competing P2P networks.

Another software design aspect which I tried to treat from an economist’s perspective is the managed environment (EJB containers, IoC containers, AOP-enabled environments, etc…). In the managed environment the developer is focusing exclusively on the business logic that has to be implemented while outside actors manage transparently issues such as concurrency, security, transactions, etc… Such a managed environment ressembles a vertically integrated organization which outsources its tasks (out-tasking is a pretty popular word these days, did you know that ;-)?) to a specialized work-force. Such an environment adapts itself to a changing environment more easily because it is the sum of some loosely coupled entities rather than a rigid entity whose elements are hard-coupled one to another. The modules that make up an application running in a managed environment would be very similar to the resources making up a multi-national: the modules/resources can be loaded/hired or let go according to the application/multi-national’s needs as these needs come or go. The result is a leaner application that responds to changes in the business environment more easily (again similar to a multi-national). The application would be a composition of various modules carried out by different actors, just like a multi-national is increasingly a composition of various tasks carried out by different actors. I’m not sure if I am seriously going off-road…

Anyway, I find mixing software design and economic theory fascinating and I will try to continue to treat various software design issues from an economic perspective. Comments welcome!!!

No Comments | Tags: Development, Econo-computing, Favorites

19 September 2006 - 21:22The assembly line

A while ago I worked on a project that was laid-out as an assembly line: it had an entry point which did some work after which it sent a message out to a queue. This queue was taking the message, did some work, transformed it and send it out to another queue. And so on… At the end something was produced that was stored in a DB. It was the first time I saw this architecture and I found it interesting. It is essentially message-passing done in a sequential order, in this particular project there was no branching, no task could pass a message to multiple dependent tasks. It is obvious that the architecture could support branching, but there was no need for it in this particular project.
In this posting the term “task” will represent a task that is deployed on a node in a cluster.
I grew interested in this architecture because of the similarity it has with an assembly line. I am interested in economic concepts applied to software development and such similarities attract my attention. So, rather than talk in this post about message passing programming (you can find some good literature on the web) I’ll take a look at the effects of this architecture from a managerial point of view and I will compare this architecture to the same architecture running in a single JVM. You can set-up an interaction similar to an assembly line in one JVM by passing messages between various business logic beans. From the computer science point of view the main difference between the 2 architectures is the treatment of state. If the process that is carried out on this assembly line requires that state is kept across the whole process then a distributed assembly line is probably not the best choice. If you do go ahead with this architecture then you will have to find a way to propagate this state from task to task across a cluster.

So what would be side-effects of this architecture?
The first and most important side-effect is that the failure of a task doesn’t block the process, the rest of the items that have to be processed go on. Error reporting becomes more fine-grained, but also more difuse. You don’t have to scan a log that records every step of the process in order to get to the task’s stack trace because the node on which you have the log is dedicated to that particular task. One problem is that the error which is reported is particular to that task, you do not get the benefit of seeing the whole process dumped in a stack trace. In order to determine what went wrong you have to back-track from the current task to the previous task and in order to capture the whole information pertaining to that process instance. Trying to do this along a process spread across a cluster is not the easiest thing… In order to make this more easier you would have to propagate information considered helpful from task to task in order to report errors more meaningfully.
We would also have fine grained scalability. Once the process has been split into tasks and the communication between these tasks implemented you can scale each task independently of another. The result is that a computationally-intensive task cand be carried out over a larger part of the cluster than one that doesn’t require so many resources. The bad thing about fine grained scalability is that sometimes you need coarse-grained scalability. For example, let’s say that your organization plans to increase the usage of this process by a factor of 4, it is moving from processing 1000 items to 4000 items. You would basically have to scale the computational resources for every task 4 times and then test the new cluster. For a process with a large number of tasks it could be tedious.

Another bad side-effect is the difficulty with which you version such an architecture. If your process spreads across a whole cluster you may have to replicate the whole cluster and make sure that all the relationships between tasks are maintained in the new version. If you have a high tolerance to pain you may try to mix-and-match tasks from different versions: for example task 15 from version 4 would be related to task 16 from version 3 while waiting messaged from task 14 deployed in version 2, 4 and 5.

One good point about this architecture which is being brought up is the fact that you could reuse a task. Well, this doesn’t really apply because code-reuse is related to the specificity of a task: the more specific is a task, the less the potential for reuse.

To conclude this posting I would say that this architecture is a pretty good architecture when a process consists of a series of transformations, it decomposes the process meaningfully. Most of the side-effects are related to the length of the process, so it would help to keep the number of tasks under control. In order to better manage your process you may find that you may have to propagate additional information from task to task apart from the tasks require for carrying out their work.
If you have any other thoughts on this particular architecture or you have worked with such processes please drop a comment.

2 Comments | Tags: Development, Econo-computing, Favorites

17 July 2006 - 13:45AOP and the division of labor

I am looking at AOP as an example of applying the theory of Division of Labor to development. With AOP it is possible to have different types of professionals working on the same project at the same time.
In theory a security expert would secure the application or even better a security expert in one domain would secure one part of the application, while a security expert in another domain would secure another part of the application, a transaction expert would trace transaction boundaries within the application, a concurrency engineer would serialize calls to a state repository in order to implement concurrent access, etc… I say this is possible in theory because managing this interaction takes some good managerial skills and processes, processes with which I am not familiar for the time being.
The division of labor is a process that scales very well in diversity and ultimately in complexity. Various tasks are assigned to people trained specifically for them and are executed in an efficient manner. At the end of it the process has managed to put together a product which is quite complex and has received input for people of various backgrounds. This could be very well the solution to managing complex interactions when developing software: dividing them into tasks which can be performed individually by experts and assembling them at the end. Just like when you have a coffee you don’t have to worry about growing a coffee tree, harvesting the beans and roasting them because someone else did that for you when you are writing an application you should not concern yourself with securing it or implemeting failover thru exception handling because someone else would do this as well.
The division of labor pulled the human race out of the Middle Ages and into the industrial age. Hopefully AOP will be able to pull large, complex projects out of the quagmire of pointless meetings, weekly merges, versions, etc… and into functional, deployed applications.

P.S. I wrote this post on the most horrible coffee I had in a long time. If it had an effect on the post itself I am sorry.
P.P.S. Separation of concerns resembles the division of labor as well.

No Comments | Tags: AOP, Development, Econo-computing, Favorites