The Accidental History of Hadoop

Creative Commons: Attribution by Flickr User Efecto; Negativo

There are two very different types of collaboration; Intentional and Accidental.  Intentional collaboration is focused by a defined team with a shared purpose.  Interactions are marked by introductions, updates, “take a look at this”, “please review…” etc.  Most “collaboration” technology fits this category.  It is boring, line cook kitchen model collaboration.  Read the recipe. Gather the ingredients. Cook, plate and serve.  There is efficient repetition but little or no innovation.

So where do new recipes come from?

The answer is Accidental Collaboration.  Accidental collaboration is time and context shifted. It subverts or ignores original intent (of authors, findings, content or audience).  It finds new uses and applications for old information.  It is disruptive, innovative and amazing.  Examples in technology include re-blgging (tumblr, pinterest), reporting, content curation, re-use, re-purposing, re-search. When information is available and accessible new insights can occur.  This is because each new re-combination of content allows different features to emerge.  A collection of events in a city becomes a holiday schedule.  A collection of medical journal articles reveals a new drug delivery pathway.

A thread of an idea that started in 1676 with mathematician Leibniz can be traced through history to David Hilbert (1882), Alonzo Church (1936), John McCarthy (1958), Dean & Ghemawat (2004), and finally Doug Cutting (2006) who stands on the shoulders of these giants to create Hadoop.  Hadoop is at the center of the “Big Data” buzz.  Big data is all about deriving insight from huge amounts of disparate data.  It is accidental collaboration.

The original intent of the data is largely irrelevant.  It’s the data, and the availability of that data that is important. Leibniz wanted to create a language that could prove or disprove any proposition.  Hilbert came along to challenge that idea.  Church created Lambda calculus to prove that Hilbert’s challenge was actually unsolvable.  McCarthy used Lambda calculus to create LISP.  Dean and Ghemawat used LISP programming ideas to create MapReduce.  Cutting read their research and combined MapReduce with Lucene to create Hadoop.

Just as McCarthy never worked on a project team with Church to create LISP, the content Church created for Lambda Calculus was indispensible in helping McCarthy create the programming language.  Similarly, the ways in which LISP was created directly influenced Dean and Ghemawat at Google to create the map & reduce capabilities that allow massive distributed problem solving.  From that inspiration, a lot of hard work, and some help from Yahoo!, Hadoop was born.

The men involved, the content and approach were all in different eras, but they came together to create something special, innovative and impactful.  If that information was not available or accessible, Hadoop (and all the applications that rely upon it) would never have happened.

Accidental collaboration throughout history has been incredibly slow.  Modern information management technology like hadoop or active content archives can speed it up and deliver to us amazing insight in incredibly short periods of time.

This post originally appeared on on July 16

5 Must Have Steps To Combine Big Data and Collaboration

Creative Commons Attribution by Flickr user crsan –

This is not a fluffy-puff collaboration article.  You won’t read about team unity, the importance of open and honest dialogue or flattened org charts.  This is about big data.  This is about the 5 steps you must take if you hope to tap the power of big data and use it drive fundamental improvements in collaboration.  This is not about solving problems.  This is about dissolving problems.

So here is what you have got to do

1)      Improve Your Collection – Big Data, Big Content

There is so much information, chances are what we want is out there, if only something could ensure it was captured and then bubble it up to the surface.  Traditional content management, records management, knowledge management and collaboration systems all rely on severe user disruption.  Even the sync and share systems that are popular now (e.g. DropBox,, YouSendIt) require disruption.  All of these require end users to stop being brilliant, stop creating, stop working and check something in.  Some have big forms to fill out, some have drag-and-drop but they all start by stopping you from doing what you do best.

Big data analytics systems do it better by scraping and crawling data that has been identified by some kind of integration, ETL process or migration tool.  This is better than disruptive check-ins but it still creates a lag in the data.  Even where collection is real-time, Big Data requires analytics to turn it into consumable information.

On the Big Content front, over a decade of ECM experience should have taught us by now that adoption suffers and collections aren’t as rich as they might be because of check-in disruption.

The best solution is one that frictionlessly collects content and data as it is created or updated.  If you’re focused on collaboration, either across time through easily accessible historical collections or immediately for teams that span departments, then ensuring you have everything is vital.  Why base your collaboration strategies on the hope that owners will be nice, stop what they’re doing and check in their content (and do it correctly)?  Instead, automatically ingest content with no requirement on employees.  Automatic backup and recovery software has been doing this for years for servers and email.  Software like Digitiliti does it for business content including email.  The difference is that, instead of sitting in an inaccessible archive somewhere, all that information is immediately available.

2)      Improve Your Aggregation – Classification, Grouping, Analytics

Look, we spend too much time helping computers understand what we mean: metadata, tagging, summarizations, search within, drill-down, SEO, keywords.  It’s really incredible when you consider that an entire industry has grown up around helping computers understand what we have already created.  This is all done so that other people who are removed from us by time or distance can find and enjoy what we’ve created.  How about we change that that paradigm!

Here’s how:

  • Encourage voluntary participation with entertainment, ease & rewards. (Hint: the industry buzzword for this is “gamification”). User generated collections are great assets.  Sites like Pinterest, and Tumblr prove that user generated collections are incredibly valuable sources of classification and intelligence.
  • Use some of that Big Data power to track and aggregate what people are doing in the course of their daily work.  Searches, application opening behavior, website referrers all become useful aggregation points that can help spur collaboration.  See a trend in searches? Promote that content before a search is executed.  Lots of traffic being generated by employees from the same web site referrers? Partner with or advertise on that site.  Or grab a feed of it and put it on your intranet.  It will make your intranet suck less.
  • See which content or data aggregations have staying power.   These become curated collections.  Promote those to your use base.  Collections spur additional participation.  It is always easier to comment on something which has already started than to create something totally new.

3)      Tracking – usage and creation patterns are more kinds of metadata. 

This is a bit tricky.  Europe has new “do not track” laws which impact cookie use.  The USA is considering similar legislation while with W3C is working on a standard for “do not track / do not follow”.  However, that still leaves intranets, extranets and opt-in sites wide open.  If you’re looking to spur collaboration within your organization, then you absolutely must understand what is leading employees to information and what is drawing them away.  Here are some ideas:

  • Track when people use certain information and what they do while using it.  What applications are open? What was their search/navigation path to that content?
  • Take a cue from the Web Gateway, Compliance and Data Loss Prevention industries.  Most of those systems look at and log what is going on.  How about using that data for intelligence.


4)      Prediction – Business Intelligence from Big Data and Big Content. 

Once you have all that Big Data and are using those collections to feed analytics and intelligence engines, one of the results is the ability to spot trends. Instead of stopping at understanding why certain things happened, start predicting what will happen.  This allows business to move from things like issue resolution to issue interception.  Issue interception means solving issues before they explode in to PR nightmares.  Here are some other ideas:

  • Anticipate the files, web resources, documentation articles, URLs people will want, even before they know it.
  • Identify project teams that fit interest and expertise and invite employees to act in an advisory capacity for the project.  This spurs cross-departmental knowledge sharing and breaks down silos of structure and information.

Ecommerce and online dating sites have been doing this for years.  It’s not magical technology.  But business has been doing little to match content to anticipated employee need.  It’s time this changed.

5)      Delivery – Get Usable Content into the Hands of People to Whom it Makes a Difference. 

The best insight and intelligence is pointless if it’s not available. So get it into the hands of those who can use it.  This means delivering it to them where they are at.  Where are most of us these days?  On our mobile smartphones and tablets.  So if you’re wondering if you need a mobile collaboration strategy, the answer is Yes!  Make sure that all that predictive intelligence, those invitations to participate, that recommended content is available not just on your corporate portal but in the places where it is most relevant.

  • Got a shop floor?  Make sure the information most commonly needed there is available there.  Do you still have paper maintenance logs or 3-ring binder policies and procedures stuffed away in a back office somewhere? Make them available via mobile phone or tablet so employees can fill out the log or look up the procedure where the work is being done.  Not sure how to do it?  Start easy.  Use a QR code that links a phone browser to an internal web page or form.
  • Make access at the point of transaction.  This makes the decision to engage easier for users.  Invitations to participate in a survey, team project or to download a file become disruptive and annoying when the user gets the invite and then must go back to a desk, open up an application, view, decide, click yes/no/start or download and *then start collaborating.  Why make it so hard?  Why make people wait to decide to collaborate?  Keep it in the moment and keep it agile.

These 5 steps make collaborating with people AND with information so much easier.  The promise of collaboration is an accessibility of ideas and information.  When that happens, we start to realize that innovation and insight comes not just from the hand-me-downs of original programmers, founders and CEOs but from the collective insight of everyone involved, over time.  The big trick is making sure the information stays available.  With these 5 steps, you’ll see that collaboration with content and not just people yields information and insight that come from very different  sources, hidden our org charts, buried by time, by project team, by anonymity.

This article originally appeared on on July 12