Identity in the Blogosphere

The Blogosphere is exploding with activity. Driven by empowered users who are leveraging easy-to-use tools, blogging is transforming the publishing industry. CNN had an interesting article on the number of blogs in china.

BEIJING, China (Reuters) — The number of blog sites in China reached 34 million in August, a 30-fold increase from four years ago, state media said on Tuesday, despite a series of curbs on media and dissent.

China has more than 17 million people writing blogs (short for Web logs) and more than 75 million people reading them, Xinhua news agency said.

David Sifry of Technorati in his latest state of the blogosphere post says that they are tracking 50 million blogs and that the size of blogosphere is doubling every 6 months.

  • Technorati is now tracking over 50 Million Blogs.
  • The Blogosphere is over 100 times bigger than it was just 3 years ago.
  • Today, the blogosphere is doubling in size every 200 days, or about once every 6 and a half months.
  • From January 2004 until July 2006, the number of blogs that Technorati tracks has continued to double every 5-7 months.
  • About 175,000 new weblogs were created each day, which means that on average, there are more than 2 blogs created each second of each day.
  • About 8% of new blogs get past Technorati’s filters, even if it is only for a few hours or days.
  • About 70% of the pings Technorati receives are from known spam sources, but we drop them before we have to send out a spider to go and index the splog.
  • Total posting volume of the blogosphere continues to rise, showing about 1.6 Million postings per day, or about 18.6 posts per second.
  • This is about double the volume of about a year ago.

In the article David mentions that only 12% of the posts are in Chinese language compared to 41% for English. This means that they are under-counting the number of blogs in other languages than English. Also they are likely under-counting the blogs at community sites such at MySpace etc. I believe a more accurate picture is presented by the blog herald survey from February (its a bit dated but still very instructive…I hope they come up with another survey soon)

The good news: the blogosphere continues to boom. This month I estimate there to be 200 million blogs in existence.

The sums by country add up to approx 154 million blogs and by host 185 million, but this doesn’t take into account a pile of places + smaller hosts + self hosted blogs. Hence I’m calling the figure 200 million blogs.

Broken down by the hosts the data looks as follows (from Marketingfacts):

Broken down by country the breakdown looks as follows:

Overall my guess is that the number of blogs is well above 250 million (based on the review of the blog hearld sources to get updated count). Of these about 30% are spam blogs (some estimates are higher but taking into account all the estimates for hosting sites etc. this seems like a reasonable number) and another 35-40% are inactive blogs (refer to David Sifry’s post for inactive blog rates. The rates are probably lower for community blogs because of ease of use) in which users stopped posting after creating the blog. This leaves us about 80 million blogs. Taking into account the world Internet usage statistics, this means that about 4-8% (Assuming some users have multiple blogs) of the people on-line are blogging. To get a better understanding of who is creating these blogs, let’s break the blog usage in three different categories.

  1. Community Blogs: These are blogs created to participate in an existing community where blogging is the main method of communication. A number of times the authors of such blogs don’t even know that they are blogging. Example of such sites are MySpace, LiveJournal, MSN spaces , Xanga etc. These kinds of blogs make up majority of the blogosphere. MySpace has more then 100 million members now. I don’t thinkTechnorati indexes most of these blogs. One can argue that these should not even be considered blogs but if the they do enable users to publish information in an easy and democratic way I think they should be considered as blogs.
  2. Personal Blogs: These blogs are created by individuals to express a point of view and to interact with other like-minded bloggers in an open community. These are the kind of blogs like this one that are hosted by individuals or blog hosting sites like wordpress, Typepad etc. The community interaction on such blogs are less closed and the level of technical sophistication required to manage such a blog is a lot higher then the community blogs. Such blogs form a small part of the blogosphere (less then 15% would be my guess).
  3. Corporate Blogs: These blogs are created by companies to propagate or enhance the company positioning. They also serve to humanize the company (e.g. Microsoft blogs or Google employee blogs). Not many companies have created formal structures for blogging yet but they are likely to come down the line. These might include corporate hosting of blogs or universal branding etc. Such blogs form a small part of the blogosphere (less then 5% would be my guess).

Vertically speaking the blogosphere for non-community blogs can be broken down in 6 main areas:

  1. Technology
  2. Politics and world events
  3. Arts (celebrity, jokes, films, music, TV etc.)
  4. Sports (Don’t think this one is a huge segment yet)
  5. Personal
  6. All other

Identity in a community blog site is typically governed by the the community owner. They make each of the users sign up and provide some basic identity information that is shared.

In the personal and corporate blogosphere identity is a problem. Some hosting sites try to address the issue by requiring users to log-in and have a create a blog before they can interact with the blogs on their site but this does not help the identity situation if the users don’t post anything on their blogs. In reality for most interactions in personal and corporate blogosphere anonymity is the norm. This default of anonymity provides the wrong incentives for participation in communities and thereby messing up the quality of conversations.

BusinessWeek article on Click Fraud

Last week, Business Week had a great article on click fraud. The writers did a great job detailing the issues with the paid-to-click (PTA) businesses working with domain parking services to make click fraud happen. One additional angle I would have liked to see in the article is the angle on competitive click fraud. Competitive click fraud is when company A pays somebody to click on ads for company B, in order to drain company B of its resources. I am not sure if such an arrangement would even be illegal besides being difficult to prosecute.

After reading the article, I wanted to leave a comment at the BW site, but they have comment moderation turned on. So after leaving the comment I got a message saying that my comment will be reviewed by somebody in 24 hours…There isn’t much I hate more than having to wait 24 hours to get into a conversation. Anybody else had the same experience? On the other hand though, I guess BW has to be careful about the spammers. Also I guess being a established old business, they probably believe in erring on the side of caution then free flowing discussions. I am not really upset with BW as this is an issue facing most established brands…short of moderating/censoring the discussion there really isn’t a way to ensure a good quality of discussion.

Taking passwords to the grave

Interesting article on CNET related to the issues with estate planning in the on-line world. The problem is getting messier with users moving a lot of their financial and organizational information on to the web. One of the suggestions in the article is that on-line users, add all their passwords and account information to their estate plans. This really does not work if it means that you need to update your estate plan, every time you create a new account or change a password. The problem is worse if users are trying out new sites, especially with so many cool services like GMail, GCal or stock trading companies coming on-line. What we need is a universal mechanism to store all the passwords and other on-line identities in a central location. The access to this central location is what should be passed on, in a structured manner. There are systems like Inforcards, OpenID, SXIP that provide these services…let’s just hope that these systems see strong adoption.

Privacy and Social networks

There was a time a few years back where privacy was a huge issue on the web. Consumer advocates were up in arms about companies not guarding customer’s information or even selling it to other companies. The apparent issue with that was that companies and spammers will use that information to steal the identity of customers and thereby cause them financial harm or send unsolicited communications.

With the advent of social networks, things seem to have changed.

  • There are now 110 million profiles on MySpace.
  • There are millions of users sharing their deepest thoughts on YouTube.
  • There are over 60 million blogs where users are publishing their thoughts and at times their identification information like email and address.

A lot of the information on these sites makes the job of spammers/companies a lot easier. In addition to providing contact information, this user generated content also provides a great deal of information which can be used by spammers/companies to better target their offers. The strange thing, though, is that the users creating this content do not seem to care. What is going on?

I think what is going on is simple utility optimization. As my economics 101 professor would have said – the utility the users are deriving from participating in these communities is greater then any downside in terms of privacy. Another reason could be that users of these social sites are sharing non-transactional information (as opposed to transaction information like credit cards, SSN etc.) that cannot be used easily to cause financial harm. My guess is that in the busy world that we live in, people are staved for attention. As a result, users in these social sites might actually welcome targeted offers or communications from people who take the time to read through all the information they have published. For social network users this is a way to fulfill the basic need of connecting with other humans. In this sense, the social networks are replacing the real world communities and relationships. It could also be that the tools available right now makes it hard to limit the access to the information to a smaller community. I guess that is what SixApart is trying to address with their new VOX platform.

The big problem with the current system is that information in online communities makes the users a lot more vulnerable compared to real world communities. The reason is that in online communities all information is logged and is available to all seeing eyes of Google and Technorati for perpetuity. See the interesting post from Eric Nolin on the topic (he defines cool sounding “Nolin’s maxim”). Thoughts?

Pretexting and social engineering

Great summary from Kim Cameron of the NPR show on pretexting and privacy issues brought froth by HP spying scandal (originally from Craig Burton)…Pretexting is a problem that will be there as long as there is profit to be made by pretending to be somebody else. In real world communities, short of DNA profiling or a chip planted into each human being, there is not much that can done to eliminate it. And even then enterprising social engineers/pretexts will find a way to pretend to be somebody else.

As with all new technologies that facilitate communication, there is a price to be paid in terms of increase in pretexting. The advent of phones brought in a wave of new pretexting scams (Kevin Mitnick does a good job of documenting them in “The Art of Deception“) and the same is now true of Internet. So what is the solution? How do on-line communities handle rampant pretexting?

I do not believe there are any silver bullets to deal with this issue. Technologies like info-cards help in providing ease of use for managing identities (its a big problem) along with some good encryption mechanisms to make it harder for pretexters to steal identities. But anytime there is a fixed set of credentials (like name, SSN, Credit card etc.) that are used to establish identity, pretexters will be able to deploy clever techniques (albeit with a bit more difficulty) to collect these credentials. Another approach is to rely on more decentralized identity mechanism shared in a tight knit community. Establishing identity in such communities will not only require a user to have the right credentials but also have an understanding of all the old interactions including the shared context with the community members. This will not stop pretexters but will make their job a whole lot harder.

Mystery of online community

John C. Dvorak, the often controversial and flamboyant columnist at the PC magazine had an interesting post, related to the problems with virtual on-line communities.

The problem with on-line communities has been the lack of an identity infrastructure and other word-of-mouth mechanisms typically available in real-world communities. In real world communities like a church group or a professional group, word-of-mouth mechanisms provide a strong incentive to all participants to contribute positively to the shared interest of the group. In the virtual communities, where there is no physical presence required and there are no costs of joining new communities, none of these identify or word-of-mouth mechanisms that provide incentives for positive participation, exist. As a result most of the web conversation degenerate into a series of venting or spamming entries. So is it impossible to have a workable virtual community?

One of the communities John looked at in the article is Slashdot. Slashdot is a very successful community (over 100K members) that a number of my techie friends swear by. Slashdot replaces the real-world word-of-mouth mechanisms with its Karma/reputation scores in order to provide incentives to all members to contribute positively to the community. A lot of what Slashdot does is manual member-driven management of the moderation and meta-moderation system but the results are a vibrant community that provides a lot of value to its members. The takeaway then is that if one can provide the right incentives for positive participation along with a reliable identity mechanism, it is possible to have a vibrant on-line community. Now who is up to that challenge :-).

What is identity?

There are a number of problems with the identity systems available on the Internet:

  • Trying to keep track of all the username and passwords of all different accounts is hard enough but if you are like my wife, who likes to have a separate password for all her accounts, the problem is ten times more vexing.
  • Trying to ascertain that you are indeed on a web page you think you are on is not easy for technically unsophisticated users. This leads to a number of Phishing incidents.
  • Trying to ascertain who you are dealing with is hard on the internet. This leads to a number of baiting scams.
  • Identity theft is a growing menace with offenders able to easily complete a number of fraudulent transactions with the stolen identity data.
  • Email spam and comment spam on blogs is growing problem.

Kim Cameron’s laws of identity provide an excellent roadmap for building solutions that can address the identity infrastructure needs. Based on some of the laws, there are a number of solutions in the market waiting to mature and provide solutions to some of the problems listed above. A few of these solutions/approaches are SXIP identity, OpenID, Inforcard (Microsoft) etc. While there solutions and laws are important in addressing the glaring needs of identity infrastructure, they might not apply to all layers of identity.

Multiple Personas

Every individual has multiple personas. People have a persona as a professional (VP of engineering), a persona as a customer (buying a book from Amazon.com), a persona as a citizen (INS etc.) persona as a member of social clubs (treasurer of TIE), a persona for friends (you don’t know him like I do!), a persona for parents (I am not intimidated by him as I have seen him in diapers), a persona as a spouse and a parent (remember that time in Hawaii) etc.

Some of these personas like customer or citizen personas require explicit credentials based claim validations but several others like treasurer of a social club are validated by other people based on shared experiences. Remember that famous scene from Ghost, when Oda Mae Brown (Whoopi) allows Sam to take over her body and touch Molly. Molly does not ask Sam for any social security number or password, a touch based on their shared past is all the identification she needs to feel Sam’s presence. There shared experiences are important form of identification especially in the online social networks. In fact companies are willing to pay money for some of these personas if they can be unambiguously identified.

What kind of infrastructure is needed to support to capture such shared experiences? Do all the laws of identity still apply? How does it fit with the first law of user control and consent?

 

Introducing KarmaWeb

BIO
Jitendra has over 15 years of experience in software technology. He started his career as a software engineer in the EDA industry. In 2000, after his MBA, Jitendra joined Siebel systems as product manager for Siebel web platform – the platform used for all Siebel application to the web. When he left Siebel in April 2005, he was managing a team of 4 product managers and 3 product lines. In May 2005 Jitendra joined InQuira as director product management. At InQuira, he managed the company’s flagship search product. In his role, Jitendra set the course of product development and participated in closing a number of sizable deals.

Jitendra obtained his MBA degree from University of Chicago in 2000 and his B.Tech degree in EE from IIT Kanpur in 1993. At Chicago, he developed and marketed chibus.com, the on-line edition of university of Chicago GSB school paper. He won the best PM award in PM group at Siebel for his efforts on the initiative to make the Siebel architecture more flexible.

Update 5/20/2009: Jitendra started (late 2006) and sold (March, 2009)  SezWho – an online reputation service for social media participants. In the process Jitendra raised $1.3 M in VC money, built up massive distribution, met a lot amazing people, did some innovative deals and all in all had a blast…