Considering the blog here has been inactive for a while I figured I would contribute alittle post. One problem in analyzing historical voting patterns that has come up a few times how it is problematic that older posts have a long time to accumulate more posts. For instance, in my analysis of reputation effects it is problematic that older votes have alonger exposure time. Recently a question on average votes on CV meta reminded me of it as well. I pointed to the fact that the average number of upvotes for questions and answers on the site is 3.2, but that doesn’t take into account how long the questions have been available to be voted on. So I figured I would dig into the data a bit, aided by the Stack Exchange Data Explorer, and look at how votes accumulate over time.
First, lets just look at the distribution of voting after a question in created.
The above chart plots a histogram of the
post vote date - post creation date, and bins all votes after one week (this query was resticted to 23,344 posts in 2012 to not reach the SO data explorer limit). We can see here in the histogram that over half of voting is done on the same day the post is created, and over another 20% are voted on the following day. By a week, these historical votes have nearly accumulated 90% of their totals. Below I just arbitrarily assigned posts to during the week or after the week, and graphed vote contributions aggregated weekly over time.
Looking at the trends in the data in cumulative area charts, you can see this slow accumulation over time pretty easily. Also you can see a trend to more voting (partially due to more posts) on the site, but if one just looked at the historical accumulation of total votes it may mask such a trend. Graphing the proportion of votes over time shows an even more striking pattern.
I kind of expected this to show a plateau, but that doesn’t appear to be the case. Probably in the future if I am doing any analysis on historical voting patterns I will likely just chop off votes after a pre-specified time (like a week), but certainly more interesting and useful information can be culled from such voting patterns. In fact, as far as reputation effects are concerned, one could reasonably argue the flattery of having older posts voted up is a type of reputation effect (although in general such things are potentially confounded by a host of other factors).
These graphics open up other interesting questions as well, such as are the historical votes from new members? I encourage all to brush up on some SQL and go check out the data explorer. Maybe we can harass the StackExchange group to implement some more graphs as well (being able to show a histogram would be nice, as well as interactive redefine what variables go on the x and y axis’s for the line charts).
One of the community projects we have started at Cross Validated is a journal club. The journal club is mainly intended to engage the community in general discussion related to really any statistical material (either theoretical or applied). In a meta thread we subsequently propose papers (that are publicly available), and then once enough community interest seems to be in place we set a time and invite the author of the article to come and join us in chat.
Because statistical methods are applied to such a wide array of topics, there is really an expansive set of topics the journal club can cover. For prior discussions, one can peruse the journal club tag on the CV meta site, but for a quick round-up here are a few example discussions so far;
- Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy by Gregory J. Matthews and Ofer Harel
- Douglas N. Arnold, Kristine K. Fowler (2011). Nefarious Numbers. Which involves criticism of impact factor scores for journals.
- A Statistical Analysis Of Multiple Temperature Proxies: Are Reconstructions Of Surface Temperatures Over The Last 1000 Years Reliable? By Blakeley B. McShane and Abraham J. Wyner
I’d like here to encourage both regular cross-validated members and passer-bys from other communities to propose material. In particular, I’d like to note that the proposed material need not be advanced statistical methodology. In fact we probably get more lively discussions when the material is basic enough to allow a wide array of individuals with varying backgrounds to understand the material and participate in general discussion.
So stop on by the current proposal, the seventh Cross Validated journal club, and either vote up the current proposal or propose some more material to discuss!
A recent question on meta.stats.se has brought up some concerns over how visible the blog is on the site. Here I will present some statistics on site views and referrals over the time of the blogs existence.
First I made a comment in the aforementioned link that the majority of the site traffic so far has not come from the main site. Below is a time series chart of the accumulated site views for the site. I have superimposed blue dots for when an item was posted on the blog, and red dots for when key figures either tweeted or referred to one of our posts in a blog.
I’ve annotated the one huge spike of over 2,000 views in the chart, as the main driver behind that surge was a tweet by Joel Spolsky. Below is the same chart with the Y axis restricted to within the range of more usual site traffic flow.
In general it appears that site traffic sees a slight increase whenever a new post is published. Again I’ve highlighted several key referrals from outside sources that appear to have driven site traffic up that aren’t due to the initial spike from a new post. One from a mention of the Emacs post by David Smith over at the revolution analytics blog, and another mention of the recent tables post by Rob Hyndman on his personal blog. Below I have inserted the table of refferal sites from the wordpress dashboard (site urls are truncated to the first 35 characters).
Of these sources I arbitrarily collapsed them to different categories. So for the grand finale, sit down and have a slice of pie to visualize the proportion of referrals that come from stats.stackexchange.
Just kidding. Pie charts aren’t necessarily awful here in this instance, but it is difficult to tell the difference in sizes among the three categories of stats.stackexchange, key blogs/tweets or other sources. This is trivial task with either a bar chart or a dot plot, and below I have reposted the same data in a bar plot. It is easy there to see that referrals from stats.stachexchange is the smallest among the categories.
Note that these are just referrals that are recorded by wordpress. As of writing this post, there was approximately 6,000 total site views. That means about 20% of the total site views so far are from Tal Galil’s R-Bloggers syndication! The other sources category includes referrals from blog.overflow and other questions on the SE sites, but referrals from the main Stats.SE site should be largely represented within that category. These referrals obviously under-count referrals as well. Twitter is only listed as accounting for a total of around 300 referrals, although the huge spike of over 2,000 views on 8/5/2011 can only be reasonably explained by the mention of Joel Spolsky on twitter.
The reason pie charts suck is that visualizing the angles in slices in pie charts is more difficult than visualizing the length of a line (bar charts) or position of a point in a cartesian coordinate system (dot plots). Frequently bad pie charts are chastised for having too many categories, they are worse than bar charts or dot plots even when they have a small number of categories too! If your being difficult you could perhaps argue that pie charts are still useful because they don’t need a common scale with which to make comparisons between (see the maps of Charles Minard for an example) or that there ubiquity should leave them as an option (as they are so prevalent we have developed a gestalt for interpreting them). I think my response to these critiques would be its refreshing to hear an argument for pie charts that isn’t I like the way they look!
Website analytics is a bit out of my ken, but my speculation from the site traffic and referral statistics is as follows;
- Click throughs to the blog from the main stats site are pretty sad. I don’t know what the average site visits are for stats.se (apparently such info is a secret), and I don’t know what a reasonable number of clickthroughs would be. But I do know averaged over the time period the blog has been in existence, we are averaging around 3~4 referrals from the main stats site to the blog per day. Cats walking on the keyboard and by chance clicking on the link to the blog at the very bottom of the page are perhaps to blame.
- Referrals from outside sources have a much greater overall potential to increase traffic to the site, regardless of how much we improve referrals from the main site.
So where to go from here? Maybe we should just ask Joel to tweet all of our blog posts, or just spam the R-Bloggers feed with all our posts. Being serious though, I would just like to see the community take greater participation in writing posts. I assume that quality content is the best means to attract more visitors to the blog, and along the way we can figure out how to do a better job of integrating the blog with the main site and what the role the blog will take in supplement to the main Q/A site.
That being said I do think that a permanent link to the blog in the header of the main page is a good idea as well (although I have no idea how much traffic overall it will bring). Also all of you folks with twitter accounts (along with other social networking updates) would be doing us a favor by pointing to posts on our blog you think are worthwhile. It could potentially cascade into a much wider audience than we could ever get directly from the Cross Validated main site as well.
The blog is an excellent platform for issues that don’t fit well within the constraint of questions and answers on the main sites, and so I believe it is a useful tool in the dissemination of information that community members agree is important. I’d like here to remind the community that every member of the community is invited to contribute a post to the blog. We undoubtedly need greater involvement from the community though to make the blog sustainable. Several suggested thematic post series have gone unwritten because we need help! Surely more analysis of the wealth of public data in the stack exchange data dump would be of wide interest to the general community as well.
I’ve posted the data used in this post (along with SPSS syntax to produce the charts) in this google code link. Thoughts from the stats community on the topic are always appreciated, and if other communities have advice about promoting their blogs I would love to hear it as well.
It is almost a year since CrossValidated was launched. Today we start a new activity at CrossValidated – a community blog. It is the fourth (after the main site, meta and chat) place for getting in touch with the community and contributing to it.
To get started, we plan to post series of posts about the following topics:
- Question of the Week Each week there will be a survey on meta in which the users will nominate and elect some recent splendid question — then we’ll ask the author of best answer either to elaborate a bit, write some summary of the whole thread, start a miniseries of posts inspired by the QoW or do something else in this manner.
- Journal Club report There will be some summary of the each JC chat. While the JC has a summer break currently, you can expect some summaries of archived chats.
- R tips&tricks We can not ignore our most popular tag. Hopefully we would manage to attract some people from R community on SO to participate here.
- Challenge alert Announcements (with some introduction) of machine learning or data mining challenges.
Finally the most important thing: remember that this is your place. Anyone can write a post here — just visit Skewed distribution – blog’s chat room and ask for an author account or maybe just suggest some topic or share your critic. We are open to any idea and we hope we’ll manage to make it an useful and avid place for the community.