The most popular links posted by developers to Stack Overflow

Encosia · on May 19, 2013

Nice work.

This would take some work (more on the server's part than yours), but I think it would be more accurate if you tested the links for 301 redirects and consolidated the results accordingly. For example, in the ASP.NET tag one of my pages has 112 links to an older URL (http://encosia.com/2008/05/29/using-jquery-to-directly-call-...) that 301 redirects to a newer URL (http://encosia.com/using-jquery-to-directly-call-aspnet-ajax...) which has 129 links itself. It would be interesting to see only the latter URL show up in the list with 241 links.

bcleary · on May 19, 2013

Thanks. Re the redirects, yea I have a lot of work to do on my crawler, my current solution is hand rolled and only has limited capabilities. But if anybody knows of a good open source project let me know, i have been looking at using the common crawl but that's not a complete solution either.

aaronbrethorst · on May 19, 2013

I'd be interested in seeing what happens when you bucket citations by domain instead of URL. Google Analytics counts over a thousand unique referring SO pages to Cocoa Controls (http://www.cocoacontrols.com) this year alone. But, of course, most of the links back to my site are long tail.

bcleary · on May 19, 2013

Great idea, will try and look at that next.

yarianluis · on May 19, 2013

Another good reason to do this is for example the android tag. Most of the top results are Android documentation for basic classes like ASyncTask and Activity. If I could filter out google domains I'd be left with interesting links related to Android.

aaronbrethorst · on May 19, 2013

thanks!

mtowle · on May 18, 2013

The top python link, which should direct to http://www.crummy.com/software/BeautifulSoup/ instead directs to http://www.crummy.com/software/beautifulsoup/ --which causes crummy.com to throw a 404 error at you.

Evgeny · on May 19, 2013

You learn something every day. I was sure URLs are not case sensitive until now.

While domain names are not case-sensitive, the rest of the URL might be. In our example, this would be everything that follows “.com” as in wisegeek.com/are-urls-case-sensitive.htm.

Encosia · on May 19, 2013

Somewhat related, it took me an embarrassingly long time to realize that browsers treat assets with different URL casing as different assets for purposes of caching. Three image elements that reference Foo.jpg, foo,jpg, and foo.JPG all require a separate request and space in the cache, even if the web server you're using is case-insensitive and all three URLs resolve to the same image.

ars · on May 19, 2013

I always add:

    CheckSpelling off

To my dev (but not production) server so that I catch things like that. This won't help with a case insensitive web server though.

wahnfrieden · on May 19, 2013

Think about it, the host server gets the path string raw, it's up to it how to interpret it.

claudius · on May 19, 2013

Ideally, it would do a redirect to the ‘canonical’ resource (all lowercase, e.g.), so that the browser only has to cache stuff once – but that would take another request at least once. Is there some way for the server to serve the content and redirect in the same reply? As in ‘Here’s your picture, but it is really called foo.jpg rather than FOO.jpg’?

bcleary · on May 18, 2013

Thanks for the heads up, fixed.

captaincrowbar · on May 19, 2013

Looking at the results for C++, it looks like the text santizing you're doing on link titles is losing the "++". All the titles seem to have C instead of C++, e.g. "Boost C Libraries" should be "Boost C++ Libraries". You're also losing the # in C# and the dot in .NET, and probably others.

bcleary · on May 19, 2013

Thanks for the heads up, will modify the parser, again if anybody knows of a good open source crawler let me know. I rolled my own very quickly but would love it if i could find a 3rd party solution. Another option i have explored is pulling in the titles and descriptions provided by search engines but currently only DuckDuckGo offers anything useful and even then its coverage of some of these low ranking programming pages isn't great. Bing offers a pay per use access to its index but the costing structure really doesn't fit with my use case.

bcleary · on May 19, 2013

Thanks for the comments and votes. By the way if anybody is interested this was presented as part of the mining challenge at MSR2013 http://2013.msrconf.org/challenge.php and here is the paper http://thechiselgroup.org/2013/03/27/a-study-of-innovation-d...

tchalla · on May 19, 2013

I was just about to ask this question! I read that paper and found it quite interesting.

ChrisClark · on May 19, 2013

The most popular link for Android answers is AsyncTask. It makes sense, one of the biggest complaint about Android is that it isn't always perfectly smooth and people notice the jerkiness in the UI. I would say a large majority of the time it is because an Android app developer is running slow code on the UI thread instead of doing it correctly.

bcleary · on May 19, 2013

Interesting in c# the second most popular link is for the BackgroundWorker class, not entirely the same use case but i guess similar motivations. http://msdn.microsoft.com/en-us/library/system.componentmode...

ben336 · on May 19, 2013

The javascript entry doubles as a Table of Contents for the jQuery API. Totally unsurprising

rjzzleep · on May 19, 2013

looks like the android entry then.

ronj · on May 18, 2013

Nice! Nitpick: converting entities (such as '—'es) to their actual representation ('—') in the titles would be nice, currently they show up as 'mdash'

bcleary · on May 18, 2013

Thanks, yea we will have to do some work on cleaning up our title and description parser. Will add to the bug list.

ronj · on May 18, 2013

Cool. Also, props for the pun, that's a fine name you found here :)

teshima · on May 19, 2013

How different could be the results from linked_lists tool compared with Google results for a specific topic? Are they more close to what a developer needs?

bcleary · on May 19, 2013

Ah, you guessed my next paper :) I have done a little informal analysis on the top 10 results from linked_lists vs the top 10 for that tag used as a Google query. They are quite different, almost totally different actually. But this is not really surprising if you think of the developers curating the links posted on Stack Overflow. I know there was an attempt a few years back to build a search engine based on the SO data set, don't know what happened to that.

taternuts · on May 19, 2013

Very cool I like it! I like that it adds whatever tag you clicked on to the top, but maybe you should save that between sessions. Also, a lot of the more popular languages will reflect what everyone here has already seen and worked with - which is cool because it does what you advertise, but the usefulness is somewhat limited for us folk. It would be cool to have a year/month/week filter to see what has been linked to the most lately (I just saw that was suggested earlier). Things like node would benefit from that since it's growing so fast, but everyone knows about express. A simple thing to enable a bit more usefulness would be to link to the actual Stack Overflow posts so we can look at the comments. Cool stuff though!

bcleary · on May 19, 2013

Thanks for the support. Yes to everything above :) We are currently mining the post history data to be able to do those kinds of time range queries. Cant wait to get that out there it should be very cool indeed, also want to allow users to search by their SO id and to filter their links by tag. (As an aside when we do mine the history we will be able to get more accuracy on which users actually posted which links rather than just the post owner.)

brokentone · on May 19, 2013

This is nicely done, but so far isn't returning anything interesting for me. All the results are very basic (Django docs, PHP docs, etc), which makes sense, the most often cited will be the most general.

What about a change in algorithm to try and add some discovery here. What if you look at votes for answers vs cites? So things that have a high average vote:cite ratio rank?

bcleary · on May 19, 2013

Thanks. Yea i agree our initial use case was to provide an interface to the dataset and to allow us to explore what kinds of things developers were sharing on SO. For the next version we are working on new ranking metrics that will improve the discover aspect, vote:cite and view:cite are 2 we are looking at.

brokentone · on May 20, 2013

Looking forward to v2!

raimonds · on May 19, 2013

Here's the same but for Hacker News http://www.hnstore.co/42.html

bcleary · on May 19, 2013

Wow, that's really cool, have to admit i didn't see that before. Let me know if your interested in sharing data, we are doing a lot of research in this area and the more data the better!

eliasmacpherson · on May 19, 2013

This is very cool! This only a minor quibble but when you've selected e.g. C and then C++, the 'x' to remove one or other tag from the search results is nearly invisible. I only found it because I've used a similar 'x' box before on a different site to remove results.

bcleary · on May 19, 2013

Thanks, will update.

rc4algorithm · on May 19, 2013

It would be interesting to weight the links by the score of the corresponding comment.

bcleary · on May 19, 2013

I actually had that in an earlier version and took it out just to simplify the design but i am actually looking at this again to produce a better sorting experience. Also the number of views a post receives may be a good metric also.

__sb__ · on May 19, 2013

This is an awesome idea! Is there any way to create a randomized list weighted with popularity (so for a given tag you can refresh to find new links, but still ones likely to be interesting)?

bcleary · on May 19, 2013

Thanks for the support, really encouraged by the feedback here on Hacker News. You guys are great.

To your question, yes I am currently testing some ideas for a magic ranking system. One option is as you say a kind of random select amounts popular or trending links. Another is a smart weighted sort when filtering by multiple tags. One problem is that right now if you add jQuery to your filter, the javascript results are going to just dominate everything else.

coin · on May 19, 2013

Doesn't work on an iPad. I type in my search term, press enter, and nothing happens. Furthermore, when I enter my search term, the "Filter by tag (e.g. javascript)" text doesn't disappear.

bcleary · on May 19, 2013

Wow - ok looking into that now. What version of ios?

coin · on May 19, 2013

The latest, iOS 6.1.3

tharshan09 · on May 19, 2013

Great work. I am sure this would be useful if there was a filter on official documentation sites like jquery or php.net etc.

In the spirit of HN, how does it work and what powers it? :)

bcleary · on May 19, 2013

Thanks. Yes the domain filter is a great idea, will go to the top of the feature request list.

The site is c#, asp.net mvc, with a javascript front end, backed with sql server 2012. And running on AWS.

quanganhdo · on May 19, 2013

Top Objective-C link points to ASIHTTPRequest, an obsolete networking library that is no longer supported (and even its developer recommends against using it).

SG- · on May 19, 2013

Right but there's a lot of people still maintaining code that has it implemented and also a lot of tutorials that people are using that likely use that framework.

mratzloff · on May 19, 2013

Ha, the top result for C is the spec. Every other language is a popular library.

I wonder if there are more RTFM-type responses for C than other languages.

TheCoelacanth · on May 20, 2013

I suspect it has more to do with the large amount of undefined/unspecified/implementation defined behavior in C. A lot of questions on Stack Overflow can only be answered correctly by referring the the standard.

SG- · on May 19, 2013

Are these links from only 2013? It would be interesting to see what the top links are by year/month or something too.

bcleary · on May 19, 2013

So these are taken from the March 2013 data dump, which includes questions going right back to the start of SO. So some of these links have been collecting citations for a few years. We only mined the actual post content on the date the dump was created, we did not mine the post history. But we are working on that right now, its a lot of data to process :)

nivstein · on May 19, 2013

Interesting. This data may potentially be useful in evaluating framework/library/OSS trends and prominence.

bcleary · on May 19, 2013

Thanks, yes we are actually working on a paper to that effect at the moment.

gwillen · on May 18, 2013

you're

ceautery · on May 19, 2013

Totally. Fix the bad grammar, and flesh out what cc/sa means for those unfamiliar with it.

bcleary · on May 19, 2013

Thanks, missed that one, will fix.

bcleary · on May 19, 2013

Fixed.

novaleaf · on May 20, 2013

i dunno... 11 of the top 14 links under the "Javascript" tag are about jquery (which has it's own tag). I suppose it's not a flaw in your algorithm, just not really very interesting results.

bcleary · on May 20, 2013

I know javascript and jQuery suffer from their popularity and utility I think in this analysis. There is also the issue of jQuery and javascript tags being used by people as synonyms. jQuery particularly is a 500 pound gorilla over all the dataset, it shows up everywhere. As I mentioned above there was a joke on SO at one point that the answer to nearly every question was jQuery with a link to the site. Apparently it wasn't a joke.

nhebb · on May 19, 2013

Feature idea: add a time filter, e.g. last six months, last year, etc.

bcleary · on May 19, 2013

Thanks, yea second most requested feature after the domain filter. Will hopefully add soon.

astrodust · on May 19, 2013

Does this scrape links from the comments posted to questions as well?

bcleary · on May 20, 2013

No just the post bodies (question and answers) at the moment, but we are working parsing the comments and the post history, those datasets are about 4 times the size of the posts! So there are probably a lot more URLs in there, although we will have to decide if we treat all URLs the same or if we differentiate between URLs in post bodies contained in the dump, URLs in the post history (that may have been removed from the post) and URLs in the comments. Not maybe a concern for the website, but more so for research.

astrodust · on May 21, 2013

You usually see notes or clarification posted to questions in the form of comments first, where the types of links are more introductory, general purpose, than specific as you might find in answers.

Can't wait to see the updated stats.

If you could make "more" load more than just a few more records, though, that'd make it a lot easier to dig deeper.

nthitz · on May 19, 2013

PHP's top result: jQuery...

bcleary · on May 19, 2013

Yea, I know. I think it probably goes back to that SO joke about "Q - I have this programming problem" "A - jQuery"