Hacker News new | past | comments | ask | show | jobs | submit login
The most popular links posted by developers to Stack Overflow (linkedlists.net)
203 points by bcleary on May 18, 2013 | hide | past | favorite | 65 comments



Nice work.

This would take some work (more on the server's part than yours), but I think it would be more accurate if you tested the links for 301 redirects and consolidated the results accordingly. For example, in the ASP.NET tag one of my pages has 112 links to an older URL (http://encosia.com/2008/05/29/using-jquery-to-directly-call-...) that 301 redirects to a newer URL (http://encosia.com/using-jquery-to-directly-call-aspnet-ajax...) which has 129 links itself. It would be interesting to see only the latter URL show up in the list with 241 links.


Thanks. Re the redirects, yea I have a lot of work to do on my crawler, my current solution is hand rolled and only has limited capabilities. But if anybody knows of a good open source project let me know, i have been looking at using the common crawl but that's not a complete solution either.


I'd be interested in seeing what happens when you bucket citations by domain instead of URL. Google Analytics counts over a thousand unique referring SO pages to Cocoa Controls (http://www.cocoacontrols.com) this year alone. But, of course, most of the links back to my site are long tail.


Great idea, will try and look at that next.


Another good reason to do this is for example the android tag. Most of the top results are Android documentation for basic classes like ASyncTask and Activity. If I could filter out google domains I'd be left with interesting links related to Android.


thanks!


The top python link, which should direct to http://www.crummy.com/software/BeautifulSoup/ instead directs to http://www.crummy.com/software/beautifulsoup/ --which causes crummy.com to throw a 404 error at you.


You learn something every day. I was sure URLs are not case sensitive until now.

While domain names are not case-sensitive, the rest of the URL might be. In our example, this would be everything that follows “.com” as in wisegeek.com/are-urls-case-sensitive.htm.


Somewhat related, it took me an embarrassingly long time to realize that browsers treat assets with different URL casing as different assets for purposes of caching. Three image elements that reference Foo.jpg, foo,jpg, and foo.JPG all require a separate request and space in the cache, even if the web server you're using is case-insensitive and all three URLs resolve to the same image.


I always add:

    CheckSpelling off
To my dev (but not production) server so that I catch things like that. This won't help with a case insensitive web server though.


Think about it, the host server gets the path string raw, it's up to it how to interpret it.


Ideally, it would do a redirect to the ‘canonical’ resource (all lowercase, e.g.), so that the browser only has to cache stuff once – but that would take another request at least once. Is there some way for the server to serve the content and redirect in the same reply? As in ‘Here’s your picture, but it is really called foo.jpg rather than FOO.jpg’?


Thanks for the heads up, fixed.


Looking at the results for C++, it looks like the text santizing you're doing on link titles is losing the "++". All the titles seem to have C instead of C++, e.g. "Boost C Libraries" should be "Boost C++ Libraries". You're also losing the # in C# and the dot in .NET, and probably others.


Thanks for the heads up, will modify the parser, again if anybody knows of a good open source crawler let me know. I rolled my own very quickly but would love it if i could find a 3rd party solution. Another option i have explored is pulling in the titles and descriptions provided by search engines but currently only DuckDuckGo offers anything useful and even then its coverage of some of these low ranking programming pages isn't great. Bing offers a pay per use access to its index but the costing structure really doesn't fit with my use case.


Thanks for the comments and votes. By the way if anybody is interested this was presented as part of the mining challenge at MSR2013 http://2013.msrconf.org/challenge.php and here is the paper http://thechiselgroup.org/2013/03/27/a-study-of-innovation-d...


I was just about to ask this question! I read that paper and found it quite interesting.


The most popular link for Android answers is AsyncTask. It makes sense, one of the biggest complaint about Android is that it isn't always perfectly smooth and people notice the jerkiness in the UI. I would say a large majority of the time it is because an Android app developer is running slow code on the UI thread instead of doing it correctly.


Interesting in c# the second most popular link is for the BackgroundWorker class, not entirely the same use case but i guess similar motivations. http://msdn.microsoft.com/en-us/library/system.componentmode...


The javascript entry doubles as a Table of Contents for the jQuery API. Totally unsurprising


looks like the android entry then.


Nice! Nitpick: converting entities (such as '—'es) to their actual representation ('—') in the titles would be nice, currently they show up as 'mdash'


Thanks, yea we will have to do some work on cleaning up our title and description parser. Will add to the bug list.


Cool. Also, props for the pun, that's a fine name you found here :)


How different could be the results from linked_lists tool compared with Google results for a specific topic? Are they more close to what a developer needs?


Ah, you guessed my next paper :) I have done a little informal analysis on the top 10 results from linked_lists vs the top 10 for that tag used as a Google query. They are quite different, almost totally different actually. But this is not really surprising if you think of the developers curating the links posted on Stack Overflow. I know there was an attempt a few years back to build a search engine based on the SO data set, don't know what happened to that.


Very cool I like it! I like that it adds whatever tag you clicked on to the top, but maybe you should save that between sessions. Also, a lot of the more popular languages will reflect what everyone here has already seen and worked with - which is cool because it does what you advertise, but the usefulness is somewhat limited for us folk. It would be cool to have a year/month/week filter to see what has been linked to the most lately (I just saw that was suggested earlier). Things like node would benefit from that since it's growing so fast, but everyone knows about express. A simple thing to enable a bit more usefulness would be to link to the actual Stack Overflow posts so we can look at the comments. Cool stuff though!


Thanks for the support. Yes to everything above :) We are currently mining the post history data to be able to do those kinds of time range queries. Cant wait to get that out there it should be very cool indeed, also want to allow users to search by their SO id and to filter their links by tag. (As an aside when we do mine the history we will be able to get more accuracy on which users actually posted which links rather than just the post owner.)


This is nicely done, but so far isn't returning anything interesting for me. All the results are very basic (Django docs, PHP docs, etc), which makes sense, the most often cited will be the most general.

What about a change in algorithm to try and add some discovery here. What if you look at votes for answers vs cites? So things that have a high average vote:cite ratio rank?


Thanks. Yea i agree our initial use case was to provide an interface to the dataset and to allow us to explore what kinds of things developers were sharing on SO. For the next version we are working on new ranking metrics that will improve the discover aspect, vote:cite and view:cite are 2 we are looking at.


Looking forward to v2!


Here's the same but for Hacker News http://www.hnstore.co/42.html


Wow, that's really cool, have to admit i didn't see that before. Let me know if your interested in sharing data, we are doing a lot of research in this area and the more data the better!


This is very cool! This only a minor quibble but when you've selected e.g. C and then C++, the 'x' to remove one or other tag from the search results is nearly invisible. I only found it because I've used a similar 'x' box before on a different site to remove results.


Thanks, will update.


It would be interesting to weight the links by the score of the corresponding comment.


I actually had that in an earlier version and took it out just to simplify the design but i am actually looking at this again to produce a better sorting experience. Also the number of views a post receives may be a good metric also.


This is an awesome idea! Is there any way to create a randomized list weighted with popularity (so for a given tag you can refresh to find new links, but still ones likely to be interesting)?


Thanks for the support, really encouraged by the feedback here on Hacker News. You guys are great.

To your question, yes I am currently testing some ideas for a magic ranking system. One option is as you say a kind of random select amounts popular or trending links. Another is a smart weighted sort when filtering by multiple tags. One problem is that right now if you add jQuery to your filter, the javascript results are going to just dominate everything else.


Doesn't work on an iPad. I type in my search term, press enter, and nothing happens. Furthermore, when I enter my search term, the "Filter by tag (e.g. javascript)" text doesn't disappear.


Wow - ok looking into that now. What version of ios?


The latest, iOS 6.1.3


Great work. I am sure this would be useful if there was a filter on official documentation sites like jquery or php.net etc.

In the spirit of HN, how does it work and what powers it? :)


Thanks. Yes the domain filter is a great idea, will go to the top of the feature request list.

The site is c#, asp.net mvc, with a javascript front end, backed with sql server 2012. And running on AWS.


Top Objective-C link points to ASIHTTPRequest, an obsolete networking library that is no longer supported (and even its developer recommends against using it).


Right but there's a lot of people still maintaining code that has it implemented and also a lot of tutorials that people are using that likely use that framework.


Ha, the top result for C is the spec. Every other language is a popular library.

I wonder if there are more RTFM-type responses for C than other languages.


I suspect it has more to do with the large amount of undefined/unspecified/implementation defined behavior in C. A lot of questions on Stack Overflow can only be answered correctly by referring the the standard.


Are these links from only 2013? It would be interesting to see what the top links are by year/month or something too.


So these are taken from the March 2013 data dump, which includes questions going right back to the start of SO. So some of these links have been collecting citations for a few years. We only mined the actual post content on the date the dump was created, we did not mine the post history. But we are working on that right now, its a lot of data to process :)


Interesting. This data may potentially be useful in evaluating framework/library/OSS trends and prominence.


Thanks, yes we are actually working on a paper to that effect at the moment.


you're


Totally. Fix the bad grammar, and flesh out what cc/sa means for those unfamiliar with it.


Thanks, missed that one, will fix.


Fixed.


i dunno... 11 of the top 14 links under the "Javascript" tag are about jquery (which has it's own tag). I suppose it's not a flaw in your algorithm, just not really very interesting results.


I know javascript and jQuery suffer from their popularity and utility I think in this analysis. There is also the issue of jQuery and javascript tags being used by people as synonyms. jQuery particularly is a 500 pound gorilla over all the dataset, it shows up everywhere. As I mentioned above there was a joke on SO at one point that the answer to nearly every question was jQuery with a link to the site. Apparently it wasn't a joke.


Feature idea: add a time filter, e.g. last six months, last year, etc.


Thanks, yea second most requested feature after the domain filter. Will hopefully add soon.


Does this scrape links from the comments posted to questions as well?


No just the post bodies (question and answers) at the moment, but we are working parsing the comments and the post history, those datasets are about 4 times the size of the posts! So there are probably a lot more URLs in there, although we will have to decide if we treat all URLs the same or if we differentiate between URLs in post bodies contained in the dump, URLs in the post history (that may have been removed from the post) and URLs in the comments. Not maybe a concern for the website, but more so for research.


You usually see notes or clarification posted to questions in the form of comments first, where the types of links are more introductory, general purpose, than specific as you might find in answers.

Can't wait to see the updated stats.

If you could make "more" load more than just a few more records, though, that'd make it a lot easier to dig deeper.


PHP's top result: jQuery...


Yea, I know. I think it probably goes back to that SO joke about "Q - I have this programming problem" "A - jQuery"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: