This would take some work (more on the server's part than yours), but I think it would be more accurate if you tested the links for 301 redirects and consolidated the results accordingly. For example, in the ASP.NET tag one of my pages has 112 links to an older URL (http://encosia.com/2008/05/29/using-jquery-to-directly-call-...) that 301 redirects to a newer URL (http://encosia.com/using-jquery-to-directly-call-aspnet-ajax...) which has 129 links itself. It would be interesting to see only the latter URL show up in the list with 241 links.
Thanks. Re the redirects, yea I have a lot of work to do on my crawler, my current solution is hand rolled and only has limited capabilities. But if anybody knows of a good open source project let me know, i have been looking at using the common crawl but that's not a complete solution either.
I'd be interested in seeing what happens when you bucket citations by domain instead of URL. Google Analytics counts over a thousand unique referring SO pages to Cocoa Controls (http://www.cocoacontrols.com) this year alone. But, of course, most of the links back to my site are long tail.
Another good reason to do this is for example the android tag. Most of the top results are Android documentation for basic classes like ASyncTask and Activity. If I could filter out google domains I'd be left with interesting links related to Android.
You learn something every day. I was sure URLs are not case sensitive until now.
While domain names are not case-sensitive, the rest of the URL might be. In our example, this would be everything that follows “.com” as in wisegeek.com/are-urls-case-sensitive.htm.
Somewhat related, it took me an embarrassingly long time to realize that browsers treat assets with different URL casing as different assets for purposes of caching. Three image elements that reference Foo.jpg, foo,jpg, and foo.JPG all require a separate request and space in the cache, even if the web server you're using is case-insensitive and all three URLs resolve to the same image.
Ideally, it would do a redirect to the ‘canonical’ resource (all lowercase, e.g.), so that the browser only has to cache stuff once – but that would take another request at least once. Is there some way for the server to serve the content and redirect in the same reply? As in ‘Here’s your picture, but it is really called foo.jpg rather than FOO.jpg’?
Looking at the results for C++, it looks like the text santizing you're doing on link titles is losing the "++". All the titles seem to have C instead of C++, e.g. "Boost C Libraries" should be "Boost C++ Libraries". You're also losing the # in C# and the dot in .NET, and probably others.
Thanks for the heads up, will modify the parser, again if anybody knows of a good open source crawler let me know. I rolled my own very quickly but would love it if i could find a 3rd party solution. Another option i have explored is pulling in the titles and descriptions provided by search engines but currently only DuckDuckGo offers anything useful and even then its coverage of some of these low ranking programming pages isn't great. Bing offers a pay per use access to its index but the costing structure really doesn't fit with my use case.
The most popular link for Android answers is AsyncTask. It makes sense, one of the biggest complaint about Android is that it isn't always perfectly smooth and people notice the jerkiness in the UI. I would say a large majority of the time it is because an Android app developer is running slow code on the UI thread instead of doing it correctly.
Nice! Nitpick: converting entities (such as '—'es) to their actual representation ('—') in the titles would be nice, currently they show up as 'mdash'
How different could be the results from linked_lists tool compared with Google results for a specific topic? Are they more close to what a developer needs?
Ah, you guessed my next paper :) I have done a little informal analysis on the top 10 results from linked_lists vs the top 10 for that tag used as a Google query. They are quite different, almost totally different actually. But this is not really surprising if you think of the developers curating the links posted on Stack Overflow. I know there was an attempt a few years back to build a search engine based on the SO data set, don't know what happened to that.
Very cool I like it! I like that it adds whatever tag you clicked on to the top, but maybe you should save that between sessions. Also, a lot of the more popular languages will reflect what everyone here has already seen and worked with - which is cool because it does what you advertise, but the usefulness is somewhat limited for us folk. It would be cool to have a year/month/week filter to see what has been linked to the most lately (I just saw that was suggested earlier). Things like node would benefit from that since it's growing so fast, but everyone knows about express. A simple thing to enable a bit more usefulness would be to link to the actual Stack Overflow posts so we can look at the comments. Cool stuff though!
Thanks for the support. Yes to everything above :) We are currently mining the post history data to be able to do those kinds of time range queries. Cant wait to get that out there it should be very cool indeed, also want to allow users to search by their SO id and to filter their links by tag. (As an aside when we do mine the history we will be able to get more accuracy on which users actually posted which links rather than just the post owner.)
This is nicely done, but so far isn't returning anything interesting for me. All the results are very basic (Django docs, PHP docs, etc), which makes sense, the most often cited will be the most general.
What about a change in algorithm to try and add some discovery here. What if you look at votes for answers vs cites? So things that have a high average vote:cite ratio rank?
Thanks. Yea i agree our initial use case was to provide an interface to the dataset and to allow us to explore what kinds of things developers were sharing on SO. For the next version we are working on new ranking metrics that will improve the discover aspect, vote:cite and view:cite are 2 we are looking at.
Wow, that's really cool, have to admit i didn't see that before. Let me know if your interested in sharing data, we are doing a lot of research in this area and the more data the better!
This is very cool!
This only a minor quibble but when you've selected e.g. C and then C++, the 'x' to remove one or other tag from the search results is nearly invisible. I only found it because I've used a similar 'x' box before on a different site to remove results.
I actually had that in an earlier version and took it out just to simplify the design but i am actually looking at this again to produce a better sorting experience. Also the number of views a post receives may be a good metric also.
This is an awesome idea! Is there any way to create a randomized list weighted with popularity (so for a given tag you can refresh to find new links, but still ones likely to be interesting)?
Thanks for the support, really encouraged by the feedback here on Hacker News. You guys are great.
To your question, yes I am currently testing some ideas for a magic ranking system. One option is as you say a kind of random select amounts popular or trending links. Another is a smart weighted sort when filtering by multiple tags. One problem is that right now if you add jQuery to your filter, the javascript results are going to just dominate everything else.
Doesn't work on an iPad. I type in my search term, press enter, and nothing happens. Furthermore, when I enter my search term, the "Filter by tag (e.g. javascript)" text doesn't disappear.
Top Objective-C link points to ASIHTTPRequest, an obsolete networking library that is no longer supported (and even its developer recommends against using it).
Right but there's a lot of people still maintaining code that has it implemented and also a lot of tutorials that people are using that likely use that framework.
I suspect it has more to do with the large amount of undefined/unspecified/implementation defined behavior in C. A lot of questions on Stack Overflow can only be answered correctly by referring the the standard.
So these are taken from the March 2013 data dump, which includes questions going right back to the start of SO. So some of these links have been collecting citations for a few years. We only mined the actual post content on the date the dump was created, we did not mine the post history. But we are working on that right now, its a lot of data to process :)
i dunno... 11 of the top 14 links under the "Javascript" tag are about jquery (which has it's own tag). I suppose it's not a flaw in your algorithm, just not really very interesting results.
I know javascript and jQuery suffer from their popularity and utility I think in this analysis. There is also the issue of jQuery and javascript tags being used by people as synonyms. jQuery particularly is a 500 pound gorilla over all the dataset, it shows up everywhere. As I mentioned above there was a joke on SO at one point that the answer to nearly every question was jQuery with a link to the site. Apparently it wasn't a joke.
No just the post bodies (question and answers) at the moment, but we are working parsing the comments and the post history, those datasets are about 4 times the size of the posts! So there are probably a lot more URLs in there, although we will have to decide if we treat all URLs the same or if we differentiate between URLs in post bodies contained in the dump, URLs in the post history (that may have been removed from the post) and URLs in the comments. Not maybe a concern for the website, but more so for research.
You usually see notes or clarification posted to questions in the form of comments first, where the types of links are more introductory, general purpose, than specific as you might find in answers.
Can't wait to see the updated stats.
If you could make "more" load more than just a few more records, though, that'd make it a lot easier to dig deeper.
This would take some work (more on the server's part than yours), but I think it would be more accurate if you tested the links for 301 redirects and consolidated the results accordingly. For example, in the ASP.NET tag one of my pages has 112 links to an older URL (http://encosia.com/2008/05/29/using-jquery-to-directly-call-...) that 301 redirects to a newer URL (http://encosia.com/using-jquery-to-directly-call-aspnet-ajax...) which has 129 links itself. It would be interesting to see only the latter URL show up in the list with 241 links.