Academic Torrents

cing · on Jan 30, 2014

The team should learn from the ghost-town that is BioTorrents[1] and offer more than just a tracker. [1] http://www.biotorrents.net/browse.php?incldead=1

_delirium · on Jan 30, 2014

That's one reason I'd prefer that academics just put data into some kind of local university archive, where possible. Many universities provide resources to host scientific data (and have done so for decades, since the days of ftp.dept.university.edu servers), and putting it there makes it more likely that it'll still be there in 10 years. Torrents by comparison tend to be: 1) slow, as you rely on random seeders rather than a university that's peered onto Internet2 or the LambdaRail; and 2) unreliably seeded, as people drop off. Plus the workflow of "curl -O URL" is nicer than torrenting.

Universities typically have great bandwidth and good peering, and already host much larger data repositories than this seems to be targeting (e.g. here's a 30-terabyte repository, http://gis.iu.edu/), so they should be able to provide space for your local scientific data. Complain if not!

capnrefsmmat · on Jan 30, 2014

Another alternative is something like the Dryad Digital Repository:

http://datadryad.org/

It's meant to include companion datasets for published papers, and gives out DOIs so datasets can be cited in other works. And it's mirrored at various universities to prevent loss.

runarberg · on Jan 30, 2014

Perhaps if universities would robustly seed their staffs and students torrents?

glomph · on Jan 30, 2014

Kind of solves a problem that doesn't exist though doesn't it? It isn't like these universities are crying about bandwidth costs and it isn't like demand is maxing out their upstream.

icebraining · on Jan 30, 2014

It's not just about their bandwidth; some countries / zones / networks have much better local connectivity than external, particularly international.

For example, until a few years ago, some of our ISPs had different caps for national vs international traffic, and there were popular forks of P2P clients that allowed you to filter based on that.

We have since moved to unlimited everything, but I wouldn't be surprised if some countries still had different caps or speeds for international traffic.

nl · on Jan 30, 2014

I agree about university data.

But there is a need for a way to distributed large datasets that come out of nonacademic projects.

For example, the DBPedia data dumps are very slow to download at the moment.

voltagex_ · on Jan 30, 2014

You can have both an use a web seed with most clients.

reitzensteinm · on Jan 30, 2014

So, it's quite cheap to get a seeding box from LeaseWeb, in ascending levels of sophistication:

* 100mbps unmetered 2x2tb 39 eur/mo

* 1gbps unmetered 24x2tb 349 eur/mo

* 10gbps unmetered 24x2tb 1089 eur/mo

I'm tempted to grab the first, and open a GitTip account in case anyone wants to chip in towards the second (4tb isn't a lot of space as far as this stuff goes). The third is unlikely to be useful; this stuff is long tail by its nature, so storage is probably more important.

Though in a world containing Google Fiber, would it still be a valuable service?

There's a university box seeding the torrent I'm grabbing (2011 weather patterns), but it still seems to be going quite slowly.

blueblob · on Jan 30, 2014

That does not sound cheap to me as a graduate student.

aroch · on Jan 30, 2014

$50/month for a side project that I want to grow and nurture isn't a lot by any stretch, even on a gTA stipend. Given that you could subsidize it through your program, it becomes even cheaper

mathgladiator · on Jan 30, 2014

like: http://docs.aws.amazon.com/AmazonS3/latest/dev/S3Torrent.htm...

TheBiv · on Jan 30, 2014

This is really cool.

I simply wished that the messaging was more clear and told a story that I could tell to my friends who ultimately are "too busy" to think about the value of this product.

Unfortunately "We've designed a distributed system for sharing enormous datasets - for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds." Just isn't a story that I can tell to my buddies and get them excited.

henryzlo · on Jan 30, 2014

Thanks for the comment. We've created a shorter "pitch" style presentation for the non-technical / too-busy, which summarizes the benefits etc. in a short several minute description.

https://docs.google.com/presentation/d/1JC2d1g9U6HaenGSn_Xvk...

Schwolop · on Jan 30, 2014

Respectfully - that's an EVEN LONGER message. You need a sentence. Ten words, tops.

gog · on Jan 30, 2014

What do you use for the tracker?

hardwaresofton · on Jan 30, 2014

Wow, this is pretty cool -- one of the most direct approaches to open-data that I've seen so far (and the research world is of course in dire need of this kind of open data/connect-the-dots enabling effort)!

I think it would be pretty cool to have trending datasets on the front page (I'm sure you could do a small cron that would find the most-downloaded per-week/per-day/etc)

Also, while not a dire necessity, I think a cooler name would help this project fly farther -- You should be able to make a play on "data torrents", maybe something like datastorm/samplerain/datawave/dataswell/Acadata?

Any way, trivial stuff aside, nice implementation -- bookmarked for when I get the urge to do a data-analysis project!

yinghang · on Jan 30, 2014

Thanks! But this is not my project. It is something created by a grad student I met just a couple hours ago at a hack night discussion.

teddyh · on Jan 30, 2014

So what do I do if I want to seed them all? Also, are all the data sets (and other things) freely licensed, i.e. no “non-commercial use only” clauses or things of that nature? Can I count on this going forward?

jakeogh · on Jan 30, 2014

A few TB of FOIA information related to the September 11th attacks is available via BT.

Direct link: http://911datasets.org/images/911datasets.org_all_torrents_J...

dav- · on Jan 30, 2014

Any reason passwords for user accounts are limited to 40 chars?

ses · on Jan 30, 2014

Projects like this confirm my suspicion that traditional academic publishing is going to take a nosedive in the next few years. Working in this industry as I do, I don't see commercial publishers moving quickly enough to change. Really love the idea of this and can't help but support the general ethos of it, even if it / its descendants will put a lot of us out of a job.

kartikkumar · on Jan 31, 2014

Brilliant idea if I understand it correctly. Just want to check that my use case would fit. I just submitted my first and main paper for my PhD to Icarus. I'm planning on soon uploading it to ArXiv as well. My paper is theoretical in nature and through a suite of Monte Carlo simulations I generated a few hundred MBs of data. Can I make use of this system as a way to deposit that data so that it's available to anyone that wants to verify the conclusions I reach in my paper and possibly extend the research?

thedudemabry · on Jan 30, 2014

Wow! That's a snappy site. Major props to the frontend dev(s).

kirubakaran · on Jan 30, 2014

Looks like just stock Bootstrap (not that there is anything wrong with that).

pointernil · on Jan 30, 2014

True. I guess ppl of academia are used to "different" quality/snappiness levels ;)

csense · on Jan 30, 2014

I'm surprised they don't have the Google Books n-gram dataset [1]. Then again, maybe they're more focused on data that doesn't have a good home already than on mirroring.

[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2....

lancemjoseph · on Jan 31, 2014

Many of the datasets that I've seen in academia are stored in static SQL databases that tend to be about 10-20 terabytes. Where does this leave individuals with limited resources who would like to query large databases without having to juggle the data management side of research? Are there softwares that make database querying P2P accessible?

macarthy12 · on Jan 30, 2014

The problem is the word "torrent". Too many negative connotations for many in the traditional academic world.

teddyh · on Jan 30, 2014

And you’re writing this on “Hacker News”…

ctrl · on Jan 31, 2014

Maybe this can help in turning around those connotations.

shitlord · on Jan 31, 2014

I have an idle server with 500 Mb/s upload. Now I can finally put it to good use! :)

incogmind · on Jan 30, 2014

I remember the old days of DC++ whenever I hear blazing fast speeds.

linux_devil · on Jan 30, 2014

Great ! Looking forward to coursera, edx and ocw videos too

dombili · on Jan 30, 2014

That'd be great actually. Especially for those who're not able to reach Coursera because of the stupid laws.

sitkack · on Jan 31, 2014

If you can get a cheap VPS in the US, you can use coursera downloader to grab all the content and then rsync it to your home country.

dombili · on Jan 31, 2014

I'm able to reach Coursera just fine. I don't live in any of those countries (nor I'm from any of them). I just thought it'd be nice to make them available to everyone, because that's the way it should be.

I use coursera downloader because it's hard to keep up with Coursera's own schedule. I already have a ton of materials from different courses on my computer and I would be happy to make them available to everyone, but my upload speed sucks.

alagappanr · on Jan 30, 2014

We would need a significant number of seeders in order for this to become a successfully used product. Perhaps, universities can seed data?

nvdk · on Jan 30, 2014

this seems to be very focused on US academics, at least that is what impression I'm given by labeling ".edu" addresses. It gives a feeling that these torrents/datasets are of better quality. I'm also missing a catalog on this tracker, some basic taxonomy would be most welcome...

jsumrall · on Jan 30, 2014

I didn't get that impression. Are you referring to the ".edu" address of the creators of the site? Do you mean people with a ".edu" address, and therefore at an American institution, give you a sense of their work being higher quality?

mineo · on Jan 30, 2014

I think he's referring to the "[edu]" label on the browsing pages (like [0]) which indicates that the uploader has a .edu email address. I'm not too sure about other countries, but at least in Germany, not many academical institutions actually have those, just normal .de ones.

[0] http://academictorrents.com/browse.php?cat=5

nvdk · on Jan 30, 2014

to clarify: torrents are marked "edu" if the user has a .edu address, this makes those torrents stand out. The majority of non us universities do not offer *.edu addresses to their staff and students.

jsumrall · on Jan 31, 2014

Yes, you're right. Which then brings up the question about how to determine if the data comes from an "academic" address, as was pointed out, only US institutions or institutions which are accredited by the US Dept. of Education can apply for an .edu top level domain— meaning nobody has it.

mathattack · on Jan 30, 2014

I am no expert on torrents, but I like this conceptually. Publicly funded academic research should be free.

talles · on Jan 30, 2014

What a wonderful idea. This fits so well with the torrent protocol (maybe even philosophically speaking).

huevosabio · on Jan 30, 2014

This awesome! Thanks for sharing!

erikb · on Jan 30, 2014

awesome invention! Could this be connected with the google scholar to add keyword searching?

guspe · on Jan 30, 2014

Aaron Swartz's dream come true?

skrebbel · on Jan 30, 2014

maerF0x0 · on Jan 30, 2014

what exactly was his dream?

sillysaurus2 · on Jan 30, 2014

One problem with offering a dataset as a torrent is that it's impossible to edit it after it's released. However, it seems like that doesn't matter at all in this case, because any scenario I can think of which could be solved by editing the dataset (like redacting private info that was accidentally included) wouldn't avoid the original problem: that they accidentally released private info in the first place. Perhaps it'd be useful to edit the original dataset in order to add to it / enhance it with more info, but in that case they could just release a second dataset as an addendum.

So the core idea seems solid. Thank you for this!

jzelinskie · on Jan 30, 2014

There are attempts to feel out a process for "updating torrents". However, this is long from becoming a standard practice in the BitTorrent ecosystem. Check this[0] out for more info.

[0] http://www.bittorrent.org/beps/bep_0039.html

dnautics · on Jan 31, 2014

One problem with offering a dataset as a torrent is that it's impossible to edit it after it's released.

for scientific endeavours, this should be considered a feature, not a bug.

dspillett · on Jan 30, 2014

> but in that case they could just release a second dataset as an addendum

Or for some data it would make sense to partition the data into smaller chunks instead of one huge archive. That way adding a chunk (the new year's data for a multi-year dataset perhaps) just menas releasing a new torrent with the extra srchive in and a name meaningful enough to indicate the difference. Anyone with the last set could then just download the new partition (and any modified ones).

pixelcort · on Jan 30, 2014

BitTorrent Sync might be useful for that:

http://www.bittorrent.com/sync

rodolphoarruda · on Jan 30, 2014

I'd used BT Sync for a couple of weeks to sync data between my own machines. It works neatly. One question here. When you modify some part of a big file, does the program send out only the difference to the other authorized machines, or entire file? Let's say a researcher exports her data to a 1GB CSV file of my interest. I download it. In the following week the same researcher updates her CVS with more data, now it has 1.01GB in size. How big my next download will be?

lojack · on Jan 30, 2014

Seems as though it supports patching so only the parts that are changed would be synced. Of course, the download size is completely dependent on what parts of the file were changed.

http://forum.bittorrent.com/topic/19228-latest-sync-build-11...

whadar · on Jan 30, 2014

Hopefully sharefest.me would be another alternative pretty soon

axman6 · on Jan 31, 2014

If it's stored on ZFS, Copy on Write will let you edit a copy that only stores the changed files, and deduplication could give back even more space (if necessary and RAM permits).

jackmaney · on Jan 30, 2014

Excellent! It's far too early to tell, but I'd like to be hopeful that this distribution network could be another nail in the coffin of the old, expensive, dead-tree journals.

rfoeorfisdjus · on Jan 30, 2014

I guess you mean papers. Go back to reddit, libtard.

jackmaney · on Jan 30, 2014

Yes, I do mean papers. I'm not on reddit, you mouth-breathing neanderthal.

CompleteMoron · on Jan 30, 2014

thanks for sharing! I shall store in the vault of Hard Drives I keep here in the desert