That's one reason I'd prefer that academics just put data into some kind of local university archive, where possible. Many universities provide resources to host scientific data (and have done so for decades, since the days of ftp.dept.university.edu servers), and putting it there makes it more likely that it'll still be there in 10 years. Torrents by comparison tend to be: 1) slow, as you rely on random seeders rather than a university that's peered onto Internet2 or the LambdaRail; and 2) unreliably seeded, as people drop off. Plus the workflow of "curl -O URL" is nicer than torrenting.
Universities typically have great bandwidth and good peering, and already host much larger data repositories than this seems to be targeting (e.g. here's a 30-terabyte repository, http://gis.iu.edu/), so they should be able to provide space for your local scientific data. Complain if not!
It's meant to include companion datasets for published papers, and gives out DOIs so datasets can be cited in other works. And it's mirrored at various universities to prevent loss.
Kind of solves a problem that doesn't exist though doesn't it? It isn't like these universities are crying about bandwidth costs and it isn't like demand is maxing out their upstream.
It's not just about their bandwidth; some countries / zones / networks have much better local connectivity than external, particularly international.
For example, until a few years ago, some of our ISPs had different caps for national vs international traffic, and there were popular forks of P2P clients that allowed you to filter based on that.
We have since moved to unlimited everything, but I wouldn't be surprised if some countries still had different caps or speeds for international traffic.
So, it's quite cheap to get a seeding box from LeaseWeb, in ascending levels of sophistication:
* 100mbps unmetered 2x2tb 39 eur/mo
* 1gbps unmetered 24x2tb 349 eur/mo
* 10gbps unmetered 24x2tb 1089 eur/mo
I'm tempted to grab the first, and open a GitTip account in case anyone wants to chip in towards the second (4tb isn't a lot of space as far as this stuff goes). The third is unlikely to be useful; this stuff is long tail by its nature, so storage is probably more important.
Though in a world containing Google Fiber, would it still be a valuable service?
There's a university box seeding the torrent I'm grabbing (2011 weather patterns), but it still seems to be going quite slowly.
$50/month for a side project that I want to grow and nurture isn't a lot by any stretch, even on a gTA stipend. Given that you could subsidize it through your program, it becomes even cheaper
I simply wished that the messaging was more clear and told a story that I could tell to my friends who ultimately are "too busy" to think about the value of this product.
Unfortunately "We've designed a distributed system for sharing enormous datasets - for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds." Just isn't a story that I can tell to my buddies and get them excited.
Thanks for the comment. We've created a shorter "pitch" style presentation for the non-technical / too-busy, which summarizes the benefits etc. in a short several minute description.
Wow, this is pretty cool -- one of the most direct approaches to open-data that I've seen so far (and the research world is of course in dire need of this kind of open data/connect-the-dots enabling effort)!
I think it would be pretty cool to have trending datasets on the front page (I'm sure you could do a small cron that would find the most-downloaded per-week/per-day/etc)
Also, while not a dire necessity, I think a cooler name would help this project fly farther -- You should be able to make a play on "data torrents", maybe something like datastorm/samplerain/datawave/dataswell/Acadata?
Any way, trivial stuff aside, nice implementation -- bookmarked for when I get the urge to do a data-analysis project!
So what do I do if I want to seed them all? Also, are all the data sets (and other things) freely licensed, i.e. no “non-commercial use only” clauses or things of that nature? Can I count on this going forward?
Projects like this confirm my suspicion that traditional academic publishing is going to take a nosedive in the next few years. Working in this industry as I do, I don't see commercial publishers moving quickly enough to change.
Really love the idea of this and can't help but support the general ethos of it, even if it / its descendants will put a lot of us out of a job.
Brilliant idea if I understand it correctly. Just want to check that my use case would fit. I just submitted my first and main paper for my PhD to Icarus. I'm planning on soon uploading it to ArXiv as well. My paper is theoretical in nature and through a suite of Monte Carlo simulations I generated a few hundred MBs of data. Can I make use of this system as a way to deposit that data so that it's available to anyone that wants to verify the conclusions I reach in my paper and possibly extend the research?
I'm surprised they don't have the Google Books n-gram dataset [1]. Then again, maybe they're more focused on data that doesn't have a good home already than on mirroring.
Many of the datasets that I've seen in academia are stored in static SQL databases that tend to be about 10-20 terabytes. Where does this leave individuals with limited resources who would like to query large databases without having to juggle the data management side of research?
Are there softwares that make database querying P2P accessible?
I'm able to reach Coursera just fine. I don't live in any of those countries (nor I'm from any of them). I just thought it'd be nice to make them available to everyone, because that's the way it should be.
I use coursera downloader because it's hard to keep up with Coursera's own schedule. I already have a ton of materials from different courses on my computer and I would be happy to make them available to everyone, but my upload speed sucks.
this seems to be very focused on US academics, at least that is what impression I'm given by labeling ".edu" addresses. It gives a feeling that these torrents/datasets are of better quality.
I'm also missing a catalog on this tracker, some basic taxonomy would be most welcome...
I didn't get that impression. Are you referring to the ".edu" address of the creators of the site? Do you mean people with a ".edu" address, and therefore at an American institution, give you a sense of their work being higher quality?
I think he's referring to the "[edu]" label on the browsing pages (like [0]) which indicates that the uploader has a .edu email address. I'm not too sure about other countries, but at least in Germany, not many academical institutions actually have those, just normal .de ones.
to clarify: torrents are marked "edu" if the user has a .edu address, this makes those torrents stand out. The majority of non us universities do not offer *.edu addresses to their staff and students.
Yes, you're right. Which then brings up the question about how to determine if the data comes from an "academic" address, as was pointed out, only US institutions or institutions which are accredited by the US Dept. of Education can apply for an .edu top level domain— meaning nobody has it.
One problem with offering a dataset as a torrent is that it's impossible to edit it after it's released. However, it seems like that doesn't matter at all in this case, because any scenario I can think of which could be solved by editing the dataset (like redacting private info that was accidentally included) wouldn't avoid the original problem: that they accidentally released private info in the first place. Perhaps it'd be useful to edit the original dataset in order to add to it / enhance it with more info, but in that case they could just release a second dataset as an addendum.
There are attempts to feel out a process for "updating torrents". However, this is long from becoming a standard practice in the BitTorrent ecosystem. Check this[0] out for more info.
> but in that case they could just release a second dataset as an addendum
Or for some data it would make sense to partition the data into smaller chunks instead of one huge archive. That way adding a chunk (the new year's data for a multi-year dataset perhaps) just menas releasing a new torrent with the extra srchive in and a name meaningful enough to indicate the difference. Anyone with the last set could then just download the new partition (and any modified ones).
I'd used BT Sync for a couple of weeks to sync data between my own machines. It works neatly. One question here. When you modify some part of a big file, does the program send out only the difference to the other authorized machines, or entire file? Let's say a researcher exports her data to a 1GB CSV file of my interest. I download it. In the following week the same researcher updates her CVS with more data, now it has 1.01GB in size. How big my next download will be?
Seems as though it supports patching so only the parts that are changed would be synced. Of course, the download size is completely dependent on what parts of the file were changed.
If it's stored on ZFS, Copy on Write will let you edit a copy that only stores the changed files, and deduplication could give back even more space (if necessary and RAM permits).
Excellent! It's far too early to tell, but I'd like to be hopeful that this distribution network could be another nail in the coffin of the old, expensive, dead-tree journals.