Welcome to Etherbox!
This pad text is synchronized as you type, so that everyone viewing this page sees the same text. Add __NOPUBLISH__ to prevent this pad from being archived.
Documentation of IRC node
Interviews: Wednesday morning
Getting the Github data:
Who has it?
*
Software Heritage
https://www.softwareheritage.org/
* Mining Software Repositories
http://www.msrconf.org/
* The GHTorrent project
http://ghtorrent.org/
* This looks like it has what we need:
* "GHTorrent monitors the Github public event time line. For each event, it retrieves its contents and their dependencies, exhaustively."
"It then stores the raw JSON responses to a MongoDB database, while also extracting their structure in a MySQL database."
* Google Big Query? (fuck you Google we don't need you ↑
*
https://cloud.google.com/blog/products/gcp/github-on-bigquery-analyze-all-the-open-source-code
* 2.8 million open source projects - this is much less than is available AFAIK
Alternative approaches:
* Last ditch py3 async effort ... 90 million repos!?
* Only projects starting with "a"
* "reverse lookup" from IRC channels - topic settings, channel name?
Question: shall we look at package managers like PyPi / Node etc. (like Nokia code compass):
* Projects vs. library view
How many repositories on Github?
Things to consider:
* Private vs. Public
Sources:
*
https://octoverse.github.com/
(github PR) - 96 million (40% increase from 2018)
*
https://en.wikipedia.org/wiki/GitHub
(28 million public as of June 2018)
How many FLOSS project IRC channels?
* freenode: <53.000 channels total
* OFTC: 2500 channels total
* darkfasel.net: 120 channels total
* mozilla.org: 1772 channels total
Moving on with GHTorrent:
* Get it, load it into the database
* latest dump is
http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2019-04-01.tar.gz
* Try to point
https://www.sqlalchemy.org/
at it to get a free DB mapping (
https://docs.sqlalchemy.org/en/latest/orm/extensions/automap.html)
* Further documentation:
http://ghtorrent.org/relational.html
* OH! We can only get the metadata of the project and not the file contents
* However, it's a good enough set to get the list and SOME metadata
* From there, we have a list of all public repositories and then from the metadata
* we can start to close down the search space with queries (like some below)
* Then, it may be feasible to make a file search in some ad-hoc way ...
* Lovely option for further exploring the data while waiting for the 96+ GB to come down
*
http://ghtorrent.org/services.html
* "To obtain access, please send us your public key as described here. When we contact you back, you "
"will be able to setup an SSH tunnel with the following command: ssh -L ..." !!!
* We sent a pull request:
https://github.com/ghtorrent/ghtorrent.org/pull/656
- we will get access soon!
* Oh, and we sent an email:
"
Dear Georgios,
First, an apology: sorry to bother you for this access request!
I have got your email from
http://ghtorrent.org/faq.html
and have
recently just submitted
https://github.com/ghtorrent/ghtorrent.org/pull/656
to get access for the GHTorrent services. I would not email so quickly
but we are working in the context of only a few day work session[1], so
it is feeling urgent enough to write this ;)
We are hoping to learn more about the relationship of free software
projects and IRC.
Best,
Luke & maxigas
[1]:
http://constantvzw.org/site/Call-for-participants-Networks-with-an-Attitude,3102.html
PS. Thanks for your great work on this project. A serious resource!
"
* In the meantime, we are downloading a smaller set so we can work towards preparing the queries / scripts
What are useful parameters for considering the projects for inclusion into search?
* (easy) contributor count
* (hard) sustained frequency of commits
* (easy) stars - relying on the dynamic of github
* (easy) watchers - perhaps more indicating "real" reliance on the project
* (easy) issue count (arguable)
* (hard) searching from search engines to check
* (hard) check against debian listings
*
https://pkgs.org
Once we have the data:
* Closing down the window/problem space
* More than 3 contributors etc.
* Reasoning: if we get 28 million projects and only find 1% of them are using it
* then we will find it hard to continue speaking of IRC as relevant but
* if we frame some conditions to the search space, we can give it more
* context etc.
Other interesting questions:
* After 100+ contributors, did they switch to Slack
* ???
Scraper:
* List of users/projects: .repos.txt 2.8 gigabytes, downloading to my computer
* Scraper:
$
parallel --joblog .job.log --jobs 100 -a .repos.txt curl --silent --fail --create-dirs -o {} -o {} -o {}
https://raw.githubusercontent.com/
{}/master/README.md
https://raw.githubusercontent.com/
{}/master/README
https://raw.githubusercontent.com/
{}/master/README.txt
RESUMING ON LUNA
$ parallel --resume --joblog .job.log --jobs 200 -a .antirepos.txt curl --silent --fail --create-dirs -o {} -o {} -o {}
https://raw.githubusercontent.com/
{}/master/README.md
https://raw.githubusercontent.com/
{}/master/README
https://raw.githubusercontent.com/
{}/master/README.txt
REVERSING:
sort -r [to new file]
* Analysis:
find * -type f -exec sh -c 'grep -H -E " #$(basename $1)| IRC|irc:|freenode|Freenode|hackint| OFTC|oftc." $1' foo {} \;
contact at Software Heritage : Nicolas Dandrimont olasd@softwareheritage.org