Welcome to Etherbox!

This pad text is synchronized as you type, so that everyone viewing this page sees the same text. Add __NOPUBLISH__ to prevent this pad from being archived.
Documentation of IRC node

Interviews: Wednesday morning

Getting the Github data:

Who has it?
  * Software Heritage https://www.softwareheritage.org/
  * Mining Software Repositories http://www.msrconf.org/
  * The GHTorrent project  http://ghtorrent.org/
    * This looks like it has what we need:
    * "GHTorrent monitors the Github public event time line. For each event, it retrieves its contents and their dependencies, exhaustively."
       "It then stores the raw JSON responses to a MongoDB database, while also extracting their structure in a MySQL database."
  * Google Big Query? (fuck you Google we don't need you ↑
    * https://cloud.google.com/blog/products/gcp/github-on-bigquery-analyze-all-the-open-source-code
    * 2.8 million open source projects - this is much less than is available AFAIK

Alternative approaches:
   * Last ditch py3 async effort ... 90 million repos!?
   * Only projects starting with "a"
   * "reverse lookup" from IRC channels - topic settings, channel name?

Question: shall we look at package managers like PyPi / Node etc. (like Nokia code compass):
    * Projects vs. library view

How many repositories on Github?
Things to consider:
  * Private vs. Public
Sources:
  * https://octoverse.github.com/ (github PR) - 96 million (40% increase from 2018)
  * https://en.wikipedia.org/wiki/GitHub (28 million public as of June 2018)

How many FLOSS project IRC channels?
 * freenode: <53.000 channels total
 * OFTC: 2500 channels total
 * darkfasel.net: 120 channels total
 * mozilla.org: 1772 channels total

Moving on with GHTorrent:
    * Get it, load it into the database
      * latest dump is http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2019-04-01.tar.gz
    * Try to point https://www.sqlalchemy.org/ at it to get a free DB mapping ( https://docs.sqlalchemy.org/en/latest/orm/extensions/automap.html)
      * Further documentation: http://ghtorrent.org/relational.html
      
     * OH! We can only get the metadata of the project and not the file contents
     * However, it's a good enough set to get the list and SOME metadata
     * From there, we have a list of all public repositories and then from the metadata
     * we can start to close down the search space with queries (like some below)
     * Then, it may be feasible to make a file search in some ad-hoc way ...
     
     * Lovely option for further exploring the data while waiting for the 96+ GB to come down
       * http://ghtorrent.org/services.html
       * "To obtain access, please send us your public key as described here. When we contact you back, you "
         "will be able to setup an SSH tunnel with the following command: ssh -L ..." !!!
      * We sent a pull request: https://github.com/ghtorrent/ghtorrent.org/pull/656 - we will get access soon!
      * Oh, and we sent an email:
          "
Dear Georgios,
First, an apology: sorry to bother you for this access request!
I have got your email from http://ghtorrent.org/faq.html and have
recently just submitted https://github.com/ghtorrent/ghtorrent.org/pull/656
to get access for the GHTorrent services. I would not email so quickly
but we are working in the context of only a few day work session[1], so
it is feeling urgent enough to write this ;)
We are hoping to learn more about the relationship of free software
projects and IRC.
Best,
Luke & maxigas
[1]: http://constantvzw.org/site/Call-for-participants-Networks-with-an-Attitude,3102.html
PS. Thanks for your great work on this project. A serious resource!
       "
      * In the meantime, we are downloading a smaller set so we can work towards preparing the queries / scripts

What are useful parameters for considering the projects for inclusion into search?
* (easy) contributor count
* (hard) sustained frequency of commits
* (easy) stars - relying on the dynamic of github
* (easy) watchers - perhaps more indicating "real" reliance on the project
* (easy) issue count (arguable)
* (hard) searching from search engines to check
* (hard) check against debian listings
  * https://pkgs.org

Once we have the data:
    * Closing down the window/problem space
      * More than 3 contributors etc.
      * Reasoning: if we get 28 million projects and only find 1% of them are using it
      *                    then we will find it hard to continue speaking of IRC as relevant but
      *                    if we frame some conditions to the search space, we can give it more
      *                    context etc.
Other interesting questions:
    * After 100+ contributors, did they switch to Slack
    * ???

Scraper:

 * List of users/projects: .repos.txt 2.8 gigabytes, downloading to my computer

 * Scraper: 

$ parallel --joblog .job.log --jobs 100 -a .repos.txt curl --silent --fail --create-dirs -o {} -o {} -o {} https://raw.githubusercontent.com/ {}/master/README.md https://raw.githubusercontent.com/ {}/master/README https://raw.githubusercontent.com/ {}/master/README.txt

RESUMING ON LUNA
$ parallel --resume --joblog .job.log --jobs 200 -a .antirepos.txt curl --silent --fail --create-dirs -o {} -o {} -o {} https://raw.githubusercontent.com/ {}/master/README.md https://raw.githubusercontent.com/ {}/master/README https://raw.githubusercontent.com/ {}/master/README.txt


REVERSING:
    sort -r [to new file]




 * Analysis:

find * -type f -exec sh -c 'grep -H -E " #$(basename $1)| IRC|irc:|freenode|Freenode|hackint| OFTC|oftc." $1' foo {} \;

contact at Software Heritage : Nicolas Dandrimont olasd@softwareheritage.org