Puppeteer Crawling Github Guide

Introduction#

Puppeteer is a headless Chrome Node library produced by Chrome. It provides a series of APIs that can call Chrome's functions without UI, suitable for various scenarios such as web scraping and automation processing.

Usage#

Installation#

npm install puppeteer-chromium-resolver --save

Launch/Close Browser#

 const browser = await (puppeteer.launch({
        args: ['--no-sandbox', 
        '--disable-setuid-sandbox'],
        // If accessing an https page, this property will ignore https errors
        ignoreHTTPSErrors: true,
        headless: true, // Set to true for headless mode, which does not display the browser and runs Chrome in a headless environment
  }));

  // Close the browser
  browser.close();

Create a New Tab and Navigate#

const page = await browser.newPage();
await page.goto('https://github.com/'+name+'?tab=repositories'); // Navigate to the repositories page of a specific user on GitHub

Execute Functions in the Console (evaluate())#

// Get all project URLs on the current page, and also get the URL of the next repository if available
 const rep =  await page.evaluate(()=>{
        const url = document.querySelectorAll('.wb-break-all > a');
        const next = document.querySelector('.BtnGroup > a');
        let urlList = undefined;
        let nextUrl = undefined;
        if(url != null){
            urlList = Array.prototype.map.call(url,(item)=>{
                return item.href;
            })
        }
        if(next!=null&&next.outerText==='Next'){
            nextUrl = next.href;
        }
        return {
            urlList:urlList,
            nextUrl:nextUrl
        }
  });

Get Page Elements#

const el = await page.$(selector)

Click on an Element#

await el.click()

Enter Text#

await el.type(text)

Scraping Data from Github#

Using express, I scraped the number of followers for a specified user, the dates and number of commits made on each day, and the URL, project name, commit count, and star count for each public project. The data is then returned in JSON format.

Project link: getGithub

Example usage:

http://localhost:4000/getAllContributions/Magren0321

Example response:

Challenges Encountered#

When scraping the number of contributions for each year, I encountered difficulties in retrieving the attributes of the elements. The attributes, such as data-date and data-count, are custom attributes defined by GitHub and cannot be directly accessed. In this case, the getAttribute() method must be used.

There are also setAttribute() and removeAttribute() methods in Attributes, which are used to set and remove attributes that do not exist in the node prototype.

// Get the dates and number of commits for each day
async function getDateList(yearData,page){
    await page.goto(yearData);
    const dateList = await page.evaluate(()=>{
        const date = document.querySelectorAll('.ContributionCalendar-day');
        const datelist = [];
        for(let item of date){
            if(item.getAttribute('data-count')!=0&&item.getAttribute('data-count')!=null){
                datelist.push({
                    data_date: item.getAttribute('data-date'),
                    data_count: item.getAttribute('data-count')
                })
            }
        }
      
        return datelist;
    })
    return dateList;
}

There are also some strange things about the Github API that I won't summarize here......
For example, returning an empty array, but the length is 2 when I output it. When I use direct requests, I found that there are two line breaks inside. 😵
I can only test with more data. 😔