AI Powered Blog Writer using Scrapeless and Pinecone Database

Alex Johnson

Senior Web Scraping Engineer

17-Jul-2025

You must be an experienced content creator. As a startup team, the daily updated content of the product is too rich. Not only do you need to lay out a large number of drainage blogs to increase website traffic quickly, but you also need to prepare 2-3 blogs per week that are subject to product update promotion.

Compared with spending a lot of money to increase the bidding budget of paid ads in exchange for higher display positions and more exposure, content marketing still has irreplaceable advantages: wide range of content, low cost of customer acquisition testing, high output efficiency, relatively low investment of energy, rich field experience knowledge base, etc.

However, what are the results of a large amount of content marketing?

Unfortunately, many articles are deeply buried on the 10th page of Google search.

Is there any good way to avoid the strong impact of "low-traffic" articles as much as possible?
Have you ever wanted to create a self-updating SEO writer that clones the knowledge of top-performing blogs and generates fresh content at scale?

In this guide, we'll walk you through building a fully automated SEO content generation workflow using n8n, Scrapeless, Gemini (You can choose some other ones like Claude/OpenRouter as wanted), and Pinecone.
This workflow uses a Retrieval-Augmented Generation (RAG) system to collect, store, and generate content based on existing high-traffic blogs.

YouTube tutorial: https://www.youtube.com/watch?v=MmitAOjyrT4

What This Workflow Does?

This workflow will involve four steps:

Part 1: Call the Scrapeless Crawl to crawl all sub-pages of the target website, and use Scrape to deeply analyze the entire content of each page.
Part 2: Store the crawled data in Pinecone Vector Store.
Part 3: Use Scrapeless's Google Search node to fully analyze the value of the target topic or keywords.
Part 4: Convey instructions to Gemini, integrate contextual content from the prepared database through RAG, and produce target blogs or answer questions.

If you haven't heard of Scrapeless, it’s a leading infrastructure company focused on powering AI agents, automation workflows, and web crawling. Scrapeless provides the essential building blocks that enable developers and businesses to create intelligent, autonomous systems efficiently.

At its core, Scrapeless delivers browser-level tooling and protocol-based APIs—such as headless cloud browser, Deep SERP API, and Universal Crawling APIs—that serve as a unified, modular foundation for AI agents and automation platforms.

It is really built for AI applications because AI models are not always up to date with many things, whether it be current events or new technologies

In addition to n8n, it can also be called through API, and there are nodes on mainstream platforms such as Make:

You can also use it directly on the official website.

To use Scrapeless in n8n:

Go to Settings > Community Nodes
Search for n8n-nodes-scrapeless and install it

We need to install the Scrapeless community node on n8n first:

Credential Connection

Scrapeless API Key

In this tutorial, we will use the Scrapeless service. Please make sure you have registered and obtained the API Key.

Sign up on the Scrapeless website to get your API key and claim the free trial.
Then, you can open the Scrapeless node, paste your API key in the credentials section, and connect it.

Pinecone Index and API Key

After crawling the data, we will integrate and process it and collect all the data into the Pinecone database. We need to prepare the Pinecone API Key and Index in advance.

Create API Key

After logging in, click API Keys → Click Create API key → Supplement your API key name → Create key. Now, you can set it up in the n8n credentials

⚠️ After the creation is complete, please copy and save your API Key. For data security, Pinecone will no longer display the created API key.

Create Index

Click Index and enter the creation page. Set the Index name → Select model for Configuration → Set the appropriate Dimension → Create index.
2 common dimension settings:

Google Gemini Embedding-001 → 768 dimensions
OpenAI's text-embedding-3-small → 1536 dimensions

Phase1: Scrape and Crawl Websites for Knowledge Base

The first stage is to directly aggregate all blog content. Crawling content from a large area allows our AI Agent to obtain data sources from all fields, thereby ensuring the quality of the final output articles.

The Scrapeless node crawls the article page and collects all blog post URLs.
Then it loops through every URL, scrapes the blog content, and organizes the data.
Each blog post is embedded using your AI model and stored in Pinecone.
In our case, we scraped 25 blog posts in just a few minutes — without lifting a finger.

Scrapeless Crawl node

This node is used to crawl all the content of the target blog website including Meta data, sub-page content and export it in Markdown format. This is a large-scale content crawling that we cannot quickly achieve through manual coding.

Configuration:

Connect your Scrapeless API key
Resource: Crawler
Operation: Crawl
Input your target scraping website. Here we use https://www.scrapeless.com/en/blog as a reference.

Code node

After getting the blog data, we need to parse the data and extract the structured information we need from it.

The following is the code I used. You can refer to it directly:

JavaScript Copy

return items.map(item => {
  const md = $input.first().json['0'].markdown; 

  if (typeof md !== 'string') {
    console.warn('Markdown content is not a string:', md);
    return {
      json: {
        title: '',
        mainContent: '',
        extractedLinks: [],
        error: 'Markdown content is not a string'
      }
    };
  }

  const articleTitleMatch = md.match(/^#\s*(.*)/m);
  const title = articleTitleMatch ? articleTitleMatch[1].trim() : 'No Title Found';

  let mainContent = md.replace(/^#\s*.*(\r?\n)+/, '').trim();

  const extractedLinks = [];
  // The negative lookahead `(?!#)` ensures '#' is not matched after the base URL,
  // or a more robust way is to specifically stop before the '#'
  const linkRegex = /\[([^\]]+)\]\((https?:\/\/[^\s#)]+)\)/g; 
  let match;
  while ((match = linkRegex.exec(mainContent))) {
    extractedLinks.push({
      text: match[1].trim(),
      url: match[2].trim(),
    });
  }

  return {
    json: {
      title,
      mainContent,
      extractedLinks,
    },
  };
});

Node: Split out

The Split out node can help us integrate the cleaned data and extract the URLs and text content we need.

Loop Over Items + Scrapeless Scrape

Loop Over Items

Use the Loop Over Time node with Scrapeless's Scrape to repeatedly perform crawling tasks, and deeply analyze all the items obtained previously.

Scrapeless Scrape

Scrape node is used to crawl all the content contained in the previously obtained URL. In this way, each URL can be deeply analyzed. The markdown format is returned and metadata and other information are integrated.

Phase 2. Store data on Pinecone

We have successfully extracted the entire content of the Scrapeless blog page. Now we need to access the Pinecone Vector Store to store this information so that we can use it later.

Node: Aggregate

In order to store data in the knowledge base conveniently, we need to use the Aggregate node to integrate all the content.

Aggregate: All Item Data (Into a Single List)
Put Output in Field: data
Include: All Fields

Node: Convert to File

Great! All the data has been successfully integrated. Now we need to convert the acquired data into a text format that can be directly read by Pinecone. To do this, just add a Convert to File.

Node: Pinecone Vector store

Now we need to configure the knowledge base. The nodes used are:

Pinecone Vector Store
Google Gemini
Default Data Loader
Recursive Character Text Splitter

The above four nodes will recursively integrate and crawl the data we have obtained. Then all are integrated into the Pinecone knowledge base.

Phase 3. SERP Analysis using AI

To ensure you're writing content that ranks, we perform a live SERP analysis:

Use the Scrapeless Deep SerpApi to fetch search results for your chosen keyword
Input both the keyword and search intent (e.g., Scraping, Google trends, API)
The results are analyzed by an LLM and summarized into an HTML report

Node: Edit Fields

The knowledge base is ready! Now it’s time to determine our target keywords. Fill in the target keywords in the content box and add the intent.

Node: Google Search

The Google Search node calls Scrapeless's Deep SerpApi to retrieve target keywords.

Node: LLM Chain

Building LLM Chain with Gemini can help us analyze the data obtained in the previous steps and explain to LLM the reference input and intent we need to use so that LLM can generate feedback that better meets the needs.

Node: Markdown

Since LLM usually exports in Markdown format, as users we cannot directly obtain the data we need most clearly, so please add a Markdown node to convert the results returned by LLM into HTML.

Node: HTML

Now we need to use the HTML node to standardize the results - use the Blog/Report format to intuitively display the relevant content.

Operation: Generate HTML Template

The following code is required:

XML Copy

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>Report Summary</title>
  <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap" rel="stylesheet">
  <style>
    body {
      margin: 0;
      padding: 0;
      font-family: 'Inter', sans-serif;
      background: #f4f6f8;
      display: flex;
      align-items: center;
      justify-content: center;
      min-height: 100vh;
    }

    .container {
      background-color: #ffffff;
      max-width: 600px;
      width: 90%;
      padding: 32px;
      border-radius: 16px;
      box-shadow: 0 10px 30px rgba(0, 0, 0, 0.1);
      text-align: center;
    }

    h1 {
      color: #ff6d5a;
      font-size: 28px;
      font-weight: 700;
      margin-bottom: 12px;
    }

    h2 {
      color: #606770;
      font-size: 20px;
      font-weight: 600;
      margin-bottom: 24px;
    }

    .content {
      color: #333;
      font-size: 16px;
      line-height: 1.6;
      white-space: pre-wrap;
    }

    @media (max-width: 480px) {
      .container {
        padding: 20px;
      }

      h1 {
        font-size: 24px;
      }

      h2 {
        font-size: 18px;
      }
    }
  </style>
</head>
<body>
  <div class="container">
    <h1>Data Report</h1>
    <h2>Processed via Automation</h2>
    <div class="content">{{ $json.data }}</div>
  </div>

  <script>
    console.log("Hello World!");
  </script>
</body>
</html>

This report includes:

Top-ranking keywords and long-tail phrases
User search intent trends
Suggested blog titles and angles
Keyword clustering

Phase 4. Generating the Blog with AI + RAG

Now that you've collected and stored the knowledge and researched your keywords, it's time to generate your blog.

Construct a prompt using insights from the SERP report
Call an AI agent (e.g., Claude, Gemini, or OpenRouter)
The model retrieves the relevant context from Pinecone and writes a full blog post

Unlike generic AI output, the result here includes specific ideas, phrases, and tone from Scrapeless' original content — made possible by RAG.

The Ending Thoughts

This end-to-end SEO content engine showcases the power of n8n + Scrapeless + Vector Database + LLMs.
You can:

Replace Scrapeless Blog Page with any other blog
Swap Pinecone for other vector stores
Use OpenAI, Claude, or Gemini as your writing engine
Build custom publishing pipelines (e.g., auto-post to CMS or Notion)

👉 Get started today by installing Scrapeless community node and start generating blogs at scale — no coding required.

At Scrapeless, we only access publicly available data while strictly complying with applicable laws, regulations, and website privacy policies. The content in this blog is for demonstration purposes only and does not involve any illegal or infringing activities. We make no guarantees and disclaim all liability for the use of information from this blog or third-party links. Before engaging in any scraping activities, consult your legal advisor and review the target website's terms of service or obtain the necessary permissions.