skip to content
ainoya.dev

Improving Web Scraping with Readability and Table Support in Markdown

/ 1 min read

I previously used DOM Distiller for content extraction in my project cloudflare-dom-distiller. However, I have recently switched to Readability because it produces more visually appealing results. This change has significantly improved the markdown output, removing unwanted headers and footers and enhancing the overall look.

Additionally, I have implemented support for the <table> tag to output tables in Markdown format. For converting HTML to Markdown, I use turndown, which has a plugin for GitHub Flavored Markdown (GFM), turndown-plugin-gfm. Integrating this plugin has enabled seamless table conversion in Markdown.

While searching for similar functionalities, I came across an interesting open-source project and SaaS, jina-ai/reader. This tool not only offers features comparable to cloudflare-dom-distiller but also includes web search capabilities, utilizing puppeteer to access the Brave search engine. For those seeking more advanced features, jina-ai/reader is a promising option.