I swear I'll do a better write up sometime soon.
Duplicated feeds are ignored. Domains that block scrapping via robots.txt will be skipped.
Any page will work, doesn't need to be the home.
Use the ExternalUrlsSpider crawler. The output will be a .jsonl file with
rss_url and domain of each website listed.
# on the scraper/ directory
uv run scrapy crawl rss -a urls_file=urls.txt -o rss.jsonlImport the resulting .jsonl file into the backend's database using
the flask import-feeds command.
# on the backend/ directory
uv run flask import-feeds rss.jsonlFormat it to .jsonl before importing with flask import-feeds:
# on the backend/ directory
jq -R -n -c '[inputs] | map({rss_url: .}) | .[]' rss_urls.txt > rss.jsonl
uv run flask import-feeds rss.jsonlAlways use the full list of imported feed_urls. Order randomly to reduce chances of hammering a small provider.
# on the backend/ directory
sqlite3 brcrawl.sqlite3
.output ../website/feeds.txt
SELECT feed_url FROM feeds WHERE status_id = 1 ORDER BY RANDOM();
.output stdout
.quitNow use the generated feeds.txt file to run the build.sh command from the
website directory.
# on the website/ directory
./build.sh feeds.txtThe resulting .html files can be deployed (eg. via github pages or vps with nginx).
Filter by status verified and apply a filter based on the date. Eg;
SELECT feed_url FROM feeds WHERE status_id = 1 AND created_at > '2026-02-12 10:00:00';