how we source content

Bataoo doesn't crawl the open web

Unlike most search engines, Bataoo indexes only content where the publisher has explicitly consented or where an open licence applies. We do not crawl arbitrary HTML pages. Here's exactly what we ingest and where it comes from.

1 · RSS feeds

We fetch RSS/Atom feeds that publishers have voluntarily published. Publishing via RSS is an explicit invitation to syndicate. We display the title, summary and image as provided in the feed, with a link back to your article.

To remove your feed, write to [email protected] or use /grievance. We action requests within 24 hours.

2 · Open-licence corpora

  • · Wikipedia articles (CC-BY-SA 3.0/4.0)
  • · Wikidata structured facts (CC0, public domain)
  • · AI4Bharat Sangraha Indic-language corpus (CC-BY 4.0)
  • · CulturaX / mC4 cleaned multilingual web (ODC-BY)
  • · Wikimedia Commons media (per-file CC licences)

3 · Government public-domain

  • · data.gov.in (Government Open Data Licence)
  • · PIB press releases
  • · IndiaCode (statutes)
  • · RBI Master Circulars · SEBI · ECI
  • · MyGov FAQs · AYUSH · NCERT · NIRF rankings
  • · Census of India

4 · Real-time data via public APIs

For live cards (cricket scores, stock prices, train PNR, weather, AQI, petrol prices, panchang), Bataoo fetches at user-query time from public APIs designed for consumption. We don't store the data — each query is a fresh API call.

5 · Pucho — community Q&A

User-generated questions and answers on /pucho/. Bataoo is the intermediary platform; users author the content. Operated under IT Rules 2021 with Grievance Redressal at /grievance.

What we do NOT do

  • · Crawl HTML pages of publishers without an open licence
  • · Walk sitemaps of copyrighted sites
  • · Store full article body of any publisher's content
  • · Bypass paywalls or login walls
  • · Train AI models on third-party content

Earlier crawler (decommissioned)

Bataoo previously operated a polite identifying crawler (BataooBot/2.0). That crawler has been retired as of May 2026. Any past entries scraped from your site have been deleted from our index. If you encounter residual references, please contact [email protected].

Contact

Sourcing questions, opt-outs, or grievances: [email protected]. See /grievance for the formal redressal process.