back to main blog page

Teaching LLMs to Scrape Real Estate Deals

I wanted to scrape real estate deals from Facebook groups. Private investor groups aren't a goldmine, but they're better than most public marketplaces and probably the best way to find off-market deals short of having a relationship with a broker or being in a private mastermind group. The problem is that the listings are a mess. Some people post key numbers in the description. Some put them in the comments. Some slap text overlays on photos with the ARV, rehab estimate, and asking price baked into an image. There's no consistent format, and different people use different terms for the same concepts (ARV, after-repair value, comps, comparable sales). Traditional web scrapers are useless here because there's nothing structured to scrape. The format changes with every poster. When GPT-4V launched with the ability to "see" images, I thought I'd finally found my answer.

The Experiment

I already had a tool for this kind of work: drone, a state-machine web scraper I'd built for testing web apps. Drone models a website as a graph of states (are we on the login page? the search results? a listing?) and transitions (click this button, fill this form, scroll down), then navigates between them like a GPS finding the shortest route. The dream was to stitch drone together with GPT-4V and Puppeteer to create a browser that could navigate any site the way a human would: GPT-4V as the eyes (reads the page and extracts meaning), Puppeteer as the hands (clicks and types), drone as the GPS (knows where we are in the flow and what to do next).

The prototype was simpler than the dream. Puppeteer loaded a listing page, took a screenshot, and sent it to GPT-4V to extract deal information (price, ARV, location, rehab costs) regardless of how or where that information appeared on the page. I tested it against Facebook Marketplace listings and private investor groups, the two places where the messiest deal data lives.

The results were mixed in a way that was both impressive and disqualifying. GPT-4V passed spatial reasoning. It could distinguish the actual listing content from Facebook's toolbars, sidebars, and navigation elements. It understood that the important information was in the main post area, not the header. It could read text overlaid on images, identify property photos, and generally understand the layout of a listing. For a model that had never seen these pages before, that's genuinely remarkable.

Where It Falls Apart

The problem is accuracy on the numbers. In real estate, the difference between an ARV of $185,000 and $186,000 might not matter. But the difference between $185,000 and $285,000 absolutely does, and GPT-4V would occasionally make exactly that kind of error. It would misread digits, transpose numbers, or confidently report a figure that was wrong or missing entirely. For a use case where the whole point is extracting precise financial data, "occasionally wrong" is the same as "useless." This is the same hallucination problem I wrote about with text-based ChatGPT, except with numbers it's harder to catch because a hallucinated price looks just as plausible as a real one.

Then there's cost. Each GPT-4V API call costs a few cents, simply running it like a regular web-scraper would cost me a few thousand dollars each month. The economics didn't work.

The Open-Source Fallback

I tested several alternatives. LLaVA, the most promising open-source visual model at the time, had general image understanding but couldn't reliably read text from screenshots. It would get the gist of an image (this is a property listing) but miss the specific numbers entirely. Other open-source visual models were even further behind. The honest assessment: GPT-4V is the state of the art for visual understanding, and the state of the art isn't ready for production use in any application where precise text extraction from images matters.

What I Learned

This experiment followed the same arc as my earlier assessment of ChatGPT: impressive demos, real limitations. The DALL-E posts I wrote about AI-generated images were about AI producing visual output. This is about AI understanding visual input. Both hit the same wall: the models are good enough to make you think they're ready, but not good enough to trust in real-world scenarios.

Be the first to comment...