main.py
This Python code is a comprehensive script designed for data scraping, processing, and database operations. It's structured around various functions that handle different tasks related to data collection from web sources, processing that data, and storing it in a database. The code also involves some auxiliary functionalities like watermark configuration and image processing.
Here's a detailed breakdown of the code:
1. Imports and Dependencies
import os
from pydoc import ispackage
import time
import util
from log import my_log
import yaml
import json
import pymssql
import decimal
import uuid
import math
import datetime
import random
from util.image import imageThread
from urllib.parse import urlparse
from lxml import etree
import traceback
import _cffi_backend as backend
import operator
from functools import reduce
from benedict import benedict
import re
import keyboard
import pytz
from dateutil import relativedelta
from dateutil.parser import parse
- Standard Libraries: The script imports several standard libraries (
os, time, math, datetime, etc.) for general-purpose tasks like file handling, time management, and mathematical operations.
- Third-Party Libraries:
yaml: For parsing YAML configuration files.
pymssql: For connecting and executing queries on a Microsoft SQL Server database.
lxml.etree: For parsing HTML/XML documents.
benedict: A library for enhanced dictionary functionalities.
keyboard: For capturing keyboard events.
pytz and dateutil: For timezone and date parsing/manipulation.
re: For regular expressions.
- Custom Modules:
util: Appears to be a custom module, potentially handling various utility functions.
log: Likely handles logging.
imageThread: Handles image processing, including watermarking.
2. Global Variables
yaml_data = yaml.load(open("config.yaml"), Loader=yaml.FullLoader)
instance = yaml_data['instance']
User_Agent = yaml_data['User_Agent']['value']
host = yaml_data['mssql']['host']
user = yaml_data['mssql']['user']
pwd = yaml_data['mssql']['pwd']
database = yaml_data['mssql']['database']
port = yaml_data['mssql']['port']
ipage = yaml_data['ipage']
daysdiff = yaml_data['daysdiff']
skipcounter = yaml_data['skipcounter']
cdnImg = yaml_data['cdnImg']['value']
- YAML Configuration: The script loads configuration parameters from a
config.yaml file. These parameters include database connection details (host, user, pwd, etc.), user-agent strings for web requests, image CDN settings, and more.
3. Database Connection Setup
conn = pymssql.connect(host=host, user=user, password=str(pwd), database=database, port=str(port))
cursor = conn.cursor(as_dict=True)
- Database Connection: Establishes a connection to a Microsoft SQL Server database using the
pymssql library.
4. Global Variables and Flags
gcount = 0
rcount = 90
quit = False
- Counters:
gcount and rcount are used in certain functions to control iterations or indexing.
- Flag:
quit is used as a global flag to control the termination of certain processes, particularly in the key_capture_thread() function.
5. Key Capture Thread
def key_capture_thread():
global quit
a = keyboard.read_key()
if a == "esc":
print("ESC Stop, finishing the current slug")
quit = True
- Keyboard Listener: This function listens for the "ESC" key to be pressed. If detected, it sets the
quit flag to True, which can be used to gracefully stop long-running processes.
6. General Utility Functions
_getgcount()
def _getgcount():
global gcount
gcount += 1
rqcode = math.floor(gcount/rcount) % 250
return format(rqcode+1, '#04x')[2:]
- Counter Function: Increments the global
gcount and calculates a hexadecimal code based on this value.
ele_to_str(ele)
def ele_to_str(ele):
return etree.tostring(ele, pretty_print=True, encoding="utf-8", method="html").decode("utf-8")
- Element to String: Converts an
lxml element object to a pretty-printed string representation.
_getJSONval() and _getJSON()
def _getJSONval(json, key, type="any"):
val = None
try:
val = reduce(operator.getitem, key.split('.'), json)
except Exception as e:
if(type=="array"):
val = []
return val
def _getJSON(json, key, type="any"):
val = None
try:
val = json[key]
if(val):
if(type == "int"):
val = int(val)
elif(type == "decimal"):
val = float(str(val).replace(",",""))
except Exception as e:
if(type=="array"):
val = []
else:
val = None
return val
- JSON Handling: These functions are used to safely extract values from a JSON-like dictionary structure. They allow for nested key access and type casting.
_cleanImageDir()
def _cleanImageDir():
import shutil
import os
cursor.execute('SELECT ID, MerchantID FROM dbo.Merchants WHERE ID NOT IN (SELECT ID FROM dbo.Merchants_copy1)')
results = cursor.fetchall()
for imgdir in results:
path = "F:/image/" + imgdir['MerchantID']
if os.path.exists(path):
print(path)
shutil.rmtree(path)
- Directory Cleanup: This function deletes directories corresponding to merchants that are no longer present in a backup table (
Merchants_copy1).
_getrow() and _getcol()
def _getrow(table="", filter=""):
global conn, cursor
cursor.execute(f'SELECT ID FROM dbo.{table} WHERE {filter}')
results = cursor.fetchall()
return len(results)
def _getcol(table="", col="", filter=""):
global conn, cursor
cursor.execute(f'SELECT {col} FROM dbo.{table} WHERE {filter}')
results = cursor.fetchall()
return results
- Database Query Helpers: These functions are simple helpers to execute SQL queries and fetch either row counts or specific columns.
7. Data Collection and Processing Functions
_getcat()
def _getcat():
global conn, cursor
cursor.execute('SELECT * FROM dbo.PropType')
results = cursor.fetchall()
propType = benedict()
for row in results:
propType[f"{row['code']}"] = row['name']
a = 34
print(propType[f'{a}'])
- Property Type Data Fetching: This function fetches property type data from the database, stores it in a
benedict dictionary, and prints a specific entry.
_getlist_city(), _getlist_state(), _getlist_extra()
These functions are responsible for scraping and processing lists of cities, states, or other entities from a business directory website. The data is then stored in corresponding database tables (City, State, Jobs2).
- _getlist_city(): Scrapes city data and inserts it into the
City table.
- _getlist_state(): Scrapes state data and inserts it into the
State table.
- _getlist_extra(): Scrapes extra data from states not found in cities and inserts it into the
Jobs2 table.
_setlist() and _getlistByType()
_setlist(): Prepares URLs for scraping property listings, either for sale or rent, and inserts them into the Jobs_Listing table.
_getlistByType(): Scrapes property listings by type (sale or rent) and processes them. This function handles the pagination and checks if a listing should be skipped based on its age or if it already exists in the database.
_getlist(), _getlist_rent()
These functions are responsible for scraping property listings, storing the scraped URLs in the database, and processing them further if needed.
8. Data Insertion and Processing
getTbl()
def getTbl():
global conn, cursor
cursor.execute("""SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'ProppyPropertyMarket'
AND COLUMN_NAME <> 'ID';""")
results = cursor.fetchall()
tbl = {}
for row in results:
att = {}
att['type'] = row['DATA_TYPE']
att['value'] = None
tbl[row['COLUMN_NAME']] = att
print(json.dumps(tbl))
return
- Database Schema Fetching: This function retrieves the schema of the
ProppyPropertyMarket table (excluding the ID column) and prints it as a JSON object.
_getdata()
def _getdata(ltype, slug):
global conn, cursor,
quit, idate, watermark_setting, cdnImg, daysdiff
utc=pytz.UTC
watermark_setting = get_watermark_config()
# ...
return 1
- Core Data Processing: This is the main function responsible for scraping and processing detailed property data from a URL (
slug). It performs several tasks:
- Fetches the property details from the URL.
- Processes and extracts data, including the advertiser type, agent information, property details, and images.
- Inserts the processed data into the corresponding database tables.
- Handles watermarking of images and other image-related processing.
date_diff()
def date_diff():
utc=pytz.UTC
startWorkDate = "2022-10-12T14:40:49.875Z"
d1 = parse(startWorkDate)
d1 = d1 + datetime.timedelta(hours=8)
print(d1.strftime('%Y-%m-%d %H:%M:%S'))
# ...
- Date Difference Calculation: This utility function calculates the difference between two dates and prints it. It seems to be a helper function for date-related calculations.
get_watermark_config()
def get_watermark_config():
watermark_setting = WaterMark()
yaml_data = yaml.load(open("config.yaml"), Loader=yaml.FullLoader)
watermark_setting.enabled = yaml_data['Watermark']['enabled']
# ...
return watermark_setting
- Watermark Configuration: This function loads watermark settings from the YAML configuration file and returns them as a
WaterMark object.
9. Main Script Execution
if __name__ == '__main__':
my_log()
_getlistByType("rent")
conn.close()
- Script Entry Point: When the script is run, it initializes logging, processes property listings for rent by calling
_getlistByType(), and finally closes the database connection.
Summary
This Python script is a sophisticated data scraper and processor designed to fetch real estate or property-related information from various online sources, process it, and store it in a Microsoft SQL Server database. The script handles various tasks such as:
- Scraping and processing city, state, and property data.
- Handling database connections and queries.
- Performing data extraction and transformation.
- Processing images, including watermarking.
- Handling pagination and managing large datasets efficiently.
- Logging and error handling.
This script is well-suited for an environment where large-scale data collection and processing are needed, particularly in real estate or similar industries where such information is critical.