Hello!
Here is what I'm trying to do:
Index a particular section of a web page. This particular section is a forum that is updated constantly, and there is only 1 main column that I'm interested in, which is titled "Subject".
How do I accomplish this w/o running into duplicate entries? - which is what I'm getting when I do the following.
Currently I run the following using PowerShell:
$wc.downloadstring("https://website.com/forum123/") >C:\PS_Output\Output.txt
Then I index output.txt and use Splunk to find a Named Variable using Regex to find the occurrences of a particular string (i.e.: 4 consecutive capitol letters).
But each time Output.txt is overwritten (when I run $wc.download string twice - seconds apart), I get a lot of duplicates.
I believe I have 2 problems:
1) Need to instead clean up output.txt and only have relevant events (no need for all the surround garbage html source). Perhaps I need to add some regex to the $wc.downloadstring class?
2) The tricky part is how quickly the webpage's table is flushed out with new posts. If I run this every minute, but all 50 posts flush with 50 new posts within 30 seconds, I loose about half content that I need.
Anyone out there ever tried grabbing content from an external site (not having admin access to the server of course) and keeping historical data?
Thanks!
↧
Index a specific table (forum) of a webpage - allowing me to kick off reports (based on time frame)
↧