Hacker News front-page analytics
A question about what states were most-frequently represented on the HN homepage had me do some quick querying via Hacker News's Algolia search ... which is NOT limited to the front page. Those results were ... surprising (Maine and Iowa outstrip the more probable results of California and, say, New York). Results are further confounded by other factors.
Thread: https://news.ycombinator.com/item?id=36076870
HN provides an interface to historical front-page stories (https://news.ycombinator.com/front), and that can be crawled by providing a list of corresponding date specifications, e.g.:
https://news.ycombinator.com/front?day=2023-05-25
Easy enough.
So I'm crawling that and compiling a local archive. Rate-limiting and other factors mean that's only about halfway complete, and a full pull will take another day or so.
But I'll be able to look at story titles, sites, submitters, time-based patterns (day of week, day of month, month of year, yearly variations), and other patterns. There's also looking at mean points and comments by various dimensions.
Among surprises are that as of January 2015, among the highest consistently-voted sites is The Guardian. I'd thought HN leaned consistently less liberal.
The full archive will probably be < 1 GB (raw HTML), currently 123 MB on disk.
Contents are the 30 top-voted stories for each day since 20 February 2007.
If anyone has suggestions for other questions to ask of this, fire away.
And, as of early 2015, top state mentions are:
1. new york: 150
2. california: 101
3. texas: 39
4. washington: 38
5. colorado: 15
6. florida: 10
7. georgia: 10
8. kansas: 10
9. north carolina: 9
10. oregon: 9
NY is highly overrepresented (NY Times, NY Post, NY City), likewise Washington (Post, Times, DC). Adding in "Silicon Valley" and a few other toponyms boosts California's score markedly. I've also got some city-based analytics.
Crawl complete:
FINISHED --2023-05-27 20:11:03--
Total wall clock time: 1d 17h 55m 39s
Downloaded: 5939 files, 217M in 9m 48s (378 KB/s)
NB: wget performed admirably:
grep 'HTTP request sent' fetchlog | sort | uniq -c | sort -k1nr
5939 HTTP request sent, awaiting response... 200 OK
14 HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
1 HTTP request sent, awaiting response... Read error (Operation timed out) in headers.
Each of the read errors succeeded on a 2nd try.
I'm working on parsing. Playing with identifying countries most often mentioned in titles right now, on still-partial data (missing the past month or so's front pages).
Countries most likely to be confused with a major celebrity and/or IT/tech sector personality: Cuba & Jordan.
Country most likely to be confused with a device connection standard: US (USB).
Raw stats, top-20, THERE ARE ISSUES WITH THESE DATA:
1 US: 1350 (186 matched "USB")
2 U.S.: 1073 (USA: 59, U.S.A.: 2, America/American: 979)
3 China: 634
4 Japan: 526
5 India: 477
6 UK: 288
7 EU: 225 (E.U.: 54)
8 Russia: 221
9 Germany: 165
10 Canada: 162
11 Australia: 157
12 Korea: 140 (DRK: 69, SK: 38)
13 France: 116
14 Iran: 91
15 Dutch: 80 (25 Netherlands)
16 United States: 75
17 Brazil: 69
18 North Korea: 69
19 Sweden: 68
20 Cuba: 67 (32 "Mark Cuban")
How Much Colorado Love? Or a 16-year Hacker News Front Page analytics
I've pulled 5,939 front pages from Hacker News, dating from 20 February 2007 to 25 May 2023, initially to answer the question "how often is Colorado mentioned on the front page?" (38 times, 5th most frequent US state). This also affords the opportunity to ask and answer other questions.
Preliminary report: https://news.ycombinator.com/item?id=36098749
I've confirmed that the story shortfall does represent actual HN experience. Several days with fewer-than-usual stories, one day of complete outage, mostly in the first year of operations:
2007-03-10: 29
2007-03-24: 26
2007-03-25: 25
2007-05-19: 27
2007-05-26: 26
2007-05-28: 29
2007-06-02: 19
2007-06-16: 28
2007-06-23: 17
2007-06-24: 28
2007-06-30: 20
2007-07-01: 28
2007-07-07: 27
2007-07-15: 26
2007-07-28: 27
2014-01-06: 0
I'm wanting to test some reporting / queries / logic based on a sampling of data.
Since my file-naming convention follows ISO-8601 (YYYY-MM-DD), I can just lexically sort those.
And to grab a random year's worth (365 days) of reports from across the set:
ls rendered-crawl/* | sort -R | head -365 | sort
(I've rendered the pages, using w3m's -dump feature, to speed processing).
The full dataset is large enough and my awk code sloppy enough (several large sequential lists used in pattern-matching) that a full parse takes about 10 minutes, so the sampling shown here speeds development better than 10x while still providing representative data across time.
@dredmorbius
"bash arithmetic", two of the scariest words in the English language
@niplav You're either hanging out in the right or wrong places. I'm not sure which.
But I've got a Very Simple Bash Script I can trust to answer that question for me ...