Conversation
|
This looks good. You can just use a list instead of key value pairs for each proposal. I will have look at the code more carefully after some time. |
|
I think it becomes easier to use when key-pairs are used. If one has to know the name of speaker of 10th proposal, he/she can simply do |
|
The same applies if you store it in a list right?
…On Fri 6 Jul, 2018, 11:10 AM Nivesh Krishna, ***@***.***> wrote:
I think it becomes easier to use when key-pairs are used. If one has to
know the name of speaker of 10th proposal, he/she can simply do
proposals[10]["author"]
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKACR9TaDcuwh2fA3jPk7KBEtI_hAZStks5uDvhlgaJpZM4U5QA5>
.
|
|
Right, got you..! |
|
@ananyo2012 Changed it to list instead of dict. |
|
|
||
| } | ||
|
|
||
| } No newline at end of file |
There was a problem hiding this comment.
Please add a newline here.
There was a problem hiding this comment.
It was just a test file, not of any importance.
| # Obey robots.txt rules | ||
| ROBOTSTXT_OBEY = True | ||
|
|
||
| # Configure maximum concurrent requests performed by Scrapy (default: 16) |
There was a problem hiding this comment.
Why do we have so many comments?
There was a problem hiding this comment.
It is added by default by scrapy
There was a problem hiding this comment.
@aaqaishtyaq please review this PR as you are also working with scrapy these days.
| import scrapy | ||
| import json | ||
| from scrapy import signals | ||
|
|
There was a problem hiding this comment.
There should be 2 empty lines after import statements.
ananyo2012
left a comment
There was a problem hiding this comment.
@Niveshkrishna can you fix the things I mentioned in comments. Make a good documentation for the usage and also remove the binary files .pyc from the changes and add them to .gitignore
| @@ -0,0 +1,3 @@ | |||
| ### Basic Usage | |||
|
|
|||
| scrapy crawl crawler | |||
There was a problem hiding this comment.
Can you add some documentation for the project ?
cfp_crawler/logs.log
Outdated
| @@ -0,0 +1,2947 @@ | |||
| 2018-06-27 12:25:46 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: proposal) | |||
There was a problem hiding this comment.
Is this generated file ? Then please remove this and add to .gitignore
cfp_crawler/proposals.json
Outdated
There was a problem hiding this comment.
This is the data dump right ? Just keep one sample data and remove the rest.
| created_on = response.xpath("//p[@class='text-muted']/small/b/time/text()").extract()[0].strip() | ||
|
|
||
| section = response.xpath("//section[@class='col-sm-8 proposal-writeup']/div") | ||
| some_dic = {} |
There was a problem hiding this comment.
Rename this to proposal . hard to understand what is some_dic .
| some_dic = {} | ||
| for div in section: | ||
| heading = div.xpath(".//h4[@class='heading']/b/text()").extract()[0] | ||
| data = self.format_data(div.xpath(".//text()").extract(), heading) |
There was a problem hiding this comment.
Ok got it. This is the main proposal content. How about we name the variables as
heading => section_heading
data => section_content
There was a problem hiding this comment.
Also just being curious does div.xpath() return 2 values in a tuple ? Since I see format_data() takes in 2 arguments.
There was a problem hiding this comment.
which div.xpath() are you referring to?
There was a problem hiding this comment.
@Niveshkrishna NVM I got it. div.xpath(".//text()").extract() is the first argument and heading is second. Rename the variable name as I suggested. Also to be more clear make a separate variable for div.xpath(".//text()").extract() say raw_section_content
| some_dic[heading[:-1]] = data | ||
|
|
||
| table = response.xpath("//table/tr") | ||
| for col in table: |
There was a problem hiding this comment.
And this should be for row in table_rows
| data = data[2:-2] | ||
| some_dic[heading[:-1]] = data | ||
|
|
||
| table = response.xpath("//table/tr") |
There was a problem hiding this comment.
This should be table_rows
| table = response.xpath("//table/tr") | ||
| for col in table: | ||
| heading = col.xpath(".//td/small/text()").extract()[0].strip() | ||
| data = col.xpath(".//td/text()").extract()[0].strip() |
There was a problem hiding this comment.
how about we name these variables as
heading => extra_info_heading
data => extra_info_content
| allowed_domains = ['in.pycon.org'] | ||
| url = "https://in.pycon.org" | ||
| proposals = [] | ||
| file = open("proposals.json", "w") |
There was a problem hiding this comment.
The file is opened but never closed. You should close the file after use. The best way to do this would be to use with. Something like
with open("filename.json", "w") as f:
f.write("data")Move it inside the spider_closed method and if you want you make the filename as a member.
There was a problem hiding this comment.
oops! forgot to close it.
|
@Niveshkrishna also it would be good to add requirements.txt, pip freeze > requirements.txtand as @ananyo2012 suggested, documentation > how to setup the project using virtualenv/pipenv.. |
|
|
||
|
|
||
| class CrawlerSpider(scrapy.Spider): | ||
| name = 'crawler' |
There was a problem hiding this comment.
Also can you change the name of the spider, something like cfpcrawler or cfp.
|
|
||
| def spider_closed(self, spider): | ||
| print("Closing spider") | ||
| json.dump(self.proposals, self.file, indent = 2, sort_keys = True) |
There was a problem hiding this comment.
It's a common practice to use pipelines to save scrapy items to a file.
Something like in pipelines.py,
class JsonWritePipeline(object):
def open_spider(self, spider):
self.file = open('proposal.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return itemThere was a problem hiding this comment.
True, could have used middlewears.py as well. But since this being not so complicated to crawl, I just hardcoded everywhere.
There was a problem hiding this comment.
There's always a possibility of extension of any project, don't hardcode anything unless that's the only way out.
cc: @realslimshanky
|
|
||
| class ProposalItem(scrapy.Item): | ||
| # define the fields for your item here like: | ||
| # name = scrapy.Field() |
There was a problem hiding this comment.
@Niveshkrishna, You are not declaring any item in items.py. Why?
There was a problem hiding this comment.
Unless there are multiple crawlers which are using the same item, there's not much of difference in using the items and normal variables. Since there is only one crawler, it is okay to have them declared directly in crawler.py file. Or is it not? Is there any advantage of using items in this case?
There was a problem hiding this comment.
You are right. But it would help to reuse a bunch of code in future and to keep everything tidy, clean and understandable.
|
I think everything except |
|
@Niveshkrishna Can you complete the documentation so that we can merge this PR ? |
This crawler crawls through all the proposal on cfp page and saves them into proposals.json file