chhsiao90 opened a new issue, #1247:
URL: https://github.com/apache/incubator-stormcrawler/issues/1247

   What kind of issue is this?
   
    - [ ] Question. This issue tracker is not the best place for questions. If 
you want to ask how to do
          something, or to understand why something isn't working the way you 
expect it to, use StackOverflow
          instead with the label 'stormcrawler': 
https://stackoverflow.com/questions/tagged/stormcrawler 
   
    - [x] Bug report. If you’ve found a bug, please include a test if you can, 
it makes it a lot easier to fix things. Use the label 'bug' on the issue.
    
    - [ ] Feature request. Please use the label 'wish' on the issue.
   
   ### Reproduce steps
   
   To reproduce it, we can run the HttpProtocol main function with many urls 
with MultiProxyFactory
   
   the crawler.conf
   ```
   config:
     http.agent.name: test
     http.proxy.manager: org.apache.stormcrawler.proxy.MultiProxyManager
     http.proxy.file: proxies
     http.robots.file.skip: true
   ```
   
   the proxies file
   ```
   http://first:password@proxy1:8888
   http://second:password@proxy2:8888
   ```
   
   ### Root cause
   
   The HttpProtocol (both okhttp and apache) is not thread-safe
   - the same instance which was initiated by ProxyFactory may be used in 
different bolts (different workers) at same time
   - the shared request/client builder was manipulated by different bolt/thread 
at same time
   
   Example 1 (wrong proxy auth)
   - (Thread 2) builder.setProxy(secondProxy)
   - (Thread 1) builder.setProxy(firstProxy)
   - (Thread 1) builder.setAuth(firstAuth)
   - (Thread 2) builder.setAuth(secondAuth)
   - (Thread 1) builder.build()
   - We'll have firstProxy + secondAuth
   
   Example 2 (wrong proxy used)
   - (Thread 1) builder.setProxy(firstProxy)
   - (Thread 1) builder.setAuth(firstAuth)
   - (Thread 2) builder.setProxy(secondProxy)
   - (Thread 2) builder.setAuth(secondAuth)
   - (Thread 1) builder.build()
   - Now both requests use the second proxy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@stormcrawler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to