Jump to content

[PHP] Asynchronously download hundreds of small files.

joncppl

I'm working on a project that involves fetching some data from another server (which is beyond my control). This data happens to be broken into several (hundreds) of fairly small XML files.

Now, this server (or the connected to it) is fairly slow, it takes a few seconds to grab each file. Consequently, if I linearly fetch all of these XML files, the PHP script takes minutes to execute. 
 

Now, I have minimal experience with threading, but what I do know is that you shouldn't create a plethora of threads. 
For example, after compiling php with thread support, I tried to humor myself by creating one thread per file to download. Low and behold my script executed in seconds, with accurate results. Unfortunately this is plays havoc with the thread management and all that, and after executing a few times the pc just dies, which is a behavior I wasn't expecting. 

 

Does anyone know the best solution to this type of problem? Am I close, or am I approaching it in completely the wrong way. (If possible I'd rather avoid using PHP threads, because I don't like using my self-compiled apache/php. I much prefer the debian bundled versions)

 

Link to comment
Share on other sites

Link to post
Share on other sites

is there a ftp you can connect to?

 

also how are the links you are downloading presented

Our Lord and Saviour Chunt!!!

Link to comment
Share on other sites

Link to post
Share on other sites

how do you fetch the XML files? http? ftp?

if it's simple http, a thing i'm familiar with is nonblocking sockets handled with select()

this could be overly complicated, but well that's what i know, and it does exactly what you need. it allows paralled downloads without multithreading

Link to comment
Share on other sites

Link to post
Share on other sites

is there a ftp you can connect to?

 

also how are the links you are downloading presented

 

Specifically I'm connecting to the MyAnimeList API (which sucks, at least for the purpose I want)

http://myanimelist.net/modules.php?go=api

I am acquiring and storing the data from search queries. ie. I search based off name, cross reference the id it returns (because there is no id based look up???!) with the one I want, then grab that data.

Problem is I want to grab the information of about 500 entries at once, and it is impossible to pull that with a single query. Actually handling the data is trivial, the only issue is making so many http requests in sequence.

I'm having my script run on a crontab once every few hours, to update my database with the data I'll be using, because then I can get it nice and fast. It works as it is and doesn't tax the server too much, but It'd be nice if it could go faster. I feel uncomfortable merely having a script that takes that long to executed even if it is not outwardly accessible.

 

Link to comment
Share on other sites

Link to post
Share on other sites

Perhaps a better solution would be to selectively update some at a time, spreading the load out.
I also don't want to tax the server I'm making the requests on too much.

As it is I have what I want working, I just want to make sure I'm doing it right, following best practices and all that.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×