Jump to content

You don't really install a "cluster computer". You install each machine and then some software that can run in a cluster, like for example a database.

Remember to either quote or @mention others, so they are notified of your reply

Link to post
Share on other sites

1 minute ago, Electronics Wizardy said:

What do you want the cluster to do? What are you goals with the cluster?

 

Do you want to do vms on the cluster? Shared storage? Computer?

Probably for running and compiling code that would take forever on my main pc

Link to post
Share on other sites

Just now, archiso said:

Probably for running and compiling code that would take forever on my main pc

What type of code? What complier?

 

Normally this is pretty hard to cluster for a single piece of code. If your making multiple builds you can have each computer do one build each.

 

You can't just make a cluster to have one virtual fast computer, due to latency you have to split up the work first.

Link to post
Share on other sites

Just now, Eigenvektor said:

You don't really install a "cluster computer". You install each machine and then some software that can run in a cluster, like for example a database.

Sorry for not explaining it correctly, I know this I just don't know what software to use or how to set it up. Also I have tried with Slurm but I couldn't figure out how to set it up.

Link to post
Share on other sites

Just now, archiso said:

Probably for running and compiling code that would take forever on my main pc

Never heard of a cluster capable compiler. The amount of data to transfer back and forth would probably eat up any improvements from compiling in parallel on different machines. Even on a single machine compiling isn't always multi threaded.

Remember to either quote or @mention others, so they are notified of your reply

Link to post
Share on other sites

1 minute ago, Electronics Wizardy said:

What type of code? What complier?

 

Normally this is pretty hard to cluster for a single piece of code. If your making multiple builds you can have each computer do one build each.

 

You can't just make a cluster to have one virtual fast computer, due to latency you have to split up the work first.

compiling will mostly be unity projects and the code is mainly python, I'm trying to make a search engine in python right now.

Link to post
Share on other sites

1 minute ago, Eigenvektor said:

Never heard of a cluster capable compiler. The amount of data to transfer back and forth would probably eat up any improvements from compiling in parallel on different machines. Even on a single machine compiling isn't always multi threaded.

That is probably correct it will be mainly for running python code though.

Link to post
Share on other sites

3 minutes ago, archiso said:

compiling will mostly be unity projects and the code is mainly python, I'm trying to make a search engine in python right now.

Well you don't compile python normally. If this is for unity, I don't think they have a linux version anyways.

 

2 minutes ago, archiso said:

That is probably correct it will be mainly for running python code though.

What is that python code doing?

 

 

Really just get a single faster computer, somethings just don't make sense to cluster, and this is one of them.

Link to post
Share on other sites

5 minutes ago, Electronics Wizardy said:

Well you don't compile python normally. If this is for unity, I don't think they have a linux version anyways.

 

What is that python code doing?

 

 

Really just get a single faster computer, somethings just don't make sense to cluster, and this is one of them.

the python code runs through websites to find urls and runs through those to find more and so one. Then it runs a search algorithm through every website it found and returns an ordered list of the websites. Also I probably wont respond to anything for about an hour.

Link to post
Share on other sites

1 minute ago, archiso said:

the python code runs through websites to find urls and runs through those to find more and so one. Then it runs a search algorithm through every website it found and returns an ordered list of the websites. Also I probably wont respond to anything for about an hour.

Could you make 3 lists of websites and then have each computer do 1/3 of the websites? That should do it in about 1/3 of the time.

Link to post
Share on other sites

36 minutes ago, Electronics Wizardy said:

Could you make 3 lists of websites and then have each computer do 1/3 of the websites? That should do it in about 1/3 of the time.

With how the algorithm works I don’t think it would be very effective and it would be better if I could automate it. Also I just want to mess around with cluster computing and coding programs for then.

Link to post
Share on other sites

1 hour ago, Electronics Wizardy said:

Well you don't compile python normally. If this is for unity, I don't think they have a linux version anyways.

They do, its shipped as a Appimage.

 

1 hour ago, archiso said:

the python code runs through websites to find urls and runs through those to find more and so one. Then it runs a search algorithm through every website it found and returns an ordered list of the websites. Also I probably wont respond to anything for about an hour.

For python, its compiled/interpreted at runtime. Regardless, for something like this, It would be up to you to make it run across multiple machines. If your storing data in a database, just have them check in before iterating through a site.

 

For Parallel Processing, its not what you probably think it is. These are setup for very specific use cases and relies on the user to divide the processes, if they can be.

For compiling this can be beneficial if you have a group of projects where they don't depend on each other being compiled in order to build, you can take and share the source across multiple machines and have it compile them independently at the same time.  At the end of the day however, they are just individual machines performing a task.

Link to post
Share on other sites

1 minute ago, Nayr438 said:

They do, its shipped as a Appimage.

 

For python, its compiled/interpreted at runtime. Regardless, for something like this, It would be up to you to make it run across multiple machines. If your storing data in a database, just have them check in before iterating through a site.

 

For Parallel Processing, its not what you probably think it is. These are setup for very specific use cases and relies on the user to divide the processes, if they can be.

For compiling this can be beneficial if you have a group of projects where they don't depend on each other being compiled in order to build, you can take and share the source across multiple machines and have it compile them independently at the same time.  At the end of the day however, they are just individual machines performing a task.

What program would you recommend for doing something like this?

Link to post
Share on other sites

Just now, archiso said:

What program would you recommend for doing something like this?

For what?

Compiling? Setup a shared storage and separate the build directories. Then you can setup a shell script to issue the commands over ssh to compile the specific projects on there target machine.

Link to post
Share on other sites

1 minute ago, Nayr438 said:

For what?

Compiling? Setup a shared storage and separate the build directories. Then you can setup a shell script to issue the commands over ssh to compile the specific projects on there target machine.

no for running python code.

Link to post
Share on other sites

Just now, archiso said:

no for running python code.

That would be entirely up to you implementing it in your code. You just need a central point to check in and compare against. If another machine has already checked in, then continue through your list to the next available one that hasn't. PostgreSQL may be good for this. There isn't a program that will just make python run across multiple machines.

Link to post
Share on other sites

1 hour ago, Nayr438 said:

That would be entirely up to you implementing it in your code. You just need a central point to check in and compare against. If another machine has already checked in, then continue through your list to the next available one that hasn't. PostgreSQL may be good for this. There isn't a program that will just make python run across multiple machines.

Ok. What is the purpose of programs like Slurm then?

Link to post
Share on other sites

5 minutes ago, archiso said:

Ok. What is the purpose of programs like Slurm then?

 

Quote

The Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM), or simply Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.

It provides three key functions:

  • allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work,
  • providing a framework for starting, executing, and monitoring work, typically a parallel job such as Message Passing Interface (MPI) on a set of allocated nodes, and
  • arbitrating contention for resources by managing a queue of pending jobs.

 

It's just a workload manager. It basically allocates systems and resources, starts whatever tasks are assigned them, and monitors there current status.

Link to post
Share on other sites

3 minutes ago, Nayr438 said:

 

 

It's just a workload manager. It basically allocates systems and resources, starts whatever tasks are assigned them, and monitors there current status.

So basically what I need but it doesn’t work with python so I have to make one in python?

Link to post
Share on other sites

3 minutes ago, archiso said:

So basically what I need but it doesn’t work with python so I have to make one in python?

It just runs tasks. It would be the equivalent of you just running your script on 3 machines. It doesn't matter what the tasks is or how its made, slurm doesn't care.

The task/program that's being executed has to be made for parallel processing to use parallel processing. The task itself needs to have it's own way to communicate with the task on the other machine.

Link to post
Share on other sites

1 hour ago, Nayr438 said:

It just runs tasks. It would be the equivalent of you just running your script on 3 machines. It doesn't matter what the tasks is or how its made, slurm doesn't care.

The task/program that's being executed has to be made for parallel processing to use parallel processing. The task itself needs to have it's own way to communicate with the task on the other machine.

Ok, is there anyway to code my python project for parallel processing?

Link to post
Share on other sites

There are a number of questions you should be asking yourself:

 

1. Can the problem I'm trying to solve be subdivided?

Not every problem is suitable for multi-processing or more specifically distributed computing.

 

As @Electronics Wizardy pointed out, you could modify your crawler so that each node is responsible for some part of the search space (i.e. x out of $total URLs).

 

2. Is subdivision an efficient solution?

To be suitable for distributed computing, the work done by each node should be largely independent, with minimal coordination needed between them. The more coordination you need, the less efficient distribution can become.

Additionally the performance of your nodes should not be bound by a shared resource like network.

 

If your search algorithm is mainly limited by network performance, you're not going to gain much or anything if your nodes share the same (slow) internet connection.

 

On the other hand, if your search algorithm is mainly bound by CPU performance then distribution should work fine.

 

3. Which strategy is optimal for distribution?

You need to determine how much coordination is needed between nodes and how to best go about it.

 

In some cases a shared database and transactions/locks may enough to ensure nodes don't do the same (redundant) work. If this isn't enough, can nodes coordinate directly with each other or is some central coordination (e.g. work server) the best solution?

 

4. Is language X a suitable choice?

My experience with Python is extremely limited, so I can't make a reasonable recommendation based on this.

In my limited understanding Python's GIL makes working with multiple threads somewhat difficult, so maybe a different language would be better.

 

A language that is easy to pick up as a beginner doesn't mean it is necessarily the best choice for every problem out there.

 

5 hours ago, archiso said:

Ok, is there anyway to code my python project for parallel processing?

I understand that my points above don't address your problem directly. The short answer is probably: Yes, of course it can be done (if you have the know-how/skill/experience).

 

The much more interesting question would be: How can it be done? Take the questions above as a starting point for this.

Remember to either quote or @mention others, so they are notified of your reply

Link to post
Share on other sites

6 hours ago, Eigenvektor said:

There are a number of questions you should be asking yourself:

 

1. Can the problem I'm trying to solve be subdivided?

Not every problem is suitable for multi-processing or more specifically distributed computing.

 

As @Electronics Wizardy pointed out, you could modify your crawler so that each node is responsible for some part of the search space (i.e. x out of $total URLs).

 

2. Is subdivision an efficient solution?

To be suitable for distributed computing, the work done by each node should be largely independent, with minimal coordination needed between them. The more coordination you need, the less efficient distribution can become.

Additionally the performance of your nodes should not be bound by a shared resource like network.

 

If your search algorithm is mainly limited by network performance, you're not going to gain much or anything if your nodes share the same (slow) internet connection.

 

On the other hand, if your search algorithm is mainly bound by CPU performance then distribution should work fine.

 

3. Which strategy is optimal for distribution?

You need to determine how much coordination is needed between nodes and how to best go about it.

 

In some cases a shared database and transactions/locks may enough to ensure nodes don't do the same (redundant) work. If this isn't enough, can nodes coordinate directly with each other or is some central coordination (e.g. work server) the best solution?

 

4. Is language X a suitable choice?

My experience with Python is extremely limited, so I can't make a reasonable recommendation based on this.

In my limited understanding Python's GIL makes working with multiple threads somewhat difficult, so maybe a different language would be better.

 

A language that is easy to pick up as a beginner doesn't mean it is necessarily the best choice for every problem out there.

 

I understand that my points above don't address your problem directly. The short answer is probably: Yes, of course it can be done (if you have the know-how/skill/experience).

 

The much more interesting question would be: How can it be done? Take the questions above as a starting point for this.

Ok, @Franck said the same thing about the main code of this project, that a different language like C#, C++, or Java would be better for this. Maybe JavaScript would work? That would be easier to put into the web front end I'm making.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×