Skip to content

Add MMLSPARK_PYSPARK_CORES to specify CPU core count for PySpark#577

Open
ghost wants to merge 2 commits intomicrosoft:masterfrom
Software-Natives-OSS:master
Open

Add MMLSPARK_PYSPARK_CORES to specify CPU core count for PySpark#577
ghost wants to merge 2 commits intomicrosoft:masterfrom
Software-Natives-OSS:master

Conversation

@ghost
Copy link

@ghost ghost commented Jun 1, 2019

This is just a POC to get early feedback.

I run mmlspark locally on my notebook and figured out that only 2 of my 6 CPU cores were used when calculating Pi with PySpark, with code as below. I couldn't find an easy out-of-the-box mechanism to tweak this behavior. Therefore, I thought it'd be nice to make this configurable through env-vars so that users can tweak this during container creating. Thus, this pull request.

Please give me feedback whether you like this feature. If you do, I'll extend the documentation accordingly.

For reference, here's the code I run

import random
num_samples = 100000000
def inside(p):
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4.0 * count / num_samples
print(pi)

@ghost ghost changed the title Add MMLSPARK_PYSPARK_CORES to CPU core count for PySpark Add MMLSPARK_PYSPARK_CORES to specify CPU core count for PySpark Jun 1, 2019
@mhamilton723
Copy link
Contributor

Looks good @zulli73 if you add a line in the docs ill merge!

Changed MMLSPARK_PYSPARK_CORES to MMLSPARK_PYSPARK_THREADS to better reflect it's purpose. Added documentation
@msftclas
Copy link

msftclas commented Jun 16, 2019

CLA assistant check
All CLA requirements met.

@ghost
Copy link
Author

ghost commented Jun 16, 2019

I've added documentation.

I added a whole new section covering all environment variables because I felt it didn't fit into any of the existing part of the documentation. Moreover, I thought about adding it to the example docker run command, but I didn't want to make that example more complicated than necessary. Hence, the new section.

Copy link
Contributor

@drdarshan drdarshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zulli73! This looks good to me.

@drdarshan
Copy link
Contributor

Hello @zulli73, if you don't mind, please resolve the conflict and I'll trigger the merge. Thank you for your contribution!

@ghost
Copy link
Author

ghost commented Aug 12, 2019

Hi @drdarshan.

Short story: Has this pull request become obsolete? Searching for "local[", all results use "local[*]" which indicates that the latest version at master may already use all CPU cores.

Long story: I'd happily fix merge conflicts, but I have troubles to understand the change that caused this merge conflict d34f9d1: The file I modified got removed and it's not obvious to me why it became obsolete. It seems to me that since that change, no new Docker image has been pushed - therefore I can't easily check whether Spark utilizes all available CPU cores since that commit. Finally: I couldn't find the docs for building the Docker image myself/locally

@mhamilton723 mhamilton723 self-requested a review as a code owner November 17, 2022 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants