IPythonConfigurationThisinstallationworkflowlooselyfollowstheonecontributedbyFernandoPerezhere.ThisshouldbeperformedonthemachinewheretheIPythonNotebookwillbeexecuted,typicallyoneoftheHadoopnodes.FirstcreateanIPythonprofileforusewithPySpark.1ipythonprofilecreatepysparkThisshouldhavecreatedtheprofiledirectory~/.ipython/profile_pyspark/.Editthefile~/.ipython/profile_pyspark/ipython_notebook_config.pytohave:12345c=get_config()c.NotebookApp.ip='*'c.NotebookApp.open_browser=Falsec.NotebookApp.port=8880#orwhateveryouwant;beawareofconflictswithCDHIfyouwantapasswordpromptaswell,firstgenerateapasswordforthenotebookapp:1python-c'fromIPython.libimportpasswd;printpasswd()'>~/.ipython/profile_pyspark/nbpasswd.txtandsetthefollowinginthesame/ipython_notebook_config.pyfileyoujustedited:12PWDFILE='~/.ipython/profile_pyspark/nbpasswd.txt'c.NotebookApp.password=open(PWDFILE).read().strip()Finally,createthefile~/.ipython/profile_pyspark/startup/00-pyspark-setup.pywiththefollowingcontents:123456789importosimportsysspark_home=os.environ.get('SPARK_HOME',None)ifnotspark_home:raiseValueError('SPARK_HOMEenvironmentvariableisnotset')sys.path.insert(0,os.path.join(spark_home,'python'))sys.path.insert(0,os.path.join(spark_home,'python/lib/py4j-0.8.1-src.zip'))execfile(os.path.join(spark_home,'python/pyspark/shell.py'))StartingIPythonNotebookwithPySparkIPythonNotebookshouldberunonamachinefromwhichPySparkwouldberunon,typicallyoneoftheHadoopnodes.First,makesurethefollowingenvironmentvariablesareset:12345#fortheCDH-installedSparkexportSPARK_HOME='/opt/cloudera/parcels/CDH/lib/spark'#thisiswhereyouspecifyalltheoptionsyouwouldnormallyaddafterbin/pysparkexportPYSPARK_SUBMIT_ARGS='--masteryarn--deploy-modeclient--num-executors24--executor-memory10g--executor-cores5'NotethatyoumustsetwhateverotherenvironmentvariablesyouwanttogetSparkrunningthewayyoudesire.Forexample,thesettingsaboveareconsistentwithrunningtheCDH-installedSparkinYARN-clientmode.IfyouwantedtorunyourowncustomSpark,youcouldbuildit,puttheJARonHDFS,settheSPARK_JARenvironmentvariable,alongwithanyothernecessaryparameters.Forexample,seehereforrunningacustomSparkonYARN.Finally,decidefromwhatdirectorytoruntheIPythonNotebook.Thisdirectorywillcontainthe.ipynbfilesthatrepresentthedifferentnotebooksthatcanbeserved.SeetheIPythondocsformoreinformation.Fromthisdirectory,execute:1ipythonnotebook--profile=pysparkNotethatifyoujustwanttoservethenotebookswithoutinitializingSpark,youcanstartIPythonNotebookusingaprofilethatdoesnotexecutetheshell.pyscriptinthestartupfile.ExampleSessionAtthispoint,theIPythonNotebookservershouldberunning.Pointyourbrowserto:8880/,whichshouldopenupthemainaccesspointtotheavailablenotebooks.Thisshouldlooksomethinglikethis:Thiswillshowthelistofpossible.ipynbfilestoserve.Ifitisempty(becausethisisthefirsttimeyou’rerunningit)youcancreateanewnotebook,whichwillalsocreateanew.ipynbfile.Asanexample,hereisascreenshotfromasessionthatusesPySparktoanalyzetheGDELTeventdataset:Thefull.ipynbfilecanbeobtainedasaGitHubgist.