Posted By: Red5 | Jun 19th, 2008 @ 6:22 AM
page 1 of 1
Comments: 17 | Views: 1841
Red5
Red5
Systems Manager Curmudgen
Q: How do you discover what a line terminator is when using a StreamReader.ReadLine?

I'm working with some very large (500-600mb) PostScript files. There is a mixture of chr(13) + chr(10)  and  chr(10) + chr(10).
My end goal is to write the file back out with some string replacements and retain the original bytes less my changes.

Of course I can read the entire file (StreamReader.ReadToEnd) but then have memory issues on some of these larger files.
littleguru
littleguru
<3 Seattle
Doesn't the ReadLine method already figure out that for you? It should only read until the line "terminator" has been found...

Or am I missunderstanding your question?
wouldn't you just use WriteLine when writting back? or what exactly you need?
foreachdev
foreachdev
Twitter: @foreachdev
Enviroment.Newline is what I always use.
littleguru
littleguru
<3 Seattle

That's something that I wanted to mention: use Read and Write and provide an array of a certain size... that would do the job way better for your; don't use ReadLine in your scenario... - don't only read one byte, but a series of bytes and operate on them in memory. it is way more convenient. you also get back how many bytes have been read and if less than the size of your array has been read you know that you have reached the end.

littleguru
littleguru
<3 Seattle
Could be more efficient if you directly inject them... You need to figure it out on your own. - perhaps with some performance testing. I mean if it only happens once in a while it's probably not worth the time, otherwise you should take it into consideration.
You could convert them to byte arrays but for that you need to specify a text encoding so the big question is: what encoding is used in those postscript files? Is it ASCII perhaps (since as far as I know postscript is kind of old, Unicode did not exist when it was created)?

And a couple of other notes:

Strings in .NET are UTF16 and that means that 2 bytes per character are needed. In normal cases that doesn't matter but if we're talking about 500-600 mbytes of text... that means that you'll end up doing search/replace in 1-1.2 gbytes so using byte arrays may speed up things.

System.IO and VB/C#: I don't know where you read that but there is absolutely no different in performance between using System.IO from C# or VB.NET.

Read buffer size: from my experience you need to read at least 32k at a time to get good performance. With such large files I'd go with even larger buffers, let's say at 1-16 mbytes (unless you're memory constrained).
His VB.NET test code is using VB specific IO library and not System.IO.

What that test does is comparing Microsoft.VisualBasic.dll and System.IO.dll libraries rather than VB.NET and C#. One can just as well use System.IO.dll from VB.NET (like you already did) or Microsoft.VisualBasic.dll from C# (though probably nobody actually does that). It is not a language benchmark but a library benchmark.

In general VB.NET and C# produce pretty much the same code and cases where performance differs are tipically caused by more or less subtle sematic differences in code (one case I saw once used '/' in C# to divide integers and the "same" '/' operator in VB.NET which actually results in a floating point operation).
littleguru
littleguru
<3 Seattle
You probably would need to use reflector or a similar tool to see how the VB.NET code got translated... probably something ended up with other method calls; otherwise there are no differences between the languages... both run on the same runtime, have the same jitter at the backend and use the same types. They should be exactly even when it comes to speed.

Remember, I said "in general", a 350 mbyte files is not "in general" Smiley

One thing to watch out for:

C# and VB.NET have different defaults for "Check Integer Overflows" compiler option. C# defaults to "no" and VB.NET defaults to "yes". Normally you won't see a big difference but imagine that you have just one + operation per file byte. That's going to leave a mark on the execution time.

What that test does is comparing Microsoft.VisualBasic.dll and System.IO.dll libraries rather than VB.NET and C#. One can just as well use System.IO.dll from VB.NET (like you already did) or Microsoft.VisualBasic.dll from C# (though probably nobody actually does that). It is not a language benchmark but a library benchmark.

But the article doesn’t say it is comparing VB DLL VS IO DLL. It say’s .net languages so I am assuming there are talking about IO DLL in both languages. Isn’t something to do with VBC?

Yes, it doesn't specifically say that and I consider the article to be kind of flawed because of that. I had to look at the source code to figure out what's going on.

While the other tests actually test how good a compiler/jitter/interpreter is, the IO test is special because it has a high dependency on how well the IO library performs. 

It's obvious that you cannot use the same IO library in C#, C and Java but the case VB.NET vs. C# is special because you can use System.IO with both. It is not impossible/difficult/unsupported to use System.IO in VB.NET and if you chose not do it in a benchmark the result you'll get will only be about Microsoft.VisualBasic vs. System.IO.



page 1 of 1
Comments: 17 | Views: 1841
Microsoft Communities